08. HuggingFace Pipeline

Hugging Face Local Pipelines

HuggingFacePipeline Classes allow Hugging Face models to run locally.

Hugging Face Model Hub Hosts over 120,000 models, 20,000 datasets, and 50,000 demo apps on the online platform, all open source and publicly available, making it easy for people to collaborate and build ML together.

These models can be used by LangChain by calling through this local pipeline wrapper, or by calling the inference endpoint hosted through the HuggingFaceHub class. For more information on hosted pipeline HuggingFaceHub See the laptop.

For use PyTorch Python with Package transformers It should be installed.

Also, for more memory efficient attention implementation xformer You can also install.

Copy

%pip install --upgrade --quiet  transformers --quiet

Set the path to download the model

Copy

# Path to download the hugging face model/tokenizer
# (example)
import os

# ./cache/ Set to download to path
os.environ["TRANSFORMERS_CACHE"] = "./cache/"
os.environ["HF_HOME"] = "./cache/"

Model Loading

Model from_model_id You can load it by specifying model parameters using methods.

HuggingFacePipeline Load the pre-learned model of the Hugging Face using classes.
from_model_id Using methods beomi/llama-2-ko-7b Specify the model and set the job to "text-generation".
pipeline_kwargs Use parameters to limit the maximum number of tokens to be generated to 10.
Loaded models hf It is assigned to a variable, which allows you to perform text generation tasks.

Model used: https://huggingface.co/beomi/llama-2-ko-7b

Copy

from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline

Copy

# HuggingFace Download the model.
hf = HuggingFacePipeline.from_model_id(
    model_id="beomi/llama-2-ko-7b",  # Specifies the ID of the model to use..
    task="text-generation",  # Specifies the task to perform, in this case generating text..
    # Set additional arguments to pass to the pipeline. Here we limit the maximum number of tokens to be generated to 10..
    pipeline_kwargs={"max_new_tokens": 512},
)

existing transformers You can also load by passing the pipeline directly.

Implement a text generation model using HuggingFacePipeline.

AutoTokenizer Wow AutoModelForCausalLM Using beomi/llama-2-ko-7b Load the model and the talkizer.
pipeline Create a "text-generation" pipline using functions, set the model and talkizer. The maximum number of tokens generated is limited to 10.
HuggingFacePipeline Using class hf Create an object and pass the generated pipline.

Created like this hf You can use objects to generate text for a given prompt.

Copy

from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

model_id = "beomi/llama-2-ko-7b"  # Specifies the ID of the model to use.
tokenizer = AutoTokenizer.from_pretrained(
    model_id
)  # Loads a tokenizer for the specified model.
model = AutoModelForCausalLM.from_pretrained(model_id)  # Loads the specified model.
# Create a text generation pipeline and set the maximum number of new tokens to generate to 10.
pipe = pipeline("text-generation", model=model,
                tokenizer=tokenizer, max_new_tokens=512)
# HuggingFacePipeline Create an object and pass it to the pipeline it was created from.
hf = HuggingFacePipeline(pipeline=pipe)

Create Chain

When the model is loaded into memory, it can be configured with a prompt to form a chain.

PromptTemplate Create prompt templates that use classes to define the format of your questions and answers.
prompt With objects hf By connecting the object to the pipe line chain Create an object.
chain.invoke() Call methods to generate and output answers to a given question.

Copy

from langchain.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser

template = """Answer the following question in Korean.
#Question: 
{question}

#Answer: """  # Template defining the question and answer format
prompt = PromptTemplate.from_template(template)  # Creating a prompt object using a template

# Create a chain by connecting prompts and language models
chain = prompt | hf | StrOutputParser()

question = What is the capital of South Korea?" #Question Definition

print(
    chain.invoke({"question": question})
)  # Generate and output answers to questions by calling the chain

GPU Inference

When running on GPU device=n Models can be placed on specific devices by specifying parameters.

Default -1 In addition, CPU performs reasoning.

If you have multiple GPUs or if your model is too large for a single GPU device_map="auto" You can specify

In this case Accelerate A library is required and is used to automatically determine how to load model weights.

caution : device Wow device_map Silver should not be specified together, it can cause unexpected behavior.

HuggingFacePipeline Using gpt2 Load the model, device Set the parameter to 0 so it runs on the GPU.
pipeline_kwargs Use parameters to limit the maximum number of tokens to be generated to 10.
prompt Wow gpu_llm By connecting the pipe line gpu_chain Generate.
gpu_chain.invoke() Call methods to generate and output answers to a given question.

Copy

gpu_llm = HuggingFacePipeline.from_model_id(
    model_id="beomi/llama-2-ko-7b",  # Specifies the ID of the model to use.
    task="text-generation",  # Set the task to perform, in this case generating text.
    # Specifies the GPU device number to use. Setting to "auto" uses the accelerate library. device=0,
    # Set additional arguments to be passed to the pipeline. Here we limit the maximum number of tokens to be generated to 10.
    pipeline_kwargs={"max_new_tokens": 64},
)

gpu_chain = prompt | gpu_llm  # Create a gpu_chain by concatenating prompt and gpu_llm.

# Create a chain by connecting prompts and language models.
gpu_chain = prompt | gpu_llm | StrOutputParser()

question = "What is the capital of South Korea?" #Question Definition

# Generate and output answers to questions by calling the chain
print(gpu_chain.invoke({"question": question}))

Batch GPU Inference

When running on a GPU device, you can run inference on the GPU in batch mode.

HuggingFacePipeline Using beomi/llama-2-ko-7b Load the model and set it to run on the GPU.
gpu_llm When creating batch_size Set to 2, temperature to zero, max_length Set to 64.
prompt Wow gpu_llm By connecting the pipe line gpu_chain Generate and set the exit token to "\n\n".
gpu_chain.batch() Using questions Generate answers to the questions in parallel.
Output the generated answer through a repeat statement.

Copy

gpu_llm = HuggingFacePipeline.from_model_id(
    model_id="beomi/llama-2-ko-7b",  # Specifies the ID of the model to use.
    task="text-generation",  # Set the task to perform.
    device=0,  # Specifies the device number. -1 means CPU..
    batch_size=2,  # Adjust the batch size. Set it appropriately depending on your GPU memory and model size.
    model_kwargs={
        "temperature": 0,
        "max_length": 256,
    },  # Set additional arguments to pass to the model.
)

# Create a chain by connecting prompts and language models.
gpu_chain = prompt | gpu_llm.bind(stop=["\n\n"])

questions = []
for i in range(4):
    # Create a list of questions
    questions.append({"question": f"number {i} What is this in Korean?"})

answers = gpu_chain.batch(questions)  # Batch process a list of questions to generate answers.
for answer in answers:
    print(answer)  # Prints the generated answer.

Previous07. (HuggingFace Local) Copy Next09. (Ollama)

Last updated 6 months ago