08. HuggingFace Pipeline

Hugging Face Local Pipelines

HuggingFacePipeline Classes allow Hugging Face models to run locally.

Hugging Face Model Hub Hosts over 120,000 models, 20,000 datasets, and 50,000 demo apps on the online platform, all open source and publicly available, making it easy for people to collaborate and build ML together.

These models can be used by LangChain by calling through this local pipeline wrapper, or by calling the inference endpoint hosted through the HuggingFaceHub class. For more information on hosted pipeline HuggingFaceHub See the laptop.

For use PyTorch Python with Package transformers It should be installed.

Also, for more memory efficient attention implementation xformer You can also install.

Copy

%pip install --upgrade --quiet  transformers --quiet

Set the path to download the model

Copy

# Path to download the hugging face model/tokenizer
# (example)
import os

# ./cache/ Set to download to path
os.environ["TRANSFORMERS_CACHE"] = "./cache/"
os.environ["HF_HOME"] = "./cache/"

Model Loading

Model from_model_id You can load it by specifying model parameters using methods.

  • HuggingFacePipeline Load the pre-learned model of the Hugging Face using classes.

  • from_model_id Using methods beomi/llama-2-ko-7b Specify the model and set the job to "text-generation".

  • pipeline_kwargs Use parameters to limit the maximum number of tokens to be generated to 10.

  • Loaded models hf It is assigned to a variable, which allows you to perform text generation tasks.

Model used: https://huggingface.co/beomi/llama-2-ko-7b

Copy

Copy

existing transformers You can also load by passing the pipeline directly.

Implement a text generation model using HuggingFacePipeline.

  • AutoTokenizer Wow AutoModelForCausalLM Using beomi/llama-2-ko-7b Load the model and the talkizer.

  • pipeline Create a "text-generation" pipline using functions, set the model and talkizer. The maximum number of tokens generated is limited to 10.

  • HuggingFacePipeline Using class hf Create an object and pass the generated pipline.

Created like this hf You can use objects to generate text for a given prompt.

Copy

Create Chain

When the model is loaded into memory, it can be configured with a prompt to form a chain.

  • PromptTemplate Create prompt templates that use classes to define the format of your questions and answers.

  • prompt With objects hf By connecting the object to the pipe line chain Create an object.

  • chain.invoke() Call methods to generate and output answers to a given question.

Copy

GPU Inference

When running on GPU device=n Models can be placed on specific devices by specifying parameters.

Default -1 In addition, CPU performs reasoning.

If you have multiple GPUs or if your model is too large for a single GPU device_map="auto" You can specify

In this case Accelerate A library is required and is used to automatically determine how to load model weights.

caution : device Wow device_map Silver should not be specified together, it can cause unexpected behavior.

  • HuggingFacePipeline Using gpt2 Load the model, device Set the parameter to 0 so it runs on the GPU.

  • pipeline_kwargs Use parameters to limit the maximum number of tokens to be generated to 10.

  • prompt Wow gpu_llm By connecting the pipe line gpu_chain Generate.

  • gpu_chain.invoke() Call methods to generate and output answers to a given question.

Copy

Batch GPU Inference

When running on a GPU device, you can run inference on the GPU in batch mode.

  • HuggingFacePipeline Using beomi/llama-2-ko-7b Load the model and set it to run on the GPU.

  • gpu_llm When creating batch_size Set to 2, temperature to zero, max_length Set to 64.

  • prompt Wow gpu_llm By connecting the pipe line gpu_chain Generate and set the exit token to "\n\n".

  • gpu_chain.batch() Using questions Generate answers to the questions in parallel.

  • Output the generated answer through a repeat statement.

Copy

Last updated