08. HuggingFace Pipeline
Hugging Face Local Pipelines
HuggingFacePipeline Classes allow Hugging Face models to run locally.
Hugging Face Model Hub Hosts over 120,000 models, 20,000 datasets, and 50,000 demo apps on the online platform, all open source and publicly available, making it easy for people to collaborate and build ML together.
These models can be used by LangChain by calling through this local pipeline wrapper, or by calling the inference endpoint hosted through the HuggingFaceHub class. For more information on hosted pipeline HuggingFaceHub See the laptop.
For use PyTorch Python with Package transformers It should be installed.
Also, for more memory efficient attention implementation xformer You can also install.
Copy
%pip install --upgrade --quiet transformers --quietSet the path to download the model
Copy
# Path to download the hugging face model/tokenizer
# (example)
import os
# ./cache/ Set to download to path
os.environ["TRANSFORMERS_CACHE"] = "./cache/"
os.environ["HF_HOME"] = "./cache/"Model Loading
Model from_model_id You can load it by specifying model parameters using methods.
HuggingFacePipelineLoad the pre-learned model of the Hugging Face using classes.from_model_idUsing methodsbeomi/llama-2-ko-7bSpecify the model and set the job to "text-generation".pipeline_kwargsUse parameters to limit the maximum number of tokens to be generated to 10.Loaded models
hfIt is assigned to a variable, which allows you to perform text generation tasks.
Model used: https://huggingface.co/beomi/llama-2-ko-7b
Copy
Copy
existing transformers You can also load by passing the pipeline directly.
Implement a text generation model using HuggingFacePipeline.
AutoTokenizerWowAutoModelForCausalLMUsingbeomi/llama-2-ko-7bLoad the model and the talkizer.pipelineCreate a "text-generation" pipline using functions, set the model and talkizer. The maximum number of tokens generated is limited to 10.HuggingFacePipelineUsing classhfCreate an object and pass the generated pipline.
Created like this hf You can use objects to generate text for a given prompt.
Copy
Create Chain
When the model is loaded into memory, it can be configured with a prompt to form a chain.
PromptTemplateCreate prompt templates that use classes to define the format of your questions and answers.promptWith objectshfBy connecting the object to the pipe linechainCreate an object.chain.invoke()Call methods to generate and output answers to a given question.
Copy
GPU Inference
When running on GPU device=n Models can be placed on specific devices by specifying parameters.
Default -1 In addition, CPU performs reasoning.
If you have multiple GPUs or if your model is too large for a single GPU device_map="auto" You can specify
In this case Accelerate A library is required and is used to automatically determine how to load model weights.
caution : device Wow device_map Silver should not be specified together, it can cause unexpected behavior.
HuggingFacePipelineUsinggpt2Load the model,deviceSet the parameter to 0 so it runs on the GPU.pipeline_kwargsUse parameters to limit the maximum number of tokens to be generated to 10.promptWowgpu_llmBy connecting the pipe linegpu_chainGenerate.gpu_chain.invoke()Call methods to generate and output answers to a given question.
Copy
Batch GPU Inference
When running on a GPU device, you can run inference on the GPU in batch mode.
HuggingFacePipelineUsingbeomi/llama-2-ko-7bLoad the model and set it to run on the GPU.gpu_llmWhen creatingbatch_sizeSet to 2,temperatureto zero,max_lengthSet to 64.promptWowgpu_llmBy connecting the pipe linegpu_chainGenerate and set the exit token to "\n\n".gpu_chain.batch()UsingquestionsGenerate answers to the questions in parallel.Output the generated answer through a repeat statement.
Copy
Last updated