03. (HuggingFace Embeddings)

from dotenv import load_dotenv

load_dotenv()

Copy

True

Copy

import os
import warnings

warnings.filterwarnings("ignore")

Sample Data

Copy

from langchain_core.documents import Document

texts = [
    "Hello, nice to meet you.",
    "LangChain simplifies the process of building applications with large language models",
    "Langchain Korean Tutorial LangChain Official documentation of, cookbook and based on various practical examples, users can LangChain It is structured to make it easier and more effective to use. ",
    "LangChainSimplifies the process of building applications with super-large language models.",
    "Retrieval-Augmented Generation (RAG) is an effective technique for improving AI responses.",
]

(Reference)

Open toolbar Write a caption(Source) Kor-IR: Embedding Benchmark for Korean Search

HuggingFace Endpoint Embedding

HuggingFaceEndpointEmbeddings is internally InferenceClient In that we compute the embedding usingHuggingFaceEndpoint LLM It is very similar to what you do in an LLM.

Copy

from langchain_huggingface.embeddings import HuggingFaceEndpointEmbeddings

model_name = "intfloat/multilingual-e5-large-instruct"

hf_embeddings = HuggingFaceEndpointEmbeddings(
    model=model_name,
    task="feature-extraction",
    huggingfacehub_api_token=os.environ["HUGGINGFACEHUB_API_TOKEN"],
)

Document Embedding is embed_documents() It can be created by calling .

Copy

%%time
# Document Embedding commitment
embedded_documents = hf_embeddings.embed_documents(texts)

Copy

CPU times: user 79.4 ms, sys: 14.5 ms, total: 93.9 ms
Wall time: 26.2 s

Copy

print("[HuggingFace Endpoint Embedding]")
print(f"Model: \t\t{model_name}")
print(f"Dimension: \t{len(embedded_documents[0])}")

Copy

# Document Embedding commitment
embedded_query = hf_embeddings.embed_query("LangChain Please tell me about.")

query and embedding_document Calculating similarity between livers

Copy

import numpy as np

# question(embedded_query): LangChain Please tell me about.
np.array(embedded_query) @ np.array(embedded_documents).T

Copy

array([0.84186319, 0.86502318, 0.86470304, 0.89564882, 0.76847344])

Copy

sorted_idx = (np.array(embedded_query) @ np.array(embedded_documents).T).argsort()[::-1]
sorted_idx

Copy

array([3, 1, 2, 0, 4])

Copy

print("[Query] Tell me about LangChain.\n====================================")
for i, idx in enumerate(sorted_idx):
    print(f"[{i}] {texts[idx]}")
    print()

Copy

[Query] LangChain Please tell me about.
====================================
[0] LangChain simplifies the process of building applications with large language models.

[1] LangChain simplifies the process of building applications with large language models

[2] Langchain Korean Tutorial LangChain Official documentation of, cookbook and based on various practical examples, users can LangChain It is structured to make it easier and more effective to use. 

[3] Hello, nice to meet you.

[4] Retrieval-Augmented Generation (RAG) is an effective technique for improving AI responses.

HuggingFace Embeddings

intfloat/multilingual-e5-large-instruct

Copy

from langchain_huggingface.embeddings import HuggingFaceEmbeddings

model_name = "intfloat/multilingual-e5-large-instruct"
# model_name = "intfloat/multilingual-e5-large"

hf_embeddings = HuggingFaceEmbeddings(
    model_name=model_name,
    model_kwargs={"device": "cuda"},  # cuda, cpu
    encode_kwargs={"normalize_embeddings": True},
)

Copy

%time
# Document
embedded_documents1 = hf_embeddings.embed_documents(texts)

Copy

CPU times: user 2 μs, sys: 1e+03 ns, total: 3 μs
Wall time: 6.44 μs

Copy

print(f"Model: \t\t{model_name}")
print(f"Dimension: \t{len(embedded_documents[0])}")

Copy

Model:      intfloat/multilingual-e5-large-instruct
Dimension:  1024

BGE-M3 Embedding

Copy

from langchain_huggingface import HuggingFaceEmbeddings

model_name = "BAAI/bge-m3"
model_kwargs = {"device": "cuda"}
encode_kwargs = {"normalize_embeddings": True}
hf_embeddings = HuggingFaceEmbeddings(
    model_name=model_name, model_kwargs=model_kwargs, encode_kwargs=encode_kwargs
)

Copy

%time
# Document
embedded_documents = hf_embeddings.embed_documents(texts)

Copy

CPU times: user 3 μs, sys: 1 μs, total: 4 μs
Wall time: 9.06 μs

Copy

print(f"Model: \t\t{model_name}")
print(f"Dimension: \t{len(embedded_documents[0])}")

Copy

Model:      BAAI/bge-m3
Dimension:  1024

FlagEmbedding How to use

참고 - FlagEmbedding - BGE-M3 Usage

FlagEmbedding By combining the three approaches provided in , you can build a more powerful search system.

Dense Vector: BGE-M3 Based on the multilingual, multitasking capabilities of.
Lexical weight using sparse embedding Perform exact word matching with.
ColBERT의 multi-vector Performing fine-grained matching with context-sensitive approaches.

Copy

# FlagEmbedding installation
!pip install -qU FlagEmbedding

Copy

from FlagEmbedding import BGEM3FlagModel

model_name = "BAAI/bge-m3"
bge_embeddings = BGEM3FlagModel(
    model_name, use_fp16=True
)  # use_fp16 True Setting this to speeds up calculations with some performance penalty.

bge_embedded = bge_embeddings.encode(
    texts,
    batch_size=12,
    max_length=8192,  # If you don't need such a long length, you can set a smaller value to speed up the encoding process.
)["dense_vecs"]

Copy

bge_embedded.shape

Copy

(5, 1024)

Copy

print(f"Model: \t\t{model_name}")
print(f"Dimension: \t{len(embedded_documents[0])}")

Copy

Model:      BAAI/bge-m3
Dimension:  1024

Copy

from FlagEmbedding import BGEM3FlagModel

bge_flagmodel = BGEM3FlagModel(
    "BAAI/bge-m3", use_fp16=True
)  # Setting use_fp16 to True will speed up the calculations with a slight performance penalty.
bge_encoded = bge_flagmodel.encode(texts, return_dense=True)

Copy

# Output results (rows, columns)
bge_encoded["dense_vecs"].shape

Copy

(5, 1024)

Sparse Embedding (Lexical Weight)

Sparse embedding is an embedding method that uses high-dimensional vectors where most of the values in the vector are 0. The method that utilizes lexical weight creates an embedding by considering the importance of words.

How it works :

1. Compute the lexical weight for each word. This can be done using methods such as TF-IDF or BM25.

2. For each word in the document or query, assign a value to the corresponding dimension of the sparse vector using the lexical weight of that word.

3. As a result, the document or query is represented as a high-dimensional vector where most of the values are 0.

Advantages - It can directly reflect the importance of words. - It can match specific words or phrases exactly. - The calculation is relatively fast.

Copy

bge_flagmodel = BGEM3FlagModel(
    "BAAI/bge-m3", use_fp16=True
)  # Setting use_fp16 to True will speed up the calculations with a slight performance penalty.
bge_encoded = bge_flagmodel.encode(texts, return_sparse=True)

Copy

lexical_scores1 = bge_flagmodel.compute_lexical_matching_score(
    bge_encoded["lexical_weights"][0], bge_encoded["lexical_weights"][0]
)
lexical_scores2 = bge_flagmodel.compute_lexical_matching_score(
    bge_encoded["lexical_weights"][0], bge_encoded["lexical_weights"][1]
)
# 0 <-> 0
print(lexical_scores1)
# 0 <-> 1
print(lexical_scores2)

Copy

0.30167388916015625
0

Multi-Vector (ColBERT)

ColBERT (Contextualized Late Interaction over BERT) is an efficient method for document retrieval. It uses a multi-vector approach that represents documents and queries as multiple vectors.

How it works :

For each token in the document, a separate vector is generated. That is, a single document is represented by multiple vectors. The query also generates a separate vector for each token. When searching, the similarity between each token vector in the query and all token vectors in the document is calculated. These similarities are combined to calculate the final search score.

Pros - Allows for token-level fine-grained matching. - Can generate context-sensitive embeddings. - Works well for long documents.

Copy

bge_flagmodel = BGEM3FlagModel(
    "BAAI/bge-m3", use_fp16=True
)  # Setting use_fp16 to True will speed up the calculations with a slight performance penalty.
bge_encoded = bge_flagmodel.encode(texts, return_colbert_vecs=True)

Copy

colbert_scores1 = bge_flagmodel.colbert_score(
    bge_encoded["colbert_vecs"][0], bge_encoded["colbert_vecs"][0]
)
colbert_scores2 = bge_flagmodel.colbert_score(
    bge_encoded["colbert_vecs"][0], bge_encoded["colbert_vecs"][1]
)
# 0 <-> 0
print(colbert_scores1)
# 0 <-> 1
print(colbert_scores2)

Copy

tensor(1.)
tensor(0.3748)

Previous02. CacheBackedEmbeddings Next04. UpstageEmbeddings Copy

Last updated 5 months ago