03. (HuggingFace Embeddings)

from dotenv import load_dotenv

load_dotenv()

Copy

True

Copy

import os
import warnings

warnings.filterwarnings("ignore")

Sample Data

Copy

from langchain_core.documents import Document

texts = [
    "Hello, nice to meet you.",
    "LangChain simplifies the process of building applications with large language models",
    "Langchain Korean Tutorial LangChain Official documentation of, cookbook and based on various practical examples, users can LangChain It is structured to make it easier and more effective to use. ",
    "LangChainSimplifies the process of building applications with super-large language models.",
    "Retrieval-Augmented Generation (RAG) is an effective technique for improving AI responses.",
]

(Reference)

  • Open toolbar Write a caption​​(Source) Kor-IR: Embedding Benchmark for Korean Search

HuggingFace Endpoint Embedding

HuggingFaceEndpointEmbeddings is internally InferenceClient In that we compute the embedding usingHuggingFaceEndpoint LLM It is very similar to what you do in an LLM.

Copy

Document Embedding is embed_documents() It can be created by calling .

Copy

Copy

Copy

Copy

query and embedding_document Calculating similarity between livers

Copy

Copy

Copy

Copy

Copy

Copy

HuggingFace Embeddings

intfloat/multilingual-e5-large-instruct

Copy

Copy

Copy

Copy

Copy

BGE-M3 Embedding

Copy

Copy

Copy

Copy

Copy

FlagEmbedding How to use

참고 - FlagEmbedding - BGE-M3 Usage

FlagEmbedding By combining the three approaches provided in , you can build a more powerful search system.

  • Dense Vector: BGE-M3 Based on the multilingual, multitasking capabilities of.

  • Lexical weight using sparse embedding Perform exact word matching with.

  • ColBERT의 multi-vector Performing fine-grained matching with context-sensitive approaches.

Copy

Copy

Copy

Copy

Copy

Copy

Copy

Copy

Copy

Sparse Embedding (Lexical Weight)

Sparse embedding is an embedding method that uses high-dimensional vectors where most of the values ​​in the vector are 0. The method that utilizes lexical weight creates an embedding by considering the importance of words.

How it works :

1. Compute the lexical weight for each word. This can be done using methods such as TF-IDF or BM25.

2. For each word in the document or query, assign a value to the corresponding dimension of the sparse vector using the lexical weight of that word.

3. As a result, the document or query is represented as a high-dimensional vector where most of the values ​​are 0.

Advantages - It can directly reflect the importance of words. - It can match specific words or phrases exactly. - The calculation is relatively fast.

Copy

Copy

Copy

Multi-Vector (ColBERT)

ColBERT (Contextualized Late Interaction over BERT) is an efficient method for document retrieval. It uses a multi-vector approach that represents documents and queries as multiple vectors.

How it works :

For each token in the document, a separate vector is generated. That is, a single document is represented by multiple vectors. The query also generates a separate vector for each token. When searching, the similarity between each token vector in the query and all token vectors in the document is calculated. These similarities are combined to calculate the final search score.

Pros - Allows for token-level fine-grained matching. - Can generate context-sensitive embeddings. - Works well for long documents.

Copy

Copy

Copy

Last updated