CH08 Embedding

Embedding is the third stage of the Retrieval-Augmented Generation (RAG) system, which is the document units created during the document division phase. Convert to numerical forms that machines can understand It is a process to do. This step is one of the key parts of the RAG system, by expressing the meaning of the document in the form of a vector (array of numbers), stored in DB for the question (Query) entered by the user Search for document fragments/paragraphs (Chunk) When importing Utilization when calculating similarity can be.

The need for embedding

  1. Understanding meaning : Natural language is very complex and has a variety of meanings. By transforming these texts into quantified forms through embedding, computers can better understand and process the content and meaning of documents.

  2. Information search enhancement : Conversion to numerical vector forms is essential for calculating similarity between documents. This facilitates searching for related documents or finding the document that best suits your question.

example

Embedding: Change sentence to numerical expression?

  • Paragraph 1: [0.1, 0.5, 0.9, ..., 0.1, 0.2]

  • Paragraph 2: [0.7, 0.1, 0.3, ..., 0.5, 0.6]

  • Paragraph 3: [0.9, 0.4, 0.5, ..., 0.4, 0.3]

Question: "What is the average annual growth rate of the AI software market predicted by market inspector IDC?"

  • [0.1, 0.5, 0.9, ..., 0.2, 0.4]

Similarity calculation example

  • 1 time: 80% -> Select!

  • Number 2: 30%

  • Number 3: 25%

code

Copy

Reference

Last updated