09. Time Weighted Vector StoreRetriever

TimeWeightedVectorStoreRetriever Is a search tool that combines semantic similarity with attenuation over time. Through this, the document or data "freshness" and "relevance" All of them are considered and provide results.

The scoring algorithm consists of:

semantic_similarity+(1.0−decay_rate)hourspassed

here semantic_similarity Indicates the semantic similarity between documents or data, decay_rate Is the percentage that indicates how much the score decreases over time. hours_passed means the time (in hours) that has elapsed since the object was last accessed.

The main feature of this approach is based on the time the object was last approached "The freshness of information" Is that it evaluates. In other words, Frequently approached objects score high over time To maintain, through this Frequently used or important information increases the likelihood that it will be located at the top of the search results. This method provides dynamic search results that take into account both the latest and relevant.

Especially, decay_rate not after the retriever's object was created Time elapsed since last access Means. In other words, frequently accessed objects remain'latest'.

Copy

# API A configuration file for managing keys as environment variables.
from dotenv import load_dotenv

# API Load key information
load_dotenv()

Copy

 True

Copy

# LangSmith set up tracking. https://smith.langchain.com
# !pip install langchain-teddynote
from langchain_teddynote import logging

# Enter a project name.
logging.langsmith("CH11-Retriever")

Copy

 Start tracking LangSmith. 
[Project name] 
CH11-Retriever

Low decay_rate

decay rate Low (I'll set it extremely close to zero here) "Remember" longer It means that it will.
decay rate end 0 This is never forgotten Means to, which makes this retriever equal to the vector lookup.

TimeWeightedVectorStoreRetriever Initialize the vector reservoir, damping rate ( decay_rate ) To a very small value, and the number of vectors to search for (k) is 1.

Copy

from datetime import datetime, timedelta

import faiss
from langchain.docstore import InMemoryDocstore
from langchain.retrievers import TimeWeightedVectorStoreRetriever
from langchain_community.vectorstores import FAISS
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings

# Define an embedding model.
embeddings_model = OpenAIEmbeddings(model="text-embedding-3-small")

# Initializes the vector storage to an empty state.
embedding_size = 1536
index = faiss.IndexFlatL2(embedding_size)
vectorstore = FAISS(embeddings_model, index, InMemoryDocstore({}), {})

# Initialize a time-weighted vector storage searcher (here, applying a low decay rate).
retriever = TimeWeightedVectorStoreRetriever(
    vectorstore=vectorstore, decay_rate=0.0000000000000000000000001, k=1
)

Add simple example data.

Copy

# Calculate yesterday's date.
yesterday = datetime.now() - timedelta(days=1)

retriever.add_documents(
    # Add a document and set yesterday's date in metadata.
    [
        Document(
            page_content="Please subscribe to Teddy Note.",
            metadata={"last_accessed_at": yesterday},
        )
    ]
)

# Add another document. No metadata set separately.
retriever.add_documents([Document(page_content="Would you like to subscribe to Teddy Note? Please!")])

Copy

 ['a6c732c4-adb2-45d1-bcbb-a5108a9778f7']

retriever.invoke() Perform a search by calling.

This is because it is the most prominent (salient) document.
decay_rate end Because it is close to zero The document in is still considered the latest (recent).

Copy

# "The reason why "Please subscribe to Teddy Note" is returned first is because it is the most prominent.
# Since the decay rate is close to 0, it means that it is still up to date.
retriever.invoke("teddy note")

Copy

 [Document (metadata={'last_accessed_at': datetime.datetime (2024, 8, 30, 22, 1, 49, 841379),'created_at': datetime.datetime (2024, 8, 30,2

High decay_rate

High decay_rate (E.g. 0.9999...)Using recency score It converges to zero quickly.

(If you set this value to 1, recency The value is 0, and you get the same result as Vector Lookup.)

TimeWeightedVectorStoreRetriever Use to initialize the searcher. decay_rate Adjust the weight reduction rate over time by setting 0.999.

Copy

# Define an embedding model.
embeddings_model = OpenAIEmbeddings(model="text-embedding-3-small")

# Initializes the vector storage to an empty state.
embedding_size = 1536
index = faiss.IndexFlatL2(embedding_size)
vectorstore = FAISS(embeddings_model, index, InMemoryDocstore({}), {})

# Initializes a time-weighted vector storage finder.
retriever = TimeWeightedVectorStoreRetriever(
    vectorstore=vectorstore, decay_rate=0.999, k=1
)

Add a new document again.

Copy

# Calculate yesterday's date.
yesterday = datetime.now() - timedelta(days=1)

retriever.add_documents(
    # Add a document and set yesterday's date in metadata.
    [
        Document(
            page_content="Please subscribe to Teddy Note.",
            metadata={"last_accessed_at": yesterday},
        )
    ]
)

# Add another document. No metadata set separately.
retriever.add_documents([Document(page_content="테디노트 구독 해주실꺼죠? Please!")])

Copy

 ['c3349ba9-75c7-49ec-be7a-017bc0917fa2']

retriever.invoke("테디노트") When called ""테디노트 구독 해주실꺼죠? Please!"" Is returned first. -This is retriever's "Subscribe to the teddy note." This is because most of the documents related to have been forgotten.

Copy

# Check results after search
retriever.invoke("teddy note")

Copy

 [Document (metadata={'last_accessed_at': datetime.datetime (2024, 8, 30, 22, 3, 18, 331780),'created_at': datetime.datetime(2024, 8, 30,2 Please!')]

Arrangement of damping rate (decay_rate)

decay_rate When set to 0.000001 very small
The attenuation rate (i.e., the rate of oblivion of information) is very low, so I rarely forget the information.
therefore, There is little time weight difference, whether it's up-to-date or old. At this time, you will give a higher score for similarity.
decay_rate When set to 0.999, close to 1
The attenuation rate (i.e., the rate of oblivion of information) is very high. Therefore, the information of the past is almost forgotten.
Therefore, these cases will give you a higher score for the latest information.

In virtual time `decay_rate` adjustment

Some utilities from LangChain allow you to mock (mock) time components.

mock_now A function is a utility function provided by LangChain, used to mock the current time.

Copy

import datetime

from langchain.utils import mock_now

# Set current time to a specific point in time
mock_now(datetime.datetime(2024, 8, 30, 00, 00))

# Print current time
print(datetime.datetime.now())

Copy

 2024-08-30 22:05:01.844175

mock_now You can use functions to test your search results while changing the current time.

Take advantage of that feature decay_rate You can help find it.

[Caution] If you set it to a time that was too long ago, you may get an error when calculating decay_rate.

Copy

# Change the current time to any time.
with mock_now(datetime.datetime(2024, 8, 29, 00, 00)):
    # Search documents at the point of change.
    print(retriever.invoke("teddy note"))

Copy

 [Document (metadata={'last_accessed_at': MockDateTime (2024, 8, 29, 0, 0),'created_at': datetime.datetime (2024, 8, 30, 22, 2, 44, 6187 Please!')]

Previous08. Self Query Retriever Next10. Hangeulocyte analyzer (Kiwi, Kkma, Okt) + BM25 finder

Last updated 5 months ago

Low decay_rate

High decay_rate

Arrangement of damping rate (decay_rate)

In virtual time decay_rate adjustment

In virtual time `decay_rate` adjustment