09. Text (TextLoader)

TXT Loader

.txt Let's look at how to load files with extensions into loaders.

Copy

from langchain_community.document_loaders import TextLoader

# Create a text loader
loader = TextLoader("data/appendix-keywords.txt")

# load document
docs = loader.load()
print(f"Number of documents: {len(docs)}\n")
print("[Metadata]\n")
print(docs[0].metadata)
print("\n========= [Preview] =========\n")
print(docs[0].page_content[:500])

Copy

Number of documents: 1 

[Metadata] 

{'source':'data/appendix-keywords.txt'} 

========= [Front] Preview ========= 

Semantic Search 

Definition: A semantic search is a search method that goes beyond a simple keyword match for a user's query and grasps its meaning and returns related results. 
Example: When a user searches for a "solar planet", it returns information about the related planet, such as "Vegetic", "Mars", etc. 
Associates: natural language processing, search algorithms, data mining 

Embedding 

Definition: Embedding is the process of converting text data, such as words or sentences, into a low-dimensional, continuous vector. This allows the computer to understand and process the text. 
Example: The word "apple" is expressed in vectors such as [0.65, -0.23, 0.17]. 
Associated Keywords: natural language processing, vectorization, deep learning 

Token 

Definition: Token means splitting text into smaller units. This can usually be a word, sentence, or verse. 
Example: Split the sentence "I go to school" into "I", "to school", and "go". 
Associated Keyword: Tokenization, Natural Language

Automatic detection of file encoding via TextLoader

In this example, we'll look at some of the strategies that are useful when using the TextLoader class to load random file lists in bulk in a directory.

First, let's load multiple texts with random encodings to illustrate the problem.

silent_errors : You can pass the silent_errors parameter to the directoryer to cross the file that cannot be loaded and continue the load process.
autodetect_encoding : You can also pass the auto-sensing_ encoding to the loader class to request that the file encoding be detected automatically before it fails.

Copy

from langchain_community.document_loaders import DirectoryLoader

path = "data/"

text_loader_kwargs = {"autodetect_encoding": True}

loader = DirectoryLoader(
    path,
    glob="**/*.txt",
    loader_cls=TextLoader,
    silent_errors=True,
    loader_kwargs=text_loader_kwargs,
)
docs = loader.load()

data/appendix-keywords.txt Derivatives with similar file names and file names are all files with different encoding methods.

Copy

doc_sources = [doc.metadata["source"] for doc in docs]
doc_sources

Copy

 ['data/appendix-keywords-CP949.txt','data/reference.txt','data/appendix-keywords-EUCKR.txt','data/chain-of-density.txt','data/appendix -keywords.txt','data/appendix

Copy

print("[Metadata]\n")
print(docs[2].metadata)
print("\n========= [Preview] =========\n")
print(docs[2].page_content[:500])

Copy

[Metadata] 

{'source':'data/appendix-keywords-EUCKR.txt'} 

========= [Front] Preview ========= 

Semantic Search 

Definition: A semantic search is a search method that goes beyond a simple keyword match for a user's query and grasps its meaning and returns related results. 
Example: When a user searches for a "solar planet", it returns information about the related planet, such as "Vegetic", "Mars", etc. 
Associates: natural language processing, search algorithms, data mining 

Embedding 

Definition: Embedding is the process of converting text data, such as words or sentences, into a low-dimensional, continuous vector. This allows the computer to understand and process the text. 
Example: The word "apple" is expressed in vectors such as [0.65, -0.23, 0.17]. 
Associated Keywords: natural language processing, vectorization, deep learning 

Token 

Definition: Token means splitting text into smaller units. This can usually be a word, sentence, or verse. 
Example: Split the sentence "I go to school" into "I", "to school", and "go". 
Associated Keyword: Tokenization, Natural Language

Copy

print("[Metadata]\n")
print(docs[3].metadata)
print("\n========= [Preview] =========\n")
print(docs[3].page_content[:500])

Copy

 [Metadata] 

{'source':'data/chain-of-density.txt'} 

========= [Front] Preview ========= 

Selecting the “right” amount of information to include in a summary is a difficult task.  
A good summary should be detailed and entity-centric without being overly sense and hard to follow. To better understand this tradeoff, we solicit increasingly sense GPT-4 summaries with what we refer to as a “Chain of Density” (CoD) prompt. Specificly, GPT-4 generates an initial entity-sparse summary before iteratively incorporating missing salient entities without increasing the length. Summaries genera

Copy

print("[Metadata]\n")
print(docs[4].metadata)
print("\n========= [Preview] =========\n")
print(docs[4].page_content[:500])

Copy

[Metadata] 

{'source':'data/appendix-keywords.txt'} 

========= [Front] Preview ========= 

Semantic Search 

Definition: A semantic search is a search method that goes beyond a simple keyword match for a user's query and grasps its meaning and returns related results. 
Example: When a user searches for a "solar planet", it returns information about the related planet, such as "Vegetic", "Mars", etc. 
Associates: natural language processing, search algorithms, data mining 

Embedding 

Definition: Embedding is the process of converting text data, such as words or sentences, into a low-dimensional, continuous vector. This allows the computer to understand and process the text. 
Example: The word "apple" is expressed in vectors such as [0.65, -0.23, 0.17]. 
Associated Keywords: natural language processing, vectorization, deep learning 

Token 

Definition: Token means splitting text into smaller units. This can usually be a word, sentence, or verse. 
Example: Split the sentence "I go to school" into "I", "to school", and "go". 
Associated Keyword: Tokenization, Natural Language

Previous08. Web document (WebBaseLoader)Next10. JSON

Last updated 9 months ago

hashtagTXT Loader

hashtagAutomatic detection of file encoding via TextLoader

TXT Loader

Automatic detection of file encoding via TextLoader