09. Text (TextLoader)

TXT Loader

.txt Let's look at how to load files with extensions into loaders.

Copy

from langchain_community.document_loaders import TextLoader

# Create a text loader
loader = TextLoader("data/appendix-keywords.txt")

# load document
docs = loader.load()
print(f"Number of documents: {len(docs)}\n")
print("[Metadata]\n")
print(docs[0].metadata)
print("\n========= [Preview] =========\n")
print(docs[0].page_content[:500])

Copy

Number of documents: 1 

[Metadata] 

{'source':'data/appendix-keywords.txt'} 

========= [Front] Preview ========= 

Semantic Search 

Definition: A semantic search is a search method that goes beyond a simple keyword match for a user's query and grasps its meaning and returns related results. 
Example: When a user searches for a "solar planet", it returns information about the related planet, such as "Vegetic", "Mars", etc. 
Associates: natural language processing, search algorithms, data mining 

Embedding 

Definition: Embedding is the process of converting text data, such as words or sentences, into a low-dimensional, continuous vector. This allows the computer to understand and process the text. 
Example: The word "apple" is expressed in vectors such as [0.65, -0.23, 0.17]. 
Associated Keywords: natural language processing, vectorization, deep learning 

Token 

Definition: Token means splitting text into smaller units. This can usually be a word, sentence, or verse. 
Example: Split the sentence "I go to school" into "I", "to school", and "go". 
Associated Keyword: Tokenization, Natural Language 

Automatic detection of file encoding via TextLoader

In this example, we'll look at some of the strategies that are useful when using the TextLoader class to load random file lists in bulk in a directory.

First, let's load multiple texts with random encodings to illustrate the problem.

  • silent_errors : You can pass the silent_errors parameter to the directoryer to cross the file that cannot be loaded and continue the load process.

  • autodetect_encoding : You can also pass the auto-sensing_ encoding to the loader class to request that the file encoding be detected automatically before it fails.

Copy

data/appendix-keywords.txt Derivatives with similar file names and file names are all files with different encoding methods.

Copy

Copy

Copy

Copy

Copy

Copy

Copy

Copy

Last updated