09. Text (TextLoader)
TXT Loader
.txt Let's look at how to load files with extensions into loaders.
Copy
from langchain_community.document_loaders import TextLoader
# Create a text loader
loader = TextLoader("data/appendix-keywords.txt")
# load document
docs = loader.load()
print(f"Number of documents: {len(docs)}\n")
print("[Metadata]\n")
print(docs[0].metadata)
print("\n========= [Preview] =========\n")
print(docs[0].page_content[:500])Copy
Number of documents: 1
[Metadata]
{'source':'data/appendix-keywords.txt'}
========= [Front] Preview =========
Semantic Search
Definition: A semantic search is a search method that goes beyond a simple keyword match for a user's query and grasps its meaning and returns related results.
Example: When a user searches for a "solar planet", it returns information about the related planet, such as "Vegetic", "Mars", etc.
Associates: natural language processing, search algorithms, data mining
Embedding
Definition: Embedding is the process of converting text data, such as words or sentences, into a low-dimensional, continuous vector. This allows the computer to understand and process the text.
Example: The word "apple" is expressed in vectors such as [0.65, -0.23, 0.17].
Associated Keywords: natural language processing, vectorization, deep learning
Token
Definition: Token means splitting text into smaller units. This can usually be a word, sentence, or verse.
Example: Split the sentence "I go to school" into "I", "to school", and "go".
Associated Keyword: Tokenization, Natural Language Automatic detection of file encoding via TextLoader
In this example, we'll look at some of the strategies that are useful when using the TextLoader class to load random file lists in bulk in a directory.
First, let's load multiple texts with random encodings to illustrate the problem.
silent_errors: You can pass the silent_errors parameter to the directoryer to cross the file that cannot be loaded and continue the load process.autodetect_encoding: You can also pass the auto-sensing_ encoding to the loader class to request that the file encoding be detected automatically before it fails.
Copy
data/appendix-keywords.txt Derivatives with similar file names and file names are all files with different encoding methods.
Copy
Copy
Copy
Copy
Copy
Copy
Copy
Copy
Last updated