08. Regressive JSON split (RecursiveJsonSplitter)

RecursiveJsonSplitter

This JSON divider creates a smaller JSON chunk by deep-first traversal of JSON data.

This divider attempts to keep nested JSON objects as much as possible, but splits objects if necessary to keep the chunk size between min_chunk_size and max_chunk_size. If the value is a very large string, not a nested JSON, that string is not split.

If you need strict restrictions on the size of the chunk, you can consider using the Recursive Text Splitter after this divider to handle that chunk.

Split criteria

Text splitting method: based on JSON value
Chunk size measurement method: based on number of characters

Copy

%pip install -qU langchain-text-splitters

requests.get() Use the function to get JSON data from the "https://api.smith.langchain.com/openapi.json" URL.
Imported JSON data json() Converted to Python dictionary form via method json_data Stored in variables.

Copy

import requests

# JSON Load the data.
json_data = requests.get("https://api.smith.langchain.com/openapi.json").json()

RecursiveJsonSplitter An example of splitting JSON data using.

Copy

from langchain_text_splitters import RecursiveJsonSplitter

# JSON Splits data into chunks of up to 300 size. RecursiveJsonSplitter Create an object.
splitter = RecursiveJsonSplitter(max_chunk_size=300)

splitter.split_json() Split JSON data recursively using functions.

Copy

# JSON Splits data recursively. Use when you need to access or manipulate small pieces of JSON.
json_chunks = splitter.split_json(json_data=json_data)

splitter.create_documents() Convert JSON data to document format using methods.
splitter.split_text() Split JSON data into string list using methods.

Copy

# JSON Generate documents based on data.
docs = splitter.create_documents(texts=[json_data])

# JSON Generates string chunks based on data.
texts = splitter.split_text(json_data=json_data)

# Prints the first string.
print(docs[0].page_content)

print("===" * 20)
# Outputs chunks of a split string.
print(texts[0])

Copy

{"openapi": "3.1.0", "info": {"title": "LangSmith", "version": "0.1.0"}, "paths": {"/api/v1/sessions/{ 
============================================================ 
{"openapi": "3.1.0", "info": {"title": "LangSmith", "version": "0.1.0"}, "paths": {"/api/v1/sessions/{

texts[2] After reviewing one of the large chunks by outputting, you can see that the chunk contains a list object.

There is a reason why the size of the second chunk exceeds the limit 300, which is a list object.
This is RecursiveJsonSplitter end Because the list object does not split is.

Copy

# Let's check the size of the chunk.
print([len(text) for text in texts][:10])

# If we examine one of the larger chunks, we can see that it contains a list object.
print(texts[1])

Copy

[232, 197, 469, 210, 213, 237, 271, 191, 232, 215] 
{"paths": {"/api/v1/sessions/{session_id}": {"get": {" operationId": "read_tracer_session_api_v1_sessions__session_id

2 index chunks as follows json You can parse using modules.

Copy

import json

json_data = json.loads(texts[2])
json_data["paths"]

Copy

'1':'G','G','G'Accept'}}]}}}

convert_lists parameter True Rest within JSON by setting it to index:item Form key:value You can convert it into pairs.

Copy

# Next, we preprocess the JSON and convert the list into a dictionary with key:value pairs as index:items.
texts = splitter.split_text(json_data=json_data, convert_lists=True)

Copy

# The list is converted to a dictionary and the result is checked.
print(texts[2])

Copy

 "3G1>"paths": {"/api/v1/sessions/{session_id{": }"geters": [{"name": "session_id", "in":}], "title": "  Accept" }}]}}}}

docs You can check the documents corresponding to the specific index of the list.

Copy

# Check document number 2.
docs[2]

Copy

 "3:Face": "Gex": {" type": "null" }], "  title": "Accept" }}]}}}}')

Previous06. Markdownheader Text Split (MarkdownheaderTextSplitter)NextCH08 Embedding

Last updated 9 months ago

hashtagRecursiveJsonSplitter

RecursiveJsonSplitter