145. Working with Large Data in Chunks

When working with large datasets, it is often more efficient to process data in smaller chunks rather than loading everything into memory at once. This approach reduces memory usage and can speed up processing, especially when dealing with very large files or databases.

Below are some Python code snippets that demonstrate how to handle large datasets in chunks for different scenarios:

1. Reading a Large File in Chunks

Using read() to process large text files in smaller chunks.

Copy

def read_large_file(file_path, chunk_size=1024):
    with open(file_path, 'r') as file:
        while chunk := file.read(chunk_size):
            process_chunk(chunk)

def process_chunk(chunk):
    # Process the chunk (e.g., parse, analyze, etc.)
    print(f"Processing chunk: {chunk[:50]}...")  # Preview of first 50 characters

# Example usage
read_large_file('large_file.txt', chunk_size=2048)

Explanation:

The read_large_file function reads the file in chunks (default 1024 bytes).
The process_chunk function processes each chunk. You can replace it with your actual data processing logic.

2. Processing Large CSV Files in Chunks with pandas

Using pandas.read_csv() to read large CSV files in chunks.

Copy

import pandas as pd

def process_csv_in_chunks(file_path, chunk_size=100000):
    chunk_iter = pd.read_csv(file_path, chunksize=chunk_size)
    for chunk in chunk_iter:
        process_chunk(chunk)

def process_chunk(chunk):
    # Process the chunk (e.g., filtering, aggregation)
    print(f"Processing chunk with {len(chunk)} rows")

# Example usage
process_csv_in_chunks('large_data.csv', chunk_size=50000)

Explanation:

pandas.read_csv() allows you to read large CSV files in chunks, which can be processed in memory-efficient chunks.
chunksize defines the number of rows per chunk.

3. Processing Large JSON Files in Chunks

Using ijson for iterating through large JSON files.

Copy

import ijson

def process_large_json(file_path):
    with open(file_path, 'r') as file:
        objects = ijson.items(file, 'item')  # 'item' is the root element name
        for obj in objects:
            process_chunk(obj)

def process_chunk(chunk):
    # Process each chunk of the JSON data
    print(f"Processing chunk: {chunk}")

# Example usage
process_large_json('large_file.json')

Explanation:

ijson allows you to parse JSON files incrementally, loading data one object at a time.
This approach avoids loading the entire file into memory.

4. Handling Large SQL Queries in Chunks

Using sqlite3 to fetch large results in chunks.

Copy

import sqlite3

def fetch_data_in_chunks(db_path, query, chunk_size=1000):
    connection = sqlite3.connect(db_path)
    cursor = connection.cursor()

    cursor.execute(query)
    while True:
        rows = cursor.fetchmany(chunk_size)
        if not rows:
            break
        process_chunk(rows)

    connection.close()

def process_chunk(rows):
    # Process the chunk (e.g., data analysis, transformation)
    print(f"Processing {len(rows)} rows")

# Example usage
fetch_data_in_chunks('large_database.db', 'SELECT * FROM large_table')

Explanation:

The fetchmany() function allows fetching a specific number of rows at a time from the database, avoiding memory overload.

5. Processing Large Data in Chunks with Dask

Dask allows for parallel processing of large datasets.

Copy

import dask.dataframe as dd

def process_large_data_with_dask(file_path):
    df = dd.read_csv(file_path)  # Automatically processes the CSV file in chunks
    df = df[df['column_name'] > 100]  # Example filtering
    df.compute()  # Triggers the computation

# Example usage
process_large_data_with_dask('large_file.csv')

Explanation:

Dask enables parallel processing of large data and handles chunks internally. It is particularly useful for large-scale data operations in a distributed manner.

6. Processing Large Data with NumPy in Chunks

Using numpy to load large datasets in chunks and process them.

Copy

import numpy as np

def process_large_array(file_path, chunk_size=100000):
    total_data = np.loadtxt(file_path, delimiter=',')
    for i in range(0, len(total_data), chunk_size):
        chunk = total_data[i:i+chunk_size]
        process_chunk(chunk)

def process_chunk(chunk):
    # Example: Compute the sum of the chunk
    print(f"Sum of chunk: {np.sum(chunk)}")

# Example usage
process_large_array('large_data.csv')

Explanation:

This approach uses numpy's loadtxt() to load the entire data and processes it in chunks. You can change the processing logic according to your needs.

7. Streaming Data from an API in Chunks

Handling large JSON responses from an API in chunks using requests.

Copy

import requests

def process_api_data_in_chunks(api_url):
    response = requests.get(api_url, stream=True)
    for chunk in response.iter_content(chunk_size=1024):
        if chunk:
            process_chunk(chunk)

def process_chunk(chunk):
    print(f"Processing chunk: {chunk[:50]}...")

# Example usage
process_api_data_in_chunks('https://api.example.com/large_data')

Explanation:

requests.get() with stream=True allows you to download large files incrementally without consuming too much memory.

8. Processing Large Data from a Queue

Processing data from a queue in chunks.

Copy

import queue
import threading

def process_data_in_chunks(data_queue, chunk_size=5):
    while not data_queue.empty():
        chunk = [data_queue.get() for _ in range(chunk_size) if not data_queue.empty()]
        process_chunk(chunk)

def process_chunk(chunk):
    print(f"Processing chunk of size {len(chunk)}")

# Example usage
data_queue = queue.Queue()
for i in range(20):  # Add data to the queue
    data_queue.put(i)

# Start a thread to process data in chunks
thread = threading.Thread(target=process_data_in_chunks, args=(data_queue,))
thread.start()
thread.join()

Explanation:

A queue is used to store large amounts of data, which can then be processed in chunks.

9. Memory-Mapped Files with mmap

Using memory-mapped files for processing large data without loading it entirely into memory.

Copy

import mmap

def process_large_file_mmap(file_path):
    with open(file_path, 'r') as f:
        mmapped_file = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
        chunk_size = 1024
        for i in range(0, len(mmapped_file), chunk_size):
            chunk = mmapped_file[i:i+chunk_size]
            process_chunk(chunk)

def process_chunk(chunk):
    print(f"Processing chunk: {chunk[:50]}...")

# Example usage
process_large_file_mmap('large_file.txt')

Explanation:

mmap provides a way to map a file into memory, enabling efficient chunk-based processing without loading the entire file into memory.

10. Using Generators for Chunked Data Processing

Processing data in chunks using generators.

Copy

def chunked_generator(file_path, chunk_size=100):
    with open(file_path, 'r') as file:
        while True:
            lines = [file.readline() for _ in range(chunk_size)]
            if not lines:
                break
            yield lines

def process_data(file_path):
    for chunk in chunked_generator(file_path):
        process_chunk(chunk)

def process_chunk(chunk):
    print(f"Processing chunk with {len(chunk)} lines")

# Example usage
process_data('large_file.txt')

Explanation:

A generator is used to yield chunks of data, allowing for memory-efficient processing in a loop.

Conclusion:

By processing large datasets in chunks, you can handle big files and data sources efficiently without overwhelming system memory. These techniques can be applied across various data types, such as files, databases, or even API responses, making them versatile for large-scale data processing.

Previous144. Scheduling Tasks with sched Next146. Interfacing with RESTful APIs using requests

Last updated 9 months ago