02. PDF

PDF

Portable Document Format (PDF) The file format, standardized to ISO 32000, was developed by Adobe to present documents in 1992, which includes text formatting and images in a way that is independent of application software, hardware and operating systems.

This guide PDF LangChain document Document Covers how to load in format. This format is used in the downstream.

LangChain is integrated with various PDF parsers. Some are simple and relatively low-level, others support OCR and image processing or perform advanced document layout analysis.

The right choice depends on the user's application.

Reference

LangChain

PDF experiment on AutoRAG team

leaderboards based on experiments conducted at AutoRAG

The numbers shown below represent the number of equal numbers. (The lower, the better)

PDFMiner

PDFPlumber

PyPDFium2

PyMuPDF

PyPDF2

Medical

One

Law

One

Finance

One

Public

One

Sum

source: AutoRAG Medium Blog

Copy

# API KEY a configuration file for managing environment variables
from dotenv impwort load_dotenv

# API KEY load information
load_dotenv()

Copy

True

Copy

Documents utilized for practice
Software Policy Institute (SPRi)-December 2023

Author: Jaeheung Lee (AI Policy Institute Office Liability Institute), Lee Ji-soo (AI Policy Lab Yi Phyang Institute)
Link: https://spri.kr/posts/view/23669
File name: SPRI_AI_Brief_2023 december issue_F.pdf
Reference : The file above data Get download within the folder

Copy

FILE_PATH = "./data/SPRI_AI_Brief_December 2023 issue_F.pdf"

Copy

wwwwdef show_metadata(docs):
    if docs:
        print("[metadata]")
        print(list(docs[0].metadata.keys()))
        print("\n[examples]")
        max_key_length = max(len(k) for k in docs[0].metadata.keys())
        for k, v in docs[0].metadata.items():
            print(f"{k:<{max_key_length}} : {v}")

PyPDF

Here pypdf Load PDFs into document arrays using, each document page Includes page content and metadata along with the number.

Copy

# installation
# !pip install -qU pypdf

Copy

from langchain_community.document_lwwoaders import PyPDFLoader

# Set file path
loader = PyPDFLoader(FILE_PATH)

# PDF Loader initialization
docs = loader.load()

# Print the contents of the document
print(docs[10].page_content[:300])

Copy

SPRi AI Brief |  
December 2023 
8Cohir unveils data source explorer to ensure data transparency 
The original data source, relicense status, through audits of a wide range of datasets by ncohires and 12 institutions,  
Launched the ‘data source explorer ’ platform, providing a variety of information, including authors 
The n-interactive platform allows developers to easily grasp the license status of the dataset, and the dataset  
Configuration and genealogy traceable KEY Contents 
£Data source explorer improves data transparency by providing a wide range of dataset information 
nAI corporate coheir (

Copy

# Output metadata
show_metadata(docs)

Copy

 [metadata] 
['source','page'] 

[examples] 
source: ./data/SPRI_AI_Brief_2023 December issue_F.pdf 
page: 0

PyPDF (OCR)

Some PDFs contain text images within scanned documents or pictures. rapidocr-onnxruntime You can also extract text from images using packages.

Copy

# installation
# !pip install -qU rapidocr-onnxruntime

Copy

# PDF Initialize loader, enable image extraction option
loader = PyPDFLoader("https://arxiv.org/pdf/2103.15348.pdf", extract_images=True)

# PDF page load
docs = loader.load()

# Access page content
print(docs[4].page_content[:300])

Copy

LayoutParser: A Unified Toolkit for DL-Based DIA 5 
Table 1: Current layout detection models in the LayoutParser model zoo 
Dataset Base Model1Large Model Notes 
PubLayNet [38] F/M M Layouts of modern scientific documents 
PRImA [3] M - Layouts of scanned modern magazines and scientific reports 
Newspaper

Copy

show_metadata(docs)

Copy

[metadata] 
['source','page'] 

[examples] 
source: https://arxiv.org/pdf/2103.15348.pdf 
page: 0

PyMuPDF

PyMuPDF Is speed optimization and contains detailed metadata for PDF and its pages. Returns one document per page:

Copy

# installation
# !pip install -qU pymupdf

Copy

from langchain_community.document_loaders import PyMuPDFLoader

# PyMuPDF Create a loader instance
loader = PyMuPDFLoader(FILE_PATH)

# load document
docs = loader.load()

# Print the contents of the document
print(docs[10].page_content[:300])

Copy

SPRi AI Brief |  
2023-December 
8 
Cohir unveils data source explorer to ensure data transparency 
n Original data sources, relicense status, through coheirs and 12 agencies auditing a wide range of data sets  
Launched the ‘data source explorer ’ platform, providing a variety of information, including authors 
n The interactive platform allows developers to easily grasp the license status of the dataset, and the dataset  
Configuration and genealogy are also traceable 
KEY Contents 
£ Data source explorer improves data transparency by providing a wide range of data set information 
n AI company cohir

Copy

show_metadata(docs)

Copy

[metadata] 
['source','file_path','page','total_pages','format','title','author','subject','keywords','creator','producer','creationDate', 'modDate','trapped'] 

[examples] 
source: ./data/SPRI_AI_Brief_2023 December issue_F.pdf 
file_path: ./data/SPRI_AI_Brief_2023 December _F.pdf 
page: 0 
total_pages: 23 
format: PDF 1.4 
title:  
author: dj 
subject :  
keywords:  
creator: Hwp 2018 10.0.0.13462 
producer: Hancom PDF 1.3.0.542 
creationDate: D:20231208132838+09'00' 
modDate: D:20231208132838+09'00' 
trapped:

Unstructured

Unstructured Supports a common interface to deal with unstructured or hemisputed file formats such as Markdown or PDF.

LangChain UnstructuredPDFLoader LangChain PDF documents integrated with Unstructured Document Parse with objects.

Copy

# installation
# !pip install -qU unstructured

Copy

from langchain_community.document_loaders import UnstructuredPDFLoader

# UnstructuredPDFLoader Create an instance
loader = UnstructuredPDFLoader(FILE_PATH)

# load data
docs = loader.load()

# Content output of the document
print(docs[0].page_content[:300])

Copy

December 2023 

December 2023 

Ⅰ. Artificial Industry Trends Brief 

One. Policy/legal 

▹ United States announces administrative orders on safe and reliable AI development and use ··························1 

▹ G7, Hiroshima AI process to agree on the International Code of Conduct for AI companies ························2 

▹ 28 countries participating in the UK AI Safety Summit, joint response to AI risks ·························3 

▹ US court, artist-generated AI company

Copy

show_metadata(docs)

Copy

 [metadata] 
['source'] 

[examples] 
source: ./data/SPRI_AI_Brief_2023 December issue_F.pdf

Internally atypical, each text chunk is different. Element Create ". Basically these are combined mode="elements" You can easily separate it by specifying.

Copy

# UnstructuredPDFLoader Create an instance(mode="elements")
loader = UnstructuredPDFLoader(FILE_PATH, mode="elements")

# load data
docs = loader.load()

# Print the contents of the document
print(docs[0].page_content)

Copy

 December 2023

See the full set of element types for this particular document

Copy

set(doc.metadata["category"] for doc in docs)  # 데이터 카테고리 추출

Copy

 {'ListItem','NarrativeText','Title','UncategorizedText'}

Copy

show_metadata(docs)

Copy

 [metadata] 
['source','coordinates','filename','file_directory','last_modified','filetype','page_number','links','category'] 

[examples] 
source: ./data/SPRI_AI_Brief_2023 December issue_F.pdf 
coordinates: {'points': (256.579467, 282.44348), (256.579467, 303.423873000007), (355.4236898438, 30 
filename: SPRI_AI_Brief_2023 December _F.pdf 
file_directory: ./data 
last_modified: 2024-03-11T18:59:07 
filetype: application/pdf 
page_number: 1 
links: [] 
category: UncategorizedText

PyPDFium2

Copy

from langchain_community.document_loaders import PyPDFium2Loader

# PyPDFium2 create a loader instance
loader = PyPDFium2Loader(FILE_PATH)

# load data
docs = loader.load()

# print the contents of the document
print(docs[10].page_content[:300])

Copy

SPRi AI Brief |  
2023-December 
8 
Cohir unveils data source explorer to ensure data transparency 
n Cohir and 12 organizations launch a ‘data source explorer’ platform that provides a variety of information, including original data sources, relicense status, and authors, through audits of a wide range of data sets. 
n The interactive platform allows developers to easily grasp the license status of the dataset, and the dataset  
Configuration and genealogy are also traceable 
KEY Contents 
£ Data source explorer improves data transparency by providing a wide range of data set information 
n AI

Copy

show_metadata(docs)

Copy

 [metadata] 
['source','page'] 

[examples] 
source: ./data/SPRI_AI_Brief_2023 December issue_F.pdf 
page: 0

PDFMiner

Copy

from langchain_community.document_loaders import PDFMinerLoader

# Creating a PDFMiner Loader Instance
loader = PDFMinerLoader(FILE_PATH)

# load data
docs = loader.load()

# print the contents of the document
print(docs[0].page_content[:300])

Copy

December 2023 

December 2023 

Ⅰ.  Artificial Industry Trends Brief 

  One.  Policy/legal  

      ▹ United States announces administrative orders on safe and reliable AI development and use ··························1 

      ▹ G7, Hiroshima AI process to agree on the International Code of Conduct for AI companies ························2 

      ▹ 28 countries participating in the UK AI Safety Summit, joint to AI risk

Copy

show_metadata(docs)

Copy

[metadata] 
['source'] 

[examples] 
source: ./data/SPRI_AI_Brief_2023 December issue_F.pdf

PDFMiner Generate HTML text using

This method is output HTML content BeautifulSoup By parsing through, you can get more structured and rich information about font size, page numbers, PDF headers/puters, etc., which can help you divide text into semantically sections.

Copy

from langchain_community.document_loaders import PDFMinerPDFasHTMLLoader

# PDFMinerPDFasHTMLLoader create an instance
loader = PDFMinerPDFasHTMLLoader(FILE_PATH)

# load document
docs = loader.load()

# print the contents of the document
print(docs[0].page_content[:300])

Copy

<html><head>
<meta http-equiv="Content-Type" content="text/html">
</head><body>
<span style="position:absolute; border: gray 1px solid; left:0px; top:50px; width:612px; height:858px;"></span>
<div style="position:absolute; top:50px;"><a name="1">Page 1</a></div>
<div style="position:absolute; border

Copy

show_metadata(docs)

Copy

[metadata] 
['source'] 

[examples] 
source: ./data/SPRI_AI_Brief_2023 December issue_F.pdf

Copy

from bs4 import BeautifulSoup

soup = BeautifulSoup(docs[0].page_content, "html.parser")  # Initializing the HTML parser
content = soup.find_all("div")  # find all div tags

Copy

import re

cur_fs = None
cur_text = ""
snippets = []  # collect all snippets with the same font size

for c in content:
    sp = c.find("span")
    if not sp:
        continue
    st = sp.get("style")
    if not st:
        continue
    fs = re.findall("font-size:(\d+)px", st)
    if not fs:
        continue
    fs = int(fs[0])
    if not cur_fs:
        cur_fs = fs
    if fs == cur_fs:
        cur_text += c.text
    else:
        snippets.append((cur_text, cur_fs))
        cur_fs = fs
        cur_text = c.text
snippets.append((cur_text, cur_fs))
# Possibility to add duplicate snippet removal strategy (Since the header/footer of a PDF appears across multiple pages, it may be considered duplicate information if duplicates are found.)

Copy

from langchain_core.documents import Document

cur_idx = -1
semantic_snippets = []
# Title assumption: high font size
for s in snippets:
    # New title determination: Current snippet font > Previous title font
    if (
        not semantic_snippets
        or s[1] > semantic_snippets[cur_idx].metadata["heading_font"]
    ):
        metadata = {"heading": s[0], "content_font": 0, "heading_font": s[1]}
        metadata.update(docs[0].metadata)
        semantic_snippets.append(Document(page_content="", metadata=metadata))
        cur_idx += 1
        continue

    # Determine same section content: current snippet font <= previous content font
    if (
        not semantic_snippets[cur_idx].metadata["content_font"]
        or s[1] <= semantic_snippets[cur_idx].metadata["content_font"]
    ):
        semantic_snippets[cur_idx].page_content += s[0]
        semantic_snippets[cur_idx].metadata["content_font"] = max(
            s[1], semantic_snippets[cur_idx].metadata["content_font"]
        )
        continue

    # Conditions for creating a new section: Current snippet font > previous content font, previous title font less
    metadata = {"heading": s[0], "content_font": 0, "heading_font": s[1]}
    metadata.update(docs[0].metadata)
    semantic_snippets.append(Document(page_content="", metadata=metadata))
    cur_idx += 1

print(semantic_snippets[4])

Copy

page_content='KEY Contents 
n US President Biden signs ‘Administrative Decree ’ on the development and use of safe and reliable AI  
Specify a wide range of administrative measures 
n Administrative orders △AI safety and security standards established △Personal information protection △Equity and citizenship improvement △Consumer  
Protection △Worker Support △Innovation and Competition Promotion △International Cooperation is a Goal 
'metadata={' heading':'USA announces administrative orders on safe and reliable AI development and use \n','content_font': 12,'heading_font': 15,'source':'./data/SPRI_AI_Brief_2023 Year 12 issue_F.pdf'}

PyPDF directory

Load PDF from directory

Copy

from langchain_community.document_loaders import PyPDFDirectoryLoader

# directory path
loader = PyPDFDirectoryLoader("data/")

# load document
docs = loader.load()

# output the number of documents
print(len(docs))

Copy

# print the contents of the document
print(docs[50].page_content[:300])

Copy

Note 2 Main tasks by year Promotion schedule by year 
Promotion task promotion 
Buddha 2019 2020 2021 2022 
Sang Ha Sang Ha Sang Ha 
One.Preemptive · Integrated Great National Service Innovation 
□1st National Benefit Service Customized Guidance Subsidy ISP Departmental, Geomagnetic Public Institution · Complex High Painting Faculty, Footnote,  
Local pilot headquarters 
□2 Life cycle service Significantly expanded 2 4 7 10 10 ‧ 
Education ‧ Employment, etc. 
□Providing preemptive services to prevent blind spots Building a masquerade system Demonstration Ministry of Welfare ‧Ministry 
Local law amendment 
□4 Exceeds Existing Limits Public Service Innovation Business Promotion 21 Year Budget Reflection

Copy

# metadata output of power
print(docs[50].metadata)

Copy

{'source':'data/digital government innovation promotion plan.pdf','page': 7}

PDFPlumber

Like PyMuPDF, the output document contains a PDF and a detailed metadata for that page, and returns one document per page.

Copy

from langchain_community.document_loaders import PDFPlumberLoader

# Create a PDF document loader instance
loader = PDFPlumberLoader(FILE_PATH)

# loading documents
docs = loader.load()

# accessing the first document data
print(docs[10].page_content[:300])

Copy

SPRi AI Brief | 
2023-December 
Cohir unveils data source explorer to ensure data transparency 
KEY Contents 
n Original data sources, relicense status, through coheirs and 12 agencies auditing a wide range of data sets 
Launched the ‘data source explorer ’ platform, providing a variety of information, including authors 
n The interactive platform allows developers to easily grasp the license status of the dataset, and the dataset 
Configuration and genealogy are also traceable 
£Data source explorer improves data transparency by providing a wide range of data set information 
n AI company Cohere

Copy

[metadata] 
['source','file_path','page','total_pages','Author','Creator','Producer','CreationDate','ModDate','PDFVersion'] 

[examples] 
source: ./data/SPRI_AI_Brief_2023 December issue_F.pdf 
file_path: ./data/SPRI_AI_Brief_2023 December _F.pdf 
page: 0 
total_pages: 23 
Author: dj 
Creator: Hwp 2018 10.0.0.13462 
Producer: Hancom PDF 1.3.0.542 
CreationDate: D:20231208132838+09'00' 
ModDate: D:20231208132838+09'00' 
PDFVersion: 1.4

PreviousPage 1 Next03. Hangeul (HWP)

Last updated 6 months ago