02. PDF

PDF

Portable Document Format (PDF) The file format, standardized to ISO 32000, was developed by Adobe to present documents in 1992, which includes text formatting and images in a way that is independent of application software, hardware and operating systems.

This guide PDF LangChain document Document Covers how to load in format. This format is used in the downstream.

LangChain is integrated with various PDF parsers. Some are simple and relatively low-level, others support OCR and image processing or perform advanced document layout analysis.

The right choice depends on the user's application.

Reference

PDF experiment on AutoRAG team

leaderboards based on experiments conducted at AutoRAG

The numbers shown below represent the number of equal numbers. (The lower, the better)

PDFMiner

PDFPlumber

PyPDFium2

PyMuPDF

PyPDF2

Medical

One

2

3

4

5

Law

3

One

One

3

5

Finance

One

2

2

4

5

Public

One

One

One

4

5

Sum

5

5

7

15

20

source: AutoRAG Medium Blog

Copy

Copy

Copy

Copy

Copy

PyPDF

Here pypdf Load PDFs into document arrays using, each document page Includes page content and metadata along with the number.

Copy

Copy

Copy

Copy

Copy

PyPDF (OCR)

Some PDFs contain text images within scanned documents or pictures. rapidocr-onnxruntime You can also extract text from images using packages.

Copy

Copy

Copy

Copy

Copy

PyMuPDF

PyMuPDF Is speed optimization and contains detailed metadata for PDF and its pages. Returns one document per page:

Copy

Copy

Copy

Copy

Copy

Unstructured

Unstructured Supports a common interface to deal with unstructured or hemisputed file formats such as Markdown or PDF.

LangChain UnstructuredPDFLoader LangChain PDF documents integrated with Unstructured Document Parse with objects.

Copy

Copy

Copy

Copy

Copy

Internally atypical, each text chunk is different. Element Create ". Basically these are combined mode="elements" You can easily separate it by specifying.

Copy

Copy

See the full set of element types for this particular document

Copy

Copy

Copy

Copy

PyPDFium2

Copy

Copy

Copy

Copy

PDFMiner

Copy

Copy

Copy

Copy

PDFMiner Generate HTML text using

This method is output HTML content BeautifulSoup By parsing through, you can get more structured and rich information about font size, page numbers, PDF headers/puters, etc., which can help you divide text into semantically sections.

Copy

Copy

Copy

Copy

Copy

Copy

Copy

Copy

PyPDF directory

Load PDF from directory

Copy

Copy

Copy

Copy

Copy

Copy

PDFPlumber

Like PyMuPDF, the output document contains a PDF and a detailed metadata for that page, and returns one document per page.

Copy

Copy

Copy

Last updated