02. PDF
PDF
Portable Document Format (PDF) The file format, standardized to ISO 32000, was developed by Adobe to present documents in 1992, which includes text formatting and images in a way that is independent of application software, hardware and operating systems.
This guide PDF LangChain document Document Covers how to load in format. This format is used in the downstream.
LangChain is integrated with various PDF parsers. Some are simple and relatively low-level, others support OCR and image processing or perform advanced document layout analysis.
The right choice depends on the user's application.
Reference
PDF experiment on AutoRAG team
leaderboards based on experiments conducted at AutoRAG
The numbers shown below represent the number of equal numbers. (The lower, the better)
PDFMiner
PDFPlumber
PyPDFium2
PyMuPDF
PyPDF2
Medical
One
2
3
4
5
Law
3
One
One
3
5
Finance
One
2
2
4
5
Public
One
One
One
4
5
Sum
5
5
7
15
20
source: AutoRAG Medium Blog
Copy
Copy
Copy
Copy
Copy
PyPDF
Here pypdf Load PDFs into document arrays using, each document page Includes page content and metadata along with the number.
Copy
Copy
Copy
Copy
Copy
PyPDF (OCR)
Some PDFs contain text images within scanned documents or pictures. rapidocr-onnxruntime You can also extract text from images using packages.
Copy
Copy
Copy
Copy
Copy
PyMuPDF
PyMuPDF Is speed optimization and contains detailed metadata for PDF and its pages. Returns one document per page:
Copy
Copy
Copy
Copy
Copy
Unstructured
Unstructured Supports a common interface to deal with unstructured or hemisputed file formats such as Markdown or PDF.
LangChain UnstructuredPDFLoader LangChain PDF documents integrated with Unstructured Document Parse with objects.
Copy
Copy
Copy
Copy
Copy
Internally atypical, each text chunk is different. Element Create ". Basically these are combined mode="elements" You can easily separate it by specifying.
Copy
Copy
See the full set of element types for this particular document
Copy
Copy
Copy
Copy
PyPDFium2
Copy
Copy
Copy
Copy
PDFMiner
Copy
Copy
Copy
Copy
PDFMiner Generate HTML text using
This method is output HTML content BeautifulSoup By parsing through, you can get more structured and rich information about font size, page numbers, PDF headers/puters, etc., which can help you divide text into semantically sections.
Copy
Copy
Copy
Copy
Copy
Copy
Copy
Copy
PyPDF directory
Load PDF from directory
Copy
Copy
Copy
Copy
Copy
Copy
PDFPlumber
Like PyMuPDF, the output document contains a PDF and a detailed metadata for that page, and returns one document per page.
Copy
Copy
Copy
Last updated