Every LLM app that ingests PDFs has the same bug. The text extraction returns multi-column garbage, collapsed tables, broken equations. The team blames the model. The model is fine. The parser is the problem. Docling is the only open-source parser that actually parses structure.

Docling Is the Only Open-Source Document Parser That Reads a PDF Like a Human, and Every RAG Pipeline Using pypdf Has a Latent Bug

I spent two months debugging a "RAG quality regression" for a client. The model was the same. The chunking was the same. The retrieval was the same. The answers went sideways. The team spent three weeks tuning prompts before someone ran the input through a print statement.

The PDF extraction was reading multi-column whitepapers as a single string run, in the wrong order, with figures chopped out and tables collapsed into one word per line. The model was receiving the document the way a stroke patient reads a stock ticker. The bug was in pypdf, not in the LLM.

I rewrote the ingestion with Docling. The same documents came out as a DoclingDocument with section headings intact, tables as pandas DataFrames, and figures separated. The RAG answers improved before I touched a prompt. The "prompt engineering problem" was a parser problem in a costume.

The Document Parsing Problem Nobody Wants to Solve

Every LLM app that ingests documents inherits the same mistake: it treats PDF parsing as a solved problem. pip install pypdf, PdfReader.extract_text(), ship. The string goes to the chunker. The chunker feeds the retriever. The retriever feeds the model. The model hallucinates. The team blames the model.

It is not the model. pypdf and pdfplumber do one job: extract a stream of glyphs in whatever order the PDF object table says to draw them. A two-column paper becomes a sentence that hops between columns. A financial table becomes a wall of cells with no row boundaries. A scanned document returns an empty string.

The team rewrites chunking. The team rewrites the prompt. The team rewrites the retriever. The team does not rewrite the parser, because a one-liner feels like something you do not have to think about. This is the worst engineering instinct in the LLM stack in 2026.

What Docling Actually Is

Docling is an MIT-licensed open-source document parser from IBM Research, an LF AI & Data Foundation graduated project, ~46k GitHub stars, 2M+ monthly PyPI downloads. It supports PDF, DOCX, PPTX, XLSX, HTML, EPUB, LaTeX, images, audio, and email. It runs locally. It does not phone home. It is a pip install docling away from working in an air-gapped environment.

Under the hood, Docling does the thing the rest of the field skipped: layout analysis. A model identifies regions on each page (title, paragraph, list, table, figure, code, formula), assigns a reading order, and emits a DoclingDocument — a typed representation with structure intact. The Markdown export reads like the PDF. The JSON export is lossless. Tables come out as tables; figures come out as image objects with captions.

python

from docling.document_converter import DocumentConverter
converter = DocumentConverter()
result = converter.convert("data/whitepaper.pdf")
doc = result.document
print(doc.export_to_markdown())            # clean, ordered markdown
print(doc.tables[0].export_to_dataframe()) # actual pandas DataFrame
print(doc.figures[0].get_image(doc.pages[0]))  # extracted figure

No model API to wire. No SaaS to call. No YAML config. The converter downloads the layout model on first run and caches it locally. From the second document onward, it is just CPU and a cache directory.

The Weakness

Layout models are slow on CPU. First run downloads ~200MB of weights. Throughput on a 500-page engineering manual is 1-2 pages/sec on a modern CPU, 10-20x on a T4. Plan a GPU if you have millions of pages.

It is opinionated about output formats. Markdown is good. Lossless JSON is great. Plain text is the worst output you can pick. If your downstream consumer wants a string, you are doing it wrong, and Docling will not save you from yourself.

OCR has its own failure modes. Scanned PDFs route through Tesseract. For clean scans it is fine. For old or low-resolution documents you will find edge cases. Budget time.

The chunking story is younger than LangChain's. HybridChunker is the right default, but the ecosystem is newer. Custom chunking means writing against the DoclingDocument API — a better place to start than a string, but still on you.

The Take

Docling is the only open-source document parser in 2026 that treats a PDF as a structured document. Every other library in the Python ecosystem treats it as a glyph bucket. If you have shipped a RAG pipeline, an agent, or a doc-grounded LLM app with anything else, you have a latent bug in your ingestion.

Use this if: you ingest PDFs, Office docs, scanned documents, or anything that is not clean text. You have ever had a "RAG quality regression" that did not respond to prompt tuning. You cannot send customer documents to a third-party OCR API.

Skip this if: your inputs are clean Markdown or HTML from day one. Your scale is so high that CPU throughput is a deal-breaker and you can afford a GPU.

Try first: the DocumentConverter hello world above, on the worst PDF in your corpus — multi-column, merged cells, embedded figures, embedded equations. Compare the Markdown export to your current pipeline. The bug is usually in the first ten lines of the converted output. The HybridChunker is what finishes the job.

The document parsing layer of the LLM stack has been ignored for two years. It is the slowest, ugliest, and most embarrassing part of most production RAG systems. Docling is the first project that took the problem seriously. It is the parser I use. It is the parser I tell every team to use.

— Mr. Technology

*Docling: github.com/docling-project/docling — ~46k stars, MIT-licensed, LF AI & Data Foundation graduated project, IBM Research origin. 2M+ monthly PyPI downloads. Latest docling-parse release: May 27, 2026. Install: pip install docling. Runs locally. No SaaS. Supported formats: PDF, DOCX, PPTX, XLSX, HTML, EPUB, LaTeX, images, audio, email. VLM backend: GraniteDocling (258M). Integrations: LangChain, LlamaIndex, CrewAI, Haystack.*