IBM's Docling Gains Traction for Enterprise RAG Pipelines
Docling, IBM Research's open-source document processing library, has gained significant adoption for retrieval-augmented generation implementations. The tool addresses a persistent problem: traditional PDF extractors mangle complex layouts, turning structured tables into unusable text strings.
The numbers suggest real usage. 42,000 GitHub stars. 1.5 million monthly PyPI downloads. 2,400 organizations implementing it. IBM donated the project to the Linux Foundation's AI & Data Foundation in April 2025. It's now embedded in IBM Granite, Red Hat InstructLab, Watsonx.ai, and OpenSearch pipelines.
What It Actually Does
Docling converts PDFs, DOCX, HTML, images, and audio into structured JSON or Markdown while preserving document semantics. Tables stay tables. Section hierarchies remain intact. Multi-column layouts get read correctly, not across columns.
The library uses models trained on 81,000 labeled pages for layout analysis. That training shows in table extraction accuracy - a metric where basic PDF parsers typically fail. It handles financial reports, technical specifications, and other documents where structure matters.
IBM has processed 2.1 million Common Crawl PDFs with it. They're planning to run it across 1.8 billion documents for Granite multimodal training.
The RAG Pipeline Context
RAG systems retrieve relevant document chunks to provide context for LLM responses. The quality of chunk boundaries directly affects answer accuracy. Split a table mid-row, and your system can't answer "What was Q3 revenue?" reliably.
Docling chunks by semantic units - complete sections, full tables, intact paragraphs with their headers. Each chunk includes metadata: section hierarchy, page number, content type, document position. That metadata enables filtering ("search only tables") or prioritization ("boost executive summary matches").
A Pathway integration added real-time multimodal RAG capabilities, though their documentation notes you may need additional token splitters for long passages.
Alternative Approaches
The market offers options: PyMuPDF for speed, Unstructured for format variety, LlamaParse for LLM-powered extraction, MarkItDown for simpler conversions. Microsoft Research recently released MarkItDown for lightweight document conversion.
Docling's differentiator is hierarchical structure preservation at scale. Whether that matters depends on your documents. For simple text PDFs, simpler tools work fine. For financial reports, technical documentation, or multi-format processing pipelines, the structured output appears to justify the additional complexity.
IBM Research presented a PyData Global 2025 tutorial on RAG integration. The code is Apache-licensed. Whether it becomes infrastructure or gets replaced by the next approach remains to be seen.