MarkItDown: free, local, 140K GitHub stars, MIT license. Unstructured: enterprise-grade layout detection, 25+ data connectors. LlamaParse: AI-powered PDF parsing, cloud-only. Which one fits your stack? Start here.
| MarkItDown | Unstructured | LlamaParse | |
|---|---|---|---|
| Creator | Microsoft | Unstructured.io | LlamaIndex |
| License | MIT | Apache 2.0 | Proprietary (free tier) |
| Install | pip install markitdown |
pip install unstructured |
Cloud API only |
| Runs locally? | Yes | Yes | No (cloud) |
| Formats | DOCX, PDF, PPTX, XLSX, HTML, CSV, JSON, XML, ZIP, images, audio | DOCX, PDF, PPTX, XLSX, HTML, CSV, JSON, XML, TXT, Markdown, email, RTF, EPUB | PDF (primary), DOCX, PPTX, images |
| PDF accuracy | Basic text extraction | Good layout detection | Excellent (AI-powered) |
| Table handling | Basic (merges break) | Good (partitioning API) | Excellent |
| Image description | Built-in LLM support | Separate pipeline needed | Built-in (limited) |
| MCP Server | Yes (official) | No | No |
| Docker | Manual setup | Official image | N/A (cloud) |
| Free tier | Unlimited | Unlimited (OSS) | 1,000 pages/day |
| Paid pricing | Free | $10/1,000 pages (API) | $0.003/page |
Microsoft's lightweight converter — 140,000+ GitHub stars, growing at ~200 stars/day. Install in one line, works entirely offline, no API keys required. The LLM-powered image description feature is a standout — it uses GPT or Claude to describe embedded images in documents, making the output Markdown searchable.
Strengths: Zero cost, no cloud dependency, MCP Server for Claude Desktop integration, 29+ format support, MIT license.
Weaknesses: PDF extraction is basic — no layout detection, no table parsing. Complex PDFs with multi-column layouts or merged table cells produce garbled output. Designed as a demo tool — production hardening is on you.
The enterprise-grade option. Unstructured has a sophisticated document partitioning engine that understands layouts, columns, and table structures. It can chunk documents for RAG pipelines and has built-in connectors for 25+ data sources.
Strengths: Superior layout detection, excellent table extraction, official Docker images, enterprise support, broader format coverage.
Weaknesses: Heavier install (~500MB with all dependencies), slower on simple documents, API pricing adds up at scale, no built-in MCP support.
LlamaIndex's cloud-based PDF parser. Uses LLMs natively to understand document structure, making it the most accurate option for complex PDFs. Particularly good at tables, charts, and multi-column academic papers.
Strengths: Best PDF accuracy, excellent at table understanding, native LlamaIndex integration for RAG, handles scanned documents.
Weaknesses: Cloud-only (no local processing), requires API key, free tier limited to 1,000 pages/day, not suitable for sensitive documents that can't leave your infrastructure.
| Your Use Case | Best Tool | Why |
|---|---|---|
| Converting Office docs to Markdown locally | MarkItDown | Fast, free, handles DOCX/PPTX/XLSX well |
| Building a RAG pipeline over PDFs | Unstructured | Best chunking and layout detection |
| Parsing complex academic papers | LlamaParse | AI-powered accuracy for complex layouts |
| Claude Desktop automation | MarkItDown | Only one with official MCP server |
| Processing sensitive/NDA documents | MarkItDown | 100% local — data never leaves your machine |
| Enterprise document pipeline | Unstructured | Official Docker, support contracts, 25+ connectors |
| Low budget, high volume | MarkItDown | Completely free, unlimited pages |