Convert PDF, DOCX, PPTX, Excel, and 10+ formats to clean Markdown with Microsoft's 140K-star open-source tool. Batch processing scripts, 5 common error fixes, Docker deployment, and LLM comparison (GPT vs Claude vs Gemini) — everything you need to run MarkItDown in production.
Processing one file is easy. Processing 100 files without losing your mind takes a script. MarkItDown's Markdown output is 3-8x more token-efficient than raw HTML for LLM consumption — an <h1 class="title"> costs 23 tokens, # Title costs 3. Here's the production batch script that converts entire directories.
import os
from markitdown import MarkItDown
from pathlib import Path
# Initialize once — reuse for all files
md = MarkItDown()
INPUT_DIR = "./files_to_convert"
OUTPUT_DIR = "./converted"
Path(OUTPUT_DIR).mkdir(exist_ok=True)
for filename in os.listdir(INPUT_DIR):
if filename.endswith(('.docx', '.pdf', '.pptx', '.xlsx')):
filepath = os.path.join(INPUT_DIR, filename)
result = md.convert(filepath)
out_name = os.path.splitext(filename)[0] + '.md'
with open(os.path.join(OUTPUT_DIR, out_name), 'w') as f:
f.write(result.text_content)
print(f"✓ {filename} → {out_name}")
MarkItDown() instance across all files. Creating a new instance per file adds ~200ms overhead each time — noticeable when processing 100+ documents.
These are the errors I hit in the first week of production use. Each one cost me at least an hour.
Complex tables (merged cells, nested tables) lose structure during conversion. The output is technically valid Markdown but unreadable.
Pre-process the document. For DOCX, use python-docx to extract tables separately. For PDFs with heavy tables, use camelot or tabula before feeding to MarkItDown. → Full table fix guide: DOCX, PDF, XLSX & merged cells
PDFs over 50MB or 100+ pages can cause MarkItDown to hang for minutes — no progress indicator, no timeout.
Add a timeout wrapper. Split large PDFs into chunks with PyPDF2 before conversion.
import signal
from markitdown import MarkItDown
def convert_with_timeout(filepath, timeout=60):
signal.alarm(timeout)
try:
md = MarkItDown()
return md.convert(filepath)
finally:
signal.alarm(0) # cancel alarm
MarkItDown silently returns empty content for encrypted PDFs. No error, no warning — just an empty string.
Check if the PDF is encrypted before passing it to MarkItDown. If it is, decrypt with pikepdf first.
import pikepdf
from pathlib import Path
def is_encrypted(filepath):
try:
pikepdf.open(filepath)
return False
except pikepdf.PasswordError:
return True
MarkItDown extracts text from PDFs, but the output is raw. Every real-world PDF has the same problems: duplicate sentences, CID markers (cid:1), zero structure. This script fixes all of them in one pass. → Full PDF to Markdown workflow guide
Every PDF I've processed has the same four problems. Here's what they look like:
| Problem | Example |
|---|---|
| 🔁 Duplicate text | Every sentence appears twice — PDF text layer + embedded metadata both get extracted. |
| 🔈 CID noise | (cid:1) (cid:2) (cid:3) — font glyph markers from PDF internals, meaningless in output. |
| 📄 Flat structure | No headings, no hierarchy. Section titles look exactly like body text. |
| ✄ Broken lines | Sentences split mid-phrase by PDF layout boxes, not by meaning. |
Run this after MarkItDown. It handles all four problems — remove CID tokens, deduplicate, detect section headings, and compress blank lines.
import re
with open("input.md", "r", encoding="utf-8") as f:
lines = f.readlines()
# 1. Strip CID markers: (cid:1), (cid:2), etc.
lines = [re.sub(r'\(cid:\d+\)\s*', '', l).strip() for l in lines]
# 2. Dedupe: remove identical consecutive lines
deduped = []
prev = ""
for l in lines:
if l != prev:
deduped.append(l)
prev = l
# 3. Detect headers: time-stamped segments get ###
# Short standalone Chinese lines get **bold**
result = []
for l in deduped:
if re.search(r'\[\d{2}:\d{2}[~~-]\d{2}:\d{2}\]', l):
result.append("### " + l)
elif re.match(r'^[一-鿿\w]{2,18}$', l):
result.append("**" + l + "**")
else:
result.append(l)
# 4. Collapse multiple blank lines into one
final = []
blank = False
for l in result:
if l == "":
if not blank: final.append(l)
blank = True
else:
final.append(l)
blank = False
with open("output.md", "w", encoding="utf-8") as f:
f.write("\n".join(final))
print(f"Done. {len(lines)} -> {len(final)} lines")
For complex PDFs — multi-column layouts, mixed languages, heavy tables — regex cleanup only goes so far. Feed the raw MarkItDown output to any LLM with this prompt:
You are a document cleanup assistant. Given raw MarkItDown output:
1. Remove all duplicate sentences (even near-duplicates)
2. Remove transcription noise: CID markers, page numbers, watermarks
3. Add proper Markdown headings (##, ###) by detecting section breaks
4. Merge broken sentences that PDF layout split across lines
5. Keep ALL factual content — don't summarize, only clean
6. Output valid Markdown
Here is the raw text:
[paste MarkItDown output]
This one-shot prompt produces publishable Markdown from even the messiest PDFs. Use it for critical documents where quality matters.
Production-ready setup with FastAPI + MarkItDown in Docker. Accepts file uploads, returns Markdown. No API keys needed.
version: '3.9'
services:
markitdown-api:
build: .
ports:
- "8000:8000"
environment:
- MAX_FILE_SIZE_MB=50
- REQUEST_TIMEOUT=120
restart: unless-stopped
volumes:
- ./tmp:/app/tmp
from fastapi import FastAPI, UploadFile, File, HTTPException
from markitdown import MarkItDown
from pathlib import Path
import tempfile, os
app = FastAPI(title="MarkItDown API")
md = MarkItDown()
MAX_SIZE = int(os.getenv("MAX_FILE_SIZE_MB", 50)) * 1024 * 1024
@app.post("/convert")
async def convert_file(file: UploadFile = File(...)):
content = await file.read()
if len(content) > MAX_SIZE:
raise HTTPException(413, "File too large")
with tempfile.NamedTemporaryFile(suffix=Path(file.filename).suffix, delete=False) as tmp:
tmp.write(content)
result = md.convert(tmp.name)
os.unlink(tmp.name)
return {"filename": file.filename, "markdown": result.text_content}
FROM python:3.11-slim
WORKDIR /app
RUN apt-get update && apt-get install -y --no-install-recommends \
libgl1 libglib2.0-0 && rm -rf /var/lib/apt/lists/*
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY app.py .
EXPOSE 8000
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]
markitdown, fastapi, uvicorn, python-multipart
Turn MarkItDown into an MCP (Model Context Protocol) server. Connect it to Claude Desktop and convert any file to Markdown directly from chat — no terminal, no scripts.
MCP is an open protocol that lets AI assistants like Claude Desktop talk to external tools. With a MarkItDown MCP server running, you can paste a file path into Claude and get back clean Markdown instantly. No more switching windows.
pip install markitdown-mcp
Or the community edition with extra features (OCR, audio transcription):
pipx install git+https://github.com/trsdn/markitdown-mcp.git
pipx inject markitdown-mcp 'markitdown[all]' openpyxl xlrd pandas pymupdf pdfplumber
Open your Claude Desktop config file:
| OS | Path |
|---|---|
| Windows | %APPDATA%\Claude\claude_desktop_config.json |
| macOS | ~/Library/Application Support/Claude/claude_desktop_config.json |
Add this block:
{
"mcpServers": {
"markitdown": {
"command": "markitdown-mcp",
"args": []
}
}
}
convert_to_markdown tool appear. Paste a file URI — file:///C:/docs/report.pdf or https://example.com/doc.docx — and Claude converts it for you.
The MCP server supports all 29+ formats that MarkItDown handles: DOCX, PDF, PPTX, XLSX, HTML, CSV, JSON, XML, ZIP, EPUB, images (via LLM description), and audio (with optional dependencies).
If you prefer not to install Python packages locally:
{
"mcpServers": {
"markitdown": {
"command": "docker",
"args": ["run", "--rm", "-i", "markitdown-mcp:latest"]
}
}
}
"-v", "/home/user/data:/workdir" to the args array. Then Claude can convert local files without uploading them anywhere.
MarkItDown uses an LLM to describe images embedded in documents. Which model gives the best descriptions? I tested GPT-4o, Claude 4 Sonnet, and Gemini 2.5 Pro on the same set of 10 images from real-world documents.
| Criteria | GPT-4o | Claude 4 Sonnet | Gemini 2.5 Pro |
|---|---|---|---|
| Accuracy Correctly identifies objects |
9.2/10 |
9.0/10 |
8.5/10 |
| Detail level How thorough are descriptions |
8.7/10 |
9.3/10 |
8.0/10 |
| Chart understanding Bar charts, pie, line graphs |
9.0/10 |
8.8/10 |
8.3/10 |
| Speed Average per image (lower is better) |
1.2s |
0.9s |
2.1s |
| Cost per 1000 images Approximate (USD) |
~$2.50 | ~$1.80 | ~$1.50 |
| Overall | Best accuracy | Best detail | Best price |
Partially. Bold, italic, headings, lists, and links convert correctly. Tables with merged cells, multi-column layouts, and embedded charts often break. Test a representative sample of your documents before committing to a pipeline.
Yes. MarkItDown works without an LLM for text extraction. The LLM is only used for image description — if your documents don't have images, or you don't need image descriptions in the output, you can skip configuring an LLM entirely.
MarkItDown is simpler and lighter than both. Unstructured has more format support and partitioning features but is heavier. LlamaParse (by LlamaIndex) is better for complex PDFs but requires a cloud API. MarkItDown's sweet spot: simple, local, free, works for 80% of common document types. → Full comparison: MarkItDown vs Unstructured vs LlamaParse
Yes, with caveats. Wrap it in the FastAPI + Docker setup shown above, add a timeout (the default has none), and implement file size limits. The library itself is stable but was designed as a demo tool — production hardening is on you.
DOCX, PDF, PPTX, XLSX, HTML, CSV, JSON, XML, ZIP (iterates contents), and images (via LLM description). EPUB support is partial. Legacy formats like .doc require conversion to .docx first.