A complete production workflow — from raw PDF to clean, publishable Markdown using MarkItDown, batch scripts, and post-processing.
Convert a single PDF in 3 lines:
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("document.pdf")
print(result.text_content)
Works. But real-world PDFs need more. Here's the full pipeline.
Reuse a single MarkItDown instance across all files. Creating a new instance per file adds ~200ms overhead — noticeable at scale.
import os
from markitdown import MarkItDown
from pathlib import Path
md = MarkItDown()
INPUT_DIR = "./pdfs_to_convert"
OUTPUT_DIR = "./markdown_output"
Path(OUTPUT_DIR).mkdir(exist_ok=True)
for filename in os.listdir(INPUT_DIR):
if filename.endswith('.pdf'):
filepath = os.path.join(INPUT_DIR, filename)
result = md.convert(filepath)
out_name = os.path.splitext(filename)[0] + '.md'
with open(os.path.join(OUTPUT_DIR, out_name), 'w') as f:
f.write(result.text_content)
print(f"Converted: {filename} -> {out_name}")
MarkItDown extracts text, but every real-world PDF has the same four problems:
| Problem | Example |
|---|---|
| Duplicate text | Every sentence appears twice — PDF text layer + embedded metadata |
| CID noise | (cid:1) (cid:2) — font glyph markers from PDF internals |
| Flat structure | No headings, no hierarchy — section titles look like body text |
| Broken lines | Sentences split mid-phrase by PDF layout boxes |
Run this after MarkItDown. It handles all four problems in one pass.
import re
with open("raw_output.md", "r", encoding="utf-8") as f:
lines = f.readlines()
# 1. Strip CID markers: (cid:1), (cid:2), etc.
lines = [re.sub(r'\(cid:\d+\)\s*', '', l).strip() for l in lines]
# 2. Deduplicate: remove identical consecutive lines
deduped = []
prev = ""
for l in lines:
if l != prev:
deduped.append(l)
prev = l
# 3. Detect headers: time-stamped segments get ###
# Short standalone lines get **bold**
result = []
for l in deduped:
if re.search(r'\[\d{2}:\d{2}[~~-]\d{2}:\d{2}\]', l):
result.append("### " + l)
elif re.match(r'^[一-鿿\w]{2,18}$', l):
result.append("**" + l + "**")
else:
result.append(l)
# 4. Collapse multiple blank lines into one
final = []
blank = False
for l in result:
if l == "":
if not blank: final.append(l)
blank = True
else:
final.append(l)
blank = False
with open("clean_output.md", "w", encoding="utf-8") as f:
f.write("\n".join(final))
print(f"Done. {len(lines)} -> {len(final)} lines")
For complex PDFs — multi-column layouts, mixed languages, heavy tables — regex cleanup only goes so far. Feed the raw MarkItDown output to any LLM with this prompt:
You are a document cleanup assistant. Given raw MarkItDown output:
1. Remove all duplicate sentences (even near-duplicates)
2. Remove transcription noise: CID markers, page numbers, watermarks
3. Add proper Markdown headings (##, ###) by detecting section breaks
4. Merge broken sentences that PDF layout split across lines
5. Keep ALL factual content — don't summarize, only clean
6. Output valid Markdown
Here is the raw text:
[paste MarkItDown output]
This one-shot prompt produces publishable Markdown from even the messiest PDFs.
MarkItDown silently returns empty content for encrypted PDFs. No error, no warning.
import pikepdf
def decrypt_pdf(input_path, output_path, password=""):
pdf = pikepdf.open(input_path, password=password)
pdf.save(output_path)
Decrypt first, then feed to MarkItDown.
Add a timeout wrapper to prevent hanging:
import signal
from markitdown import MarkItDown
def convert_with_timeout(filepath, timeout=60):
signal.alarm(timeout)
try:
return MarkItDown().convert(filepath)
finally:
signal.alarm(0) # cancel alarm
For extremely large PDFs, split into chunks with PyPDF2 before conversion.
Wrap the entire pipeline as an API for team use:
# docker-compose.yml
services:
pdf-converter:
build: .
ports:
- "8000:8000"
environment:
- MAX_FILE_SIZE_MB=50
- REQUEST_TIMEOUT=120
volumes:
- ./output:/app/output
See the Docker deployment guide for the full app.py and Dockerfile.
| Tool | Best For | Cost |
|---|---|---|
| MarkItDown | Simple PDFs, local processing, free | Free |
| Unstructured | Complex layouts, partitioning | Free (OSS) / Paid API |
| LlamaParse | Complex PDFs, cloud API | Free tier / Paid |
See the full comparison guide for detailed benchmarks.