← Back to MarkItDown Guide

Python PDF to Markdown

A complete production workflow — from raw PDF to clean, publishable Markdown using MarkItDown, batch scripts, and post-processing.

Quick Start: One File

Convert a single PDF in 3 lines:

from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("document.pdf")
print(result.text_content)

Works. But real-world PDFs need more. Here's the full pipeline.

Batch Processing: 100+ Files

Reuse a single MarkItDown instance across all files. Creating a new instance per file adds ~200ms overhead — noticeable at scale.

import os
from markitdown import MarkItDown
from pathlib import Path

md = MarkItDown()
INPUT_DIR = "./pdfs_to_convert"
OUTPUT_DIR = "./markdown_output"
Path(OUTPUT_DIR).mkdir(exist_ok=True)

for filename in os.listdir(INPUT_DIR):
    if filename.endswith('.pdf'):
        filepath = os.path.join(INPUT_DIR, filename)
        result = md.convert(filepath)
        out_name = os.path.splitext(filename)[0] + '.md'
        with open(os.path.join(OUTPUT_DIR, out_name), 'w') as f:
            f.write(result.text_content)
        print(f"Converted: {filename} -> {out_name}")

The Cleanup Problem

MarkItDown extracts text, but every real-world PDF has the same four problems:

Problem	Example
Duplicate text	Every sentence appears twice — PDF text layer + embedded metadata
CID noise	`(cid:1)` `(cid:2)` — font glyph markers from PDF internals
Flat structure	No headings, no hierarchy — section titles look like body text
Broken lines	Sentences split mid-phrase by PDF layout boxes

The Post-Processing Script

Run this after MarkItDown. It handles all four problems in one pass.

import re

with open("raw_output.md", "r", encoding="utf-8") as f:
    lines = f.readlines()

# 1. Strip CID markers: (cid:1), (cid:2), etc.
lines = [re.sub(r'\(cid:\d+\)\s*', '', l).strip() for l in lines]

# 2. Deduplicate: remove identical consecutive lines
deduped = []
prev = ""
for l in lines:
    if l != prev:
        deduped.append(l)
    prev = l

# 3. Detect headers: time-stamped segments get ###
#    Short standalone lines get **bold**
result = []
for l in deduped:
    if re.search(r'\[\d{2}:\d{2}[～~-]\d{2}:\d{2}\]', l):
        result.append("### " + l)
    elif re.match(r'^[一-鿿\w]{2,18}$', l):
        result.append("**" + l + "**")
    else:
        result.append(l)

# 4. Collapse multiple blank lines into one
final = []
blank = False
for l in result:
    if l == "":
        if not blank: final.append(l)
        blank = True
    else:
        final.append(l)
        blank = False

with open("clean_output.md", "w", encoding="utf-8") as f:
    f.write("\n".join(final))

print(f"Done. {len(lines)} -> {len(final)} lines")

Real result: This script cut a 1,766-line PDF dump to 1,293 lines — removing ~500 duplicates, all CID markers, and adding structural headings.

When Regex Isn't Enough: Use an LLM

For complex PDFs — multi-column layouts, mixed languages, heavy tables — regex cleanup only goes so far. Feed the raw MarkItDown output to any LLM with this prompt:

You are a document cleanup assistant. Given raw MarkItDown output:
1. Remove all duplicate sentences (even near-duplicates)
2. Remove transcription noise: CID markers, page numbers, watermarks
3. Add proper Markdown headings (##, ###) by detecting section breaks
4. Merge broken sentences that PDF layout split across lines
5. Keep ALL factual content — don't summarize, only clean
6. Output valid Markdown

Here is the raw text:
[paste MarkItDown output]

This one-shot prompt produces publishable Markdown from even the messiest PDFs.

Handling Edge Cases

Encrypted / Password-Protected PDFs

MarkItDown silently returns empty content for encrypted PDFs. No error, no warning.

import pikepdf

def decrypt_pdf(input_path, output_path, password=""):
    pdf = pikepdf.open(input_path, password=password)
    pdf.save(output_path)

Decrypt first, then feed to MarkItDown.

Large PDFs (50MB+, 100+ pages)

Add a timeout wrapper to prevent hanging:

import signal
from markitdown import MarkItDown

def convert_with_timeout(filepath, timeout=60):
    signal.alarm(timeout)
    try:
        return MarkItDown().convert(filepath)
    finally:
        signal.alarm(0)  # cancel alarm

For extremely large PDFs, split into chunks with PyPDF2 before conversion.

Docker: API Endpoint

Wrap the entire pipeline as an API for team use:

# docker-compose.yml
services:
  pdf-converter:
    build: .
    ports:
      - "8000:8000"
    environment:
      - MAX_FILE_SIZE_MB=50
      - REQUEST_TIMEOUT=120
    volumes:
      - ./output:/app/output

See the Docker deployment guide for the full app.py and Dockerfile.

Alternatives: MarkItDown vs the Field

Tool	Best For	Cost
MarkItDown	Simple PDFs, local processing, free	Free
Unstructured	Complex layouts, partitioning	Free (OSS) / Paid API
LlamaParse	Complex PDFs, cloud API	Free tier / Paid

See the full comparison guide for detailed benchmarks.