Tables are MarkItDown's weakest link. Merged cells scramble, PDF tables vanish, XLSX formulas disappear. Here's how to fix each one — with pre-extraction scripts and an LLM fallback.
MarkItDown is a text extractor, not a document parser. It reads content streams — not layout boxes, not table grids. This means:
| Source | What goes wrong |
|---|---|
| DOCX | Merged cells lose hierarchy; nested tables flatten into a single jumble; cell alignment and borders disappear |
| No table concept at all — columns become space-separated text; merged cells overwrite neighbors; multi-page tables split at page breaks | |
| XLSX | Formulas — gone (only values survive); date formats reset to raw numbers; only the first sheet is processed; empty rows with formatting get dropped |
The fix is the same pattern for all three: pre-extract tables with a format-aware library, then let MarkItDown handle the rest of the document.
MarkItDown handles simple DOCX tables fine. The moment you have merged cells, nested tables, or right-to-left columns, the output breaks. The solution: use python-docx to extract every table as Markdown before MarkItDown touches the file.
from docx import Document
import os
def extract_docx_tables(docx_path):
"""Extract all tables from a DOCX as Markdown. Returns
(table_markdown, non_table_text) so you can recombine later."""
doc = Document(docx_path)
tables_md = []
non_table_parts = []
for element in doc.element.body:
# Check if this XML element is a table
if element.tag.endswith('}tbl'):
table = doc.tables[len(tables_md)] # nth table
rows = []
for row in table.rows:
cells = []
for cell in row.cells:
cells.append(cell.text.replace('\n', ' ').strip())
rows.append('| ' + ' | '.join(cells) + ' |')
# Insert separator row after header
if rows:
cols = len(table.rows[0].cells)
rows.insert(1, '|' + '|'.join(['---'] * cols) + '|')
tables_md.append('\n'.join(rows))
else:
# Non-table paragraphs — let MarkItDown handle these
pass
return tables_md
# Usage: extract tables first, then feed the DOCX to MarkItDown
# for the non-table content. Merge outputs afterward.
tables = extract_docx_tables("report.docx")
# Now run MarkItDown for everything else
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("report.docx")
# Replace scrambled tables with clean pre-extracted ones
# (Implementation depends on your document structure)
python-docx expands merged cells by default — a cell spanning 3 columns becomes 3 separate cells with identical text. For true merge detection, check cell._tc.xpath('./w:tcPr/w:gridSpan') and w:vMerge in the underlying XML. See the merged cells section below.
PDFs have no table markup — just positioned text blocks. MarkItDown reads the text stream in order, which means nearby columns get concatenated as one line. pdfplumber can detect table boundaries by analyzing text positions.
import pdfplumber
from markitdown import MarkItDown
def pdf_with_tables(pdf_path):
"""Extract tables via pdfplumber, body text via MarkItDown,
and merge them into clean Markdown."""
md = MarkItDown()
result = md.convert(pdf_path)
with pdfplumber.open(pdf_path) as pdf:
for i, page in enumerate(pdf.pages):
tables = page.extract_tables()
for j, table in enumerate(tables):
# Build Markdown table
lines = []
for ri, row in enumerate(table):
cells = [c or '' for c in row]
lines.append('| ' + ' | '.join(cells) + ' |')
# Separator after header (first row)
if ri == 0:
lines.append('|' + '|'.join(['---'] * len(cells)) + '|')
print(f"Page {i+1}, Table {j+1}:")
print('\n'.join(lines))
print()
return result
# Full pipeline: pdfplumber tables + MarkItDown body text
pdf_with_tables("financial_report.pdf")
MarkItDown reads XLSX through openpyxl but only grabs cell.value — losing formulas, number formats, and everything past the first sheet. This script fixes all three.
import openpyxl
from openpyxl.utils import get_column_letter
def xlsx_to_markdown(xlsx_path, include_empty_rows=False):
"""Convert every sheet in an XLSX to clean Markdown tables.
Preserves date formats and visual style hints."""
wb = openpyxl.load_workbook(xlsx_path, data_only=True)
output = []
for sheet_name in wb.sheetnames:
ws = wb[sheet_name]
output.append(f"## {sheet_name}\n")
rows = list(ws.iter_rows(min_row=1,
max_row=ws.max_row,
max_col=ws.max_column,
values_only=False))
for ri, row in enumerate(rows):
cells = []
for cell in row:
val = cell.value
if val is None:
cells.append('')
continue
# Format dates properly
if cell.number_format and 'yy' in cell.number_format.lower():
from datetime import datetime
if isinstance(val, (int, float)):
from openpyxl.utils.datetime import from_excel
val = from_excel(val).strftime('%Y-%m-%d')
# Bold headers
text = str(val).replace('\n', ' ').replace('|', '\\|')
if cell.font and cell.font.bold:
text = f'**{text}**'
cells.append(text)
if cells and (include_empty_rows or any(cells)):
output.append('| ' + ' | '.join(cells) + ' |')
# Separator after header row
if ri == 0:
output.append('|' + '|'.join(['---'] * len(cells)) + '|')
output.append('')
return '\n'.join(output)
# Usage
markdown = xlsx_to_markdown("financial_model.xlsx")
with open("financial_model.md", 'w') as f:
f.write(markdown)
data_only=False and read cell.value (which will be the formula string) — then decide whether to keep it as text.
Merged cells are the #1 source of table bugs in MarkItDown. A merged cell spanning 3 columns produces 3 copies of the same text in MarkItDown's output — making tables unreadable.
Use python-docx to detect merge spans before extraction:
def detect_docx_merges(docx_path):
"""Find all merged cells in a DOCX — returns (row, col, span_cols, span_rows)."""
from docx import Document
doc = Document(docx_path)
merges = []
for ti, table in enumerate(doc.tables):
for ri, row in enumerate(table.rows):
for ci, cell in enumerate(row.cells):
tc = cell._tc
tcPr = tc.find('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}tcPr')
if tcPr is not None:
grid_span = tcPr.find('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}gridSpan')
v_merge = tcPr.find('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}vMerge')
span_cols = int(grid_span.get('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}val')) if grid_span is not None else 1
is_merged = span_cols > 1 or v_merge is not None
if is_merged:
merges.append({
'table': ti, 'row': ri, 'col': ci,
'span_cols': span_cols,
'v_merge': v_merge is not None
})
return merges
# Print merge map
for m in detect_docx_merges("merged_cells.docx"):
print(f"Table {m['table']}, Row {m['row']}, Col {m['col']}: "
f"spans {m['span_cols']} cols, v_merge={m['v_merge']}")
pdfplumber does its best with extract_tables(), but for heavily merged layouts, use extract_text() with layout preservation and then interpret alignment manually. If this sounds tedious — it is. Skip to the LLM fallback.
Sometimes the output is so scrambled that no script can reconstruct it — multi-column PDFs, deeply nested merged cells, tables inside images. For these cases, feed the raw MarkItDown output to an LLM with a single prompt.
| Q1 Revenue | Q2 Revenue | Q3 Revenue | Q4 Revenue |
| --- | --- | --- | --- |
| 12,000 | 15,000 |
| | | 18,000 | 22,000 |
| North | South | East |
| 5k | 3k | 7k | 2k | | | |
| Region | Q1 | Q2 | Q3 | Q4 |
| --- | --- | --- | --- | --- |
| Total | 12,000 | 15,000 | 18,000 | 22,000 |
| North | 5,000 | - | - | - |
| South | 3,000 | - | - | - |
| East | 7,000 | - | - | - |
| West | 2,000 | - | - | - |
You are a table repair assistant. I have raw Markdown output from a
document converter. The tables are scrambled — misaligned columns,
duplicated cells from merged regions, split rows.
Your task:
1. Detect table boundaries in the raw text below
2. Reconstruct each table with correct column alignment
3. Merge cells that clearly span multiple columns (like section headers)
4. Remove duplicate text caused by merged-cell extraction
5. Output valid Markdown tables — no explanation, just the tables
6. If you can't determine column boundaries, note it with
Raw text:
[paste scrambled MarkItDown table output here]
| Your file is… | Table complexity | Use |
|---|---|---|
| DOCX | Simple (no merged cells) | MarkItDown directly |
| DOCX | Merged cells, nested tables | python-docx pre-extraction |
| Simple aligned columns | pdfplumber detection |
|
| Complex layout, merged cells | LLM fallback | |
| XLSX | Single sheet, no formulas | MarkItDown directly |
| XLSX | Multi-sheet, formulas, dates | openpyxl pre-extraction |
| Any | Unsalvageable mess | LLM fallback |