← Back to MarkItDown Guide

MarkItDown Table Conversion — Fixes & Workarounds

Tables are MarkItDown's weakest link. Merged cells scramble, PDF tables vanish, XLSX formulas disappear. Here's how to fix each one — with pre-extraction scripts and an LLM fallback.

1. Why Tables Break in MarkItDown

MarkItDown is a text extractor, not a document parser. It reads content streams — not layout boxes, not table grids. This means:

SourceWhat goes wrong
DOCXMerged cells lose hierarchy; nested tables flatten into a single jumble; cell alignment and borders disappear
PDFNo table concept at all — columns become space-separated text; merged cells overwrite neighbors; multi-page tables split at page breaks
XLSXFormulas — gone (only values survive); date formats reset to raw numbers; only the first sheet is processed; empty rows with formatting get dropped

The fix is the same pattern for all three: pre-extract tables with a format-aware library, then let MarkItDown handle the rest of the document.

2. DOCX: Pre-Extract Tables with python-docx

MarkItDown handles simple DOCX tables fine. The moment you have merged cells, nested tables, or right-to-left columns, the output breaks. The solution: use python-docx to extract every table as Markdown before MarkItDown touches the file.

from docx import Document
import os

def extract_docx_tables(docx_path):
    """Extract all tables from a DOCX as Markdown. Returns
    (table_markdown, non_table_text) so you can recombine later."""
    doc = Document(docx_path)
    tables_md = []
    non_table_parts = []

    for element in doc.element.body:
        # Check if this XML element is a table
        if element.tag.endswith('}tbl'):
            table = doc.tables[len(tables_md)]  # nth table
            rows = []
            for row in table.rows:
                cells = []
                for cell in row.cells:
                    cells.append(cell.text.replace('\n', ' ').strip())
                rows.append('| ' + ' | '.join(cells) + ' |')

            # Insert separator row after header
            if rows:
                cols = len(table.rows[0].cells)
                rows.insert(1, '|' + '|'.join(['---'] * cols) + '|')

            tables_md.append('\n'.join(rows))
        else:
            # Non-table paragraphs — let MarkItDown handle these
            pass

    return tables_md

# Usage: extract tables first, then feed the DOCX to MarkItDown
# for the non-table content. Merge outputs afterward.
tables = extract_docx_tables("report.docx")

# Now run MarkItDown for everything else
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("report.docx")

# Replace scrambled tables with clean pre-extracted ones
# (Implementation depends on your document structure)
Handling merged cells: python-docx expands merged cells by default — a cell spanning 3 columns becomes 3 separate cells with identical text. For true merge detection, check cell._tc.xpath('./w:tcPr/w:gridSpan') and w:vMerge in the underlying XML. See the merged cells section below.

3. PDF: Use pdfplumber for Table Detection

PDFs have no table markup — just positioned text blocks. MarkItDown reads the text stream in order, which means nearby columns get concatenated as one line. pdfplumber can detect table boundaries by analyzing text positions.

import pdfplumber
from markitdown import MarkItDown

def pdf_with_tables(pdf_path):
    """Extract tables via pdfplumber, body text via MarkItDown,
    and merge them into clean Markdown."""
    md = MarkItDown()
    result = md.convert(pdf_path)

    with pdfplumber.open(pdf_path) as pdf:
        for i, page in enumerate(pdf.pages):
            tables = page.extract_tables()
            for j, table in enumerate(tables):
                # Build Markdown table
                lines = []
                for ri, row in enumerate(table):
                    cells = [c or '' for c in row]
                    lines.append('| ' + ' | '.join(cells) + ' |')

                    # Separator after header (first row)
                    if ri == 0:
                        lines.append('|' + '|'.join(['---'] * len(cells)) + '|')

                print(f"Page {i+1}, Table {j+1}:")
                print('\n'.join(lines))
                print()
    return result

# Full pipeline: pdfplumber tables + MarkItDown body text
pdf_with_tables("financial_report.pdf")
pdfplumber limitation: It detects tables by finding aligned text columns. Complex layouts — rotated text, heavy borders, tables inside images — will still miss. For those, skip to the LLM fallback.

4. XLSX: Multi-Sheet, Formula-Aware Extraction

MarkItDown reads XLSX through openpyxl but only grabs cell.value — losing formulas, number formats, and everything past the first sheet. This script fixes all three.

import openpyxl
from openpyxl.utils import get_column_letter

def xlsx_to_markdown(xlsx_path, include_empty_rows=False):
    """Convert every sheet in an XLSX to clean Markdown tables.
    Preserves date formats and visual style hints."""
    wb = openpyxl.load_workbook(xlsx_path, data_only=True)
    output = []

    for sheet_name in wb.sheetnames:
        ws = wb[sheet_name]
        output.append(f"## {sheet_name}\n")

        rows = list(ws.iter_rows(min_row=1,
                                 max_row=ws.max_row,
                                 max_col=ws.max_column,
                                 values_only=False))

        for ri, row in enumerate(rows):
            cells = []
            for cell in row:
                val = cell.value
                if val is None:
                    cells.append('')
                    continue

                # Format dates properly
                if cell.number_format and 'yy' in cell.number_format.lower():
                    from datetime import datetime
                    if isinstance(val, (int, float)):
                        from openpyxl.utils.datetime import from_excel
                        val = from_excel(val).strftime('%Y-%m-%d')

                # Bold headers
                text = str(val).replace('\n', ' ').replace('|', '\\|')
                if cell.font and cell.font.bold:
                    text = f'**{text}**'

                cells.append(text)

            if cells and (include_empty_rows or any(cells)):
                output.append('| ' + ' | '.join(cells) + ' |')

            # Separator after header row
            if ri == 0:
                output.append('|' + '|'.join(['---'] * len(cells)) + '|')

        output.append('')

    return '\n'.join(output)

# Usage
markdown = xlsx_to_markdown("financial_model.xlsx")
with open("financial_model.md", 'w') as f:
    f.write(markdown)
data_only=True: This reads cached values instead of formulas. If the XLSX was saved without calculating, values may be missing. For formula preservation, use data_only=False and read cell.value (which will be the formula string) — then decide whether to keep it as text.

5. Merged Cells: The Hardest Problem

Merged cells are the #1 source of table bugs in MarkItDown. A merged cell spanning 3 columns produces 3 copies of the same text in MarkItDown's output — making tables unreadable.

DOCX Merged Cells

Use python-docx to detect merge spans before extraction:

def detect_docx_merges(docx_path):
    """Find all merged cells in a DOCX — returns (row, col, span_cols, span_rows)."""
    from docx import Document
    doc = Document(docx_path)
    merges = []

    for ti, table in enumerate(doc.tables):
        for ri, row in enumerate(table.rows):
            for ci, cell in enumerate(row.cells):
                tc = cell._tc
                tcPr = tc.find('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}tcPr')
                if tcPr is not None:
                    grid_span = tcPr.find('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}gridSpan')
                    v_merge = tcPr.find('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}vMerge')
                    span_cols = int(grid_span.get('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}val')) if grid_span is not None else 1
                    is_merged = span_cols > 1 or v_merge is not None
                    if is_merged:
                        merges.append({
                            'table': ti, 'row': ri, 'col': ci,
                            'span_cols': span_cols,
                            'v_merge': v_merge is not None
                        })

    return merges

# Print merge map
for m in detect_docx_merges("merged_cells.docx"):
    print(f"Table {m['table']}, Row {m['row']}, Col {m['col']}: "
          f"spans {m['span_cols']} cols, v_merge={m['v_merge']}")

PDF Merged Cells

pdfplumber does its best with extract_tables(), but for heavily merged layouts, use extract_text() with layout preservation and then interpret alignment manually. If this sounds tedious — it is. Skip to the LLM fallback.

6. LLM Fallback: When Code Can't Fix It

Sometimes the output is so scrambled that no script can reconstruct it — multi-column PDFs, deeply nested merged cells, tables inside images. For these cases, feed the raw MarkItDown output to an LLM with a single prompt.

Before / After

Before (MarkItDown raw)
| Q1 Revenue | Q2 Revenue | Q3 Revenue | Q4 Revenue |
| --- | --- | --- | --- |
| 12,000 | 15,000 |
| | | 18,000 | 22,000 |
| North | South | East |
| 5k | 3k | 7k | 2k | | | |
After (LLM cleaned)
| Region | Q1 | Q2 | Q3 | Q4 |
| --- | --- | --- | --- | --- |
| Total | 12,000 | 15,000 | 18,000 | 22,000 |
| North | 5,000 | - | - | - |
| South | 3,000 | - | - | - |
| East | 7,000 | - | - | - |
| West | 2,000 | - | - | - |

The Prompt

You are a table repair assistant. I have raw Markdown output from a
document converter. The tables are scrambled — misaligned columns,
duplicated cells from merged regions, split rows.

Your task:
1. Detect table boundaries in the raw text below
2. Reconstruct each table with correct column alignment
3. Merge cells that clearly span multiple columns (like section headers)
4. Remove duplicate text caused by merged-cell extraction
5. Output valid Markdown tables — no explanation, just the tables
6. If you can't determine column boundaries, note it with 

Raw text:
[paste scrambled MarkItDown table output here]
Model choice: GPT-4o and Claude 4 Sonnet both handle this well. See the LLM comparison on the main guide for cost/speed tradeoffs. One call typically costs under $0.01 for a table-heavy page.

7. Which Approach Should You Use?

Your file is…Table complexityUse
DOCX Simple (no merged cells) MarkItDown directly
DOCX Merged cells, nested tables python-docx pre-extraction
PDF Simple aligned columns pdfplumber detection
PDF Complex layout, merged cells LLM fallback
XLSX Single sheet, no formulas MarkItDown directly
XLSX Multi-sheet, formulas, dates openpyxl pre-extraction
Any Unsalvageable mess LLM fallback
Rule of thumb: If you can see the table structure when you open the file, a pre-extraction script can capture it. If the file itself looks scrambled, only an LLM can make sense of it.