← Back to MarkItDown Guide

MarkItDown Table Conversion — Fixes & Workarounds

Tables are MarkItDown's weakest link. Merged cells scramble, PDF tables vanish, XLSX formulas disappear. Here's how to fix each one — with pre-extraction scripts and an LLM fallback.

1. Why Tables Break in MarkItDown

MarkItDown is a text extractor, not a document parser. It reads content streams — not layout boxes, not table grids. This means:

Source	What goes wrong
DOCX	Merged cells lose hierarchy; nested tables flatten into a single jumble; cell alignment and borders disappear
PDF	No table concept at all — columns become space-separated text; merged cells overwrite neighbors; multi-page tables split at page breaks
XLSX	Formulas — gone (only values survive); date formats reset to raw numbers; only the first sheet is processed; empty rows with formatting get dropped

The fix is the same pattern for all three: pre-extract tables with a format-aware library, then let MarkItDown handle the rest of the document.

2. DOCX: Pre-Extract Tables with python-docx

MarkItDown handles simple DOCX tables fine. The moment you have merged cells, nested tables, or right-to-left columns, the output breaks. The solution: use python-docx to extract every table as Markdown before MarkItDown touches the file.

from docx import Document
import os

def extract_docx_tables(docx_path):
    """Extract all tables from a DOCX as Markdown. Returns
    (table_markdown, non_table_text) so you can recombine later."""
    doc = Document(docx_path)
    tables_md = []
    non_table_parts = []

    for element in doc.element.body:
        # Check if this XML element is a table
        if element.tag.endswith('}tbl'):
            table = doc.tables[len(tables_md)]  # nth table
            rows = []
            for row in table.rows:
                cells = []
                for cell in row.cells:
                    cells.append(cell.text.replace('\n', ' ').strip())
                rows.append('| ' + ' | '.join(cells) + ' |')

            # Insert separator row after header
            if rows:
                cols = len(table.rows[0].cells)
                rows.insert(1, '|' + '|'.join(['---'] * cols) + '|')

            tables_md.append('\n'.join(rows))
        else:
            # Non-table paragraphs — let MarkItDown handle these
            pass

    return tables_md

# Usage: extract tables first, then feed the DOCX to MarkItDown
# for the non-table content. Merge outputs afterward.
tables = extract_docx_tables("report.docx")

# Now run MarkItDown for everything else
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("report.docx")

# Replace scrambled tables with clean pre-extracted ones
# (Implementation depends on your document structure)

Handling merged cells: python-docx expands merged cells by default — a cell spanning 3 columns becomes 3 separate cells with identical text. For true merge detection, check cell._tc.xpath('./w:tcPr/w:gridSpan') and w:vMerge in the underlying XML. See the merged cells section below.

3. PDF: Use pdfplumber for Table Detection

PDFs have no table markup — just positioned text blocks. MarkItDown reads the text stream in order, which means nearby columns get concatenated as one line. pdfplumber can detect table boundaries by analyzing text positions.

import pdfplumber
from markitdown import MarkItDown

def pdf_with_tables(pdf_path):
    """Extract tables via pdfplumber, body text via MarkItDown,
    and merge them into clean Markdown."""
    md = MarkItDown()
    result = md.convert(pdf_path)

    with pdfplumber.open(pdf_path) as pdf:
        for i, page in enumerate(pdf.pages):
            tables = page.extract_tables()
            for j, table in enumerate(tables):
                # Build Markdown table
                lines = []
                for ri, row in enumerate(table):
                    cells = [c or '' for c in row]
                    lines.append('| ' + ' | '.join(cells) + ' |')

                    # Separator after header (first row)
                    if ri == 0:
                        lines.append('|' + '|'.join(['---'] * len(cells)) + '|')

                print(f"Page {i+1}, Table {j+1}:")
                print('\n'.join(lines))
                print()
    return result

# Full pipeline: pdfplumber tables + MarkItDown body text
pdf_with_tables("financial_report.pdf")

pdfplumber limitation: It detects tables by finding aligned text columns. Complex layouts — rotated text, heavy borders, tables inside images — will still miss. For those, skip to the LLM fallback.

4. XLSX: Multi-Sheet, Formula-Aware Extraction

MarkItDown reads XLSX through openpyxl but only grabs cell.value — losing formulas, number formats, and everything past the first sheet. This script fixes all three.

import openpyxl
from openpyxl.utils import get_column_letter

def xlsx_to_markdown(xlsx_path, include_empty_rows=False):
    """Convert every sheet in an XLSX to clean Markdown tables.
    Preserves date formats and visual style hints."""
    wb = openpyxl.load_workbook(xlsx_path, data_only=True)
    output = []

    for sheet_name in wb.sheetnames:
        ws = wb[sheet_name]
        output.append(f"## {sheet_name}\n")

        rows = list(ws.iter_rows(min_row=1,
                                 max_row=ws.max_row,
                                 max_col=ws.max_column,
                                 values_only=False))

        for ri, row in enumerate(rows):
            cells = []
            for cell in row:
                val = cell.value
                if val is None:
                    cells.append('')
                    continue

                # Format dates properly
                if cell.number_format and 'yy' in cell.number_format.lower():
                    from datetime import datetime
                    if isinstance(val, (int, float)):
                        from openpyxl.utils.datetime import from_excel
                        val = from_excel(val).strftime('%Y-%m-%d')

                # Bold headers
                text = str(val).replace('\n', ' ').replace('|', '\\|')
                if cell.font and cell.font.bold:
                    text = f'**{text}**'

                cells.append(text)

            if cells and (include_empty_rows or any(cells)):
                output.append('| ' + ' | '.join(cells) + ' |')

            # Separator after header row
            if ri == 0:
                output.append('|' + '|'.join(['---'] * len(cells)) + '|')

        output.append('')

    return '\n'.join(output)

# Usage
markdown = xlsx_to_markdown("financial_model.xlsx")
with open("financial_model.md", 'w') as f:
    f.write(markdown)

data_only=True: This reads cached values instead of formulas. If the XLSX was saved without calculating, values may be missing. For formula preservation, use data_only=False and read cell.value (which will be the formula string) — then decide whether to keep it as text.

5. Merged Cells: The Hardest Problem

Merged cells are the #1 source of table bugs in MarkItDown. A merged cell spanning 3 columns produces 3 copies of the same text in MarkItDown's output — making tables unreadable.

DOCX Merged Cells

Use python-docx to detect merge spans before extraction:

def detect_docx_merges(docx_path):
    """Find all merged cells in a DOCX — returns (row, col, span_cols, span_rows)."""
    from docx import Document
    doc = Document(docx_path)
    merges = []

    for ti, table in enumerate(doc.tables):
        for ri, row in enumerate(table.rows):
            for ci, cell in enumerate(row.cells):
                tc = cell._tc
                tcPr = tc.find('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}tcPr')
                if tcPr is not None:
                    grid_span = tcPr.find('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}gridSpan')
                    v_merge = tcPr.find('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}vMerge')
                    span_cols = int(grid_span.get('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}val')) if grid_span is not None else 1
                    is_merged = span_cols > 1 or v_merge is not None
                    if is_merged:
                        merges.append({
                            'table': ti, 'row': ri, 'col': ci,
                            'span_cols': span_cols,
                            'v_merge': v_merge is not None
                        })

    return merges

# Print merge map
for m in detect_docx_merges("merged_cells.docx"):
    print(f"Table {m['table']}, Row {m['row']}, Col {m['col']}: "
          f"spans {m['span_cols']} cols, v_merge={m['v_merge']}")

PDF Merged Cells

pdfplumber does its best with extract_tables(), but for heavily merged layouts, use extract_text() with layout preservation and then interpret alignment manually. If this sounds tedious — it is. Skip to the LLM fallback.

6. LLM Fallback: When Code Can't Fix It

Sometimes the output is so scrambled that no script can reconstruct it — multi-column PDFs, deeply nested merged cells, tables inside images. For these cases, feed the raw MarkItDown output to an LLM with a single prompt.

Before / After

Before (MarkItDown raw)

| Q1 Revenue | Q2 Revenue | Q3 Revenue | Q4 Revenue |
| --- | --- | --- | --- |
| 12,000 | 15,000 |
| | | 18,000 | 22,000 |
| North | South | East |
| 5k | 3k | 7k | 2k | | | |

After (LLM cleaned)

| Region | Q1 | Q2 | Q3 | Q4 |
| --- | --- | --- | --- | --- |
| Total | 12,000 | 15,000 | 18,000 | 22,000 |
| North | 5,000 | - | - | - |
| South | 3,000 | - | - | - |
| East | 7,000 | - | - | - |
| West | 2,000 | - | - | - |

The Prompt

You are a table repair assistant. I have raw Markdown output from a
document converter. The tables are scrambled — misaligned columns,
duplicated cells from merged regions, split rows.

Your task:
1. Detect table boundaries in the raw text below
2. Reconstruct each table with correct column alignment
3. Merge cells that clearly span multiple columns (like section headers)
4. Remove duplicate text caused by merged-cell extraction
5. Output valid Markdown tables — no explanation, just the tables
6. If you can't determine column boundaries, note it with 

Raw text:
[paste scrambled MarkItDown table output here]

Model choice: GPT-4o and Claude 4 Sonnet both handle this well. See the LLM comparison on the main guide for cost/speed tradeoffs. One call typically costs under $0.01 for a table-heavy page.

7. Which Approach Should You Use?

Your file is…	Table complexity	Use
DOCX	Simple (no merged cells)	MarkItDown directly
DOCX	Merged cells, nested tables	`python-docx` pre-extraction
PDF	Simple aligned columns	`pdfplumber` detection
PDF	Complex layout, merged cells	LLM fallback
XLSX	Single sheet, no formulas	MarkItDown directly
XLSX	Multi-sheet, formulas, dates	`openpyxl` pre-extraction
Any	Unsalvageable mess	LLM fallback

Rule of thumb: If you can see the table structure when you open the file, a pre-extraction script can capture it. If the file itself looks scrambled, only an LLM can make sense of it.