Complex Table and Layout Extraction
Multi-page tables, merged cells, nested layouts — extracted with structure intact.
The problem: structure is the first casualty
Documents with complex tables and layouts are where most extraction tools fail. Standard OCR reads characters but loses spatial relationships. LLMs process text but cannot reliably reconstruct table structure from visual input. Even specialized tools often convert tables to Markdown, which flattens merged cells, destroys row spans, and strips the relational structure that makes the data meaningful.
The real-world documents that matter most — financial statements with nested sub-tables, medical records with multi-page lab results, insurance policies with coverage matrices, engineering specifications with parameter tables — are exactly the ones that break.
A table that spans three pages with merged header cells and hierarchical row groups is not an edge case. It is Tuesday.
Multi-stage pipeline: layout to structure to extraction
anyformat processes complex documents through a multi-stage pipeline that separates layout analysis, structure recognition, and data extraction into distinct phases:
Stage 1 — Layout analysis: The system identifies regions of the document: text blocks, tables, figures, headers, footers, and page furniture. Each region is classified and spatially anchored.
Stage 2 — Structure recognition: For tables, the system reconstructs the full grid structure: row and column boundaries, merged cells, header hierarchies, and spanning relationships. For multi-page tables, structure is stitched across page breaks with continuity preserved.
Stage 3 — Data extraction: With structure understood, the system extracts values into their correct positions within the recognized structure. The output is structured JSON that preserves every relationship — not flattened text.
This separation matters. When layout analysis, structure recognition, and extraction are collapsed into a single step (as most LLM-based tools do), the system has to solve three problems simultaneously. Errors compound. anyformat solves them in sequence, with each stage validating the previous one.
Table structure preservation across page breaks
When a table starts on page 4 and ends on page 7, most tools treat each page fragment independently. The result is four separate partial tables with lost headers, duplicated rows, and broken relationships.
anyformat detects table continuations across page breaks and reconstructs the complete table as a single structure. Headers are associated with all rows they govern, even when the header row is pages away from the data. Row groups and sub-totals maintain their hierarchical relationships throughout.
Merged cell and nested layout handling
Merged cells — both horizontal and vertical spans — are a persistent source of extraction errors. A cell that spans three rows creates ambiguity: which row does the value belong to? A header that spans five columns groups those columns semantically, but most tools lose that grouping.
anyformat explicitly models cell spans in its output structure. A merged cell is represented with its span coordinates, not duplicated across rows or collapsed into the first cell. Nested tables (tables within table cells) are recursively extracted as structured sub-objects.
Figure detection and classification
Complex documents contain more than text and tables. Charts, diagrams, photographs, signatures, stamps, and embedded images carry information that text extraction misses entirely.
anyformat detects figures within documents, classifies them by type (chart, diagram, photograph, signature), and generates structured descriptions that capture what the visual element represents in context. This is particularly valuable for technical documents, inspection reports, and scientific papers where figures are load-bearing content.
Confidence scoring per cell
Not every cell in a complex table is equally easy to extract. A clearly printed numeric value in a well-structured table might deserve 99% confidence. A handwritten annotation in a merged cell spanning a page break might deserve 60%.
anyformat assigns calibrated confidence scores at the cell level, not just the document or field level. This means downstream systems and human reviewers know exactly which values to trust and which to verify. The cost of a wrong value in a financial table or medical record is not abstract — per-cell confidence makes review efficient and targeted.
No Markdown lossy conversion
Many extraction tools — including LlamaParse and other RAG-focused parsers — convert documents to Markdown as an intermediate representation. Markdown is a text format. It was not designed to represent table structure.
When a table with merged cells, hierarchical headers, and multi-page spans is converted to Markdown, the result is a pipe-delimited grid that has lost most of its structural information. That loss is usually unrecoverable — downstream extraction cannot reconstruct what the conversion destroyed.
anyformat outputs structured JSON that preserves the complete table structure. Row spans, column spans, header hierarchies, cell types, and positional relationships are all retained. No intermediate Markdown step. No lossy conversion.
Reducto's RD-TableBench benchmark demonstrates how challenging complex table extraction is. anyformat addresses that challenge by preserving structure all the way through the pipeline, from layout analysis to final JSON output.
Visual grounding: see what the system sees
Every extracted value in anyformat is visually grounded — linked back to its exact position in the source document. When a reviewer questions a value, they can see the bounding box on the original document, verifying not just the extracted text but where the system found it.
For complex layouts where the same number might appear in multiple table cells, visual grounding eliminates ambiguity about which cell was extracted.
Built for the documents that break everything else
If your documents are simple single-page forms with consistent layouts, most tools will work. If your documents contain multi-page tables with merged cells, nested structures, figures, and handwritten annotations, you need a pipeline built for complexity.
Try anyformat on your most complex documents →
anyformat is the document intelligence platform built for enterprises that process complex, high-stakes documents. ISO 27001 certified, GDPR-compliant, with zero-retention processing and on-premise deployment. Learn more at anyformat.ai

