Can anyformat extract data from multi-page tables?

Yes. anyformat's multi-stage pipeline preserves table structure across page breaks, merged cells, and nested layouts. Unlike tools that output Markdown (losing structural information), anyformat outputs structured JSON that preserves all relational data.

How does anyformat handle complex document layouts?

The extraction pipeline includes layout analysis, structure recognition, and contextual field extraction. It handles multi-column layouts, tables within tables, headers spanning multiple rows, and mixed content (text + tables + images) in a single pass.

What makes anyformat different from OCR tools on complex documents?

Standard OCR returns raw text or bounding boxes. anyformat goes further — it understands the semantic structure of the document, extracting fields into structured JSON with confidence scoring. The multi-stage pipeline handles edge cases that break simpler tools.

Complex Table and Layout Extraction

Multi-page tables, merged cells, nested layouts — extracted with structure intact.

The problem: structure is the first casualty

Documents with complex tables and layouts are where most extraction tools fail. Standard OCR reads characters but loses spatial relationships. LLMs process text but cannot reliably reconstruct table structure from visual input. Even specialized tools often convert tables to Markdown, which flattens merged cells, destroys row spans, and strips the relational structure that makes the data meaningful.

The real-world documents that matter most — financial statements with nested sub-tables, medical records with multi-page lab results, insurance policies with coverage matrices, engineering specifications with parameter tables — are exactly the ones that break.

A table that spans three pages with merged header cells and hierarchical row groups is not an edge case. It is Tuesday.

Multi-stage pipeline: layout to structure to extraction

anyformat processes complex documents through a multi-stage pipeline that separates layout analysis, structure recognition, and data extraction into distinct phases:

Stage 1 — Layout analysis: The system identifies regions of the document: text blocks, tables, figures, headers, footers, and page furniture. Each region is classified and spatially anchored.

Stage 2 — Structure recognition: For tables, the system reconstructs the full grid structure: row and column boundaries, merged cells, header hierarchies, and spanning relationships. For multi-page tables, structure is stitched across page breaks with continuity preserved.

Stage 3 — Data extraction: With structure understood, the system extracts values into their correct positions within the recognized structure. The output is structured JSON that preserves every relationship — not flattened text.

This separation matters. When layout analysis, structure recognition, and extraction are collapsed into a single step (as most LLM-based tools do), the system has to solve three problems simultaneously. Errors compound. anyformat solves them in sequence, with each stage validating the previous one.

Table structure preservation across page breaks

When a table starts on page 4 and ends on page 7, most tools treat each page fragment independently. The result is four separate partial tables with lost headers, duplicated rows, and broken relationships.

anyformat detects table continuations across page breaks and reconstructs the complete table as a single structure. Headers are associated with all rows they govern, even when the header row is pages away from the data. Row groups and sub-totals maintain their hierarchical relationships throughout.

Merged cell and nested layout handling

Merged cells — both horizontal and vertical spans — are a persistent source of extraction errors. A cell that spans three rows creates ambiguity: which row does the value belong to? A header that spans five columns groups those columns semantically, but most tools lose that grouping.

anyformat explicitly models cell spans in its output structure. A merged cell is represented with its span coordinates, not duplicated across rows or collapsed into the first cell. Nested tables (tables within table cells) are recursively extracted as structured sub-objects.

Figure detection and classification

Complex documents contain more than text and tables. Charts, diagrams, photographs, signatures, stamps, and embedded images carry information that text extraction misses entirely.

anyformat detects figures within documents, classifies them by type (chart, diagram, photograph, signature), and generates structured descriptions that capture what the visual element represents in context. This is particularly valuable for technical documents, inspection reports, and scientific papers where figures are load-bearing content.

Confidence scoring per cell

Not every cell in a complex table is equally easy to extract. A clearly printed numeric value in a well-structured table might deserve 99% confidence. A handwritten annotation in a merged cell spanning a page break might deserve 60%.

anyformat assigns calibrated confidence scores at the cell level, not just the document or field level. This means downstream systems and human reviewers know exactly which values to trust and which to verify. The cost of a wrong value in a financial table or medical record is not abstract — per-cell confidence makes review efficient and targeted.

No Markdown lossy conversion

Many extraction tools — including LlamaParse and other RAG-focused parsers — convert documents to Markdown as an intermediate representation. Markdown is a text format. It was not designed to represent table structure.

When a table with merged cells, hierarchical headers, and multi-page spans is converted to Markdown, the result is a pipe-delimited grid that has lost most of its structural information. That loss is usually unrecoverable — downstream extraction cannot reconstruct what the conversion destroyed.

anyformat outputs structured JSON that preserves the complete table structure. Row spans, column spans, header hierarchies, cell types, and positional relationships are all retained. No intermediate Markdown step. No lossy conversion.

Reducto's RD-TableBench benchmark demonstrates how challenging complex table extraction is. anyformat addresses that challenge by preserving structure all the way through the pipeline, from layout analysis to final JSON output.

Visual grounding: see what the system sees

Every extracted value in anyformat is visually grounded — linked back to its exact position in the source document. When a reviewer questions a value, they can see the bounding box on the original document, verifying not just the extracted text but where the system found it.

For complex layouts where the same number might appear in multiple table cells, visual grounding eliminates ambiguity about which cell was extracted.

Built for the documents that break everything else

If your documents are simple single-page forms with consistent layouts, most tools will work. If your documents contain multi-page tables with merged cells, nested structures, figures, and handwritten annotations, you need a pipeline built for complexity.

Try anyformat on your most complex documents →

anyformat is the document intelligence platform built for enterprises that process complex, high-stakes documents. ISO 27001 certified, GDPR-compliant, with zero-retention processing and on-premise deployment. Learn more at anyformat.ai

Complex Table and Layout Extraction

Multi-page tables, merged cells, nested layouts — extracted with structure intact.

The problem: structure is the first casualty

A table that spans three pages with merged header cells and hierarchical row groups is not an edge case. It is Tuesday.

Multi-stage pipeline: layout to structure to extraction

anyformat processes complex documents through a multi-stage pipeline that separates layout analysis, structure recognition, and data extraction into distinct phases:

Stage 1 — Layout analysis: The system identifies regions of the document: text blocks, tables, figures, headers, footers, and page furniture. Each region is classified and spatially anchored.

Table structure preservation across page breaks

Merged cell and nested layout handling

Figure detection and classification

Complex documents contain more than text and tables. Charts, diagrams, photographs, signatures, stamps, and embedded images carry information that text extraction misses entirely.

Confidence scoring per cell

No Markdown lossy conversion

Visual grounding: see what the system sees

For complex layouts where the same number might appear in multiple table cells, visual grounding eliminates ambiguity about which cell was extracted.

Built for the documents that break everything else

Try anyformat on your most complex documents →

Complex Tables & Layout Extraction

Key highlights

Complex Table and Layout Extraction

The problem: structure is the first casualty

Multi-stage pipeline: layout to structure to extraction

Table structure preservation across page breaks

Merged cell and nested layout handling

Figure detection and classification

Confidence scoring per cell

No Markdown lossy conversion

Visual grounding: see what the system sees

Built for the documents that break everything else

Frequently asked questions

Can anyformat extract data from multi-page tables?

How does anyformat handle complex document layouts?

What makes anyformat different from OCR tools on complex documents?

Other use cases

Start with your hardest documents.

Complex Tables & Layout Extraction

Key highlights

Complex Table and Layout Extraction

The problem: structure is the first casualty

Multi-stage pipeline: layout to structure to extraction

Table structure preservation across page breaks

Merged cell and nested layout handling

Figure detection and classification

Confidence scoring per cell

No Markdown lossy conversion

Visual grounding: see what the system sees

Built for the documents that break everything else

Frequently asked questions

Can anyformat extract data from multi-page tables?

How does anyformat handle complex document layouts?

What makes anyformat different from OCR tools on complex documents?

Other use cases

Start with your hardest documents.