Surface · Extraction

One typed document model for every source.

A PDF brief, a DOCX redline, an EDGAR filing, an Excel exhibit, a public web page — all become the same shape. Every paragraph, table cell, and footnote carries the page it came from, its bounding box, and its character span. Search, citation checking, and AI answers all read the same model and can always trace back to the source.

Terminal window
pip install kaos-pdf kaos-web kaos-office kaos-tabular kaos-source

One model, every source

A practice that handles briefs, redlines, deal documents, expert reports, and EDGAR filings cannot afford a different output format per source. The five extraction packages all produce the same typed ContentDocument defined by kaos-content. Search, chunking, citation extraction, AI grounding, and agent recipes read that one shape and do not care which extractor produced it.

Provenance is preserved end-to-end. The source identifier, page number, bounding box, character span, an optional confidence score, and the name of the extractor ride along on every node. A unique node ID and a JSON-pointer reference let any downstream consumer — a search hit, a citation verdict, an LLM answer — point back at the exact location in the original file.

Source 01 Extract 02 Annotate 03 ContentDocument 04 PDF DOCX/PPTX HTML / EDGAR XLSX / CSV kaos-pdf kaos-web kaos-office kaos-tabular provenance BBox + page char_span confidence typed AST JSON round-trip Markdown export searchable OUTCOME

What each extractor handles

Every package ships an MCP server an agent can call directly. Tool counts come from the live kaos status --json baseline (2026-05-04).

kaos-pdf 7 MCP tools kaos-web 31 MCP tools kaos-office 14 MCP tools kaos-tabular 8 MCP tools kaos-source 22 MCP tools Total 82 tools across these five

A taste

Read three forms of one deal — the executed PDF, an opposing-counsel redline, and the 10-K incorporating the agreement by reference. Same return shape; same provenance contract; one downstream loop pulls the change-of-control language out of each.

from kaos_pdf import parse_pdf
from kaos_office import parse_docx
from kaos_web import html_to_document
from kaos_content.views import DocumentView
# Three sources, one merger agreement.
executed = parse_pdf("merger-agreement-executed.pdf")
redline = parse_docx("opposing-counsel-redline.docx", track_changes=True)
filing = html_to_document(open("acquirer-10k.html").read())
# Pull every paragraph that mentions change of control,
# with the page it came from for the diligence memo.
for view in DocumentView(executed).paragraphs:
if "change of control" in view.text.lower():
print(f"p.{view.page}: {view.text[:120]}...")

How it compares

vs. Docling, MarkItDown, mammoth. Verified head-to-head on 8 real legal DOCX fixtures (benchmarks in the kaos-office repository): kaos-office is 5–8× faster than mammoth, 8–12× faster than MarkItDown, and 6–75× faster than Docling. None of those alternatives carries the AST + provenance contract through the rest of the platform; switching extractors loses the downstream search / citation / verification machinery.

vs. Harvey ingestion, CoCounsel federated search. The proprietary platforms ship integrated ingestion against their own document stores. KAOS is the open-source layer underneath either approach: bring your own corpus, get the same typed model, the same provenance, and the same citation pipeline. No platform lock-in.

License posture: pypdfium2 over PyMuPDF on purpose — Apache-2.0 friendly, no AGPL exposure, OEM redistribution unblocked. python-calamine is the optional fast XLSX engine. See /compare for the full named-competitor table.

Get started

See the quickstart, browse all 18 packages, or read the docs.