Surface · Extraction

One typed document model for every source.

A PDF brief, a DOCX redline, an EDGAR filing, an Excel exhibit, a public web page — all become the same shape. Every paragraph, table cell, and footnote carries the page it came from, its bounding box, and its character span. Search, citation checking, and AI answers all read the same model and can always trace back to the source.

pip install kaos-pdf kaos-web kaos-office kaos-tabular kaos-source

One model, every source

A practice that handles briefs, redlines, deal documents, expert reports, and EDGAR filings cannot afford a different output format per source. The five extraction packages all produce the same typed ContentDocument defined by kaos-content. Search, chunking, citation extraction, AI grounding, and agent recipes read that one shape and do not care which extractor produced it.

Provenance is preserved end-to-end. The source identifier, page number, bounding box, character span, an optional confidence score, and the name of the extractor ride along on every node. A unique node ID and a JSON-pointer reference let any downstream consumer — a search hit, a citation verdict, an LLM answer — point back at the exact location in the original file.

What each extractor handles

Every package ships an MCP server an agent can call directly. Tool counts come from the live kaos status --json baseline (2026-05-04).

A taste

Read three forms of one deal — the executed PDF, an opposing-counsel redline, and the 10-K incorporating the agreement by reference. Same return shape; same provenance contract; one downstream loop pulls the change-of-control language out of each.

from kaos_pdf import parse_pdf
from kaos_office import parse_docx
from kaos_web import html_to_document
from kaos_content.views import DocumentView

# Three sources, one merger agreement.
executed = parse_pdf("merger-agreement-executed.pdf")
redline  = parse_docx("opposing-counsel-redline.docx", track_changes=True)
filing   = html_to_document(open("acquirer-10k.html").read())

# Pull every paragraph that mentions change of control,
# with the page it came from for the diligence memo.
for view in DocumentView(executed).paragraphs:
    if "change of control" in view.text.lower():
        print(f"p.{view.page}: {view.text[:120]}...")

Packages in this group

kaos-pdf

7 tools. PDF → ContentDocument via pypdfium2. OCR, tables, vision optional. No AGPL.

kaos-web

31 tools. HTTP + Playwright + 4 search backends + DNS/WHOIS/TLS domain intelligence.

kaos-office

14 tools. DOCX/PPTX/XLSX read AND write. Track changes first-class. 5–75× faster than alternatives.

kaos-tabular

8 tools. SQL on legal data via DuckDB. CSV/JSON/Parquet/XLSX/SQLite, agent-ready errors.

kaos-source

22 tools. Federal Register, eCFR, EDGAR, GovInfo, GLEIF, EML/MBOX/EXIF forensics.

How it compares

vs. Docling, MarkItDown, mammoth. Verified head-to-head on 8 real legal DOCX fixtures (benchmarks in the kaos-office repository): kaos-office is 5–8× faster than mammoth, 8–12× faster than MarkItDown, and 6–75× faster than Docling. None of those alternatives carries the AST + provenance contract through the rest of the platform; switching extractors loses the downstream search / citation / verification machinery.

vs. Harvey ingestion, CoCounsel federated search. The proprietary platforms ship integrated ingestion against their own document stores. KAOS is the open-source layer underneath either approach: bring your own corpus, get the same typed model, the same provenance, and the same citation pipeline. No platform lock-in.

License posture: pypdfium2 over PyMuPDF on purpose — Apache-2.0 friendly, no AGPL exposure, OEM redistribution unblocked. python-calamine is the optional fast XLSX engine. See /compare for the full named-competitor table.

Get started

See the quickstart, browse all 18 packages, or read the docs.