Skip to content

📄 PDF Extraction (pdf-extraction)

PdfExtractionSkill parses unstructured enterprise PDFs into per-page text and document metadata so the agent can ingest them as context. It uses pypdf (a pure-Python library) — no native dependencies, no OCR. Scanned PDFs without an embedded text layer return empty strings; pre-process them with Tesseract or an equivalent OCR pipeline before invoking the skill.

When to use it

  • The agent needs to read content from a digitally-generated PDF (invoices, reports, contracts).
  • You want a single tool surface for PDF metadata (page count, author, encryption flag).
  • You're feeding PDF text into a downstream RAG pipeline and need a chunk-by-page boundary.

Tools

Tool Purpose
extract_pdf_metadata Returns page count, document metadata (title, author, …), and encryption flag.
extract_pdf_text Per-page text extraction, capped at 100 pages and 64,000 chars per page.

Configuration

No environment variables. The skill reads bytes that the agent (or the calling system) supplies — typically by passing a file path or a base64-encoded blob through the tool argument.

Example

from mirai_shared_skills.pdf_extraction import PdfExtractionSkill

skill = PdfExtractionSkill()
metadata_tool, text_tool = skill.get_tools()
# Agent calls: text_tool(path="/tmp/contract.pdf", page_range="1-10")

Caveats

  • OCR: text-bearing PDFs only. Scanned PDFs need OCR pre-processing.
  • Layout: text extraction is linear; complex multi-column layouts may produce out-of-order text.
  • Encryption: encrypted PDFs raise pypdf.errors.PdfReadError unless decrypted first.

Security considerations

standard per ADR-0001: the skill reads bytes and parses them. It does not write files. No SecureSkill wrapping required for typical use.

If the calling system passes file paths that the agent can choose, consider wrapping with a path-allowlist policy — otherwise an agent could be prompted to read sensitive files via extract_pdf_text(path="/etc/passwd") (which would simply error, but the access attempt itself may be undesirable in some environments).