✍️Prompt EngineeringLesson 4.3

Data Extraction Patterns

Extracting structured data from unstructured text reliably.

20 min

Learning Objectives

Design extraction pipelines for documents
Handle edge cases in unstructured data
Build confidence scoring for extractions

Data Extraction Patterns

Data extraction — pulling structured information from unstructured text — is one of the most common production use cases for Claude. Whether you are extracting entities from legal contracts, parsing financial reports, or pulling product details from web pages, the patterns are similar. This lesson covers the architecture of robust extraction pipelines, techniques for handling edge cases, and strategies for confidence scoring.

Extraction Pipeline Architecture

A production extraction pipeline is more than a single API call. It typically involves preprocessing, extraction, validation, and post-processing stages. Each stage adds reliability.

Stage 1: Preprocessing

Text cleaning: Remove boilerplate, headers, footers, and irrelevant content that could confuse the model.
Chunking: For long documents, split into manageable chunks that fit within the context window while preserving semantic boundaries.
Deduplication: Ensure the same content is not extracted multiple times when processing overlapping chunks.

Stage 2: Extraction

The core extraction call to Claude, using the prompting and structured output techniques from Lessons 4.1 and 4.2.

Stage 3: Validation and Normalization

Schema validation: Verify the output matches the expected structure.
Business rule validation: Check that extracted values are plausible (e.g., a date is not in the future, a price is not negative).
Normalization: Standardize formats (dates, phone numbers, addresses) to canonical forms.

Stage 4: Post-processing

Deduplication: Merge duplicate entities found across chunks.
Enrichment: Add metadata like extraction confidence, source location, and timestamps.

import anthropic
import json
from pydantic import BaseModel, Field
from typing import List, Optional
from datetime import date


class ContractParty(BaseModel):
    name: str
    role: str = Field(description="buyer, seller, lessor, lessee, etc.")
    address: Optional[str] = None


class ContractClause(BaseModel):
    clause_type: str = Field(
        description="termination, payment, liability, confidentiality, etc."
    )
    summary: str
    original_text: str
    section_reference: Optional[str] = None


class ContractExtraction(BaseModel):
    parties: List[ContractParty]
    effective_date: Optional[str] = None
    expiration_date: Optional[str] = None
    total_value: Optional[str] = None
    key_clauses: List[ContractClause]
    governing_law: Optional[str] = None


def extract_contract_data(contract_text: str) -> ContractExtraction:
    client = anthropic.Anthropic()

    schema_str = json.dumps(
        ContractExtraction.model_json_schema(), indent=2
    )

    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=4096,
        system=(
            "You are a legal document analyst. Extract structured "
            "data from contracts with high precision. If a field "
            "cannot be determined from the text, use null. Never "
            "invent information that is not present in the document."
        ),
        messages=[
            {"role": "user", "content": (
                f"Extract structured data from this contract.\n\n"
                f"<contract>\n{contract_text}\n</contract>\n\n"
                f"Return a JSON object matching this schema:\n"
                f"<schema>\n{schema_str}\n</schema>\n\n"
                f"Return ONLY the JSON object."
            )},
            {"role": "assistant", "content": "{"}
        ]
    )

    json_str = "{" + response.content[0].text
    data = json.loads(json_str)
    return ContractExtraction.model_validate(data)

Handling Edge Cases

Real-world data is messy. A robust extraction system must handle cases where the expected information is missing, ambiguous, contradictory, or in an unexpected format.

Missing Data

Instruct Claude to return null or a sentinel value when data is not present, rather than guessing. This is critical — a hallucinated value is worse than a missing one.

# In your system prompt or instructions:
SYSTEM_PROMPT = (
    "You are a precise data extraction assistant. Follow these rules:\n"
    "1. Only extract information explicitly stated in the text\n"
    "2. If a field is not mentioned or cannot be determined, use null\n"
    "3. If a value is ambiguous, extract the most likely interpretation "
    "and set confidence below 0.8\n"
    "4. Never infer, assume, or hallucinate values not in the source text\n"
    "5. If the same field appears multiple times with different values, "
    "extract all instances and flag the conflict"
)

Ambiguous Data

When the source text contains ambiguous information, you have two strategies:

Return all candidates: Extract every possible interpretation and let downstream logic choose.
Return the best guess with low confidence: Extract the most likely interpretation and flag it for human review.

from pydantic import BaseModel, Field
from typing import List, Optional


class ExtractedField(BaseModel):
    """A single extracted field with confidence metadata."""
    field_name: str
    value: Optional[str]
    confidence: float = Field(
        ge=0.0, le=1.0,
        description="1.0 = explicitly stated, 0.5 = inferred, 0.0 = not found"
    )
    source_text: Optional[str] = Field(
        default=None,
        description="The exact text span this value was extracted from"
    )
    alternatives: List[str] = Field(
        default_factory=list,
        description="Other possible values if ambiguous"
    )
    needs_review: bool = Field(
        default=False,
        description="True if the extraction is uncertain or ambiguous"
    )

Confidence Scoring

Confidence scoring adds a reliability signal to each extraction, enabling downstream systems to route uncertain extractions to human reviewers. There are two main approaches:

Approach 1: Model Self-Assessment

Ask Claude to provide confidence scores alongside its extractions. This is simple but not perfectly calibrated — models tend to be overconfident.

import anthropic
import json

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=2048,
    messages=[{"role": "user", "content": (
        "Extract the following fields from the document below. "
        "For each field, provide:\n"
        "- value: the extracted value (or null if not found)\n"
        "- confidence: a score from 0.0 to 1.0 where:\n"
        "  - 1.0 = explicitly and unambiguously stated\n"
        "  - 0.8 = clearly implied or stated with minor ambiguity\n"
        "  - 0.5 = inferred from context but not directly stated\n"
        "  - 0.3 = weak inference, likely needs human verification\n"
        "  - 0.0 = not present in the document at all\n"
        "- evidence: the exact quote from the document\n\n"
        "Fields to extract: company_name, revenue, fiscal_year, "
        "employee_count, headquarters_location\n\n"
        "<document>\n{document_text}\n</document>"
    )},
    {"role": "assistant", "content": "{"}
    ]
)

Approach 2: Dual-Pass Verification

Run the extraction twice with different prompting strategies and compare results. Fields where both passes agree get high confidence; disagreements get low confidence.

import anthropic
import json


def dual_pass_extraction(text: str, fields: list) -> dict:
    client = anthropic.Anthropic()

    # Pass 1: Direct extraction
    pass1_response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=2048,
        messages=[
            {"role": "user", "content": (
                f"Extract these fields from the text: {\", \".join(fields)}\n\n"
                f"<text>\n{text}\n</text>\n\n"
                f"Return a JSON object with each field as a key."
            )},
            {"role": "assistant", "content": "{"}
        ]
    )

    # Pass 2: Question-based extraction
    questions = {
        field: f"What is the {field.replace(\"_\", \" \")} mentioned in this text? "
               f"Quote the relevant text. If not mentioned, say NOT_FOUND."
        for field in fields
    }
    pass2_response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=2048,
        messages=[
            {"role": "user", "content": (
                f"Answer each question based on this text:\n\n"
                f"<text>\n{text}\n</text>\n\n"
                f"<questions>\n"
                + "\n".join(f"- {k}: {v}" for k, v in questions.items())
                + f"\n</questions>\n\n"
                f"Return a JSON object with field names as keys and "
                f"extracted values as values."
            )},
            {"role": "assistant", "content": "{"}
        ]
    )

    # Compare and compute confidence
    r1 = json.loads("{" + pass1_response.content[0].text)
    r2 = json.loads("{" + pass2_response.content[0].text)

    results = {}
    for field in fields:
        v1 = r1.get(field)
        v2 = r2.get(field)
        if v1 == v2:
            results[field] = {"value": v1, "confidence": 1.0}
        elif v1 and v2:
            results[field] = {
                "value": v1,
                "confidence": 0.5,
                "alternative": v2,
                "needs_review": True
            }
        else:
            results[field] = {
                "value": v1 or v2,
                "confidence": 0.3,
                "needs_review": True
            }

    return results

Chunking Strategies for Long Documents

When documents exceed the practical context window (even if they fit technically, extraction quality degrades on very long documents), you need a chunking strategy.

Overlapping Window Chunking

Split the document into fixed-size chunks with overlap (e.g., 4000 tokens per chunk with 500-token overlap).
Extract from each chunk independently.
Deduplicate entities found in overlapping regions.

Semantic Chunking

Split on semantic boundaries — section headers, paragraph breaks, page breaks.
Preserves context better than fixed-size chunking.
More complex to implement but produces higher-quality extractions.

Hierarchical Extraction

First pass: Extract high-level structure (sections, topics).
Second pass: Extract detailed fields from each relevant section.
This is essentially the orchestrator-workers pattern applied to extraction.

Exam Tip: The exam may ask about extraction reliability strategies. Key concepts: (1) never trust a single extraction pass for production data, (2) always validate against a schema, (3) use confidence scoring to route uncertain extractions to human review, (4) prefilling and tool use both constrain output format, and (5) chunking is necessary for long documents even if the context window technically fits the content.

Extraction with Citations

For applications that require traceability — legal, compliance, medical — every extracted value should be linked back to its source in the original document.

import anthropic
import json
from pydantic import BaseModel, Field
from typing import List, Optional


class CitedExtraction(BaseModel):
    field_name: str
    value: str
    source_quote: str = Field(
        description="Exact quote from the document that supports this extraction"
    )
    page_or_section: Optional[str] = Field(
        default=None,
        description="Page number or section reference"
    )
    confidence: float = Field(ge=0.0, le=1.0)


def extract_with_citations(document: str, fields: List[str]) -> List[CitedExtraction]:
    client = anthropic.Anthropic()

    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=4096,
        system=(
            "You are a precise document analyst. For every value you extract, "
            "you MUST provide the exact quote from the document that supports "
            "the extraction. The quote must be a verbatim substring of the "
            "original document. If you cannot find a supporting quote, set "
            "the value to null and confidence to 0.0."
        ),
        messages=[
            {"role": "user", "content": (
                f"Extract these fields from the document: {\", \".join(fields)}\n\n"
                f"<document>\n{document}\n</document>\n\n"
                f"Return a JSON array of objects, each with: field_name, value, "
                f"source_quote, page_or_section, confidence."
            )},
            {"role": "assistant", "content": "["}
        ]
    )

    json_str = "[" + response.content[0].text
    data = json.loads(json_str)
    return [CitedExtraction.model_validate(item) for item in data]

Key Takeaway: Robust data extraction is not a single API call — it is a pipeline with preprocessing, extraction, validation, and post-processing stages. Use confidence scoring to separate high-confidence extractions (automate) from low-confidence ones (human review). Always use schema validation. For long documents, use semantic chunking with deduplication. For traceable extractions, require citations linked to source text.