Data Extraction Patterns
Extracting structured data from unstructured text reliably.
Learning Objectives
- Design extraction pipelines for documents
- Handle edge cases in unstructured data
- Build confidence scoring for extractions
Data Extraction Patterns
Data extraction — pulling structured information from unstructured text — is one of the most common production use cases for Claude. Whether you are extracting entities from legal contracts, parsing financial reports, or pulling product details from web pages, the patterns are similar. This lesson covers the architecture of robust extraction pipelines, techniques for handling edge cases, and strategies for confidence scoring.
Extraction Pipeline Architecture
A production extraction pipeline is more than a single API call. It typically involves preprocessing, extraction, validation, and post-processing stages. Each stage adds reliability.
Stage 1: Preprocessing
- Text cleaning: Remove boilerplate, headers, footers, and irrelevant content that could confuse the model.
- Chunking: For long documents, split into manageable chunks that fit within the context window while preserving semantic boundaries.
- Deduplication: Ensure the same content is not extracted multiple times when processing overlapping chunks.
Stage 2: Extraction
The core extraction call to Claude, using the prompting and structured output techniques from Lessons 4.1 and 4.2.
Stage 3: Validation and Normalization
- Schema validation: Verify the output matches the expected structure.
- Business rule validation: Check that extracted values are plausible (e.g., a date is not in the future, a price is not negative).
- Normalization: Standardize formats (dates, phone numbers, addresses) to canonical forms.
Stage 4: Post-processing
- Deduplication: Merge duplicate entities found across chunks.
- Enrichment: Add metadata like extraction confidence, source location, and timestamps.
import anthropic
import json
from pydantic import BaseModel, Field
from typing import List, Optional
from datetime import date
class ContractParty(BaseModel):
name: str
role: str = Field(description="buyer, seller, lessor, lessee, etc.")
address: Optional[str] = None
class ContractClause(BaseModel):
clause_type: str = Field(
description="termination, payment, liability, confidentiality, etc."
)
summary: str
original_text: str
section_reference: Optional[str] = None
class ContractExtraction(BaseModel):
parties: List[ContractParty]
effective_date: Optional[str] = None
expiration_date: Optional[str] = None
total_value: Optional[str] = None
key_clauses: List[ContractClause]
governing_law: Optional[str] = None
def extract_contract_data(contract_text: str) -> ContractExtraction:
client = anthropic.Anthropic()
schema_str = json.dumps(
ContractExtraction.model_json_schema(), indent=2
)
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=4096,
system=(
"You are a legal document analyst. Extract structured "
"data from contracts with high precision. If a field "
"cannot be determined from the text, use null. Never "
"invent information that is not present in the document."
),
messages=[
{"role": "user", "content": (
f"Extract structured data from this contract.\n\n"
f"<contract>\n{contract_text}\n</contract>\n\n"
f"Return a JSON object matching this schema:\n"
f"<schema>\n{schema_str}\n</schema>\n\n"
f"Return ONLY the JSON object."
)},
{"role": "assistant", "content": "{"}
]
)
json_str = "{" + response.content[0].text
data = json.loads(json_str)
return ContractExtraction.model_validate(data)Handling Edge Cases
Real-world data is messy. A robust extraction system must handle cases where the expected information is missing, ambiguous, contradictory, or in an unexpected format.
Missing Data
Instruct Claude to return null or a sentinel value when data is not present, rather than guessing. This is critical — a hallucinated value is worse than a missing one.
# In your system prompt or instructions:
SYSTEM_PROMPT = (
"You are a precise data extraction assistant. Follow these rules:\n"
"1. Only extract information explicitly stated in the text\n"
"2. If a field is not mentioned or cannot be determined, use null\n"
"3. If a value is ambiguous, extract the most likely interpretation "
"and set confidence below 0.8\n"
"4. Never infer, assume, or hallucinate values not in the source text\n"
"5. If the same field appears multiple times with different values, "
"extract all instances and flag the conflict"
)Ambiguous Data
When the source text contains ambiguous information, you have two strategies:
- Return all candidates: Extract every possible interpretation and let downstream logic choose.
- Return the best guess with low confidence: Extract the most likely interpretation and flag it for human review.
from pydantic import BaseModel, Field
from typing import List, Optional
class ExtractedField(BaseModel):
"""A single extracted field with confidence metadata."""
field_name: str
value: Optional[str]
confidence: float = Field(
ge=0.0, le=1.0,
description="1.0 = explicitly stated, 0.5 = inferred, 0.0 = not found"
)
source_text: Optional[str] = Field(
default=None,
description="The exact text span this value was extracted from"
)
alternatives: List[str] = Field(
default_factory=list,
description="Other possible values if ambiguous"
)
needs_review: bool = Field(
default=False,
description="True if the extraction is uncertain or ambiguous"
)Confidence Scoring
Confidence scoring adds a reliability signal to each extraction, enabling downstream systems to route uncertain extractions to human reviewers. There are two main approaches:
Approach 1: Model Self-Assessment
Ask Claude to provide confidence scores alongside its extractions. This is simple but not perfectly calibrated — models tend to be overconfident.
import anthropic
import json
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=2048,
messages=[{"role": "user", "content": (
"Extract the following fields from the document below. "
"For each field, provide:\n"
"- value: the extracted value (or null if not found)\n"
"- confidence: a score from 0.0 to 1.0 where:\n"
" - 1.0 = explicitly and unambiguously stated\n"
" - 0.8 = clearly implied or stated with minor ambiguity\n"
" - 0.5 = inferred from context but not directly stated\n"
" - 0.3 = weak inference, likely needs human verification\n"
" - 0.0 = not present in the document at all\n"
"- evidence: the exact quote from the document\n\n"
"Fields to extract: company_name, revenue, fiscal_year, "
"employee_count, headquarters_location\n\n"
"<document>\n{document_text}\n</document>"
)},
{"role": "assistant", "content": "{"}
]
)Approach 2: Dual-Pass Verification
Run the extraction twice with different prompting strategies and compare results. Fields where both passes agree get high confidence; disagreements get low confidence.
import anthropic
import json
def dual_pass_extraction(text: str, fields: list) -> dict:
client = anthropic.Anthropic()
# Pass 1: Direct extraction
pass1_response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=2048,
messages=[
{"role": "user", "content": (
f"Extract these fields from the text: {\", \".join(fields)}\n\n"
f"<text>\n{text}\n</text>\n\n"
f"Return a JSON object with each field as a key."
)},
{"role": "assistant", "content": "{"}
]
)
# Pass 2: Question-based extraction
questions = {
field: f"What is the {field.replace(\"_\", \" \")} mentioned in this text? "
f"Quote the relevant text. If not mentioned, say NOT_FOUND."
for field in fields
}
pass2_response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=2048,
messages=[
{"role": "user", "content": (
f"Answer each question based on this text:\n\n"
f"<text>\n{text}\n</text>\n\n"
f"<questions>\n"
+ "\n".join(f"- {k}: {v}" for k, v in questions.items())
+ f"\n</questions>\n\n"
f"Return a JSON object with field names as keys and "
f"extracted values as values."
)},
{"role": "assistant", "content": "{"}
]
)
# Compare and compute confidence
r1 = json.loads("{" + pass1_response.content[0].text)
r2 = json.loads("{" + pass2_response.content[0].text)
results = {}
for field in fields:
v1 = r1.get(field)
v2 = r2.get(field)
if v1 == v2:
results[field] = {"value": v1, "confidence": 1.0}
elif v1 and v2:
results[field] = {
"value": v1,
"confidence": 0.5,
"alternative": v2,
"needs_review": True
}
else:
results[field] = {
"value": v1 or v2,
"confidence": 0.3,
"needs_review": True
}
return resultsChunking Strategies for Long Documents
When documents exceed the practical context window (even if they fit technically, extraction quality degrades on very long documents), you need a chunking strategy.
Overlapping Window Chunking
- Split the document into fixed-size chunks with overlap (e.g., 4000 tokens per chunk with 500-token overlap).
- Extract from each chunk independently.
- Deduplicate entities found in overlapping regions.
Semantic Chunking
- Split on semantic boundaries — section headers, paragraph breaks, page breaks.
- Preserves context better than fixed-size chunking.
- More complex to implement but produces higher-quality extractions.
Hierarchical Extraction
- First pass: Extract high-level structure (sections, topics).
- Second pass: Extract detailed fields from each relevant section.
- This is essentially the orchestrator-workers pattern applied to extraction.
Extraction with Citations
For applications that require traceability — legal, compliance, medical — every extracted value should be linked back to its source in the original document.
import anthropic
import json
from pydantic import BaseModel, Field
from typing import List, Optional
class CitedExtraction(BaseModel):
field_name: str
value: str
source_quote: str = Field(
description="Exact quote from the document that supports this extraction"
)
page_or_section: Optional[str] = Field(
default=None,
description="Page number or section reference"
)
confidence: float = Field(ge=0.0, le=1.0)
def extract_with_citations(document: str, fields: List[str]) -> List[CitedExtraction]:
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=4096,
system=(
"You are a precise document analyst. For every value you extract, "
"you MUST provide the exact quote from the document that supports "
"the extraction. The quote must be a verbatim substring of the "
"original document. If you cannot find a supporting quote, set "
"the value to null and confidence to 0.0."
),
messages=[
{"role": "user", "content": (
f"Extract these fields from the document: {\", \".join(fields)}\n\n"
f"<document>\n{document}\n</document>\n\n"
f"Return a JSON array of objects, each with: field_name, value, "
f"source_quote, page_or_section, confidence."
)},
{"role": "assistant", "content": "["}
]
)
json_str = "[" + response.content[0].text
data = json.loads(json_str)
return [CitedExtraction.model_validate(item) for item in data]