🛡️Context & ReliabilityLesson 5.2

Long-Document Processing

Chunking, map-reduce, and hierarchical summarization patterns.

25 min

Learning Objectives

  • Design chunking strategies for large documents
  • Implement map-reduce processing patterns
  • Build hierarchical summarization pipelines

Long-Document Processing

Many real-world applications require Claude to process documents that are far too large to fit in a single context window, or that would consume so much of the window that meaningful analysis becomes impossible. Long-document processing techniques — chunking, map-reduce, and hierarchical summarization — allow you to break large documents into manageable pieces, process each piece independently, and then combine the results.

When Do You Need Chunking?

Even though Claude's 200K token context window can hold roughly 150,000 words (about 500 pages of text), there are several reasons to chunk documents rather than sending them whole:

  • The document exceeds the context window. Legal contracts, codebases, book manuscripts, and research paper collections can easily exceed 200K tokens.
  • You need room for instructions and output. Sending a 180K-token document leaves only 20K tokens for the system prompt, tools, and model output.
  • Accuracy degrades with very long inputs. Research shows that model attention is strongest at the beginning and end of the context window — the "lost in the middle" effect. Chunking helps ensure every part of the document receives adequate attention.
  • You need parallel processing. Processing 10 chunks simultaneously is 10x faster than processing one massive document sequentially.

Chunking Strategies

How you split a document matters enormously. Naive splitting (every N characters) can break sentences, paragraphs, and semantic units. Here are the main strategies:

Fixed-Size Chunking with Overlap

def chunk_by_tokens(text, chunk_size=4000, overlap=200):
    """
    Split text into chunks of approximately chunk_size tokens
    with overlap to preserve context across boundaries.

    Args:
        text: The full document text
        chunk_size: Target tokens per chunk
        overlap: Number of tokens to overlap between chunks
    Returns:
        List of text chunks
    """
    # Approximate: 1 token ~ 4 characters for English text
    char_chunk = chunk_size * 4
    char_overlap = overlap * 4

    chunks = []
    start = 0
    while start < len(text):
        end = start + char_chunk
        chunk = text[start:end]

        # Try to end at a sentence boundary
        if end < len(text):
            last_period = chunk.rfind(".")
            last_newline = chunk.rfind("\n")
            boundary = max(last_period, last_newline)
            if boundary > len(chunk) * 0.8:  # Only if boundary is near the end
                chunk = chunk[:boundary + 1]
                end = start + boundary + 1

        chunks.append(chunk.strip())
        start = end - char_overlap  # Overlap with previous chunk

    return chunks

Semantic Chunking by Document Structure

import re

def chunk_by_sections(text, max_chunk_tokens=8000):
    """
    Split a document by its natural section boundaries
    (headings, chapters, etc.) while respecting a max size.
    """
    # Split on markdown-style headings or numbered sections
    section_pattern = r"\n(?=#{1,3} |\d+\.\s|Chapter \d|SECTION \d)"
    sections = re.split(section_pattern, text)

    chunks = []
    current_chunk = ""
    char_limit = max_chunk_tokens * 4

    for section in sections:
        if len(current_chunk) + len(section) <= char_limit:
            current_chunk += section
        else:
            if current_chunk.strip():
                chunks.append(current_chunk.strip())
            # If a single section exceeds the limit, split it further
            if len(section) > char_limit:
                sub_chunks = chunk_by_tokens(
                    section, chunk_size=max_chunk_tokens, overlap=200
                )
                chunks.extend(sub_chunks)
                current_chunk = ""
            else:
                current_chunk = section

    if current_chunk.strip():
        chunks.append(current_chunk.strip())

    return chunks

The Map-Reduce Pattern

Map-reduce is the most important pattern for processing large documents. It has two phases: the map phase processes each chunk independently, and thereduce phase combines the results into a final output.

import anthropic
from concurrent.futures import ThreadPoolExecutor, as_completed

client = anthropic.Anthropic()

def map_reduce_summarize(document, chunk_size=6000):
    """
    Summarize a large document using map-reduce.

    Map phase: Summarize each chunk independently (parallelizable)
    Reduce phase: Combine chunk summaries into a final summary
    """
    # Step 1: Chunk the document
    chunks = chunk_by_tokens(document, chunk_size=chunk_size, overlap=200)
    print(f"Document split into {len(chunks)} chunks")

    # Step 2: Map — Summarize each chunk in parallel
    def summarize_chunk(chunk_index, chunk_text):
        response = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=1024,
            messages=[{
                "role": "user",
                "content": (
                    f"You are summarizing part {chunk_index + 1} of "
                    f"{len(chunks)} of a larger document.\n\n"
                    f"Summarize the following section, preserving key facts, "
                    f"names, dates, and conclusions:\n\n{chunk_text}"
                ),
            }],
        )
        return chunk_index, response.content[0].text

    chunk_summaries = [None] * len(chunks)
    with ThreadPoolExecutor(max_workers=5) as executor:
        futures = {
            executor.submit(summarize_chunk, i, chunk): i
            for i, chunk in enumerate(chunks)
        }
        for future in as_completed(futures):
            idx, summary = future.result()
            chunk_summaries[idx] = summary
            print(f"  Chunk {idx + 1}/{len(chunks)} summarized")

    # Step 3: Reduce — Combine all chunk summaries
    combined_summaries = "\n\n".join(
        f"--- Section {i+1} ---\n{s}"
        for i, s in enumerate(chunk_summaries)
    )

    final_response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=2048,
        messages=[{
            "role": "user",
            "content": (
                "Below are summaries of consecutive sections of a document. "
                "Synthesize them into a single coherent summary that:\n"
                "1. Captures the main themes and conclusions\n"
                "2. Preserves important details and data points\n"
                "3. Maintains logical flow\n"
                "4. Eliminates redundancy from overlapping sections\n\n"
                f"{combined_summaries}"
            ),
        }],
    )

    return final_response.content[0].text

Map-Reduce for Question Answering

Map-reduce is not limited to summarization. You can use it to answer specific questions about large documents, extract structured data, or perform analysis.

def map_reduce_qa(document, question, chunk_size=6000):
    """
    Answer a question about a large document using map-reduce.

    Map: Extract relevant information from each chunk
    Reduce: Synthesize extractions into a final answer
    """
    chunks = chunk_by_tokens(document, chunk_size=chunk_size)

    # Map phase: Extract relevant info from each chunk
    def extract_from_chunk(chunk_index, chunk_text):
        response = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=1024,
            messages=[{
                "role": "user",
                "content": (
                    f"Given the following question:\n{question}\n\n"
                    f"Extract any information from this text that is relevant "
                    f"to answering the question. If nothing is relevant, "
                    f"respond with 'NO_RELEVANT_INFO'.\n\n"
                    f"Text (section {chunk_index + 1} of {len(chunks)}):\n"
                    f"{chunk_text}"
                ),
            }],
        )
        return chunk_index, response.content[0].text

    extractions = []
    with ThreadPoolExecutor(max_workers=5) as executor:
        futures = {
            executor.submit(extract_from_chunk, i, c): i
            for i, c in enumerate(chunks)
        }
        for future in as_completed(futures):
            idx, extraction = future.result()
            if "NO_RELEVANT_INFO" not in extraction:
                extractions.append((idx, extraction))

    # Sort by chunk index to maintain document order
    extractions.sort(key=lambda x: x[0])

    if not extractions:
        return "No relevant information found in the document."

    # Reduce phase: Synthesize into a final answer
    context = "\n\n".join(
        f"[From section {idx + 1}]: {text}"
        for idx, text in extractions
    )

    final_response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=2048,
        messages=[{
            "role": "user",
            "content": (
                f"Question: {question}\n\n"
                f"Based on the following extractions from a large document, "
                f"provide a comprehensive answer:\n\n{context}"
            ),
        }],
    )

    return final_response.content[0].text

Hierarchical Summarization

For very large documents (books, legal filings, research paper collections), a single reduce step may not be sufficient — the combined chunk summaries might themselves exceed the context window. Hierarchical summarization solves this by applying multiple levels of reduction.

def hierarchical_summarize(document, chunk_size=6000, summary_chunk_size=10000):
    """
    Multi-level summarization for very large documents.

    Level 1: Summarize individual chunks
    Level 2: Group chunk summaries and summarize groups
    Level 3: Combine group summaries into final summary
    Repeat until everything fits in one context window.
    """
    # Level 1: Chunk and summarize
    chunks = chunk_by_tokens(document, chunk_size=chunk_size)
    print(f"Level 1: Processing {len(chunks)} chunks")

    level_summaries = []
    for i, chunk in enumerate(chunks):
        response = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=512,
            messages=[{
                "role": "user",
                "content": (
                    f"Summarize this section concisely, preserving key "
                    f"information:\n\n{chunk}"
                ),
            }],
        )
        level_summaries.append(response.content[0].text)

    # Continue reducing until summaries fit in one context window
    level = 2
    while True:
        combined = "\n\n".join(level_summaries)
        combined_tokens = len(combined) // 4  # Approximate

        if combined_tokens < summary_chunk_size:
            # Everything fits — do the final synthesis
            print(f"Final synthesis from {len(level_summaries)} summaries")
            response = client.messages.create(
                model="claude-sonnet-4-20250514",
                max_tokens=2048,
                messages=[{
                    "role": "user",
                    "content": (
                        "Synthesize these section summaries into a coherent "
                        f"final summary:\n\n{combined}"
                    ),
                }],
            )
            return response.content[0].text

        # Need another level of reduction
        print(f"Level {level}: Reducing {len(level_summaries)} summaries")
        summary_chunks = chunk_by_tokens(
            combined, chunk_size=summary_chunk_size
        )
        new_summaries = []
        for chunk in summary_chunks:
            response = client.messages.create(
                model="claude-sonnet-4-20250514",
                max_tokens=1024,
                messages=[{
                    "role": "user",
                    "content": (
                        "Consolidate these summaries into a single shorter "
                        f"summary:\n\n{chunk}"
                    ),
                }],
            )
            new_summaries.append(response.content[0].text)

        level_summaries = new_summaries
        level += 1

Choosing the Right Chunk Size

Chunk size is a critical parameter that requires balancing multiple concerns:

  • Too small (under 1,000 tokens): Chunks lack sufficient context for meaningful analysis. The model cannot understand relationships between ideas that span multiple chunks.
  • Too large (over 20,000 tokens): Fewer chunks mean less parallelism, and each chunk takes longer to process. Also, the "lost in the middle" effect becomes more pronounced.
  • Sweet spot (2,000-8,000 tokens): Large enough for coherent analysis, small enough for efficient parallel processing.

The overlap between chunks (typically 5-10% of chunk size) is also important. Too little overlap and you lose information at boundaries. Too much overlap and you waste tokens re-processing the same content.

Exam Tip: The exam will test your knowledge of when to use map-reduce versus simply sending the entire document. Key decision factors: (1) Does the document exceed the context window? If yes, chunking is mandatory. (2) Does the document consume more than 75% of the context window? If yes, chunking is recommended to leave room for instructions and output. (3) Do you need to process the document quickly? Parallel map-reduce can be significantly faster than sequential processing.

Exam Tip: A common exam question asks about the "lost in the middle" problem. When Claude processes very long inputs, information in the middle of the context receives less attention than information at the beginning or end. Chunking mitigates this because each chunk is processed independently, and every piece of the document appears at the "beginning" of some chunk.

Key Takeaways

Chunk at semantic boundaries (sections, paragraphs, sentences) rather than at arbitrary character positions. Use overlap to prevent information loss at chunk boundaries.

Map-reduce is the workhorse pattern for large documents. The map phase processes chunks in parallel, and the reduce phase synthesizes results. It works for summarization, Q&A, extraction, and analysis.

Hierarchical summarization handles documents of arbitrary length by applying multiple levels of reduction until the combined output fits in a single context window.

Optimal chunk size is 2,000-8,000 tokens for most use cases, with 5-10% overlap between adjacent chunks to preserve cross-boundary context.