LLM Fundamentals

LLM Context Windows: Why Your AI Forgets and How to Work Around It

⏱ 6 min read · AI & Machine Learning

You're having a long conversation with ChatGPT. Suddenly, it forgets something you mentioned 20 messages ago. You remind it. It apologizes and continues. This isn't a bug — it's a fundamental limitation called the context window. Understanding how context windows work helps you use AI more effectively.

LLMs don't have infinite memory. They can only "see" a fixed amount of recent conversation. Once that limit is reached, older messages are forgotten. Working within this constraint requires strategy.

What Is a Context Window?

A context window is the maximum amount of text an LLM can process at once, measured in tokens. Tokens are roughly 3/4 of a word in English.

**Common context windows:**
GPT-3.5: 4K tokens (~3,000 words)
GPT-4: 8K tokens (~6,000 words), 32K option (~24,000 words)
Claude 2: 100K tokens (~75,000 words)
GPT-4 Turbo: 128K tokens (~96,000 words)

When conversation exceeds the context window, the model "forgets" the oldest messages. It only sees recent context.

Context windows are hard limits. No amount of prompting can make an LLM remember beyond its window.

How Token Counting Works

Not all text uses tokens equally. Complex words, code, and special characters use more tokens.

**Examples:**
"Hello" = 1 token
"ChatGPT" = 2 tokens
"Artificial Intelligence" = 3 tokens
Code snippet (10 lines) = ~50-100 tokens

Your prompt + AI's response + conversation history all count toward the limit. A 1,000-token prompt leaves only 3,000 tokens for response in a 4K model.

Why Context Windows Matter

**Long conversations:** After 20-30 exchanges, early messages are dropped. The AI "forgets" initial instructions or context.

**Document analysis:** Analyzing a 50-page document requires 30K+ tokens. GPT-3.5 can't handle it. GPT-4 32K or Claude 100K can.

**Code generation:** Large codebases exceed context windows. The AI can't see the entire codebase at once.

**Multi-turn tasks:** Complex tasks requiring many steps hit context limits before completion.

Strategies for Working Within Limits

**1. Summarize periodically:** Every 10-15 messages, ask the AI to summarize key points. Start new conversation with that summary.

**2. Use system messages:** Put critical instructions in system message (if available). These persist longer than conversation messages.

**3. Break tasks into chunks:** Instead of analyzing entire document, analyze sections separately then synthesize.

**4. Repeat important context:** Restate critical information periodically. "Remember, we're building a Python app for..."

**5. Use external memory:** Store information outside the conversation (notes, files) and reference as needed.

The Sliding Window Effect

As conversation grows, older messages are dropped to make room for new ones. This creates a "sliding window" of recent context.

**Example (4K token window):**
Messages 1-10: 2K tokens (visible)
Messages 11-20: 2K tokens (visible)
Message 21: 500 tokens added
→ Messages 1-3 dropped to stay under 4K limit

The AI only sees messages 4-21. It has no memory of messages 1-3.

Long Context Models

Newer models have larger context windows:

**Claude 2 (100K tokens):** Can process entire books, large codebases, or very long conversations.

**GPT-4 Turbo (128K tokens):** Similar capability. Useful for document analysis and long-form content.

**Trade-offs:** Longer context = slower response, higher cost. Use long context only when necessary.

When to Start Fresh

Sometimes it's better to start a new conversation than continue a long one:

**Start fresh when:**
- Conversation has drifted from original topic
- AI is giving inconsistent responses
- You've hit context limit and losing important early context
- Task is complete and you're starting something new

**Continue when:**
- Building on previous work
- Iterating on a design or solution
- Context from earlier in conversation is still relevant

The Cost of Long Context

Longer context windows cost more. API pricing is per token, so using 32K context costs 4x more than 8K context.

**GPT-4 pricing (example):**
8K context: $0.03/1K tokens input
32K context: $0.06/1K tokens input

For applications processing many requests, context window size significantly impacts cost.

Retrieval-Augmented Generation (RAG)

RAG is a technique to work around context limits. Instead of putting entire document in context, the system:

1. Stores documents in a vector database
2. Retrieves only relevant sections based on query
3. Puts retrieved sections in context
4. LLM generates response using retrieved context

This allows working with massive datasets (millions of documents) within small context windows.

Prompt Engineering for Context Efficiency

**Be concise:** Don't use 100 words when 20 will do. Every word counts toward token limit.

**Front-load important info:** Put critical context early in prompt. If context is truncated, early content persists longer.

**Use structured formats:** JSON or bullet points are more token-efficient than prose.

**Avoid repetition:** Don't repeat instructions in every message. State once, then reference.

The Future of Context Windows

Context windows are growing rapidly:

2020: GPT-3 (2K tokens)
2022: GPT-3.5 (4K tokens)
2023: GPT-4 (8K-32K tokens), Claude 2 (100K tokens)
2024: GPT-4 Turbo (128K tokens)

Eventually, context windows may be large enough that limitations rarely matter. But for now, understanding and working within limits is essential.

Building AI applications? The context calculator helps you estimate token usage and optimize prompts for context efficiency.