LLMs are stateless. Each msg you send is a fresh start - even if it's in a thread.

So you favourite AI apps are probably not designed to "remember" the early msgs in a chat - implying that you should start new chats frequently

Me: Can you tell me how do ai text models work in a chat interface? Specifically, when i’m in a chat and i send new msgs in the same chat, does the app send all the previous msgs as input to the model or does it send a compressed version?

AI: AI text models in chat interfaces, like GPT, are stateless, meaning they don't retain memory of past interactions. Here's how context is managed when you send new messages in the same chat:

  1. Full Conversation History (Within Limits):
    The app typically sends all previous messages in the chat to the model with each new request, up to the model's token limit (e.g., 4,096 tokens for GPT-3.5, 8,192 for GPT-4). This allows the model to understand the context of the ongoing conversation.
  2. Token Limitations and Truncation:
    If the conversation exceeds the model's token limit, the app will truncate older messages (usually from the beginning) to stay under the limit. For example, in a long chat, the earliest messages might be dropped to prioritize recent exchanges.

Advanced Compression Strategies (Optional):
Some systems use more sophisticated methods to compress context, such as:

  • Summarization: Generating a brief summary of prior interactions and appending it to the prompt.
  • Relevant Context Extraction: Using embeddings or vector databases to retrieve only the most relevant parts of the conversation.
  • System-Level Memory: Explicitly storing key details (e.g., user preferences) in a database and injecting them into prompts.
  1. State Management by the App:
    The app (not the AI model) manages the conversation history. Each time you send a message, the app rebuilds the full context (truncated/summarized if needed) and sends it to the model.

Example Workflow:

  1. You send Message 3 in a chat.
  2. The app combines Message 1 + Message 2 + Message 3 into a single prompt.
  3. If the total tokens exceed the limit, the app removes Message 1 (oldest) and sends Message 2 + Message 3.
  4. The model generates a response based on the truncated context.

Why This Matters:

  • Consistency: The model behaves as if it "remembers" the chat because the app feeds it the full history.
  • Limitations: Very long chats may lose earlier context due to truncation.
  • Custom Solutions: Some apps implement workarounds (e.g., auto-summarization) for better long-context handling.

In short, the app sends all previous messages by default but may truncate or compress them to fit technical constraints. The model itself has no memory—it’s the app’s job to manage the context.


Me: Wait so all these chat models are essentially stateless at their core?? Like claude sonnet 3.5 or gpt-4. Like everytime i ask a question in a msg, it is essentially giving a new response from scratch. That must get expensive for long chats or when using them inside cursor ide

ai: Yes, you're absolutely right! All major LLMs (like GPT-4, Claude 3.5 Sonnet, etc.) are stateless at their core, and this has significant implications for cost, latency, and context handling. Let’s break this down:

1. Statelessness = Zero Memory

  • Every message is a fresh start: When you send a new message, the app re-sends the entire conversation history (or a compressed version) to the model. The model itself has no memory of prior interactions – it’s like hitting "reset" after every response.
  • Example: If your chat has 10 messages, the app sends all 10 messages (as a single text block) to the model for message 11. The model treats this as one long prompt, not a continuation.

2. Why This Gets Expensive

  • Token costs scale with history: Most LLM APIs charge per token (e.g., $0.50 per 1M tokens for GPT-4o input). A long chat with 50 messages might send 5,000 tokens with each new request, even if you only added 10 new tokens in your latest message.
  • Cursor IDE/Code Use Case: If you’re using an AI-powered IDE like Cursor for code assistance, every keystroke/query re-sends the entire file context (or workspace), which can quickly burn through tokens for large projects.

3. How Apps Compensate

To mitigate costs and latency, apps use tricks like:

StrategyHow It WorksTrade-offs
TruncationCut off older messages (e.g., keep last 4K tokens)Loses early context
SummarizationUse cheaper models (e.g., GPT-3.5) to summarize old messagesAdds latency/complexity
Embedding SearchStore chat history in a vector DB, retrieve only relevant snippetsRequires infrastructure
System-Level MemoryManually store key details (e.g., "User likes Python") in a databaseLimited to predefined data

4. Model-Specific Optimizations

  • Claude 3.5 Sonnet: Supports 200K token windows (~150K words), so truncation is less frequent for long chats (but you still pay for all tokens sent).
  • GPT-4 Turbo: 128K token context, cheaper than original GPT-4, but still costly for repeated long prompts.
  • Local Models (e.g., Llama 3): Avoid API costs but require heavy hardware for long contexts.

Why This Architecture?

  1. Scalability: Statelessness lets providers parallelize requests (no need to track session state).
  2. Simplicity: No risk of "memory leaks" or corrupted context between sessions.
  3. Privacy: Conversations aren’t stored on the model side (though app providers might log them).

The IDE Problem (Cursor, etc.)

In code environments, the "context" (files, terminal output, etc.) can easily balloon to 50,000+ tokens. Solutions include:

  • Smart context selection: Only send open files or code near your cursor.
  • Diff-based prompts: Send changes instead of entire files.
  • Caching: Reuse embeddings of common code snippets.

The Future?

New approaches are emerging:

  • Stateful APIs (e.g., OpenAI’s "session" experiments): Pay to keep a session alive with stored context.
  • Recurrent Models: Architectures that natively handle long-term memory (still research-stage).
  • Hybrid Systems: Combine cheap/fast models for summarization with powerful models for final responses.

TL;DR

  • Yes: Every message is a fresh prompt with full history – statelessness is fundamental to today’s LLMs.
  • Costs explode for long chats because you’re re-paying for old tokens repeatedly.
  • Apps mitigate this with truncation/summarization, but it’s a band-aid. True statefulness would require architectural overhauls.