Why Memory Transforms an Agent from a Chatbot into a System
The single most important difference between a chatbot and a production AI agent is memory. Without memory, an agent resets on every request — it cannot track progress, recall prior decisions, or build on previous work. It degrades, as one practical guide on AI agent engineering puts it, into a reactive system rather than a goal-driven entity.
Memory is what lets an agent remember that three steps ago it called a web search tool, got three results, and is now on step four of an eight-step research plan. It is what lets a customer support agent recall that this user contacted support twice last month and escalated both times. It is what lets a coding assistant know which files it already analysed before suggesting a refactor.
This article covers the four memory types that appear in real agent architectures, the storage backends that implement them, retrieval patterns that keep token costs under control, and concrete Python structures you can adapt to your own agents.
The Four Memory Types in AI Agent Architecture
Agent memory is not a single component. Production agents use a combination of four distinct types, each serving a different purpose at a different time horizon.
1. Short-Term / Working Memory (In-Context)
Short-term memory is everything currently loaded into the LLM's context window: the system prompt, the conversation history for this session, tool call results from this run, and the current plan step. It is fast (no I/O), immediately available, and completely ephemeral — when the session ends, it is gone.
The constraint is the context window itself. GPT-4o supports 128k tokens; Claude 3.5 Sonnet supports 200k. Those numbers sound large until you account for the system prompt (often 1–2k tokens), conversation history (grows with each turn), tool schemas (each tool definition costs tokens), and retrieved documents (RAG chunks). A multi-step agent that calls five tools per step can fill a 128k context window faster than you expect.
The canonical implementation is a bounded message buffer:
class ShortTermMemory:
def __init__(self, max_entries: int = 20):
self.max_entries = max_entries
self.entries: list[dict] = []
def add(self, role: str, content: str) -> None:
self.entries.append({"role": role, "content": content})
if len(self.entries) > self.max_entries:
self.entries.pop(0) # discard oldest
def get_messages(self) -> list[dict]:
return self.entries
2. Long-Term External Memory
Long-term memory lives outside the LLM — in a database, file system, or vector store — and persists indefinitely across sessions. It stores facts, documents, user preferences, historical interactions, and any knowledge base the agent needs to reference.
Long-term memory is persistent, searchable, and external to the LLM. The key engineering principle is that long-term memory should never rely solely on prompts. Stuffing a 500-page policy manual into a system prompt is not long-term memory — it is context abuse. Long-term memory should be indexed and retrieved on demand.
Common storage backends: relational databases (PostgreSQL, SQLite) for structured facts and user profiles; document stores (MongoDB) for semi-structured data; and vector databases (ChromaDB, Pinecone, pgvector) for semantic retrieval of unstructured content.
3. Episodic Memory
Episodic memory stores specific past interactions as retrievable episodes: "On 2026-05-10, the user asked about contract clause 4.2 and was unsatisfied with the first answer." It captures the what happened at a specific time, as distinct from semantic memory which captures general knowledge.
In practice, episodic memory is implemented as a structured log of agent runs, stored in a database with timestamps, user IDs, task descriptions, tool calls made, and outcomes. When a new session starts, the agent can retrieve relevant past episodes and use them to inform its behaviour — effectively giving the agent the ability to learn from its own history without fine-tuning.
class EpisodicStore:
def __init__(self, db_connection):
self.db = db_connection
def record_episode(self, user_id: str, task: str,
actions: list, outcome: str) -> None:
self.db.execute(
"INSERT INTO episodes VALUES (?, ?, ?, ?, ?)",
(user_id, task, str(actions), outcome,
datetime.utcnow().isoformat())
)
def recall_recent(self, user_id: str,
limit: int = 5) -> list[dict]:
return self.db.execute(
"SELECT * FROM episodes WHERE user_id = ? "
"ORDER BY timestamp DESC LIMIT ?",
(user_id, limit)
).fetchall()
4. Semantic / Vector Memory
Semantic memory is the agent's general knowledge base: documents, policies, product manuals, engineering standards — any unstructured text that the agent needs to reason about. Unlike episodic memory, it is not tied to specific past events. Unlike short-term memory, it does not fit in a context window. The solution is embedding: convert text to dense vector representations and store them in a vector database, then retrieve the most semantically relevant chunks at query time.
Vector Stores: The Backbone of Long-Term Semantic Memory
A vector database stores numerical embeddings — typically 1,536-dimensional vectors from OpenAI's text-embedding-3-small, or 768-dimensional vectors from open-source models like nomic-embed-text. When an agent needs to find relevant knowledge, it embeds the query and performs a cosine similarity search, returning the top-k most similar document chunks.
The three most common vector stores used in production agents are:
- ChromaDB — Open-source, runs embedded in Python with no separate server for development, scales to a persistent server in production. Zero-cost to start, ideal for prototyping and small-to-medium deployments.
- Pinecone — Fully managed, serverless vector database. No infrastructure to manage. Pricing starts at free for 2 million vectors (1536-dim), with paid plans from approximately $70/month for 5 million vectors with higher throughput. The right choice when you need SLA guarantees and don't want to operate infrastructure.
- pgvector — PostgreSQL extension that adds a vector column type and cosine/L2/inner-product index operators. If you already run PostgreSQL, pgvector lets you store embeddings in the same database as your structured data, simplifying your stack. Supports up to 2,000 dimensions and approximate nearest-neighbour search via IVFFlat or HNSW indexes.
A minimal ChromaDB retrieval pattern:
import chromadb
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction
client = chromadb.PersistentClient(path="./agent_memory")
embed_fn = OpenAIEmbeddingFunction(
api_key=OPENAI_API_KEY,
model_name="text-embedding-3-small"
)
collection = client.get_or_create_collection(
name="knowledge_base",
embedding_function=embed_fn
)
# Store
collection.add(
documents=["The transformer uses attention mechanisms..."],
metadatas=[{"source": "ml_primer.pdf", "page": 12}],
ids=["doc_001"]
)
# Retrieve top-3 relevant chunks
results = collection.query(
query_texts=["how does attention work?"],
n_results=3
)
Conversation History Management: Sliding Window vs Summarisation
For multi-turn agents, the conversation history is the primary form of short-term memory. Two strategies keep it within token budgets:
Sliding window keeps the N most recent messages and discards older ones. Simple and predictable — you know exactly how many tokens the history costs. The downside is that context from earlier in a long conversation is silently dropped, which can cause the agent to forget decisions made many steps ago.
Summarisation periodically compresses older messages into a summary that is prepended to the active window. More complex to implement but preserves the semantic content of early conversation at a fraction of the token cost. A typical implementation summarises after every 10 exchanges, maintaining a rolling summary plus the last 5 exchanges in full.
def maybe_summarise(memory: ShortTermMemory, llm) -> None:
if len(memory.entries) >= 30:
old_entries = memory.entries[:20]
summary_prompt = (
"Summarise the following conversation "
"in 3-5 sentences, preserving key decisions "
"and facts:
" +
"
".join(e["content"] for e in old_entries)
)
summary = llm.generate(summary_prompt)
memory.entries = (
[{"role": "system",
"content": f"[Earlier context summary]: {summary}"}]
+ memory.entries[20:]
)
Memory Retrieval Patterns
Not all memory should be sent to the LLM on every request. The standard pattern is selective retrieval: given the current task or query, retrieve only the most relevant long-term memory and inject it into the context alongside short-term history.
Top-k retrieval is the default: embed the current query, search the vector store, return the top 3–10 chunks. Works well when relevant content is clearly identifiable by semantic similarity.
Filtered retrieval adds metadata filters to narrow the search before similarity ranking — for example, filtering by source_type = "policy_document" or project_id = "proj_42" before running the vector search. Dramatically reduces irrelevant results in large knowledge bases.
Hybrid retrieval combines vector similarity search with keyword (BM25) search, then uses Reciprocal Rank Fusion to merge results. Used in production RAG pipelines where domain-specific terminology may not embed well but exact keyword matches matter.
The retrieved memory is then injected into the prompt alongside the current task, keeping the total token count within budget:
def build_agent_prompt(
task: str,
short_term: ShortTermMemory,
vector_store,
max_memory_tokens: int = 2000
) -> list[dict]:
relevant_chunks = vector_store.query(
query_texts=[task], n_results=5
)["documents"][0]
memory_text = "
".join(relevant_chunks)
if count_tokens(memory_text) > max_memory_tokens:
# Truncate to budget
memory_text = truncate_to_tokens(
memory_text, max_memory_tokens
)
system_msg = {
"role": "system",
"content": (
AGENT_SYSTEM_PROMPT
+ "
[Relevant knowledge]:
" + memory_text
)
}
return [system_msg] + short_term.get_messages()
Choosing the Right Memory Type for Your Agent
The right combination depends on the agent's purpose and the nature of its tasks:
- Use short-term memory for everything in the current session: tool call results, intermediate reasoning steps, conversation turns. Always bounded.
- Use vector / semantic memory when the agent needs to answer questions from a large corpus — policy documents, product manuals, engineering standards, legal contracts. ChromaDB for prototyping; Pinecone or pgvector for production.
- Use episodic memory when you want the agent to personalise its behaviour based on past interactions, or when you need an audit trail of agent actions for compliance.
- Use long-term structured storage (relational database) for user profiles, account data, configuration state, and anything with a fixed schema.
- Avoid storing everything in prompts — this is the most common mistake. It inflates token costs, hits context limits, and mixes retrieval with reasoning in an uncontrollable way.
The most important design principle: memory is not optional for production agents. An agent without memory is not an agent — it is a stateless text transformation. Memory is what enables persistence, goal-tracking, and intelligent behaviour over time.