What is the difference between short-term and long-term memory in AI agents?

Short-term memory (also called working or in-context memory) is everything currently loaded into the LLM's context window — conversation history, tool results, and the current plan. It is fast but ephemeral: it vanishes when the session ends. Long-term memory is stored externally in a database or vector store, persists across sessions, and is retrieved on demand. Both are necessary in production agents.

What vector database should I use for my AI agent?

ChromaDB is the easiest starting point — it runs embedded in Python with no infrastructure. For production with managed SLAs, Pinecone is the most popular choice. If you already run PostgreSQL, the pgvector extension lets you skip a separate vector service entirely. The best choice depends on your existing stack, scale requirements, and tolerance for infrastructure management.

What is RAG and how does it relate to agent memory?

Retrieval-Augmented Generation (RAG) is the pattern of retrieving relevant documents from an external knowledge base and injecting them into the LLM context before generating a response. In the context of agent memory, RAG is the retrieval mechanism for semantic long-term memory — it's how an agent accesses its knowledge base without stuffing everything into the context window.

How do I prevent agent memory from consuming too many tokens?

Three strategies: (1) Cap short-term memory with a sliding window that discards old messages; (2) summarise older conversation history into a compact summary before the window; (3) for long-term memory, retrieve only the top-k most relevant chunks rather than injecting the full knowledge base. Set a hard max_memory_tokens budget and truncate retrieved content to fit. Memory cost control is as important as memory correctness.

What is episodic memory in AI agents and why does it matter?

Episodic memory records specific past interactions as retrievable episodes — what task was attempted, what actions were taken, and what outcome resulted. Unlike semantic memory (general knowledge), episodic memory is tied to specific events in time. It enables personalisation, audit trails for compliance, and the ability for agents to learn from past mistakes without model fine-tuning.

← AI Engineering Studio

AI & Automation·9 min read·June 19, 2026

🧠 AI Agent Memory Systems: Short-Term, Long-Term, Episodic, and Vector Memory

AI agents need more than a reasoning engine — they need memory. This guide breaks down the four core memory types used in production agents, explains vector stores like ChromaDB, Pinecone, and pgvector, and shows Python patterns for managing context without exploding your token budget.

Why Memory Transforms an Agent from a Chatbot into a System

The single most important difference between a chatbot and a production AI agent is memory. Without memory, an agent resets on every request — it cannot track progress, recall prior decisions, or build on previous work. It degrades, as one practical guide on AI agent engineering puts it, into a reactive system rather than a goal-driven entity.

Memory is what lets an agent remember that three steps ago it called a web search tool, got three results, and is now on step four of an eight-step research plan. It is what lets a customer support agent recall that this user contacted support twice last month and escalated both times. It is what lets a coding assistant know which files it already analysed before suggesting a refactor.

This article covers the four memory types that appear in real agent architectures, the storage backends that implement them, retrieval patterns that keep token costs under control, and concrete Python structures you can adapt to your own agents.

The Four Memory Types in AI Agent Architecture

Agent memory is not a single component. Production agents use a combination of four distinct types, each serving a different purpose at a different time horizon.

1. Short-Term / Working Memory (In-Context)

Short-term memory is everything currently loaded into the LLM's context window: the system prompt, the conversation history for this session, tool call results from this run, and the current plan step. It is fast (no I/O), immediately available, and completely ephemeral — when the session ends, it is gone.

The constraint is the context window itself. GPT-4o supports 128k tokens; Claude 3.5 Sonnet supports 200k. Those numbers sound large until you account for the system prompt (often 1–2k tokens), conversation history (grows with each turn), tool schemas (each tool definition costs tokens), and retrieved documents (RAG chunks). A multi-step agent that calls five tools per step can fill a 128k context window faster than you expect.

The canonical implementation is a bounded message buffer:

class ShortTermMemory:
    def __init__(self, max_entries: int = 20):
        self.max_entries = max_entries
        self.entries: list[dict] = []

    def add(self, role: str, content: str) -> None:
        self.entries.append({"role": role, "content": content})
        if len(self.entries) > self.max_entries:
            self.entries.pop(0)  # discard oldest

    def get_messages(self) -> list[dict]:
        return self.entries

2. Long-Term External Memory

Long-term memory lives outside the LLM — in a database, file system, or vector store — and persists indefinitely across sessions. It stores facts, documents, user preferences, historical interactions, and any knowledge base the agent needs to reference.

Long-term memory is persistent, searchable, and external to the LLM. The key engineering principle is that long-term memory should never rely solely on prompts. Stuffing a 500-page policy manual into a system prompt is not long-term memory — it is context abuse. Long-term memory should be indexed and retrieved on demand.

Common storage backends: relational databases (PostgreSQL, SQLite) for structured facts and user profiles; document stores (MongoDB) for semi-structured data; and vector databases (ChromaDB, Pinecone, pgvector) for semantic retrieval of unstructured content.

3. Episodic Memory

Episodic memory stores specific past interactions as retrievable episodes: "On 2026-05-10, the user asked about contract clause 4.2 and was unsatisfied with the first answer." It captures the what happened at a specific time, as distinct from semantic memory which captures general knowledge.

In practice, episodic memory is implemented as a structured log of agent runs, stored in a database with timestamps, user IDs, task descriptions, tool calls made, and outcomes. When a new session starts, the agent can retrieve relevant past episodes and use them to inform its behaviour — effectively giving the agent the ability to learn from its own history without fine-tuning.

class EpisodicStore:
    def __init__(self, db_connection):
        self.db = db_connection

    def record_episode(self, user_id: str, task: str,
                       actions: list, outcome: str) -> None:
        self.db.execute(
            "INSERT INTO episodes VALUES (?, ?, ?, ?, ?)",
            (user_id, task, str(actions), outcome,
             datetime.utcnow().isoformat())
        )

    def recall_recent(self, user_id: str,
                      limit: int = 5) -> list[dict]:
        return self.db.execute(
            "SELECT * FROM episodes WHERE user_id = ? "
            "ORDER BY timestamp DESC LIMIT ?",
            (user_id, limit)
        ).fetchall()

4. Semantic / Vector Memory

Semantic memory is the agent's general knowledge base: documents, policies, product manuals, engineering standards — any unstructured text that the agent needs to reason about. Unlike episodic memory, it is not tied to specific past events. Unlike short-term memory, it does not fit in a context window. The solution is embedding: convert text to dense vector representations and store them in a vector database, then retrieve the most semantically relevant chunks at query time.

Vector Stores: The Backbone of Long-Term Semantic Memory

A vector database stores numerical embeddings — typically 1,536-dimensional vectors from OpenAI's text-embedding-3-small, or 768-dimensional vectors from open-source models like nomic-embed-text. When an agent needs to find relevant knowledge, it embeds the query and performs a cosine similarity search, returning the top-k most similar document chunks.

The three most common vector stores used in production agents are:

ChromaDB — Open-source, runs embedded in Python with no separate server for development, scales to a persistent server in production. Zero-cost to start, ideal for prototyping and small-to-medium deployments.
Pinecone — Fully managed, serverless vector database. No infrastructure to manage. Pricing starts at free for 2 million vectors (1536-dim), with paid plans from approximately $70/month for 5 million vectors with higher throughput. The right choice when you need SLA guarantees and don't want to operate infrastructure.
pgvector — PostgreSQL extension that adds a vector column type and cosine/L2/inner-product index operators. If you already run PostgreSQL, pgvector lets you store embeddings in the same database as your structured data, simplifying your stack. Supports up to 2,000 dimensions and approximate nearest-neighbour search via IVFFlat or HNSW indexes.

A minimal ChromaDB retrieval pattern:

import chromadb
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction

client = chromadb.PersistentClient(path="./agent_memory")
embed_fn = OpenAIEmbeddingFunction(
    api_key=OPENAI_API_KEY,
    model_name="text-embedding-3-small"
)
collection = client.get_or_create_collection(
    name="knowledge_base",
    embedding_function=embed_fn
)

# Store
collection.add(
    documents=["The transformer uses attention mechanisms..."],
    metadatas=[{"source": "ml_primer.pdf", "page": 12}],
    ids=["doc_001"]
)

# Retrieve top-3 relevant chunks
results = collection.query(
    query_texts=["how does attention work?"],
    n_results=3
)

Conversation History Management: Sliding Window vs Summarisation

For multi-turn agents, the conversation history is the primary form of short-term memory. Two strategies keep it within token budgets:

Sliding window keeps the N most recent messages and discards older ones. Simple and predictable — you know exactly how many tokens the history costs. The downside is that context from earlier in a long conversation is silently dropped, which can cause the agent to forget decisions made many steps ago.

Summarisation periodically compresses older messages into a summary that is prepended to the active window. More complex to implement but preserves the semantic content of early conversation at a fraction of the token cost. A typical implementation summarises after every 10 exchanges, maintaining a rolling summary plus the last 5 exchanges in full.

def maybe_summarise(memory: ShortTermMemory, llm) -> None:
    if len(memory.entries) >= 30:
        old_entries = memory.entries[:20]
        summary_prompt = (
            "Summarise the following conversation "
            "in 3-5 sentences, preserving key decisions "
            "and facts:

" +
            "
".join(e["content"] for e in old_entries)
        )
        summary = llm.generate(summary_prompt)
        memory.entries = (
            [{"role": "system",
              "content": f"[Earlier context summary]: {summary}"}]
            + memory.entries[20:]
        )

Memory Retrieval Patterns

Not all memory should be sent to the LLM on every request. The standard pattern is selective retrieval: given the current task or query, retrieve only the most relevant long-term memory and inject it into the context alongside short-term history.

Top-k retrieval is the default: embed the current query, search the vector store, return the top 3–10 chunks. Works well when relevant content is clearly identifiable by semantic similarity.

Filtered retrieval adds metadata filters to narrow the search before similarity ranking — for example, filtering by source_type = "policy_document" or project_id = "proj_42" before running the vector search. Dramatically reduces irrelevant results in large knowledge bases.

Hybrid retrieval combines vector similarity search with keyword (BM25) search, then uses Reciprocal Rank Fusion to merge results. Used in production RAG pipelines where domain-specific terminology may not embed well but exact keyword matches matter.

The retrieved memory is then injected into the prompt alongside the current task, keeping the total token count within budget:

def build_agent_prompt(
    task: str,
    short_term: ShortTermMemory,
    vector_store,
    max_memory_tokens: int = 2000
) -> list[dict]:
    relevant_chunks = vector_store.query(
        query_texts=[task], n_results=5
    )["documents"][0]

    memory_text = "

".join(relevant_chunks)
    if count_tokens(memory_text) > max_memory_tokens:
        # Truncate to budget
        memory_text = truncate_to_tokens(
            memory_text, max_memory_tokens
        )

    system_msg = {
        "role": "system",
        "content": (
            AGENT_SYSTEM_PROMPT
            + "

[Relevant knowledge]:
" + memory_text
        )
    }
    return [system_msg] + short_term.get_messages()

Choosing the Right Memory Type for Your Agent

The right combination depends on the agent's purpose and the nature of its tasks:

Use short-term memory for everything in the current session: tool call results, intermediate reasoning steps, conversation turns. Always bounded.
Use vector / semantic memory when the agent needs to answer questions from a large corpus — policy documents, product manuals, engineering standards, legal contracts. ChromaDB for prototyping; Pinecone or pgvector for production.
Use episodic memory when you want the agent to personalise its behaviour based on past interactions, or when you need an audit trail of agent actions for compliance.
Use long-term structured storage (relational database) for user profiles, account data, configuration state, and anything with a fixed schema.
Avoid storing everything in prompts — this is the most common mistake. It inflates token costs, hits context limits, and mixes retrieval with reasoning in an uncontrollable way.

The most important design principle: memory is not optional for production agents. An agent without memory is not an agent — it is a stateless text transformation. Memory is what enables persistence, goal-tracking, and intelligent behaviour over time.

Topics covered

AI agent memoryshort-term memory agentlong-term memory AIvector databaseepisodic memory AIsemantic memoryChromaDBPineconepgvectorRAG pipelineretrieval-augmented generationconversation history managementsliding window memorymemory summarizationagent state managementLangChain memoryworking memory AIin-context memoryagent architectureLLM memory systems

🛠️ Related Free Tools

Put this knowledge to work on your iPhone

Browse our full catalog of professional iOS apps — from electrical code tools to AI builders.

Browse All 95+ Apps