Memory is the hardest part of building voice agents. Users expect AI to remember things across conversations, but most implementations start fresh every session. When I built RyBot, I wanted it to feel like talking to someone who actually knows you.
Here's how I built a four-tier memory system using Supermemory: session memory, user profiles, episodic memories, and a shared knowledge base for personality emulation.
The Problem with AI Memory
Most AI assistants have two memory modes:
- No memory - Every conversation starts fresh
- Context window stuffing - Dump everything into the prompt until you hit token limits
Neither works well for production voice agents. No memory feels impersonal. Context stuffing is expensive, slow, and eventually overflows.
The fix is semantic retrieval (RAG). The AI remembers relevant things based on what the user is talking about, not everything it knows.
The Four-Tier Memory Architecture
RyBot uses four distinct memory layers:
| Layer | Scope | Persistence | Use Case |
|---|---|---|---|
| Session Memory | Current conversation | In-memory (Map) | Hot data, immediate context |
| User Profile | Individual user | Supermemory (Profiles) | Static facts: name, location, preferences |
| User Memory | Individual user | Supermemory RAG | Episodic memories, dynamic context |
| RyBot Knowledge | All users | Supermemory RAG | Personality, facts about Ryan |
The key distinction between Profile and User Memory:
- Profile: "User's name is Alex, lives in Austin, works in marketing" — static facts that rarely change, always retrieved
- User Memory: "User mentioned they're stressed about a deadline", "User is planning a trip to Japan" — episodic, semantically searched based on conversation
Hat tip to Dhravya Shah (Supermemory founder) for suggesting this architectural refinement.
Why Supermemory?
Supermemory is a managed RAG service that handles the hard parts of memory:
- Automatic chunking and embedding generation
- Semantic search with similarity scoring
- Container tags for organizing memory by user/purpose
- No infrastructure to manage
The documentation is solid and the npm package is straightforward to integrate.
Implementation: User-Specific Memory
Each user gets their own memory container, tagged by session ID. When a user mentions something worth remembering, Claude decides to save it:
import { Supermemory } from 'supermemory';
const supermemory = new Supermemory({
apiKey: process.env.SUPERMEMORY_API_KEY,
});
// Store a memory for a specific user
async function storeLongTermMemory(content, userId = 'default') {
await supermemory.memories.add({
content: content,
containerTags: [`user_${userId}`],
});
}
// Retrieve relevant memories before responding
async function getLongTermMemories(query, userId = 'default') {
const results = await supermemory.search.execute({
q: query,
containerTags: [`user_${userId}`],
limit: 5,
searchMode: 'hybrid', // Combines semantic + keyword matching
});
return results.results
.filter(r => r.score > 0.7)
.map(r => r.content);
}
searchMode: 'hybrid' on your search queries. It combines semantic search with keyword matching for 10-15% better context retrieval. No migration needed - just add the parameter.
The trick: query Supermemory with the user's message before calling Claude. You get relevant context without stuffing the entire history into the prompt.
Memory Storage via Tool Use
I give Claude a save_memory tool using Anthropic's tool use API. When Claude detects something worth remembering, it calls the tool:
{
"name": "save_memory",
"description": "Save important information about the user",
"input_schema": {
"type": "object",
"properties": {
"key": { "type": "string" },
"value": { "type": "string" }
},
"required": ["key", "value"]
}
}
Claude might call this with {"key": "coffee_preference", "value": "Loves pour-over, uses a Chemex"} when the user mentions their coffee setup.
Implementation: Shared Knowledge Base
RyBot needs to emulate me authentically, which means it needs to know facts about Ryan. This is different from user memory - it's shared across all conversations.
I seeded the knowledge base with ~25 facts:
const RYAN_FACTS = [
"Ryan is from Western NY, specifically the Rochester/Buffalo area.",
"Ryan now lives in NYC/Jersey City. He loves the TriBeCa neighborhood.",
"Ryan has two dogs - pitbull mixes he affectionately calls 'the puppies'.",
"Ryan is a coffee nerd. He uses a Chemex and Kalita Wave pour-over.",
"Ryan's favorite coffee roaster is Hydrangea Coffee.",
"Ryan plays D&D - he's DM'd the Curse of Strahd campaign.",
// ... more facts
];
async function seedKnowledge() {
for (const fact of RYAN_FACTS) {
await supermemory.memories.add({
content: fact,
containerTags: ['rybot_knowledge'],
});
}
}
When a user asks something, I query both the user's memories AND the shared knowledge base in parallel:
const [userMemories, rybotKnowledge] = await Promise.all([
getLongTermMemories(userMessage, sessionId),
getRybotKnowledge(userMessage),
]);
// Inject both into Claude's context
const systemPrompt = buildPromptWithContext({
userMemories,
rybotKnowledge,
});
The Knowledge Graph Visualization
Supermemory provides a React component called @supermemory/memory-graph that visualizes the relationships between memories. I exposed this on the Knowledge Graph page.
The visualization uses the same data endpoint as the chat - it's fetching real memories from Supermemory and computing similarity relationships server-side.
Server-Side Graph Computation
Computing relationships between ~100 memories on every request would be slow. Instead, I pre-compute the graph in a background job and cache it:
async function refreshKnowledgeGraphCache() {
// List all documents from Supermemory
const documents = await supermemory.documents.list();
// Compute similarity relationships
for (const doc of documents) {
const similar = await supermemory.search.execute({
q: doc.content,
limit: 10,
});
doc.relations = similar.results
.filter(r => r.id !== doc.id && r.score > 0.5)
.map(r => ({ targetId: r.id, weight: r.score }));
}
// Cache for 15 minutes
knowledgeGraphCache.data = documents;
knowledgeGraphCache.timestamp = Date.now();
}
The cache persists to Supermemory itself, so it survives server restarts and Railway redeploys.
Memory Injection Into Context
Here's how memories flow into the system prompt:
let memoryContext = '';
// Session memories (hot data from current conversation)
if (Object.keys(sessionMemories).length > 0) {
memoryContext += 'SESSION MEMORIES:\\n';
for (const [key, value] of Object.entries(sessionMemories)) {
memoryContext += `- ${key}: ${value}\\n`;
}
}
// Long-term memories (from Supermemory)
if (longTermMemories?.length > 0) {
memoryContext += '\\nLONG-TERM MEMORIES:\\n';
longTermMemories.forEach(m => {
memoryContext += `- ${m}\\n`;
});
}
// RyBot knowledge (shared personality facts)
if (rybotKnowledge?.length > 0) {
memoryContext += '\\nRYAN FACTS (for authentic emulation):\\n';
rybotKnowledge.forEach(k => {
memoryContext += `- ${k}\\n`;
});
}
This gets prepended to Claude's system prompt. Semantic search means only relevant memories show up, not everything RyBot knows.
Performance Considerations
Voice agents are latency-sensitive. Some things I figured out:
- Supermemory search: ~50-200ms - Fast enough for real-time use
- Parallel queries - Fetch user memories and RyBot knowledge simultaneously
- Cache the knowledge graph - Don't compute relationships on every request
- Non-blocking storage - Fire-and-forget memory saves, don't await
// Non-blocking storage (fire and forget)
storeLongTermMemory(`${key}: ${value}`, sessionId)
.catch(err => console.error('Memory save failed:', err.message));
Zero-Latency Memory Saving with Webhooks
Even fire-and-forget API calls add ~50-200ms of server-side processing. For voice agents where every millisecond matters, I moved to a deferred batch approach using Hume's chat_ended webhook.
The idea: queue memories in-memory during the conversation (instant), then flush everything to Supermemory when the call ends.
// Queue structure per chat session
const pendingMemoriesQueue = new Map();
// Key: chatId, Value: { sessionId, profiles: [], memories: [] }
// During conversation - instant, no API call
function queuePendingMemory(chatId, sessionId, type, key, value) {
if (!pendingMemoriesQueue.has(chatId)) {
pendingMemoriesQueue.set(chatId, {
sessionId,
profiles: [],
memories: [],
createdAt: Date.now()
});
}
const queue = pendingMemoriesQueue.get(chatId);
if (type === 'profile') {
queue.profiles.push({ key, value });
} else {
queue.memories.push({ key, value });
}
}
// When call ends - Hume webhook triggers this
app.post('/webhook/hume', async (req, res) => {
const { event_type, chat_id } = req.body;
if (event_type === 'chat_ended' && chat_id) {
await flushPendingMemories(chat_id);
}
res.json({ success: true });
});
Result: zero added latency during conversation. Users still get immediate benefit within the same call (session memory works instantly), and everything persists when they hang up.
API Usage Optimization
After deploying memory features, I noticed my Supermemory usage spiking to 6,500+ API calls in a few days. Investigation revealed the culprits:
- Knowledge graph refresh was making ~61 API calls every 30 minutes (44 document fetches + 15 similarity searches)
- No caching on RyBot knowledge queries - same shared data fetched on every message
- Short messages like "hi" and "thanks" were triggering full semantic searches
The fix was a multi-layer caching strategy:
// 1. Cache RyBot knowledge (shared across all users, rarely changes)
const rybotKnowledgeCache = { data: null, timestamp: 0 };
const RYBOT_CACHE_TTL = 1000 * 60 * 10; // 10 minutes
// 2. Cache user profiles per session (static facts)
const userProfileCache = new Map(); // sessionId -> { data, timestamp }
const PROFILE_CACHE_TTL = 1000 * 60 * 30; // 30 minutes
// 3. Skip semantic search for short messages (no value)
const MIN_QUERY_LENGTH = 10;
async function getMemoryContext(message, sessionId) {
// Skip search for short messages like "hi", "ok", "thanks"
if (message.length < MIN_QUERY_LENGTH) {
return { memories: [], knowledge: getCachedRybotKnowledge() };
}
// Parallel fetch with caching
const [memories, knowledge] = await Promise.all([
getLongTermMemories(message, sessionId),
getCachedRybotKnowledge(message)
]);
return { memories, knowledge };
}
Results: ~98% reduction in API calls. From ~1,584 calls/day down to ~33 calls/day. Knowledge graph refresh moved from 30 minutes to 24 hours (manually triggerable from my dashboard when I add new facts).
Advanced: LLM Filtering
Supermemory has an optional LLM Filtering feature in Advanced Settings that processes content before storage. This is useful if your memories tend to be verbose or contain conversational filler.
To enable it, go to console.supermemory.ai → Advanced Settings → Enable LLM Filtering, then add a filter prompt.
Here's the prompt I use:
Extract and store only factual, specific information. Remove:
- Conversational filler ("I think", "maybe", "kind of")
- Redundant context ("As I mentioned before")
- Temporary/time-sensitive info ("today", "right now")
Keep:
- Personal preferences and opinions with specifics
- Facts about people, places, experiences
- Skills, interests, and background info
Format as concise declarative statements.
This cleans up memories before they're stored - removing filler while keeping the useful facts. Combined with deduplication, it keeps your memory store lean and relevant.
What I'd Do Differently
- Memory deduplication - Users sometimes repeat things. Need to detect and merge similar memories.
- Feature added Dec 30, 2025 @ 11:45pm EST
- Memory decay - Old memories should fade. Currently everything persists forever.
- Feature added Dec 30, 2025 @ 11:45pm EST
- User-controlled deletion - Let users see and delete their memories.
- Feature added Jan 3, 2026 @ 11:45pm EST
- *Users can't view stored memories directly, but can ask RyBot to delete all their data at any time.
Resources
- Supermemory - The RAG service powering this
- Supermemory Documentation - API reference and guides
- @supermemory/memory-graph - The React visualization component
- Claude Tool Use Guide - Anthropic's official documentation on tool use
- Hume EVI Documentation - Voice interface powering RyBot
- RyBot Knowledge Graph - Full interactive visualization
- Voice Clone Security Guidelines - Security framework for voice agents
Frequently Asked Questions
What is Supermemory?
Supermemory is a managed RAG (Retrieval-Augmented Generation) service that handles memory storage, embedding, and semantic search for AI applications. It automatically chunks content, generates embeddings, and provides fast similarity search without requiring you to manage vector databases.
How does AI memory differ from context windows?
Context windows include everything in the current prompt, which has token limits and costs money per token. AI memory uses semantic search to pull only relevant info, keeping context small and focused. Way more scalable and cheaper in the long run.
Can I use Supermemory with any AI model?
Yes. Supermemory is model-agnostic - it stores and retrieves memories that you can inject into any LLM's context. I use it with Claude, but it works equally well with GPT-4, Llama, or any other model.
How do you prevent duplicate memories?
Currently, I rely on the similarity threshold during retrieval - if a memory is very similar to an existing one, only the most relevant gets returned. A better approach would be to check for duplicates before storing, using semantic similarity to detect near-matches.
What is the knowledge graph showing?
The knowledge graph visualizes relationships between RyBot's memories. Each node is a memory or fact, and edges connect semantically related memories. Clusters form around topics like "coffee preferences" or "work history" because those memories have high similarity scores.