Building AI Memory with Supermemory

Memory is the hardest part of building voice agents. Users expect AI to remember things across conversations, but most implementations start fresh every session. When I built RyBot, I wanted it to feel like talking to someone who actually knows you.

Here's how I built a four-tier memory system using Supermemory: session memory, user profiles, episodic memories, and a shared knowledge base for personality emulation.

The Problem with AI Memory

Most AI assistants have two memory modes:

No memory - Every conversation starts fresh
Context window stuffing - Dump everything into the prompt until you hit token limits

Neither works well for production voice agents. No memory feels impersonal. Context stuffing is expensive, slow, and eventually overflows.

The fix is semantic retrieval (RAG). The AI remembers relevant things based on what the user is talking about, not everything it knows.

The Four-Tier Memory Architecture

RyBot uses four distinct memory layers:

Layer	Scope	Persistence	Use Case
Session Memory	Current conversation	In-memory (Map)	Hot data, immediate context
User Profile	Individual user	Supermemory (Profiles)	Static facts: name, location, preferences
User Memory	Individual user	Supermemory RAG	Episodic memories, dynamic context
RyBot Knowledge	All users	Supermemory RAG	Personality, facts about Ryan

The key distinction between Profile and User Memory:

Profile: "User's name is Alex, lives in Austin, works in marketing" — static facts that rarely change, always retrieved
User Memory: "User mentioned they're stressed about a deadline", "User is planning a trip to Japan" — episodic, semantically searched based on conversation

Hat tip to Dhravya Shah (Supermemory founder) for suggesting this architectural refinement.

Why Supermemory?

Supermemory is a managed RAG service that handles the hard parts of memory:

Automatic chunking and embedding generation
Semantic search with similarity scoring
Container tags for organizing memory by user/purpose
No infrastructure to manage

The documentation is solid and the npm package is straightforward to integrate.

Implementation: User-Specific Memory

Each user gets their own memory container, tagged by session ID. When a user mentions something worth remembering, Claude decides to save it:

import { Supermemory } from 'supermemory';

const supermemory = new Supermemory({
  apiKey: process.env.SUPERMEMORY_API_KEY,
});

// Store a memory for a specific user
async function storeLongTermMemory(content, userId = 'default') {
  await supermemory.memories.add({
    content: content,
    containerTags: [`user_${userId}`],
  });
}

// Retrieve relevant memories before responding
async function getLongTermMemories(query, userId = 'default') {
  const results = await supermemory.search.execute({
    q: query,
    containerTags: [`user_${userId}`],
    limit: 5,
    searchMode: 'hybrid', // Combines semantic + keyword matching
  });

  return results.results
    .filter(r => r.score > 0.7)
    .map(r => r.content);
}

Pro tip: Always set searchMode: 'hybrid' on your search queries. It combines semantic search with keyword matching for 10-15% better context retrieval. No migration needed - just add the parameter.

The trick: query Supermemory with the user's message before calling Claude. You get relevant context without stuffing the entire history into the prompt.

Memory Storage via Tool Use

I give Claude a save_memory tool using Anthropic's tool use API. When Claude detects something worth remembering, it calls the tool:

{
  "name": "save_memory",
  "description": "Save important information about the user",
  "input_schema": {
    "type": "object",
    "properties": {
      "key": { "type": "string" },
      "value": { "type": "string" }
    },
    "required": ["key", "value"]
  }
}

Claude might call this with {"key": "coffee_preference", "value": "Loves pour-over, uses a Chemex"} when the user mentions their coffee setup.

Implementation: Shared Knowledge Base

RyBot needs to emulate me authentically, which means it needs to know facts about Ryan. This is different from user memory - it's shared across all conversations.

I seeded the knowledge base with ~25 facts:

const RYAN_FACTS = [
  "Ryan is from Western NY, specifically the Rochester/Buffalo area.",
  "Ryan now lives in NYC/Jersey City. He loves the TriBeCa neighborhood.",
  "Ryan has two dogs - pitbull mixes he affectionately calls 'the puppies'.",
  "Ryan is a coffee nerd. He uses a Chemex and Kalita Wave pour-over.",
  "Ryan's favorite coffee roaster is Hydrangea Coffee.",
  "Ryan plays D&D - he's DM'd the Curse of Strahd campaign.",
  // ... more facts
];

async function seedKnowledge() {
  for (const fact of RYAN_FACTS) {
    await supermemory.memories.add({
      content: fact,
      containerTags: ['rybot_knowledge'],
    });
  }
}

When a user asks something, I query both the user's memories AND the shared knowledge base in parallel:

const [userMemories, rybotKnowledge] = await Promise.all([
  getLongTermMemories(userMessage, sessionId),
  getRybotKnowledge(userMessage),
]);

// Inject both into Claude's context
const systemPrompt = buildPromptWithContext({
  userMemories,
  rybotKnowledge,
});

The Knowledge Graph Visualization

Supermemory provides a React component called @supermemory/memory-graph that visualizes the relationships between memories. I exposed this on the Knowledge Graph page.

The widget below shows RyBot's actual memory network. Drag to pan around. Open the full Knowledge Graph to click into individual memories and see their contents.

The visualization uses the same data endpoint as the chat - it's fetching real memories from Supermemory and computing similarity relationships server-side.

Server-Side Graph Computation

Computing relationships between ~100 memories on every request would be slow. Instead, I pre-compute the graph in a background job and cache it:

async function refreshKnowledgeGraphCache() {
  // List all documents from Supermemory
  const documents = await supermemory.documents.list();

  // Compute similarity relationships
  for (const doc of documents) {
    const similar = await supermemory.search.execute({
      q: doc.content,
      limit: 10,
    });

    doc.relations = similar.results
      .filter(r => r.id !== doc.id && r.score > 0.5)
      .map(r => ({ targetId: r.id, weight: r.score }));
  }

  // Cache for 15 minutes
  knowledgeGraphCache.data = documents;
  knowledgeGraphCache.timestamp = Date.now();
}

The cache persists to Supermemory itself, so it survives server restarts and Railway redeploys.

Memory Injection Into Context

Here's how memories flow into the system prompt:

let memoryContext = '';

// Session memories (hot data from current conversation)
if (Object.keys(sessionMemories).length > 0) {
  memoryContext += 'SESSION MEMORIES:\\n';
  for (const [key, value] of Object.entries(sessionMemories)) {
    memoryContext += `- ${key}: ${value}\\n`;
  }
}

// Long-term memories (from Supermemory)
if (longTermMemories?.length > 0) {
  memoryContext += '\\nLONG-TERM MEMORIES:\\n';
  longTermMemories.forEach(m => {
    memoryContext += `- ${m}\\n`;
  });
}

// RyBot knowledge (shared personality facts)
if (rybotKnowledge?.length > 0) {
  memoryContext += '\\nRYAN FACTS (for authentic emulation):\\n';
  rybotKnowledge.forEach(k => {
    memoryContext += `- ${k}\\n`;
  });
}

This gets prepended to Claude's system prompt. Semantic search means only relevant memories show up, not everything RyBot knows.

Performance Considerations

Voice agents are latency-sensitive. Some things I figured out:

Supermemory search: ~50-200ms - Fast enough for real-time use
Parallel queries - Fetch user memories and RyBot knowledge simultaneously
Cache the knowledge graph - Don't compute relationships on every request
Non-blocking storage - Fire-and-forget memory saves, don't await

// Non-blocking storage (fire and forget)
storeLongTermMemory(`${key}: ${value}`, sessionId)
  .catch(err => console.error('Memory save failed:', err.message));

Zero-Latency Memory Saving with Webhooks

Even fire-and-forget API calls add ~50-200ms of server-side processing. For voice agents where every millisecond matters, I moved to a deferred batch approach using Hume's chat_ended webhook.

The idea: queue memories in-memory during the conversation (instant), then flush everything to Supermemory when the call ends.

// Queue structure per chat session
const pendingMemoriesQueue = new Map();
// Key: chatId, Value: { sessionId, profiles: [], memories: [] }

// During conversation - instant, no API call
function queuePendingMemory(chatId, sessionId, type, key, value) {
  if (!pendingMemoriesQueue.has(chatId)) {
    pendingMemoriesQueue.set(chatId, {
      sessionId,
      profiles: [],
      memories: [],
      createdAt: Date.now()
    });
  }
  const queue = pendingMemoriesQueue.get(chatId);
  if (type === 'profile') {
    queue.profiles.push({ key, value });
  } else {
    queue.memories.push({ key, value });
  }
}

// When call ends - Hume webhook triggers this
app.post('/webhook/hume', async (req, res) => {
  const { event_type, chat_id } = req.body;

  if (event_type === 'chat_ended' && chat_id) {
    await flushPendingMemories(chat_id);
  }

  res.json({ success: true });
});

Safety net: If the webhook fails (network issues, server restart), a background interval flushes stale queues after 2 hours. Better late than never.

Result: zero added latency during conversation. Users still get immediate benefit within the same call (session memory works instantly), and everything persists when they hang up.

API Usage Optimization

After deploying memory features, I noticed my Supermemory usage spiking to 6,500+ API calls in a few days. Investigation revealed the culprits:

Knowledge graph refresh was making ~61 API calls every 30 minutes (44 document fetches + 15 similarity searches)
No caching on RyBot knowledge queries - same shared data fetched on every message
Short messages like "hi" and "thanks" were triggering full semantic searches

The fix was a multi-layer caching strategy:

// 1. Cache RyBot knowledge (shared across all users, rarely changes)
const rybotKnowledgeCache = { data: null, timestamp: 0 };
const RYBOT_CACHE_TTL = 1000 * 60 * 10; // 10 minutes

// 2. Cache user profiles per session (static facts)
const userProfileCache = new Map(); // sessionId -> { data, timestamp }
const PROFILE_CACHE_TTL = 1000 * 60 * 30; // 30 minutes

// 3. Skip semantic search for short messages (no value)
const MIN_QUERY_LENGTH = 10;

async function getMemoryContext(message, sessionId) {
  // Skip search for short messages like "hi", "ok", "thanks"
  if (message.length < MIN_QUERY_LENGTH) {
    return { memories: [], knowledge: getCachedRybotKnowledge() };
  }

  // Parallel fetch with caching
  const [memories, knowledge] = await Promise.all([
    getLongTermMemories(message, sessionId),
    getCachedRybotKnowledge(message)
  ]);

  return { memories, knowledge };
}

Results: ~98% reduction in API calls. From ~1,584 calls/day down to ~33 calls/day. Knowledge graph refresh moved from 30 minutes to 24 hours (manually triggerable from my dashboard when I add new facts).

Advanced: LLM Filtering

Supermemory has an optional LLM Filtering feature in Advanced Settings that processes content before storage. This is useful if your memories tend to be verbose or contain conversational filler.

To enable it, go to console.supermemory.ai → Advanced Settings → Enable LLM Filtering, then add a filter prompt.

Here's the prompt I use:

Extract and store only factual, specific information. Remove:
- Conversational filler ("I think", "maybe", "kind of")
- Redundant context ("As I mentioned before")
- Temporary/time-sensitive info ("today", "right now")

Keep:
- Personal preferences and opinions with specifics
- Facts about people, places, experiences
- Skills, interests, and background info

Format as concise declarative statements.

This cleans up memories before they're stored - removing filler while keeping the useful facts. Combined with deduplication, it keeps your memory store lean and relevant.

What I'd Do Differently

Memory deduplication - Users sometimes repeat things. Need to detect and merge similar memories.
Feature added Dec 30, 2025 @ 11:45pm EST
Memory decay - Old memories should fade. Currently everything persists forever.
Feature added Dec 30, 2025 @ 11:45pm EST
User-controlled deletion - Let users see and delete their memories.
Feature added Jan 3, 2026 @ 11:45pm EST
*Users can't view stored memories directly, but can ask RyBot to delete all their data at any time.

Resources

Supermemory - The RAG service powering this
Supermemory Documentation - API reference and guides
@supermemory/memory-graph - The React visualization component
Claude Tool Use Guide - Anthropic's official documentation on tool use
Hume EVI Documentation - Voice interface powering RyBot
RyBot Knowledge Graph - Full interactive visualization
Voice Clone Security Guidelines - Security framework for voice agents

Ryan Haigh

Product Builder & AI Developer

Ryan builds AI voice experiences and writes about the intersection of voice technology, memory systems, and product development. RyBot is his AI voice clone powered by Hume EVI and Claude.

Twitter LinkedIn Website

Frequently Asked Questions

What is Supermemory?

Supermemory is a managed RAG (Retrieval-Augmented Generation) service that handles memory storage, embedding, and semantic search for AI applications. It automatically chunks content, generates embeddings, and provides fast similarity search without requiring you to manage vector databases.

How does AI memory differ from context windows?

Context windows include everything in the current prompt, which has token limits and costs money per token. AI memory uses semantic search to pull only relevant info, keeping context small and focused. Way more scalable and cheaper in the long run.

Can I use Supermemory with any AI model?

Yes. Supermemory is model-agnostic - it stores and retrieves memories that you can inject into any LLM's context. I use it with Claude, but it works equally well with GPT-4, Llama, or any other model.

How do you prevent duplicate memories?

Currently, I rely on the similarity threshold during retrieval - if a memory is very similar to an existing one, only the most relevant gets returned. A better approach would be to check for duplicates before storing, using semantic similarity to detect near-matches.

What is the knowledge graph showing?

The knowledge graph visualizes relationships between RyBot's memories. Each node is a memory or fact, and edges connect semantically related memories. Clusters form around topics like "coffee preferences" or "work history" because those memories have high similarity scores.

RyMetrics

The Problem with AI Memory

The Four-Tier Memory Architecture

Why Supermemory?

Implementation: User-Specific Memory

Memory Storage via Tool Use

Implementation: Shared Knowledge Base

The Knowledge Graph Visualization

Server-Side Graph Computation

Memory Injection Into Context

Performance Considerations

Zero-Latency Memory Saving with Webhooks

API Usage Optimization

Advanced: LLM Filtering

What I'd Do Differently

Resources

Frequently Asked Questions