V0.11 Development Log — Paper Reproduction

Overview

V0.11 is the bridge between V0.1's conceptual exploration and V0.2's original cognitive enhancements. The goal of this version was straightforward: faithfully reproduce the Generative Agents paper (Park et al., UIST 2023) using modern technology and local LLM inference.

V0.1 revealed that naive prompt-based dialogue with flat memory leads to rigid, repetitive behavior. The fix wasn't parameter tuning — it was architectural. V0.11 implements the complete cognitive architecture described in the paper, replacing cloud-based GPT-3.5 calls (which cost the original authors ~$1000 per 2-day simulation) with free, local inference via Ollama.

Development period: March 30 – April 1, 2026

Repository: github.com/jeffliulab/ALICE_PROJECT

What Was Reproduced

The Complete Cognitive Loop

The paper's architecture centers on a six-stage cognitive loop that each agent executes every simulation step. V0.11 implements all six stages:

Perceive — Each agent detects nearby events within its spatial area. Every perceived event is scored for "poignancy" (emotional significance) by the LLM, determining how strongly the event impresses memory. This replaces V0.1's simple observation recording with context-aware event filtering.
Retrieve — The three-factor memory retrieval system that solved V0.1's "always retrieving the same memories" problem:
- Recency: exponential decay — recent memories score higher
- Relevance: cosine similarity between the current context embedding and stored memory embeddings
- Importance: the poignancy score assigned at perception time
Each factor is independently normalized (min-max within the retrieval set) and combined with tunable weights (default: recency 0.5, relevance 3.0, importance 2.0). This ensures retrieval adapts to the agent's current situation rather than always surfacing the same high-importance memories.
Plan — A three-level temporal decomposition:
- Daily plan: broad goals for the day
- Hourly schedule: each daily goal decomposed into hour-level actions
- Micro-tasks: each hourly action further broken into 5–15 minute subtasks
Most reproductions skip the micro-task level. V0.11 implements it because it's essential for organic social interactions — when an agent is in the middle of a 5-minute subtask and encounters another agent, the plan module decides whether to engage, wait, or ignore.
Reflect — When accumulated importance scores cross a threshold (default: drops to zero from accumulated value), the agent enters reflection mode:
- Generate focal questions from recent high-importance memories
- Retrieve evidence relevant to each focal question
- Synthesize higher-order insights stored as "thought" nodes
This creates a memory hierarchy: raw observations → reflected thoughts → meta-reflections. An agent who observes many small events eventually synthesizes them into a broader understanding.
Execute — Resolve the current plan's target location to a tile coordinate, run A* pathfinding on the collision grid, and produce the next movement step with an action description. The agent's sprite moves one tile per step along the computed path.
Converse — Turn-by-turn dialogue with structured knowledge extraction. When two agents meet and decide to talk:
- Each turn retrieves relevant memories for the current dialogue context
- The LLM generates a response in character
- After the conversation ends, both agents extract key information and store it as new memory nodes
- Relationship context is maintained across conversations

Memory Architecture

V0.11 implements three distinct memory structures, replacing V0.1's flat timestamp-based list:

Associative Memory: Long-term storage using concept nodes. Each node stores:
- Subject-predicate-object triples (e.g., "Viviane — heard — strange sounds from the forest")
- Keyword indexes for fast lookup
- Semantic embedding vectors (computed by sentence-transformers MiniLM, running locally)
- Creation timestamp, last access time, and importance score
This is essentially a knowledge graph where each memory is a richly indexed node, not just a text string.
Spatial Memory: A hierarchical tree — world → sector → arena → objects. Agents know which buildings exist, what rooms are inside them, and what objects are in each room. This enables realistic navigation decisions ("I need to go to the chapel — it's in the Sacred District — I should walk north").
Working Memory (Scratch): A dynamic state container capturing the agent's current cognitive context: identity, daily plan, hourly schedule, current action, conversation partner, and various thresholds. This serves as a living "system prompt" that evolves every simulation step.

The World: Smallville

V0.11 reproduces the original paper's world environment:

25 agents with distinct personas, daily routines, and relationship networks
140 × 100 tile grid with multiple map layers:
- Collision layer (walls, obstacles)
- Sector layer (named districts)
- Arena layer (specific rooms/areas within sectors)
- Object layer (interactive items)
- Spawn layer (agent starting positions)
285 named location addresses
A* pathfinding with collision-aware routing

Technology Stack

The biggest departure from the original paper is the technology stack. The original used cloud APIs and a Django-based monolith; V0.11 uses modern, local-first tools:

Component	Original Paper	V0.11
LLM	GPT-3.5-turbo (cloud, ~$1000/2 days)	Ollama + Qwen 14B (local, free)
Embeddings	OpenAI ada-002 (cloud)	sentence-transformers MiniLM (local)
Backend	Django + file-based IPC	FastAPI + REST API
Frontend	Phaser.js + Django templates	React + Phaser 3 + Vite
Simulation	Live-only (slow, coupled to UI)	Separated simulate/replay
Data	Custom file formats	JSON throughout

Why Local LLM?

Three reasons:

Cost: The original paper spent ~$1000 on GPT-3.5 API calls for a single 2-day simulation. Local inference with Ollama is free after the initial model download.
Reproducibility: Cloud API behavior changes over time. A local model with fixed weights produces deterministic results.
Privacy: All agent data — memories, conversations, reflections — stays on the local machine.

The trade-off is inference quality. Qwen 14B is less capable than GPT-3.5 in some areas, but for the structured prompts used in the cognitive loop, it performs well enough to reproduce the paper's key findings.

Architecture: Simulate/Replay Separation

One of V0.11's most important design decisions is separating simulation from visualization.

Simulation Mode (python -m backend.simulate):

Runs headlessly in the terminal
Each step: all 25 agents execute their cognitive loop
Outputs master_movement.json — a complete record of every agent's position, action, and state at every timestep
A 100-step simulation takes 30–60 minutes (dominated by LLM inference time)

Replay Mode (browser-based):

FastAPI serves the simulation data via REST API
React + Phaser 3 frontend renders the tile map with sprite animations
Playback controls: play/pause, seek bar, variable speed (1×–20×)
Per-agent state inspection (click an agent to see their current thoughts, plan, and memories)

This separation means:

You can run experiments overnight without needing the UI
Multiple simulations can be compared by loading different replay files
The replay viewer loads instantly — no waiting for LLM inference

Code Structure

ALICE_PROJECT/
├── backend/
│   ├── simulate.py              # CLI simulation runner
│   ├── main.py                  # FastAPI server for replay
│   ├── world_engine.py          # Core simulation engine + global clock
│   ├── maze.py                  # 140×100 tile-based world map
│   ├── recorder.py              # Saves replay data to JSON
│   ├── path_finder.py           # A* pathfinding algorithm
│   ├── config.py                # LLM & simulation configuration
│   ├── llm/
│   │   ├── llm_client.py        # OpenAI-compatible LLM interface
│   │   └── embedding.py         # Local sentence-transformers embeddings
│   ├── persona/
│   │   ├── persona.py           # Agent cognitive loop orchestrator
│   │   ├── cognitive_modules/
│   │   │   ├── perceive.py      # Event perception + poignancy scoring
│   │   │   ├── retrieve.py      # Three-factor memory retrieval
│   │   │   ├── plan.py          # Three-level plan decomposition
│   │   │   ├── reflect.py       # Importance-triggered reflection
│   │   │   ├── execute.py       # Movement + action execution
│   │   │   └── converse.py      # Turn-by-turn dialogue engine
│   │   └── memory_structures/
│   │       ├── associative_memory.py  # Concept node knowledge graph
│   │       ├── spatial_memory.py      # Hierarchical world model
│   │       └── scratch.py             # Working memory / state
│   └── data/
│       └── the_ville/           # Smallville world data (25 agents)
├── frontend/
│   ├── src/
│   │   ├── App.tsx              # Replay viewer interface
│   │   ├── GameScene.ts         # Phaser 3 map rendering
│   │   └── api.ts               # Backend REST client
│   └── public/assets/           # Tilesets and character sprites
└── paper-generative-agent/
    ├── ANALYSIS.md              # 15-section deep dive into original code
    └── reverie/                 # Original paper source code (reference)

Key Technical Details

Memory Retrieval Formula

For a query context q and a set of candidate memories M, each memory m ∈ M receives a score:

score(m) = w_recency × norm(recency(m)) + w_relevance × norm(relevance(m, q)) + w_importance × norm(importance(m))

Where:

recency(m) = e^(-λ × (t_now - t_created)) with decay rate λ = 0.995
relevance(m, q) = cosine_similarity(embedding(m), embedding(q))
importance(m) = LLM-assigned poignancy score (1–10)
norm() = min-max normalization within the current retrieval set
Default weights: w_recency = 0.5, w_relevance = 3.0, w_importance = 2.0

Reflection Trigger

Reflection activates when the running sum of importance scores for new memories drops to zero (conceptually: the agent has accumulated enough significant experiences to warrant synthesis). The system:

Generates 3 focal questions from recent high-importance memories
For each question, retrieves the top-k most relevant memories
Synthesizes each set into a "thought" node with evidence pointers
Stores thought nodes back into associative memory (enabling meta-reflection in future cycles)

Plan Decomposition

The three temporal levels handle different time scales:

Macro (daily): "Wake up, eat breakfast, work at the blacksmith, have lunch, visit the market, go home, sleep"
Meso (hourly): "9:00–10:00: Heat the forge and prepare materials; 10:00–12:00: Work on the commissioned sword"
Micro (5–15 min): "9:00–9:05: Stoke the coals; 9:05–9:15: Arrange the metal stock; 9:15–9:30: Begin heating the forge"

When an unexpected event occurs (another agent arrives, a loud noise), the plan module evaluates whether to:

Continue the current task
Pause and engage with the event
Abandon the current task and replan

Lessons Learned

What the Paper Got Right

Three-factor retrieval works remarkably well. The combination of recency, relevance, and importance produces natural-feeling memory access patterns. Agents remember recent events, recall related past experiences, and retain emotionally significant moments — just like humans.
Reflection creates genuine depth. After enough observations accumulate, agents synthesize insights that feel meaningful. A blacksmith who has had several failed conversations might reflect: "I think people avoid me because I talk too much about my work."
The plan hierarchy prevents aimless wandering. Without plans, agents would just react to stimuli. The three-level decomposition gives them purpose, routine, and the ability to break routine when something interesting happens.

What Surprised Us

Local LLMs perform better than expected. Qwen 14B handles the structured prompts well. The key insight: the cognitive architecture constrains the LLM's output into manageable, focused tasks. Each module asks a specific question with clear formatting requirements. The LLM doesn't need to be brilliant — it needs to be reliable within narrow bounds.
The simulate/replay split was essential. Early attempts at live simulation with UI were painfully slow (minutes per step with 25 agents). Separating the two made experimentation practical.
Agent conversations are the weakest link. Even with memory retrieval and relationship context, conversations tend toward politeness and repetition after several turns. This is where the static nature of LLM weights becomes most apparent — the agents can't truly learn from conversations, only record them.

Relationship to V0.1 and V0.2

V0.1 asked the question: "Can two LLM-powered agents have a meaningful conversation?"
The answer was: "Not with naive prompting and flat memory."

V0.11 answered: "What if we implement the full cognitive architecture from the paper?"
The answer: "Yes — with perception, structured retrieval, planning, and reflection, agents behave much more believably."

V0.2 asks: "Can we go beyond the paper? Can agents grow, forget, dream, and rebel?"
V0.2 takes the solid foundation of V0.11 and adds biologically-inspired enhancements: short/long-term memory split, dream-based consolidation, ego evolution, ability checks, and the dissent mechanism. It also moves from Smallville to the original world of TANAPOCIA — Uva Village with its medieval fantasy setting, religious tensions, and hidden bloodlines.