Skip to main content

Roadmap

What's shipped, what's in progress, and what's planned.


Shipped

FeatureWhat landed
Web FrontendNext.js chat UI with streaming, thread management, and memory inspection. Lives in apps/web/.
API LayerFastAPI with REST (POST /api/chat) and WebSocket (/api/chat/stream) endpoints. Thread management, memory status, session end. Lives in apps/backend/api/.
Voice ChatExperimental OpenAI Realtime speech preview with a FastAPI websocket bridge, standalone test harness, and Next.js voice UI. Current path is speech-only and non-agentic. Lives in apps/backend/voice/.
Session FeedbackEnd-of-session thumbs rating captured at /end, /exit, and POST /threads/{id}/end. SQLite-backed, incognito-safe.
Session Trajectory EvalUnified runner for short (inline) and long (checkpoint) trajectory datasets. 25 long-trajectory cases covering modality, boundary enforcement, crisis arcs, closing, venting, and mode transitions. Concurrent hybrid execution with --concurrency, --case, --verbose.
Crisis Gate — LLM-primaryLLM is the primary crisis classifier; regex is fallback only. Override precedence fix, shadow monitoring, prompt hardening (conversation fencing, anti-injection, adversarial examples), strict truth table enforcement.
Dispatcher — LLM-primaryLLM handles all mode + modality classification. Context-blind regex fast paths removed. LLM-based mid-exercise exit detection. Exercise modality persistence.
Knowledge Overhaulsoul.md expanded with therapeutic grounding, cultural sensitivity, repair patterns, boundary-setting voice. identity.md rewritten with product philosophy. boundaries.md expanded with redirection patterns and dependency framing.
OpenAI Embeddingstext-embedding-3-large as default provider, Gemini as fallback. Hybrid RRF retrieval achieves 14/17 recall@5 vs 6/17 token-only.

In progress

FeatureStatusWhat's left
Response quality rubricDesigned, not implementedLLM-as-judge eval runner to test empathy, tone, banned phrases, question stacking, conciseness. Needs rubric dataset + grading runner.
Memory integration evalDesigned, not implementedTest whether retrieved memory shapes responses. Cross-session continuity, procedural rule enforcement, appropriate recall.
Session feedback — closing modeDesigned, not wiredClosing detection is now LLM-primary; feedback prompt needs to fire on natural closings, not just CLI/API end commands.
Session feedback — voiceDesigned, not wiredVoice disconnect bypasses end_session() — needs to either route through the runtime or gain its own feedback hook.

Planned

Messaging Channels

Adapters for Telegram, WhatsApp, and Discord. The Channel enum already has slots (Channel.TELEGRAM, Channel.WHATSAPP); the agent graph is channel-agnostic. Each adapter maps platform message formats to AgentInput / AgentOutput. Crisis responses would need channel-specific formatting (inline buttons, embeds).

Acoustic Crisis Detection

Voice mode currently uses transcript-only crisis detection. Real gaps: voice cracking, sobbing, pressured speech, prosodic flatness. A user saying "I'm fine" through tears scores level 0.

Requires either a curated distressed-voice dataset (ethically fraught) or a validated off-the-shelf acoustic classifier (not a solved problem). Calendar-gated on dataset and model maturity.

Graph Memory

Graphiti + Neo4j for entity/relationship extraction from semantic facts. Enables relational reasoning: "you mentioned your sister and your work stress — they tend to co-occur." The wire frame exists (agent/memory/graph_store.py with NullGraphMemoryStore); the graphiti-core dependency is in pyproject.toml but the integration is intentionally disabled pending design.

Background Consolidation

Automatic fact merging, dormant marking, and a consolidation_runs log. Schema is defined (ConsolidationProposal, ConsolidationRunRecord in agent/memory/models.py); the implementation node is sketched but not wired into the graph. Adds /memory restore as an undo for destructive operations.

Session Intent, Stage, and Response Guidance

Three state fields (progress.intent, progress.stage, response.guidance) are defined in the schema but not yet populated by any node. When implemented, they enable session-level steering: the agent knows whether to deepen, stabilize, or close based on conversation arc rather than just the current message. The eval runner already supports assertions for all three — just re-add the dataset expectations.

Crisis Gate Production Telemetry

Model ID, prompt version, raw/normalized levels, confidence values, deterministic shadow results, disagreement rates, timeout/parse failure counters, and degraded-mode alerts. The shadow monitoring infrastructure is in place; the production telemetry layer is not.

Clinical Review

A trained clinician reviews the knowledge/response_modes/*.md files, the prompt builders in agent/therapeutic/prompts.py, and agent responses across dogfood sessions. This is the gate before "a trusted friend could try it" becomes a defensible claim. Calendar dependency, not engineering.