Beta invite: Persistence engine for agents, cut token usage up to 95 percent as sessions age

Pimpcat-AU · December 27, 2025, 10:31pm

If you are building agents, you already know the failure mode: long sessions get expensive, slow, and brittle because you keep paying to restate the past.

This persistence engine fixes that by moving memory out of the prompt and into a durable store, then retrieving only what is relevant per turn.

Primary capability
Token usage reduction that improves over time
You set a retrieval budget per turn. Instead of replaying transcripts, the agent retrieves a small targeted memory slice. That means token spend trends down as history grows, and can reach up to 95 percent reduction on long running workloads.

Other core capabilities

Durable session event logs, restart safe
Blob storage for large artifacts
Retrieval over history, lexical plus optional semantic
Multi tenant support for separating projects or users
Offline licensing using signed license files, no phone home

What I want feedback on

Real world token reduction numbers in your workflow
Recall quality, especially false positives and missed memories
Durability under restarts, crashes, and messy state transitions
Integration friction in actual agent loops

To join
Reply with your stack, use case, and your target constraint, token spend, latency, or reliability.

License text is being finalized with Australian counsel. Access starts as soon as that is signed.

sookoothaii · December 29, 2025, 3:16pm

Stack: Python/Rust based LLM Security Orchestrator (Firewall).
Use Case: Stateful specialized agents where ‘forgetting’ security constraints is catastrophic.
Target Constraint: Latency & Control.

Critical Question: Does the engine expose the retrieval scores or allow for custom re-ranking logic?
We are implementing an ‘Outcome-Weighted Retrieval’ (penalizing memories that led to failures). If your engine is a black box that just returns text, it breaks our safety loop. If it allows score injection or re-ranking hooks, it’s a perfect fit.

sookoothaii · December 29, 2025, 3:23pm

Saw the build log—the Go + Alpine stack with host networking looks incredibly clean. Zero bloat.

I have three specific architectural questions to see if this fits a high-security orchestration layer (we currently run a custom Python/ONNX stack):

Decoupled Embeddings: The log shows it probing nomic-embed-text via Ollama. Does the API allow ingesting pre-computed vectors (BYO Embeddings)?
- Context: We use specialized multilingual models (intfloat/e5-large) for security classifiers. We need to pass you the vectors, not the raw text, to ensure our specific embedding alignment is preserved.
Score Visibility & Reranking: Does the retrieval endpoint return the raw similarity scores/distances for the chunks, or just the text blobs?
- Context: We implement a ‘Safety Penalty’ layer (SRF) where we mathematically degrade the score of a chunk if it previously led to a jailbreak. We need the raw score to apply this delta (
```
R=S−DR=S−D
```
  ) before passing it to the agent.
Metadata Mutability: Can we update the metadata of a stored blob without re-indexing the vector?
- Context: When a ‘memory’ proves toxic, we need to tag it (e.g., failure_count++) instantly to trigger the penalty logic on the next retrieval.

If we can bring our own vectors and see/modify the scores, this could replace our entire vector backend.

Pimpcat-AU · December 30, 2025, 8:52am

Thanks for the detailed questions. You are describing exactly the kind of safety critical retrieval loop we want to support.

On score visibility and reranking: the public doc states hybrid retrieval and that results return event slices with source metadata. It does not yet specify whether raw lexical and vector scores are returned. If score visibility is a requirement, we can expose per hit lexical score, vector similarity, and the combined rank score so you can apply SRF penalties client side.
On decoupled embeddings and bring your own vectors: the current design in the doc uses an embedding worker that computes embeddings when enabled. We have not published an interface for ingesting precomputed vectors yet. If your workflow depends on BYO vectors, tell me your vector dimensions and distance metric and I will align the interface around that requirement.
On metadata mutability without vector reindex: the storage model is append only JSONL with tombstones. The doc does not yet define a metadata patch event type, but the intent is that state can evolve without rewriting history. If we treat metadata changes as new events, retrieval can filter or downrank immediately without recomputing vectors as long as the underlying text is unchanged.

Topic		Replies	Views
Designing multi-agent pipelines with shared state — how are you approaching it? Models	2	79	December 25, 2025
Securing Large Vision-Language Models via Deterministic Orchestration Layers Awesome paper	2	56	December 30, 2025
RFT Memory Receipt Engine - a Hugging Face Space by RFTSystems Spaces	0	25	January 2, 2026
Managing Memory for Agents 2.0 🤗Transformers	0	62	October 26, 2024
“How do you preserve agent state across restarts?” Models	4	125	January 15, 2026

Beta invite: Persistence engine for agents, cut token usage up to 95 percent as sessions age

Related topics