Checkpoints & Rollbacks
StateBase gives your AI agents a superpower that humans don’t have: the ability to undo mistakes. This is the core of StateBase’s reliability guarantee.The Problem: Non-Deterministic Failures
AI agents fail in unpredictable ways:How It Works: Automatic State Versioning
Every time you update a session’s state, StateBase creates an immutable snapshot:- Version number (auto-incrementing)
- State snapshot (full JSON)
- Timestamp (when it was created)
- Reasoning (why this change was made)
- Trace ID (which operation triggered it)
Rollback: Undo to a Previous Version
If your agent makes a mistake, you can revert to any previous state version:What Happens During Rollback?
- StateBase retrieves the state snapshot from version 3
- Creates a new version (e.g., version 6) with the restored state
- Returns the restored state to your agent
- Preserves history: Versions 4 and 5 are still in the database for audit
Checkpoint Strategies
Not every state change needs to be checkpointed. Here are common strategies:Strategy 1: Checkpoint After Tool Calls
Strategy 2: Checkpoint After User Confirmation
Strategy 3: Checkpoint Before Risky Operations
Automatic Checkpointing
StateBase automatically creates checkpoints in these scenarios:| Event | Checkpoint Created | Reasoning |
|---|---|---|
sessions.create() | ✅ Version 0 | Initial state |
sessions.update_state() | ✅ New version | Explicit state change |
sessions.add_turn() | ⚠️ Optional | Only if state_after differs from state_before |
memory.add() | ❌ No | Memories don’t affect session state |
Controlling Turn-Based Checkpointing
By default,add_turn() does not create a checkpoint unless you explicitly update state:
Recovery Patterns
Pattern 1: Retry with Different Prompt
Pattern 2: Fallback to Human
Pattern 3: A/B Testing Recovery
Forking: Branching Conversations
Sometimes you don’t want to replace the current state—you want to explore an alternative timeline. That’s where forking comes in.What is Forking?
Forking creates a new session that starts from a specific version of an existing session:When to Fork vs Rollback
| Use Case | Rollback | Fork |
|---|---|---|
| Undo a mistake | ✅ | ❌ |
| Try alternative approach | ❌ | ✅ |
| A/B test prompts | ❌ | ✅ |
| Preserve original conversation | ❌ | ✅ |
| Debug in production | ❌ | ✅ |
Example: Debugging in Production
Cost vs Safety Trade-offs
Checkpointing has a cost (storage + API calls). Here’s how to balance safety and efficiency:High-Frequency Checkpointing (Paranoid Mode)
Medium-Frequency Checkpointing (Recommended)
Low-Frequency Checkpointing (Optimized)
Monitoring Rollback Frequency
If you’re rolling back frequently, it’s a sign your agent needs improvement:Warning threshold: 5%
Critical threshold: 10%
Best Practices
✅ Do This
- Checkpoint before risky operations (deletions, payments, API calls)
- Include reasoning in every checkpoint (helps with debugging)
- Use forking for debugging (don’t modify production sessions)
- Monitor rollback frequency (it’s a health metric)
❌ Avoid This
- Don’t checkpoint every turn (wasteful unless state actually changes)
- Don’t roll back without understanding why (you’ll repeat the same mistake)
- Don’t delete checkpoint history (it’s your audit trail)
Next Steps
- Replay & Audit: Learn how to replay conversations for debugging
- Failure Modes: Understand common agent failure patterns
- Production Playbook: Advanced checkpointing strategies
Key Takeaway: Checkpoints are your time machine. Use them strategically to make your agents resilient to LLM non-determinism.