Installation
BanditDB has two components: the Rust engine and the Python SDK. Start the engine first, then install the SDK.
1. Start the Engine
Binary (fastest, no Docker required)
curl -fsSL https://raw.githubusercontent.com/dynamicpricing-ai/banditdb/main/scripts/install.sh | sh
banditdb
curl http://localhost:8080/health # {"status":"ok"}
Docker
docker run -d -p 8080:8080 \
-e BANDITDB_API_KEY=your-secret-key \
-v banditdb_data:/data \
simeonlukov/banditdb:latest
Docker Compose (recommended for production)
Includes persistence, RBAC, and auto-checkpointing:
curl -fsSL https://raw.githubusercontent.com/dynamicpricing-ai/banditdb/main/docker-compose.yml -o docker-compose.yml
# Set BANDITDB_API_KEYS=admin-key=admin,app-key=writer in .env
docker compose up -d
Key environment variables:
| Variable | Default | Description |
|---|---|---|
BANDITDB_API_KEYS | β | Multi-key RBAC: key1=admin;key2=writer;key3=reader |
BANDITDB_API_KEY | β | Legacy single admin key (dev only) |
BANDITDB_CHECKPOINT_INTERVAL | 5000 | Auto-checkpoint after N rewarded events |
BANDITDB_MAX_WAL_SIZE_MB | 100 | Also checkpoint when WAL exceeds N MB |
DATA_DIR | /data | WAL, checkpoint, and Parquet export directory |
PORT | 8080 | HTTP listen port |
LOG_FORMAT | β | Set to json for structured production logs |
BANDITDB_RATE_LIMIT_PER_SEC | 1000 | Per-key request rate limit |
Kubernetes (Helm)
helm install banditdb ./helm/banditdb \
--set auth.apiKeys="admin-key=admin,app-key=writer" \
--set persistence.size=10Gi \
--set ingress.enabled=true \
--set ingress.host=banditdb.example.com
The Helm chart includes a PersistentVolumeClaim, Secret for API keys, liveness/readiness probes, and a Recreate deployment strategy β required for single-writer WAL safety.
Build from source
git clone https://github.com/dynamicpricing-ai/banditdb
cd banditdb
cargo build --release # standard build
# cargo build --release --features neural # with NeuralLinUCB
./target/release/banditdb
2. Install the SDK
Python
pip install banditdb-python
from banditdb import Client
db = Client("http://localhost:8080", api_key="your-secret-key")
HTTP (no SDK)
Every SDK call is a plain JSON HTTP request. Use X-Api-Key header when auth is enabled:
curl -s http://localhost:8080/health
# {"status":"ok",...}
curl -s -X POST http://localhost:8080/predict \
-H "Content-Type: application/json" \
-H "X-Api-Key: your-secret-key" \
-d '{"campaign_id":"my-campaign","context":[1.0,0.5,0.3]}'
Run
Start the engine, verify it's alive, create your first campaign.
Start
Binary
# Minimal β no auth, in-memory only (dev)
banditdb
# With persistence and a single admin key
DATA_DIR=/var/lib/banditdb BANDITDB_API_KEY=secret banditdb
# Full production flags
DATA_DIR=/var/lib/banditdb \
BANDITDB_API_KEYS="admin-key=admin;app-key=writer;dash-key=reader" \
BANDITDB_CHECKPOINT_INTERVAL=5000 \
BANDITDB_MAX_WAL_SIZE_MB=100 \
LOG_FORMAT=json \
PORT=8080 \
banditdb
Docker Compose
docker compose up -d # start in background
docker compose logs -f # follow logs
docker compose down # stop
Verify
curl http://localhost:8080/health
# {"status":"ok","campaigns":{}}
Status values:
| Status | HTTP | Meaning |
|---|---|---|
ok | 200 | All campaigns healthy, WAL writing normally. |
degraded | 200 | One or more campaigns have entropy collapse. Service still running β data quality issue only. |
degraded: wal unavailable | 503 | WAL writer failed. Predictions still served from memory but nothing is persisted. Treat as a service failure. |
Logs
# Human-readable (default)
banditdb
# Structured JSON for log aggregators (CloudWatch, Datadog, Splunk)
LOG_FORMAT=json banditdb
Key log lines to watch on startup:
INFO banditdb: recovered 3 campaigns from checkpoint (offset=48291)
INFO banditdb: replayed 142 WAL events
INFO banditdb: listening on 0.0.0.0:8080
If checkpoint recovery fails the engine exits with a clear error rather than starting in a broken state. Check DATA_DIR permissions and that checkpoint.json is not corrupt.
First Campaign
# Create a campaign
curl -s -X POST http://localhost:8080/campaign \
-H "Content-Type: application/json" \
-H "X-Api-Key: your-secret-key" \
-d '{
"campaign_id": "homepage-hero",
"arms": ["variant_a", "variant_b", "variant_c"],
"feature_dim": 4,
"alpha": 1.0
}'
# Confirm it exists
curl -s http://localhost:8080/campaigns \
-H "X-Api-Key: your-secret-key"
Checkpoint and Stop
# Flush WAL, export Parquet, rotate β do this before stopping
curl -s -X POST http://localhost:8080/checkpoint \
-H "X-Api-Key: your-secret-key"
# Then stop
docker compose down # or kill the process
Always checkpoint before a planned shutdown. The WAL preserves everything since the last checkpoint, but checkpointing first keeps the WAL short and makes the next startup faster.
Quick Start
A complete predictβreward cycle for a sleep improvement campaign.
Sleep Improvement
One-size-fits-all sleep advice ignores individual physiology. A 25-year-old male athlete and a 60-year-old sedentary woman respond differently to the same environmental change. BanditDB learns those differences automatically β routing each participant to the intervention most likely to work for their profile, improving with every reported outcome.
from banditdb import Client
db = Client("http://localhost:8080", api_key="your-secret-key")
# 1. Create the campaign once at startup
db.create_campaign(
"sleep",
arms=["decrease_temperature", "decrease_light", "decrease_noise"],
feature_dim=5,
metadata={
"owner": "wellness-team",
"features": ["sex", "age_norm", "weight_norm", "activity", "bedtime_norm"],
},
)
# 2. A participant is ready for tonight's intervention.
# Context: [sex, age/100, weight_kg/150, activity_0β1, bedtime_hour/24]
context = [
1.0, # female
0.35, # age 35
0.50, # 75 kg
0.60, # moderately active
0.96, # bedtime 23:00
]
# 3. Ask BanditDB which intervention to apply
arm, interaction_id = db.predict("sleep", context)
print(f"Tonight's intervention: {arm}") # e.g., "decrease_temperature"
# 4. Apply the intervention, then reward the next morning
score_before = 62
score_after = 79
reward = (score_after - score_before) / score_before # β 0.27
db.reward(interaction_id, reward)
Rewards must be normalised to [0, 1]. Divide your business metric by its maximum possible value, or use a ratio like the example above.
Native Agent Tool Use (MCP)
Give any Claude-based agent persistent decision memory via the Model Context Protocol.
Standard LLM agents are stateless β if they route a task to the wrong model and fail, they repeat the same mistake tomorrow. BanditDB's built-in MCP server gives the entire agent swarm shared persistent memory.
Add the server to your Claude Desktop config at ~/Library/Application Support/Claude/claude_desktop_config.json:
{
"mcpServers": {
"banditdb": {
"command": "banditdb-mcp",
"env": {
"BANDITDB_URL": "http://localhost:8080",
"BANDITDB_API_KEY": "your-secret-key"
}
}
}
}
The agent now has five tools:
| Tool | What it does |
|---|---|
create_campaign | Create a new decision campaign. Accepts algorithm, alpha, and an optional free-form metadata object for attaching context (feature names, owner, version, etc.). |
list_campaigns | List all active campaigns with arm count, alpha, and metadata. |
campaign_diagnostics | Per-arm theta_norm, prediction count, reward rate, plus selection_entropy, entropy_status, entropy_trend, and a suggested_action when collapse is detected. Use when a campaign isn't learning or you suspect exploration has stopped. |
get_intuition | Returns the recommended action and an interaction_id to save. |
record_outcome | Reports success (1.0) or failure (0.0) and updates the shared model. |
Every agent in a swarm shares the same BanditDB instance, so the learned model improves with every interaction across the entire fleet.
Choosing an Algorithm
Both algorithms share identical per-arm state. Switching is a single field at campaign creation.
| Algorithm | Value | How it explores | When to use |
|---|---|---|---|
| LinUCB | "linucb" (default) |
Deterministic UCB bonus: ΞΈΒ·x + Ξ±Β·β(xΒ·Aβ»ΒΉΒ·x) |
Predictable, tunable. Sweep alpha offline to calibrate exploration. |
| Linear Thompson Sampling | "thompson_sampling" |
Samples ΞΈΜ ~ N(ΞΈ, Ξ±Β²Β·Aβ»ΒΉ), scores by ΞΈΜΒ·x | Natural Bayesian exploration. alpha=1.0 is the principled default. Concurrent users automatically diversify arm coverage. |
| NeuralLinUCB | {"neural_lin_ucb": {β¦}} |
Identical UCB formula to LinUCB, but applied to a learned embedding h(x; W) instead of raw context. MLP retrained in batch at checkpoint time once retrain_every rewards accumulate. |
Non-linear or high-dimensional contexts where LinUCB has plateaued. Requires the neural feature flag at compile time. |
| Progressive (Tournament) | {"progressive": {"base": "linucb", "challenger": "neural_lin_ucb"}} |
Runs both models in parallel. Uses the better-performing model for 90% of traffic. | The recommended default. Handles the cold-start with Linear and upgrades to Neural automatically when data justifies it. |
# Progressive β the internal tournament handled by BanditDB
db.create_campaign("auto_agent", ["model_a", "model_b"], feature_dim=1536,
algorithm={"progressive": {"base": "linucb", "challenger": "neural_lin_ucb"}})
The algorithm field is stored in both the WAL and checkpoint files. Old WAL records and checkpoints without an algorithm field recover as "linucb" automatically.
Model Architecture
Why BanditDB uses the Neural-Linear handoff instead of end-to-end Deep RL.
Why Neural-Linear?
In the literature, this approach is often called the "Golden Middle" of Contextual Bandits. End-to-end Deep Reinforcement Learning (like PPO or Soft Actor-Critic) is notoriously unstable and requires thousands of interactions before it makes even basic sense of its environment. BanditDB is designed for production databases where the first 50 decisions are just as important as the 5,000th.
1. Latency vs. Accuracy
A full forward pass of a deep network for every arm is slow (10ms+). By using a Neural Feature Extractor with a Linear Head, BanditDB achieves sub-millisecond latencies. The heavy neural retraining happens in the background, while the live prediction path is a simple, ultra-fast vector dot product (ΞΈα΅h).
2. Statistical Rigor (The Closed-Form Variance)
Deep networks are "black boxes" that don't natively understand their own uncertainty. To explore efficiently, they often rely on heuristics like epsilon-greedy or dropout. Because BanditDB's final layer is linear, it can compute an exact closed-form variance for every prediction. This allows us to use UCB and Thompson Sampling with mathematical certainty, leading to lower cumulative regret.
3. The Internal Tournament
The Progressive algorithm is our unique technical moat. It runs a continuous internal tournament between a Linear model (safe, sample-efficient) and a Neural model (powerful, non-linear).
- Shadow Learning: Every reward received by the database updates both models simultaneously.
- Autonomous Handoff: Using Mean Squared Error (MSE) over the interaction buffer, the engine autonomously detects the exact moment the Neural model has "surpassed" the Linear model's intuition.
- Self-Correction: If the Neural model diverges or its performance drops due to noise, the tournament automatically demotes it back to the safe Linear baseline.
NeuralLinUCB
An MLP preprocessing stage that gives LinUCB a learnable, non-linear view of the context.
Standard LinUCB assumes the expected reward is a linear function of the raw context β which breaks down when features interact non-linearly or when the context vector carries redundant dimensions. NeuralLinUCB adds a small MLP that projects the caller's context into a compact fixed-size embedding; LinUCB then operates entirely in that embedding space. The network is retrained in batches at checkpoint time β keeping the hot predict/reward path purely online with no ML overhead per request.
Two algorithms run in tandem:
- Algorithm 1 (online, every request): Runs the same Sherman-Morrison LinUCB update used by
"linucb", but on the MLP embedding rather than raw context. No added latency β the MLP forward pass is a single matrix multiply chain. - Algorithm 2 (batch, at checkpoint): Once
retrain_everyrewards have accumulated, AdamW minimisesΒ½Β·MSE(ΞΈα΅h(x;W), r) + (mΒ·Ξ»/2)Β·βW β WββΒ²forretrain_stepssteps. The L2 penalty toward the initialisation weightsWβprevents catastrophic forgetting across retrain cycles. Arm matrices are then re-accumulated β replayed through the new embedding via Sherman-Morrison β before WAL rotation.
Configuration
| Parameter | Default | Meaning |
|---|---|---|
context_dim | required | Length of the context vector the caller sends at predict time. Fixed at campaign creation. |
embed_dim | 32 | Embedding (arm matrix) dimension. Arm matrices are embed_dim Γ embed_dim. Smaller = faster online updates; larger = more expressive. 16β64 covers most use cases. |
hidden_dim | 128 | Width of each hidden layer. |
hidden_layers | 2 | Number of hidden layers before the final embedding projection. |
retrain_every | 200 | Rewards to accumulate before triggering Algorithm 2. Larger = fewer retrains, more stable arm matrices between checkpoints. |
retrain_steps | 100 | AdamW gradient-descent steps per retrain cycle. |
learning_rate | 0.001 | AdamW learning rate. |
lambda | 1.0 | L2 regularization coefficient toward initial weights Wβ. Higher = slower adaptation, more stable. Lower = allows faster weight drift between cycles. |
db.create_campaign(
"content",
arms=["article_a", "article_b", "article_c"],
algorithm={
"neural_lin_ucb": {
"context_dim": 32, # must match the context vector you send at predict time
"embed_dim": 16, # arm matrices are 16 Γ 16
"hidden_dim": 128,
"hidden_layers": 2,
"retrain_every": 500, # first retrain after 500 rewards
"retrain_steps": 100,
"learning_rate": 1e-3,
"lambda": 1.0,
}
},
alpha=1.0,
)
# context must be exactly context_dim=32 floats
arm, interaction_id = db.predict("content", context)
db.reward(interaction_id, reward)
Feature flag: NeuralLinUCB is compiled in only when the neural feature is enabled. The default binary contains no Candle or ML dependencies.
cargo build --release --features neural
cargo build --release --features neural,cuda
cargo build --release --features neural,metal
Hardware Acceleration
The compute device is selected at startup via the BANDITDB_DEVICE environment variable. When unset, BanditDB auto-detects: CUDA β Metal β CPU.
| Value | Requires | Effect |
|---|---|---|
unset or auto | β | CUDA if available, then Metal, then CPU. |
cuda | --features neural,cuda | CUDA GPU 0. |
cuda:N | --features neural,cuda | Specific CUDA device ordinal. |
metal | --features neural,metal | Apple Silicon GPU. |
cpu | --features neural | Force CPU regardless of available hardware. |
Cold Start and Warm Restart
Cold start: a freshly created campaign uses randomly-initialised MLP weights until the first Algorithm 2 retrain fires at retrain_every rewards. Random projections are a usable baseline β LinUCB adapts even in a random embedding space β but expect lower sample efficiency than a trained network during this initial phase.
Warm restart: after each retrain, arm matrices (Aβ»ΒΉ, b, ΞΈ) are re-accumulated by replaying the reward buffer through the updated network. This preserves most of the learned signal across retrains instead of resetting LinUCB statistics from scratch.
Recovery: MLP weights are saved as a .safetensors sidecar at {DATA_DIR}/neural/{campaign_id}.safetensors during each checkpoint and reloaded automatically on startup. If the sidecar is missing, the campaign recovers gracefully with fresh random weights and relearns from the WAL tail β model quality degrades temporarily but no data is lost.
The Data Science Escape Hatch
Every interaction is event-sourced. Export to Parquet and evaluate policies offline.
POST /checkpoint compiles completed predictionβreward pairs into Snappy-compressed Apache Parquet files β one file per campaign β for offline analysis in Pandas or Polars.
Every prediction will eventually appear in the Parquet file even if its reward arrives hours later. BanditDB re-emits in-flight interactions at each checkpoint so delayed rewards are always captured in a future cycle.
Each row includes a propensity column β the probability that the logging policy selected the chosen arm given the context. This is the P(a | x) term required by Inverse Propensity Scoring estimators. The method differs by algorithm:
- LinUCB / NeuralLinUCB: softmax-normalised UCB scores across all arms at prediction time.
- Thompson Sampling: adaptive Monte Carlo frequency estimate. N posterior samples are drawn per arm; propensity is the fraction of draws in which the arm produced the highest score. N adapts from 64 (cold start, diffuse posterior) down to 8 (converged, concentrated posterior) based on the maximum Aβ»ΒΉ diagonal across arms β so accuracy is highest when it is most needed and cost is lowest when traffic is highest.
import polars as pl
import requests
HEADERS = {"X-Api-Key": "your-secret-key"}
# Snapshot models, export Parquet, rotate the WAL
requests.post("http://localhost:8080/checkpoint", headers=HEADERS)
# Flat schema: interaction_id | arm_id | reward | predicted_at | rewarded_at | propensity | feature_0 β¦
df = pl.read_parquet("/data/exports/sleep.parquet")
print(df.head())
Offline Policy Evaluation
The Python SDK ships three OPE estimators in banditdb.eval. Install with:
pip install "banditdb-python[eval]"
| Estimator | Function | When to use |
|---|---|---|
| Replay | replay(df) | Sanity check baseline. Unbiased but low coverage (~1/K interactions used). |
| IPS / SNIPS | ips(df, clip=10.0) | Primary estimator. Uses every interaction with importance weights. |
| Doubly Robust | doubly_robust(df, clip=10.0) | Best statistical efficiency. Use when comparing multiple policies or sweeping alpha. |
from banditdb.eval import replay, ips, doubly_robust
df = pl.read_parquet("/data/exports/sleep.parquet")
print(replay(df))
# OPEResult(method='replay', estimate=0.4821, std_error=0.0312, coverage=22.1% [33/149])
print(ips(df))
# OPEResult(method='ips', estimate=0.5103, std_error=0.0187, coverage=100.0% [149/149])
print(doubly_robust(df))
# OPEResult(method='doubly_robust', estimate=0.5219, std_error=0.0141, coverage=100.0% [149/149])
# Compare against the observed reward of the logging policy:
print("Observed:", df["reward"].mean())
# If observed >> estimate, the campaign has learned something real.
Inspecting the WAL
The WAL is plain JSONL β every event is human-readable on disk.
# All campaigns ever created
grep "CampaignCreated" /data/bandit_wal.jsonl | jq '.CampaignCreated.campaign_id'
# Campaigns that have been deleted
grep "CampaignDeleted" /data/bandit_wal.jsonl | jq '.CampaignDeleted.campaign_id'
How Recovery Works
BanditDB survives crashes and restarts automatically. No manual intervention required.
The Two Files
| File | Purpose |
|---|---|
checkpoint.json | Snapshot of all campaign matrices (Aβ»ΒΉ, b, ΞΈ, counts) at a specific WAL byte offset. |
bandit_wal.jsonl | Append-only event log: CampaignCreated, Predicted, Rewarded, CampaignDeleted. |
Phase 1 β Load the Checkpoint
If checkpoint.json exists, BanditDB reads it and restores all campaign matrices directly into memory β no replaying, just deserialisation. The checkpoint records the WAL byte offset at which it was taken.
If no checkpoint exists, BanditDB starts from an empty state and replays the entire WAL from byte 0.
Phase 2 β Replay the WAL Tail
BanditDB opens bandit_wal.jsonl, seeks to the checkpoint's byte offset, and replays every event written after that point. One edge case: after WAL rotation the stored offset may exceed the current file size. BanditDB detects this and seeks to byte 0 instead.
Data Loss Window
Everything in the WAL is durable. The WAL writer flushes after every write burst and fsyncs before acknowledging a checkpoint. A crash between checkpoints is fully recovered by replaying the WAL tail.
The only data at risk is in-flight predictions β interactions predicted but not yet rewarded at the moment of a crash. These live in the Moka TTL cache in memory. After a crash those interaction IDs are lost and any reward sent for them will return 404.
Mitigate this by checkpointing frequently. BanditDB re-emits in-flight predictions into the WAL tail at each checkpoint, so rewards arriving before the next crash are captured.
What POST /checkpoint Does
- Flush barrier β drains all pending events and
fsyncs to disk, responds with confirmed byte offset. - Snapshot β serialises all campaign matrices to
checkpoint.tmp, atomically renames tocheckpoint.json. - Parquet export β joins
Predicted+Rewardedevents, writes matched pairs as a timestamped shard per campaign. Unmatched predictions are re-emitted into the WAL tail. - Neural retrain (NeuralLinUCB campaigns only) β if
retrain_everyrewards have accumulated since the last retrain, runs Algorithm 2, re-accumulates arm matrices in the new embedding space, and saves.safetensorsweights to{DATA_DIR}/neural/. - WAL rotation β truncates WAL to only the tail. Pre-checkpoint history is no longer needed for recovery.
Parquet files are analytics exports only β not used for recovery. Losing them does not affect model state. Recovery uses only checkpoint.json + bandit_wal.jsonl.
Recommended Production Setup
# Auto-checkpoint every 10,000 rewards
BANDITDB_CHECKPOINT_INTERVAL=10000
# Or cap WAL size (useful on edge deployments)
BANDITDB_MAX_WAL_SIZE_MB=50
# Back up the two recovery files on a schedule
cp /data/checkpoint.json /backup/checkpoint-$(date +%s).json
cp /data/bandit_wal.jsonl /backup/wal-$(date +%s).jsonl
To move BanditDB to a new host: copy checkpoint.json and bandit_wal.jsonl to the same DATA_DIR on the new machine and start. Recovery is automatic.
Observability & Production Monitoring
Know when the system is working, not just when it's running.
The Silent Failure Problem
A contextual bandit has a failure mode that produces no errors, no latency spikes, and no obvious operational signal: exploration collapse. The engine is healthy. Every prediction succeeds. Every reward matches. The WAL is flushing. But the model stopped learning days ago β one arm has absorbed all traffic and the system is silently serving a static policy.
For enterprise deployments this is a contractual and audit risk. You are paying for an adaptive system. Without explicit observability, you have no way to distinguish "the bandit has correctly converged to a winning arm" from "the bandit collapsed to a wrong arm two weeks ago and nobody noticed."
BanditDB surfaces this through three layered signals: live selection entropy per campaign, an aggregated health endpoint, and a causal validation pipeline.
Selection Entropy
Every call to GET /campaign/:id/diagnostics computes the normalised Shannon entropy of the arm selection distribution:
H = 1.0 means perfectly uniform selection across all arms. H = 0.0 means one arm receives all traffic. The normalisation by log(n_arms) makes the scale consistent regardless of how many arms a campaign has.
Entropy Status and Guards
Raw entropy is not sufficient for alerting. A campaign with H = 0.12 could be healthy (one arm has genuinely won) or broken (the model collapsed before accumulating enough data). Two guard conditions prevent false positives before a status is computed:
| Guard | Condition | Effect |
|---|---|---|
| Guard 1 β Convergence | Leading arm's 95% Wilson-score CI lower bound exceeds the second arm's upper bound (requires β₯ 30 rewards on both arms) | Status is forced to ok. Low entropy is the correct outcome when one arm has statistically won. |
| Guard 2 β Minimum observations | Total predictions < 500 | Status is forced to ok. Early in a campaign, random variance naturally concentrates traffic; alerting here would be noise. |
When neither guard fires, the status is set by thresholds on H:
entropy_status | Threshold | Meaning |
|---|---|---|
ok | H β₯ 0.4, or either guard is active | Healthy exploration or confirmed convergence. |
warning | 0.2 β€ H < 0.4 | One arm is absorbing most traffic without a convergence signal. Investigate. |
critical | H < 0.2 | Near-total collapse. likely_cause and suggested_action are always present. |
Entropy Trend
BanditDB stores an entropy snapshot at every checkpoint and compares it to the current value. This distinguishes a campaign that has been collapsed for months (data quality issue) from one that collapsed last night (operational incident).
entropy_trend | Meaning |
|---|---|
stable | Entropy changed by less than 0.1 since the last checkpoint. |
falling | Dropped by more than 0.1 β recent collapse. Likely cause is a pipeline event, deploy, or new cohort. Correlate with your deployment timeline. |
recovering | Increased by more than 0.1 β entropy is returning after a collapse. Monitor until stable. |
unknown | No checkpoint has been written yet for this campaign. Run POST /checkpoint to establish a baseline. |
Full Diagnostics Response
A campaign with critical entropy returns a self-explanatory triage payload alongside all existing diagnostics:
curl http://localhost:8080/campaign/prices/diagnostics \
-H "X-Api-Key: your-key"
{
"campaign_id": "prices",
"selection_entropy": 0.09,
"entropy_status": "critical",
"entropy_trend": "falling",
"converged": false,
"likely_cause": "recent_collapse",
"suggested_action": "Entropy dropped since last checkpoint. Check reward pipeline for bugs or recent config changes.",
"total_predictions": 4821,
"total_rewards": 312,
"arm_stats": {
"price_10": { "predictions": 4698, "rewards": 301, "avg_reward": 0.61 },
"price_15": { "predictions": 89, "rewards": 8, "avg_reward": 0.58 },
"price_20": { "predictions": 34, "rewards": 3, "avg_reward": 0.55 }
}
}
likely_cause and suggested_action are only present when entropy_status is warning or critical β they are omitted entirely when the campaign is healthy.
Operational Recipes
When an alert fires, the first step is always to look at the diagnostics: per-arm predictions, rewards, avg_reward, converged, and entropy_trend together identify which of five scenarios is occurring.
| Scenario | Signals | Action |
|---|---|---|
| Legitimate convergence | converged: true, all arms have substantial reward counts, losing arms had a fair chance (predictions > 300 each) |
No action. Guard 1 should have suppressed the alert. If it fired, verify reward count thresholds are met. Consider archiving the campaign. |
| Early lock-in | Total predictions < 2000, one or more arms have fewer than 50 predictions, converged: null |
The winning arm got a lucky early lead before others had enough data. Increase alpha to boost the UCB exploration bonus. If the campaign is too young to reset, wait β entropy may recover naturally. Reset and restart if data is cheap to regenerate. |
| Reward pipeline event | entropy_trend: "falling", collapse is recent, often correlates with a deploy or config change. One arm's reward rate is anomalously high relative to its base rate. |
Fix the pipeline before touching the campaign. Resetting while the bug is live just restarts the corruption. Once the pipeline is clean, assess whether the corrupted observations can be discarded or whether a full campaign reset is warranted. |
| New cohort after collapse | Long-running campaign, entropy was healthy for months then gradually declined, correlates with a new user segment or product change. The collapsed arm may be contextually correct for the original cohort but wrong for the new one. | Do not reset β that discards valid learning for the original cohort. Create a separate campaign for the new segment, or add new arms representing the new segment's hypotheses. |
| Alpha misconfiguration | Collapse occurs within the first 200β500 predictions, entropy_trend: "falling" from the very beginning, all arms except one have near-zero observations. Diagnosed by reviewing campaign creation parameters. |
The UCB exploration bonus was too small from the start. Recreate the campaign with a higher alpha. The cost is low β the campaign has few observations. |
Health Endpoint Integration
GET /health aggregates entropy status across all active campaigns and is designed for integration with load balancers, uptime monitors, and Kubernetes probes. It requires no authentication.
curl http://localhost:8080/health
{
"status": "degraded",
"campaigns": {
"prices": { "entropy": 0.09, "status": "critical" },
"recommendations": { "entropy": 0.71, "status": "ok" },
"onboarding": { "entropy": 0.34, "status": "warning" }
}
}
| HTTP status | Overall status | Meaning |
|---|---|---|
200 | ok | All active campaigns have healthy entropy (or are statistically converged). |
200 | degraded | One or more campaigns have warning or critical entropy. The service is still serving correctly β this is a data quality signal, not a service failure. Do not remove from load balancer rotation. |
503 | degraded: wal unavailable | The WAL writer has encountered an unrecoverable I/O error. Predictions will continue to be served from memory but no new events are being persisted. Treat as a service failure. |
The deliberate choice of HTTP 200 for entropy degradation β rather than 503 β means a k8s readiness probe or load balancer health check will not remove the instance from rotation just because a campaign's model quality has degraded. Service availability and model quality are separate failure modes and should be monitored separately.
Kubernetes
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
The k8s probe will mark the pod not-ready only when the WAL writer fails (503). Campaign-level entropy degradation returns 200 and does not affect routing β which is the correct behaviour: a degraded model is still serving predictions, and removing the pod from rotation does not fix the underlying cause.
Structured logging
When entropy_status is warning or critical, BanditDB emits a structured WARN log line at diagnostics time. Set LOG_FORMAT=json to receive machine-parseable output compatible with CloudWatch, Datadog, Splunk, and most log aggregation pipelines:
{
"level": "WARN",
"campaign": "prices",
"entropy": "0.091",
"trend": "Falling",
"status": "Critical",
"likely_cause": "recent_collapse",
"message": "entropy: low selection entropy detected"
}
An alert rule on message = "entropy: low selection entropy detected" and level = WARN is sufficient to wire this into any log-based alerting system without code changes.
Causal Validation
Entropy alerting tells you whether the bandit is still exploring. Causal analysis tells you whether the exploration it did was causally correct β whether the arm receiving the most traffic is actually causing better outcomes, or whether it is merely correlated with outcomes that were already good for that user segment.
Standard OPE estimators (IPS, SNIPS, Doubly Robust) answer the question: "what reward would policy Ο achieve?" Causal analysis answers a different question: "what is the causal effect of each arm, controlling for the fact that the bandit was selecting them non-randomly based on context?"
The distinction matters in production. A bandit can correctly route high-converting users to arm A β not because arm A causes conversions, but because arm A was selected for users who already convert well. IPS estimators will confirm arm A is performing well. Causal analysis will correctly identify that the arm's observed advantage is partially or fully explained by user selection bias.
Running the analysis
# Checkpoint first to flush the latest data to Parquet
curl -X POST http://localhost:8080/checkpoint -H "X-Api-Key: your-key"
# Run causal analysis
pip install econml scikit-learn polars pandas numpy
python scripts/causal_analysis.py \
--parquet /data/exports/prices.parquet \
--features price_sensitivity recency_days cart_value_norm segment_score
The script works with LinUCB, Thompson Sampling, and Progressive campaigns. Thompson Sampling campaigns now log per-prediction propensities via adaptive Monte Carlo sampling (N=8β64 draws per arm, driven by posterior spread). The causal estimator (CausalForestDML) does not use these logged propensities β it learns the selection propensity internally from observed (arm, context) pairs using Double Machine Learning, which is more appropriate for aggregated bandit data than using time-varying logged propensities from a non-stationary policy. The logged TS propensities are used by the IPS/SNIPS offline evaluator in banditdb.eval.
Section 0 β Positivity & Confounding Diagnostics
This section runs before the expensive causal forests and checks whether the DML validity assumptions hold. It is the causal analysis equivalent of the entropy alert: a positivity violation means the bandit never selected a given arm for some region of the feature space, making causal effect estimates for that arm in that region extrapolation rather than inference.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
0. POSITIVITY & CONFOUNDING DIAGNOSTICS
(pre-flight check β run before fitting causal forests)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Arm AUC P<0.05 P>0.95
ββββββββββββββββββββββββββββββββ βββββ βββββββ βββββββ
price_10 0.812 2.1% 1.8%
price_15 0.831 1.4% 0.9%
price_20 0.744 34.2% 0.3% β positivity violation
β Positivity violations often indicate exploration collapse.
Check GET /campaign/:id/diagnostics for entropy_status.
The AUC of the internal propensity model (model_t) quantifies how strongly context predicts arm selection. AUC β 0.5 means near-random selection β the bandit was exploring freely and causal estimates are highly reliable. AUC > 0.8 means strong confounding β the DML is doing important correction work but estimates carry more uncertainty because few users in the "wrong" arm condition exist.
P<0.05 is the fraction of observations where the model predicts near-zero probability of this arm being selected. When this exceeds 20%, the arm was almost never chosen in that feature region and the CATE estimate there is unreliable.
Section 2 β Causal Assignment vs Bandit Selection
The most operationally important output. The causal forest identifies, for each user in the dataset, which arm would causally produce the best outcome. This is compared to what the bandit actually selected.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
2. CAUSAL ASSIGNMENT vs BANDIT SELECTION
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Arm Causal Bandit Gap
ββββββββββββββββββββββββββββββββ βββββββ βββββββ βββββββ
price_10 42.1% 14.2% +27.9% ββββββββββββββββββββ
price_15 51.3% 79.4% -28.1% βββββββββββββββββββββββββ
price_20 6.6% 6.4% +0.2% βββ
| Gap | Interpretation | Action |
|---|---|---|
| Gap β 0 | The bandit's traffic distribution matches the causal structure. The model has converged to the correct policy. | No action. If entropy is also healthy, the campaign is operating correctly. |
| Gap > 0 (arm underserved) | A user group that would causally benefit from this arm is not being routed to it. The bandit is leaving value on the table for that segment. | Check whether the arm had enough early observations. Consider raising alpha or adding context features that distinguish this group. |
| Gap < 0 (arm overserved) | The bandit has over-converged to this arm. It is receiving more traffic than the causal structure warrants β often driven by user selection bias rather than genuine causal advantage. | The arm's observed reward rate is inflated by non-random selection. Do not interpret high reward rate as causal effectiveness without this analysis. |
Section 5 β Selection Stability Over Time
Splits the campaign timeline into five equal-sized buckets (by prediction timestamp) and shows per-arm selection rate across the full campaign history. This surfaces convergence, collapse, and pipeline events directly from the Parquet export without requiring access to the live diagnostics endpoint.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
5. SELECTION STABILITY OVER TIME
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
t1(961) t2(961) t3(961) t4(961) t5(961)
price_10 βββββ [ 18.2% 22.1% 28.4% 31.7% 35.9%]
price_15 βββββ [ 64.1% 59.3% 55.2% 52.8% 47.3%]
price_20 ββββΒ· [ 17.7% 18.6% 16.4% 15.5% 16.8%]
Stable across buckets β IID assumption holds, DML estimates reliable
Monotonic increase β bandit converging or collapsing to this arm
Sudden jump in t4/t5 β possible reward pipeline event or config change
The example above shows healthy convergence in progress: price_15 is steadily losing traffic to price_10 as the model learns. The DML IID assumption is not perfectly satisfied (the policy is non-stationary) but the gradual shift means causal estimates over the full history are a reasonable approximation of the average policy. A sudden jump in t4 or t5 β rather than a gradual drift β would indicate a pipeline or configuration event and should prompt investigation before trusting the causal estimates.
Interpreting the full output
| Signal | What it means | What to do |
|---|---|---|
| ATE significant & positive | The arm causally increases reward above baseline. | This arm is genuinely effective. Its observed advantage is not explained by selection bias. |
| ATE near zero or inconclusive | The arm's observed reward advantage is not causally real β it is explained by the context features it tends to be selected for. | Audit whether context features are capturing the relevant signal. The arm may be receiving credit for conversions that would have happened anyway. |
| ATE significant & negative | The arm is causally hurting outcomes. | Remove or replace the arm. Its selection is suppressing reward relative to what users would have achieved under a different policy. |
| CATE p25βp75 wide | The treatment effect is heterogeneous β some users respond strongly, others do not. Personalisation is the mechanism driving value. | The bandit's context-aware routing is doing important work. Inspect the winning segments output for which feature dimensions drive the heterogeneity. |
| CATE p25βp75 narrow | The effect is homogeneous β the arm is roughly equally good or bad for all users. | The bandit is correct to converge uniformly. A simpler rule-based policy would achieve similar results. |
| Positivity violation > 20% | The bandit collapsed exploration for this arm. CATE estimates are not trustworthy for the affected feature region. | Check entropy_status in /diagnostics. Determine which operational recipe applies and address the root cause before re-running causal analysis. |
Causal analysis requires at least 200β300 reward observations per arm for reliable CATE estimates. Run POST /checkpoint to export the latest data before each analysis run. The causal forest fits one model per arm, so runtime scales linearly with the number of arms β allow 2β5 minutes for campaigns with 5+ arms and 10,000+ observations.
Use Cases
Any decision that has a context, a finite set of choices, and a measurable outcome is a candidate for BanditDB.
| Domain | Arms (choices) | Context | Reward signal |
|---|---|---|---|
| LLM routing | GPT-4o, Claude Sonnet, Gemini Flash, β¦ | Task type, input length, session turn, user expertise | LLM-as-judge score, user thumbs up, task completion |
| Prompt strategy | Zero-shot, chain-of-thought, few-shot, structured output | Task complexity, domain, input length, session context | LLM-as-judge quality score (0β1) |
| Agent tool selection | Which tool or sub-agent to invoke for a given step | Task type, prior tool results, cost budget | Task success, latency, cost |
| Dynamic pricing | Price tiers or discount levels | Inventory, seasonality, competitor pricing, customer segment | Revenue per unit or sell-through rate |
| Checkout optimisation | Upsell offer, free shipping, no offer | Cart value, customer history, device type | Conversion (binary) or order value lift |
| Content personalisation | Article, offer, layout variant | User demographics, history, session signals | Click, time-on-page, downstream conversion |
| Legal intake routing | Consult, intake form, refer, decline | Case value, matter complexity, conflict risk, capacity | Matter opened, revenue collected |
| Adaptive clinical trials | Treatment arms | Patient demographics, comorbidities, baseline score | Outcome score normalised to [0, 1] |
| Sleep / wellness | Temperature, light, noise reduction | Sex, age, weight, activity level, bedtime | PSQI score improvement ratio |
The reward must be a scalar in [0, 1]. If your natural metric has a different range, divide by its maximum (e.g. revenue / max_possible_revenue) or use a ratio like (after - before) / before, clipped to [0, 1].
When BanditDB is not the right tool
- Pure exploration / discovery β if you have no feedback signal yet and are building a dataset from scratch, start with random assignment and switch to BanditDB once you have ~100 outcomes per arm.
- Very high-dimensional action spaces (thousands of arms) β LinUCB scales with arms Γ feature_dimΒ² in memory. For catalogue-scale recommendation, consider embedding-based retrieval first and use BanditDB for the final re-ranking stage.
- Non-stationary rewards with hard concept drift β LinUCB assumes rewards are stationary. If the distribution shifts sharply (e.g. a major product change), delete and recreate the campaign to reset the priors.