Installation

BanditDB has two components: the Rust engine and the Python SDK. Start the engine first, then install the SDK.

1. Start the Engine

Binary (fastest, no Docker required)

curl -fsSL https://raw.githubusercontent.com/dynamicpricing-ai/banditdb/main/scripts/install.sh | sh
banditdb
curl http://localhost:8080/health   # {"status":"ok"}

Docker

docker run -d -p 8080:8080 \
  -e BANDITDB_API_KEY=your-secret-key \
  -v banditdb_data:/data \
  simeonlukov/banditdb:latest

Docker Compose (recommended for production)

Includes persistence, RBAC, and auto-checkpointing:

curl -fsSL https://raw.githubusercontent.com/dynamicpricing-ai/banditdb/main/docker-compose.yml -o docker-compose.yml
# Set BANDITDB_API_KEYS=admin-key=admin,app-key=writer in .env
docker compose up -d

Key environment variables:

VariableDefaultDescription
BANDITDB_API_KEYSβ€”Multi-key RBAC: key1=admin;key2=writer;key3=reader
BANDITDB_API_KEYβ€”Legacy single admin key (dev only)
BANDITDB_CHECKPOINT_INTERVAL5000Auto-checkpoint after N rewarded events
BANDITDB_MAX_WAL_SIZE_MB100Also checkpoint when WAL exceeds N MB
DATA_DIR/dataWAL, checkpoint, and Parquet export directory
PORT8080HTTP listen port
LOG_FORMATβ€”Set to json for structured production logs
BANDITDB_RATE_LIMIT_PER_SEC1000Per-key request rate limit

Kubernetes (Helm)

helm install banditdb ./helm/banditdb \
  --set auth.apiKeys="admin-key=admin,app-key=writer" \
  --set persistence.size=10Gi \
  --set ingress.enabled=true \
  --set ingress.host=banditdb.example.com

The Helm chart includes a PersistentVolumeClaim, Secret for API keys, liveness/readiness probes, and a Recreate deployment strategy β€” required for single-writer WAL safety.

Build from source

git clone https://github.com/dynamicpricing-ai/banditdb
cd banditdb
cargo build --release                        # standard build
# cargo build --release --features neural   # with NeuralLinUCB
./target/release/banditdb

2. Install the SDK

Python

pip install banditdb-python
from banditdb import Client
db = Client("http://localhost:8080", api_key="your-secret-key")

HTTP (no SDK)

Every SDK call is a plain JSON HTTP request. Use X-Api-Key header when auth is enabled:

curl -s http://localhost:8080/health
# {"status":"ok",...}

curl -s -X POST http://localhost:8080/predict \
  -H "Content-Type: application/json" \
  -H "X-Api-Key: your-secret-key" \
  -d '{"campaign_id":"my-campaign","context":[1.0,0.5,0.3]}'

Run

Start the engine, verify it's alive, create your first campaign.

Start

Binary

# Minimal β€” no auth, in-memory only (dev)
banditdb

# With persistence and a single admin key
DATA_DIR=/var/lib/banditdb BANDITDB_API_KEY=secret banditdb

# Full production flags
DATA_DIR=/var/lib/banditdb \
BANDITDB_API_KEYS="admin-key=admin;app-key=writer;dash-key=reader" \
BANDITDB_CHECKPOINT_INTERVAL=5000 \
BANDITDB_MAX_WAL_SIZE_MB=100 \
LOG_FORMAT=json \
PORT=8080 \
banditdb

Docker Compose

docker compose up -d          # start in background
docker compose logs -f        # follow logs
docker compose down           # stop

Verify

curl http://localhost:8080/health
# {"status":"ok","campaigns":{}}

Status values:

StatusHTTPMeaning
ok200All campaigns healthy, WAL writing normally.
degraded200One or more campaigns have entropy collapse. Service still running β€” data quality issue only.
degraded: wal unavailable503WAL writer failed. Predictions still served from memory but nothing is persisted. Treat as a service failure.

Logs

# Human-readable (default)
banditdb

# Structured JSON for log aggregators (CloudWatch, Datadog, Splunk)
LOG_FORMAT=json banditdb

Key log lines to watch on startup:

INFO banditdb: recovered 3 campaigns from checkpoint (offset=48291)
INFO banditdb: replayed 142 WAL events
INFO banditdb: listening on 0.0.0.0:8080

If checkpoint recovery fails the engine exits with a clear error rather than starting in a broken state. Check DATA_DIR permissions and that checkpoint.json is not corrupt.

First Campaign

# Create a campaign
curl -s -X POST http://localhost:8080/campaign \
  -H "Content-Type: application/json" \
  -H "X-Api-Key: your-secret-key" \
  -d '{
    "campaign_id": "homepage-hero",
    "arms": ["variant_a", "variant_b", "variant_c"],
    "feature_dim": 4,
    "alpha": 1.0
  }'

# Confirm it exists
curl -s http://localhost:8080/campaigns \
  -H "X-Api-Key: your-secret-key"

Checkpoint and Stop

# Flush WAL, export Parquet, rotate β€” do this before stopping
curl -s -X POST http://localhost:8080/checkpoint \
  -H "X-Api-Key: your-secret-key"

# Then stop
docker compose down   # or kill the process

Always checkpoint before a planned shutdown. The WAL preserves everything since the last checkpoint, but checkpointing first keeps the WAL short and makes the next startup faster.

Quick Start

A complete predict→reward cycle for a sleep improvement campaign.

Sleep Improvement

One-size-fits-all sleep advice ignores individual physiology. A 25-year-old male athlete and a 60-year-old sedentary woman respond differently to the same environmental change. BanditDB learns those differences automatically β€” routing each participant to the intervention most likely to work for their profile, improving with every reported outcome.

from banditdb import Client

db = Client("http://localhost:8080", api_key="your-secret-key")

# 1. Create the campaign once at startup
db.create_campaign(
    "sleep",
    arms=["decrease_temperature", "decrease_light", "decrease_noise"],
    feature_dim=5,
    metadata={
        "owner": "wellness-team",
        "features": ["sex", "age_norm", "weight_norm", "activity", "bedtime_norm"],
    },
)

# 2. A participant is ready for tonight's intervention.
# Context: [sex, age/100, weight_kg/150, activity_0–1, bedtime_hour/24]
context = [
    1.0,   # female
    0.35,  # age 35
    0.50,  # 75 kg
    0.60,  # moderately active
    0.96,  # bedtime 23:00
]

# 3. Ask BanditDB which intervention to apply
arm, interaction_id = db.predict("sleep", context)
print(f"Tonight's intervention: {arm}")  # e.g., "decrease_temperature"

# 4. Apply the intervention, then reward the next morning
score_before = 62
score_after  = 79
reward = (score_after - score_before) / score_before  # β†’ 0.27

db.reward(interaction_id, reward)

Rewards must be normalised to [0, 1]. Divide your business metric by its maximum possible value, or use a ratio like the example above.

Native Agent Tool Use (MCP)

Give any Claude-based agent persistent decision memory via the Model Context Protocol.

Standard LLM agents are stateless β€” if they route a task to the wrong model and fail, they repeat the same mistake tomorrow. BanditDB's built-in MCP server gives the entire agent swarm shared persistent memory.

Add the server to your Claude Desktop config at ~/Library/Application Support/Claude/claude_desktop_config.json:

{
  "mcpServers": {
    "banditdb": {
      "command": "banditdb-mcp",
      "env": {
        "BANDITDB_URL": "http://localhost:8080",
        "BANDITDB_API_KEY": "your-secret-key"
      }
    }
  }
}

The agent now has five tools:

ToolWhat it does
create_campaignCreate a new decision campaign. Accepts algorithm, alpha, and an optional free-form metadata object for attaching context (feature names, owner, version, etc.).
list_campaignsList all active campaigns with arm count, alpha, and metadata.
campaign_diagnosticsPer-arm theta_norm, prediction count, reward rate, plus selection_entropy, entropy_status, entropy_trend, and a suggested_action when collapse is detected. Use when a campaign isn't learning or you suspect exploration has stopped.
get_intuitionReturns the recommended action and an interaction_id to save.
record_outcomeReports success (1.0) or failure (0.0) and updates the shared model.

Every agent in a swarm shares the same BanditDB instance, so the learned model improves with every interaction across the entire fleet.

Choosing an Algorithm

Both algorithms share identical per-arm state. Switching is a single field at campaign creation.

AlgorithmValueHow it exploresWhen to use
LinUCB "linucb" (default) Deterministic UCB bonus: θ·x + α·√(x·A⁻¹·x) Predictable, tunable. Sweep alpha offline to calibrate exploration.
Linear Thompson Sampling "thompson_sampling" Samples ΞΈΜƒ ~ N(ΞΈ, Ξ±Β²Β·A⁻¹), scores by ΞΈΜƒΒ·x Natural Bayesian exploration. alpha=1.0 is the principled default. Concurrent users automatically diversify arm coverage.
NeuralLinUCB {"neural_lin_ucb": {…}} Identical UCB formula to LinUCB, but applied to a learned embedding h(x; W) instead of raw context. MLP retrained in batch at checkpoint time once retrain_every rewards accumulate. Non-linear or high-dimensional contexts where LinUCB has plateaued. Requires the neural feature flag at compile time.
Progressive (Tournament) {"progressive": {"base": "linucb", "challenger": "neural_lin_ucb"}} Runs both models in parallel. Uses the better-performing model for 90% of traffic. The recommended default. Handles the cold-start with Linear and upgrades to Neural automatically when data justifies it.
# Progressive β€” the internal tournament handled by BanditDB
db.create_campaign("auto_agent", ["model_a", "model_b"], feature_dim=1536, 
                   algorithm={"progressive": {"base": "linucb", "challenger": "neural_lin_ucb"}})

The algorithm field is stored in both the WAL and checkpoint files. Old WAL records and checkpoints without an algorithm field recover as "linucb" automatically.

Model Architecture

Why BanditDB uses the Neural-Linear handoff instead of end-to-end Deep RL.

Why Neural-Linear?

In the literature, this approach is often called the "Golden Middle" of Contextual Bandits. End-to-end Deep Reinforcement Learning (like PPO or Soft Actor-Critic) is notoriously unstable and requires thousands of interactions before it makes even basic sense of its environment. BanditDB is designed for production databases where the first 50 decisions are just as important as the 5,000th.

1. Latency vs. Accuracy

A full forward pass of a deep network for every arm is slow (10ms+). By using a Neural Feature Extractor with a Linear Head, BanditDB achieves sub-millisecond latencies. The heavy neural retraining happens in the background, while the live prediction path is a simple, ultra-fast vector dot product (ΞΈα΅€h).

2. Statistical Rigor (The Closed-Form Variance)

Deep networks are "black boxes" that don't natively understand their own uncertainty. To explore efficiently, they often rely on heuristics like epsilon-greedy or dropout. Because BanditDB's final layer is linear, it can compute an exact closed-form variance for every prediction. This allows us to use UCB and Thompson Sampling with mathematical certainty, leading to lower cumulative regret.

3. The Internal Tournament

The Progressive algorithm is our unique technical moat. It runs a continuous internal tournament between a Linear model (safe, sample-efficient) and a Neural model (powerful, non-linear).

  • Shadow Learning: Every reward received by the database updates both models simultaneously.
  • Autonomous Handoff: Using Mean Squared Error (MSE) over the interaction buffer, the engine autonomously detects the exact moment the Neural model has "surpassed" the Linear model's intuition.
  • Self-Correction: If the Neural model diverges or its performance drops due to noise, the tournament automatically demotes it back to the safe Linear baseline.

NeuralLinUCB

An MLP preprocessing stage that gives LinUCB a learnable, non-linear view of the context.

Standard LinUCB assumes the expected reward is a linear function of the raw context β€” which breaks down when features interact non-linearly or when the context vector carries redundant dimensions. NeuralLinUCB adds a small MLP that projects the caller's context into a compact fixed-size embedding; LinUCB then operates entirely in that embedding space. The network is retrained in batches at checkpoint time β€” keeping the hot predict/reward path purely online with no ML overhead per request.

context x (context_dim floats) β”‚ β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ hidden_layers Γ— hidden_dim Γ— ReLUβ”‚ MLP h(x ; W) β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ retrained every retrain_every rewards β”‚ L2-normalise β–Ό embedding h (embed_dim floats) β”‚ ΞΈα΅€h + Ξ± Β· √(hα΅€ A⁻¹ h) ← LinUCB (Sherman-Morrison online update)

Two algorithms run in tandem:

  • Algorithm 1 (online, every request): Runs the same Sherman-Morrison LinUCB update used by "linucb", but on the MLP embedding rather than raw context. No added latency β€” the MLP forward pass is a single matrix multiply chain.
  • Algorithm 2 (batch, at checkpoint): Once retrain_every rewards have accumulated, AdamW minimises Β½Β·MSE(ΞΈα΅€h(x;W), r) + (mΒ·Ξ»/2)Β·β€–W βˆ’ Wβ‚€β€–Β² for retrain_steps steps. The L2 penalty toward the initialisation weights Wβ‚€ prevents catastrophic forgetting across retrain cycles. Arm matrices are then re-accumulated β€” replayed through the new embedding via Sherman-Morrison β€” before WAL rotation.

Configuration

ParameterDefaultMeaning
context_dimrequiredLength of the context vector the caller sends at predict time. Fixed at campaign creation.
embed_dim32Embedding (arm matrix) dimension. Arm matrices are embed_dim Γ— embed_dim. Smaller = faster online updates; larger = more expressive. 16–64 covers most use cases.
hidden_dim128Width of each hidden layer.
hidden_layers2Number of hidden layers before the final embedding projection.
retrain_every200Rewards to accumulate before triggering Algorithm 2. Larger = fewer retrains, more stable arm matrices between checkpoints.
retrain_steps100AdamW gradient-descent steps per retrain cycle.
learning_rate0.001AdamW learning rate.
lambda1.0L2 regularization coefficient toward initial weights Wβ‚€. Higher = slower adaptation, more stable. Lower = allows faster weight drift between cycles.
db.create_campaign(
    "content",
    arms=["article_a", "article_b", "article_c"],
    algorithm={
        "neural_lin_ucb": {
            "context_dim": 32,    # must match the context vector you send at predict time
            "embed_dim": 16,      # arm matrices are 16 Γ— 16
            "hidden_dim": 128,
            "hidden_layers": 2,
            "retrain_every": 500, # first retrain after 500 rewards
            "retrain_steps": 100,
            "learning_rate": 1e-3,
            "lambda": 1.0,
        }
    },
    alpha=1.0,
)

# context must be exactly context_dim=32 floats
arm, interaction_id = db.predict("content", context)
db.reward(interaction_id, reward)

Feature flag: NeuralLinUCB is compiled in only when the neural feature is enabled. The default binary contains no Candle or ML dependencies.

cargo build --release --features neural
cargo build --release --features neural,cuda
cargo build --release --features neural,metal

Hardware Acceleration

The compute device is selected at startup via the BANDITDB_DEVICE environment variable. When unset, BanditDB auto-detects: CUDA β†’ Metal β†’ CPU.

ValueRequiresEffect
unset or autoβ€”CUDA if available, then Metal, then CPU.
cuda--features neural,cudaCUDA GPU 0.
cuda:N--features neural,cudaSpecific CUDA device ordinal.
metal--features neural,metalApple Silicon GPU.
cpu--features neuralForce CPU regardless of available hardware.

Cold Start and Warm Restart

Cold start: a freshly created campaign uses randomly-initialised MLP weights until the first Algorithm 2 retrain fires at retrain_every rewards. Random projections are a usable baseline β€” LinUCB adapts even in a random embedding space β€” but expect lower sample efficiency than a trained network during this initial phase.

Warm restart: after each retrain, arm matrices (A⁻¹, b, θ) are re-accumulated by replaying the reward buffer through the updated network. This preserves most of the learned signal across retrains instead of resetting LinUCB statistics from scratch.

Recovery: MLP weights are saved as a .safetensors sidecar at {DATA_DIR}/neural/{campaign_id}.safetensors during each checkpoint and reloaded automatically on startup. If the sidecar is missing, the campaign recovers gracefully with fresh random weights and relearns from the WAL tail β€” model quality degrades temporarily but no data is lost.

The Data Science Escape Hatch

Every interaction is event-sourced. Export to Parquet and evaluate policies offline.

POST /checkpoint compiles completed prediction→reward pairs into Snappy-compressed Apache Parquet files — one file per campaign — for offline analysis in Pandas or Polars.

Every prediction will eventually appear in the Parquet file even if its reward arrives hours later. BanditDB re-emits in-flight interactions at each checkpoint so delayed rewards are always captured in a future cycle.

Each row includes a propensity column β€” the probability that the logging policy selected the chosen arm given the context. This is the P(a | x) term required by Inverse Propensity Scoring estimators. The method differs by algorithm:

  • LinUCB / NeuralLinUCB: softmax-normalised UCB scores across all arms at prediction time.
  • Thompson Sampling: adaptive Monte Carlo frequency estimate. N posterior samples are drawn per arm; propensity is the fraction of draws in which the arm produced the highest score. N adapts from 64 (cold start, diffuse posterior) down to 8 (converged, concentrated posterior) based on the maximum A⁻¹ diagonal across arms β€” so accuracy is highest when it is most needed and cost is lowest when traffic is highest.
import polars as pl
import requests

HEADERS = {"X-Api-Key": "your-secret-key"}

# Snapshot models, export Parquet, rotate the WAL
requests.post("http://localhost:8080/checkpoint", headers=HEADERS)

# Flat schema: interaction_id | arm_id | reward | predicted_at | rewarded_at | propensity | feature_0 …
df = pl.read_parquet("/data/exports/sleep.parquet")
print(df.head())

Offline Policy Evaluation

The Python SDK ships three OPE estimators in banditdb.eval. Install with:

pip install "banditdb-python[eval]"
EstimatorFunctionWhen to use
Replayreplay(df)Sanity check baseline. Unbiased but low coverage (~1/K interactions used).
IPS / SNIPSips(df, clip=10.0)Primary estimator. Uses every interaction with importance weights.
Doubly Robustdoubly_robust(df, clip=10.0)Best statistical efficiency. Use when comparing multiple policies or sweeping alpha.
from banditdb.eval import replay, ips, doubly_robust

df = pl.read_parquet("/data/exports/sleep.parquet")

print(replay(df))
# OPEResult(method='replay', estimate=0.4821, std_error=0.0312, coverage=22.1% [33/149])

print(ips(df))
# OPEResult(method='ips', estimate=0.5103, std_error=0.0187, coverage=100.0% [149/149])

print(doubly_robust(df))
# OPEResult(method='doubly_robust', estimate=0.5219, std_error=0.0141, coverage=100.0% [149/149])

# Compare against the observed reward of the logging policy:
print("Observed:", df["reward"].mean())
# If observed >> estimate, the campaign has learned something real.

Inspecting the WAL

The WAL is plain JSONL β€” every event is human-readable on disk.

# All campaigns ever created
grep "CampaignCreated" /data/bandit_wal.jsonl | jq '.CampaignCreated.campaign_id'

# Campaigns that have been deleted
grep "CampaignDeleted" /data/bandit_wal.jsonl | jq '.CampaignDeleted.campaign_id'

How Recovery Works

BanditDB survives crashes and restarts automatically. No manual intervention required.

The Two Files

FilePurpose
checkpoint.jsonSnapshot of all campaign matrices (A⁻¹, b, θ, counts) at a specific WAL byte offset.
bandit_wal.jsonlAppend-only event log: CampaignCreated, Predicted, Rewarded, CampaignDeleted.

Phase 1 β€” Load the Checkpoint

If checkpoint.json exists, BanditDB reads it and restores all campaign matrices directly into memory β€” no replaying, just deserialisation. The checkpoint records the WAL byte offset at which it was taken.

If no checkpoint exists, BanditDB starts from an empty state and replays the entire WAL from byte 0.

Phase 2 β€” Replay the WAL Tail

BanditDB opens bandit_wal.jsonl, seeks to the checkpoint's byte offset, and replays every event written after that point. One edge case: after WAL rotation the stored offset may exceed the current file size. BanditDB detects this and seeks to byte 0 instead.

checkpoint.json found? β”œβ”€β”€ YES β†’ restore all matrices from snapshot β”‚ β†’ open WAL, seek to checkpoint.wal_offset β”‚ β†’ if offset > file size (post-rotation): seek to 0 β”‚ β†’ replay events from that position └── NO β†’ open WAL, replay from byte 0

Data Loss Window

Everything in the WAL is durable. The WAL writer flushes after every write burst and fsyncs before acknowledging a checkpoint. A crash between checkpoints is fully recovered by replaying the WAL tail.

The only data at risk is in-flight predictions β€” interactions predicted but not yet rewarded at the moment of a crash. These live in the Moka TTL cache in memory. After a crash those interaction IDs are lost and any reward sent for them will return 404.

Mitigate this by checkpointing frequently. BanditDB re-emits in-flight predictions into the WAL tail at each checkpoint, so rewards arriving before the next crash are captured.

What POST /checkpoint Does

  1. Flush barrier β€” drains all pending events and fsyncs to disk, responds with confirmed byte offset.
  2. Snapshot β€” serialises all campaign matrices to checkpoint.tmp, atomically renames to checkpoint.json.
  3. Parquet export β€” joins Predicted + Rewarded events, writes matched pairs as a timestamped shard per campaign. Unmatched predictions are re-emitted into the WAL tail.
  4. Neural retrain (NeuralLinUCB campaigns only) β€” if retrain_every rewards have accumulated since the last retrain, runs Algorithm 2, re-accumulates arm matrices in the new embedding space, and saves .safetensors weights to {DATA_DIR}/neural/.
  5. WAL rotation β€” truncates WAL to only the tail. Pre-checkpoint history is no longer needed for recovery.

Parquet files are analytics exports only β€” not used for recovery. Losing them does not affect model state. Recovery uses only checkpoint.json + bandit_wal.jsonl.

Recommended Production Setup

# Auto-checkpoint every 10,000 rewards
BANDITDB_CHECKPOINT_INTERVAL=10000

# Or cap WAL size (useful on edge deployments)
BANDITDB_MAX_WAL_SIZE_MB=50

# Back up the two recovery files on a schedule
cp /data/checkpoint.json  /backup/checkpoint-$(date +%s).json
cp /data/bandit_wal.jsonl /backup/wal-$(date +%s).jsonl

To move BanditDB to a new host: copy checkpoint.json and bandit_wal.jsonl to the same DATA_DIR on the new machine and start. Recovery is automatic.

Observability & Production Monitoring

Know when the system is working, not just when it's running.

The Silent Failure Problem

A contextual bandit has a failure mode that produces no errors, no latency spikes, and no obvious operational signal: exploration collapse. The engine is healthy. Every prediction succeeds. Every reward matches. The WAL is flushing. But the model stopped learning days ago β€” one arm has absorbed all traffic and the system is silently serving a static policy.

For enterprise deployments this is a contractual and audit risk. You are paying for an adaptive system. Without explicit observability, you have no way to distinguish "the bandit has correctly converged to a winning arm" from "the bandit collapsed to a wrong arm two weeks ago and nobody noticed."

BanditDB surfaces this through three layered signals: live selection entropy per campaign, an aggregated health endpoint, and a causal validation pipeline.

Selection Entropy

Every call to GET /campaign/:id/diagnostics computes the normalised Shannon entropy of the arm selection distribution:

H = βˆ’ Ξ£ p_i Β· log(p_i) / log(n_arms) n_arms = 3, uniform exploration: p = [0.33, 0.33, 0.33] β†’ H = 1.00 n_arms = 3, healthy convergence: p = [0.70, 0.16, 0.14] β†’ H = 0.78 n_arms = 3, warning territory: p = [0.88, 0.07, 0.05] β†’ H = 0.36 n_arms = 3, critical β€” collapsed: p = [0.97, 0.02, 0.01] β†’ H = 0.09

H = 1.0 means perfectly uniform selection across all arms. H = 0.0 means one arm receives all traffic. The normalisation by log(n_arms) makes the scale consistent regardless of how many arms a campaign has.

Entropy Status and Guards

Raw entropy is not sufficient for alerting. A campaign with H = 0.12 could be healthy (one arm has genuinely won) or broken (the model collapsed before accumulating enough data). Two guard conditions prevent false positives before a status is computed:

GuardConditionEffect
Guard 1 β€” Convergence Leading arm's 95% Wilson-score CI lower bound exceeds the second arm's upper bound (requires β‰₯ 30 rewards on both arms) Status is forced to ok. Low entropy is the correct outcome when one arm has statistically won.
Guard 2 β€” Minimum observations Total predictions < 500 Status is forced to ok. Early in a campaign, random variance naturally concentrates traffic; alerting here would be noise.

When neither guard fires, the status is set by thresholds on H:

entropy_statusThresholdMeaning
okH β‰₯ 0.4, or either guard is activeHealthy exploration or confirmed convergence.
warning0.2 ≀ H < 0.4One arm is absorbing most traffic without a convergence signal. Investigate.
criticalH < 0.2Near-total collapse. likely_cause and suggested_action are always present.

Entropy Trend

BanditDB stores an entropy snapshot at every checkpoint and compares it to the current value. This distinguishes a campaign that has been collapsed for months (data quality issue) from one that collapsed last night (operational incident).

entropy_trendMeaning
stableEntropy changed by less than 0.1 since the last checkpoint.
fallingDropped by more than 0.1 β€” recent collapse. Likely cause is a pipeline event, deploy, or new cohort. Correlate with your deployment timeline.
recoveringIncreased by more than 0.1 β€” entropy is returning after a collapse. Monitor until stable.
unknownNo checkpoint has been written yet for this campaign. Run POST /checkpoint to establish a baseline.

Full Diagnostics Response

A campaign with critical entropy returns a self-explanatory triage payload alongside all existing diagnostics:

curl http://localhost:8080/campaign/prices/diagnostics \
  -H "X-Api-Key: your-key"
{
  "campaign_id": "prices",
  "selection_entropy": 0.09,
  "entropy_status": "critical",
  "entropy_trend": "falling",
  "converged": false,
  "likely_cause": "recent_collapse",
  "suggested_action": "Entropy dropped since last checkpoint. Check reward pipeline for bugs or recent config changes.",
  "total_predictions": 4821,
  "total_rewards": 312,
  "arm_stats": {
    "price_10": { "predictions": 4698, "rewards": 301, "avg_reward": 0.61 },
    "price_15": { "predictions": 89,   "rewards": 8,   "avg_reward": 0.58 },
    "price_20": { "predictions": 34,   "rewards": 3,   "avg_reward": 0.55 }
  }
}

likely_cause and suggested_action are only present when entropy_status is warning or critical β€” they are omitted entirely when the campaign is healthy.

Operational Recipes

When an alert fires, the first step is always to look at the diagnostics: per-arm predictions, rewards, avg_reward, converged, and entropy_trend together identify which of five scenarios is occurring.

ScenarioSignalsAction
Legitimate convergence converged: true, all arms have substantial reward counts, losing arms had a fair chance (predictions > 300 each) No action. Guard 1 should have suppressed the alert. If it fired, verify reward count thresholds are met. Consider archiving the campaign.
Early lock-in Total predictions < 2000, one or more arms have fewer than 50 predictions, converged: null The winning arm got a lucky early lead before others had enough data. Increase alpha to boost the UCB exploration bonus. If the campaign is too young to reset, wait β€” entropy may recover naturally. Reset and restart if data is cheap to regenerate.
Reward pipeline event entropy_trend: "falling", collapse is recent, often correlates with a deploy or config change. One arm's reward rate is anomalously high relative to its base rate. Fix the pipeline before touching the campaign. Resetting while the bug is live just restarts the corruption. Once the pipeline is clean, assess whether the corrupted observations can be discarded or whether a full campaign reset is warranted.
New cohort after collapse Long-running campaign, entropy was healthy for months then gradually declined, correlates with a new user segment or product change. The collapsed arm may be contextually correct for the original cohort but wrong for the new one. Do not reset β€” that discards valid learning for the original cohort. Create a separate campaign for the new segment, or add new arms representing the new segment's hypotheses.
Alpha misconfiguration Collapse occurs within the first 200–500 predictions, entropy_trend: "falling" from the very beginning, all arms except one have near-zero observations. Diagnosed by reviewing campaign creation parameters. The UCB exploration bonus was too small from the start. Recreate the campaign with a higher alpha. The cost is low β€” the campaign has few observations.

Health Endpoint Integration

GET /health aggregates entropy status across all active campaigns and is designed for integration with load balancers, uptime monitors, and Kubernetes probes. It requires no authentication.

curl http://localhost:8080/health
{
  "status": "degraded",
  "campaigns": {
    "prices":        { "entropy": 0.09, "status": "critical" },
    "recommendations": { "entropy": 0.71, "status": "ok" },
    "onboarding":    { "entropy": 0.34, "status": "warning" }
  }
}
HTTP statusOverall statusMeaning
200okAll active campaigns have healthy entropy (or are statistically converged).
200degradedOne or more campaigns have warning or critical entropy. The service is still serving correctly β€” this is a data quality signal, not a service failure. Do not remove from load balancer rotation.
503degraded: wal unavailableThe WAL writer has encountered an unrecoverable I/O error. Predictions will continue to be served from memory but no new events are being persisted. Treat as a service failure.

The deliberate choice of HTTP 200 for entropy degradation β€” rather than 503 β€” means a k8s readiness probe or load balancer health check will not remove the instance from rotation just because a campaign's model quality has degraded. Service availability and model quality are separate failure modes and should be monitored separately.

Kubernetes

livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 10

readinessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 10

The k8s probe will mark the pod not-ready only when the WAL writer fails (503). Campaign-level entropy degradation returns 200 and does not affect routing β€” which is the correct behaviour: a degraded model is still serving predictions, and removing the pod from rotation does not fix the underlying cause.

Structured logging

When entropy_status is warning or critical, BanditDB emits a structured WARN log line at diagnostics time. Set LOG_FORMAT=json to receive machine-parseable output compatible with CloudWatch, Datadog, Splunk, and most log aggregation pipelines:

{
  "level": "WARN",
  "campaign": "prices",
  "entropy": "0.091",
  "trend": "Falling",
  "status": "Critical",
  "likely_cause": "recent_collapse",
  "message": "entropy: low selection entropy detected"
}

An alert rule on message = "entropy: low selection entropy detected" and level = WARN is sufficient to wire this into any log-based alerting system without code changes.

Causal Validation

Entropy alerting tells you whether the bandit is still exploring. Causal analysis tells you whether the exploration it did was causally correct β€” whether the arm receiving the most traffic is actually causing better outcomes, or whether it is merely correlated with outcomes that were already good for that user segment.

Standard OPE estimators (IPS, SNIPS, Doubly Robust) answer the question: "what reward would policy Ο€ achieve?" Causal analysis answers a different question: "what is the causal effect of each arm, controlling for the fact that the bandit was selecting them non-randomly based on context?"

The distinction matters in production. A bandit can correctly route high-converting users to arm A β€” not because arm A causes conversions, but because arm A was selected for users who already convert well. IPS estimators will confirm arm A is performing well. Causal analysis will correctly identify that the arm's observed advantage is partially or fully explained by user selection bias.

Running the analysis

# Checkpoint first to flush the latest data to Parquet
curl -X POST http://localhost:8080/checkpoint -H "X-Api-Key: your-key"

# Run causal analysis
pip install econml scikit-learn polars pandas numpy
python scripts/causal_analysis.py \
  --parquet /data/exports/prices.parquet \
  --features price_sensitivity recency_days cart_value_norm segment_score

The script works with LinUCB, Thompson Sampling, and Progressive campaigns. Thompson Sampling campaigns now log per-prediction propensities via adaptive Monte Carlo sampling (N=8–64 draws per arm, driven by posterior spread). The causal estimator (CausalForestDML) does not use these logged propensities β€” it learns the selection propensity internally from observed (arm, context) pairs using Double Machine Learning, which is more appropriate for aggregated bandit data than using time-varying logged propensities from a non-stationary policy. The logged TS propensities are used by the IPS/SNIPS offline evaluator in banditdb.eval.

Section 0 β€” Positivity & Confounding Diagnostics

This section runs before the expensive causal forests and checks whether the DML validity assumptions hold. It is the causal analysis equivalent of the entropy alert: a positivity violation means the bandit never selected a given arm for some region of the feature space, making causal effect estimates for that arm in that region extrapolation rather than inference.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
0. POSITIVITY & CONFOUNDING DIAGNOSTICS
   (pre-flight check β€” run before fitting causal forests)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  Arm                               AUC    P<0.05   P>0.95
  ────────────────────────────────  ─────  ───────  ───────
  price_10                          0.812    2.1%     1.8%
  price_15                          0.831    1.4%     0.9%
  price_20                          0.744   34.2%     0.3%   ⚠ positivity violation

  ⚠  Positivity violations often indicate exploration collapse.
     Check GET /campaign/:id/diagnostics for entropy_status.

The AUC of the internal propensity model (model_t) quantifies how strongly context predicts arm selection. AUC β‰ˆ 0.5 means near-random selection β€” the bandit was exploring freely and causal estimates are highly reliable. AUC > 0.8 means strong confounding β€” the DML is doing important correction work but estimates carry more uncertainty because few users in the "wrong" arm condition exist.

P<0.05 is the fraction of observations where the model predicts near-zero probability of this arm being selected. When this exceeds 20%, the arm was almost never chosen in that feature region and the CATE estimate there is unreliable.

Section 2 β€” Causal Assignment vs Bandit Selection

The most operationally important output. The causal forest identifies, for each user in the dataset, which arm would causally produce the best outcome. This is compared to what the bandit actually selected.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
2. CAUSAL ASSIGNMENT vs BANDIT SELECTION
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  Arm                              Causal   Bandit     Gap
  ────────────────────────────────  ───────  ───────  ───────
  price_10                          42.1%    14.2%   +27.9%  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
  price_15                          51.3%    79.4%   -28.1%  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
  price_20                           6.6%     6.4%    +0.2%  β–ˆβ–ˆβ–ˆ
GapInterpretationAction
Gap β‰ˆ 0The bandit's traffic distribution matches the causal structure. The model has converged to the correct policy.No action. If entropy is also healthy, the campaign is operating correctly.
Gap > 0 (arm underserved)A user group that would causally benefit from this arm is not being routed to it. The bandit is leaving value on the table for that segment.Check whether the arm had enough early observations. Consider raising alpha or adding context features that distinguish this group.
Gap < 0 (arm overserved)The bandit has over-converged to this arm. It is receiving more traffic than the causal structure warrants β€” often driven by user selection bias rather than genuine causal advantage.The arm's observed reward rate is inflated by non-random selection. Do not interpret high reward rate as causal effectiveness without this analysis.

Section 5 β€” Selection Stability Over Time

Splits the campaign timeline into five equal-sized buckets (by prediction timestamp) and shows per-arm selection rate across the full campaign history. This surfaces convergence, collapse, and pipeline events directly from the Parquet export without requiring access to the live diagnostics endpoint.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
5. SELECTION STABILITY OVER TIME
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  t1(961)  t2(961)  t3(961)  t4(961)  t5(961)

  price_10   β–‘β–‘β–’β–’β–’  [ 18.2%  22.1%  28.4%  31.7%  35.9%]
  price_15   β–ˆβ–ˆβ–ˆβ–ˆβ–“  [ 64.1%  59.3%  55.2%  52.8%  47.3%]
  price_20   β–’β–‘β–‘β–‘Β·  [ 17.7%  18.6%  16.4%  15.5%  16.8%]

  Stable across buckets  β†’ IID assumption holds, DML estimates reliable
  Monotonic increase     β†’ bandit converging or collapsing to this arm
  Sudden jump in t4/t5   β†’ possible reward pipeline event or config change

The example above shows healthy convergence in progress: price_15 is steadily losing traffic to price_10 as the model learns. The DML IID assumption is not perfectly satisfied (the policy is non-stationary) but the gradual shift means causal estimates over the full history are a reasonable approximation of the average policy. A sudden jump in t4 or t5 β€” rather than a gradual drift β€” would indicate a pipeline or configuration event and should prompt investigation before trusting the causal estimates.

Interpreting the full output

SignalWhat it meansWhat to do
ATE significant & positiveThe arm causally increases reward above baseline.This arm is genuinely effective. Its observed advantage is not explained by selection bias.
ATE near zero or inconclusiveThe arm's observed reward advantage is not causally real β€” it is explained by the context features it tends to be selected for.Audit whether context features are capturing the relevant signal. The arm may be receiving credit for conversions that would have happened anyway.
ATE significant & negativeThe arm is causally hurting outcomes.Remove or replace the arm. Its selection is suppressing reward relative to what users would have achieved under a different policy.
CATE p25–p75 wideThe treatment effect is heterogeneous β€” some users respond strongly, others do not. Personalisation is the mechanism driving value.The bandit's context-aware routing is doing important work. Inspect the winning segments output for which feature dimensions drive the heterogeneity.
CATE p25–p75 narrowThe effect is homogeneous β€” the arm is roughly equally good or bad for all users.The bandit is correct to converge uniformly. A simpler rule-based policy would achieve similar results.
Positivity violation > 20%The bandit collapsed exploration for this arm. CATE estimates are not trustworthy for the affected feature region.Check entropy_status in /diagnostics. Determine which operational recipe applies and address the root cause before re-running causal analysis.

Causal analysis requires at least 200–300 reward observations per arm for reliable CATE estimates. Run POST /checkpoint to export the latest data before each analysis run. The causal forest fits one model per arm, so runtime scales linearly with the number of arms β€” allow 2–5 minutes for campaigns with 5+ arms and 10,000+ observations.

Use Cases

Any decision that has a context, a finite set of choices, and a measurable outcome is a candidate for BanditDB.

Domain Arms (choices) Context Reward signal
LLM routing GPT-4o, Claude Sonnet, Gemini Flash, … Task type, input length, session turn, user expertise LLM-as-judge score, user thumbs up, task completion
Prompt strategy Zero-shot, chain-of-thought, few-shot, structured output Task complexity, domain, input length, session context LLM-as-judge quality score (0–1)
Agent tool selection Which tool or sub-agent to invoke for a given step Task type, prior tool results, cost budget Task success, latency, cost
Dynamic pricing Price tiers or discount levels Inventory, seasonality, competitor pricing, customer segment Revenue per unit or sell-through rate
Checkout optimisation Upsell offer, free shipping, no offer Cart value, customer history, device type Conversion (binary) or order value lift
Content personalisation Article, offer, layout variant User demographics, history, session signals Click, time-on-page, downstream conversion
Legal intake routing Consult, intake form, refer, decline Case value, matter complexity, conflict risk, capacity Matter opened, revenue collected
Adaptive clinical trials Treatment arms Patient demographics, comorbidities, baseline score Outcome score normalised to [0, 1]
Sleep / wellness Temperature, light, noise reduction Sex, age, weight, activity level, bedtime PSQI score improvement ratio

The reward must be a scalar in [0, 1]. If your natural metric has a different range, divide by its maximum (e.g. revenue / max_possible_revenue) or use a ratio like (after - before) / before, clipped to [0, 1].

When BanditDB is not the right tool

  • Pure exploration / discovery β€” if you have no feedback signal yet and are building a dataset from scratch, start with random assignment and switch to BanditDB once you have ~100 outcomes per arm.
  • Very high-dimensional action spaces (thousands of arms) β€” LinUCB scales with arms Γ— feature_dimΒ² in memory. For catalogue-scale recommendation, consider embedding-based retrieval first and use BanditDB for the final re-ranking stage.
  • Non-stationary rewards with hard concept drift β€” LinUCB assumes rewards are stationary. If the distribution shifts sharply (e.g. a major product change), delete and recreate the campaign to reset the priors.