ClaudeJr
The layer that sits in front of every Claude & GPT call and cuts your bill up to 90% — using prompt caching, cross-provider model routing, a self-verifying speculative cascade, the Batch API, response & semantic caching and prompt compression. All stacked automatically. Built to protect your output quality, not trade it for savings.
Measured, not asserted.
Every figure here is reproducible. ClaudeJr ships with a deterministic benchmark that drives the real engine over a fixed 1,500-call workload, prices each call at published provider rates, and decomposes the saving per technique — gated by sixteen honesty invariants that fail the build if a single dollar can't be traced to provider pricing.
$ npm run claudejr:bench # reproduce every number above
the real, verbatim output — not a screenshot, not a claim
────────────────────────────────────────────────────────────────────────── ClaudeJr savings benchmark — MEASURED against real catalog pricing ────────────────────────────────────────────────────────────────────────── workload : 1500 calls · seed 0x5eed1009 · reproducible engine sees : 45.4% simple · 39.3% standard · 15.3% complex (classifyComplexity) baseline : claude-opus for EVERY call (flagship-for-everything) system prefix : 74 tokens, shared by all calls ────────────────────────────────────────────────────────────────────────── ENGINE BEHAVIOR (what the real engine actually did) cache hit rate : 28.9% (310 exact · 124 semantic) live model calls : 1066 of 1500 routed below opus : 902 of 1066 live (84.6%) input tokens saved : 4,168 via compression volatile bypasses : 25 paraphrase probes, 0 served stale ────────────────────────────────────────────────────────────────────────── MEASURED COST (priced at published per-token rates) baseline spend : $62.55 ClaudeJr spend : $21.16 saved : $41.39 → 66.2% reduction ────────────────────────────────────────────────────────────────────────── WHERE THE SAVING COMES FROM (telescopes exactly to total) response+sem cache $17.77 28.4% smart routing $21.69 34.7% prompt compression $0.06 0.1% output right-sizing $1.87 3.0% sum of buckets $41.39 66.2% ────────────────────────────────────────────────────────────────────────── SPECULATIVE CASCADE (runVerified over 400 live calls) draft accept rate : 67.5% (143/212 cheap drafts shipped) cascade spend : $9.02 vs $17.88 baseline → 49.5% reduction ────────────────────────────────────────────────────────────────────────── SINGLE-FLIGHT COALESCING (burst of 50 identical concurrent calls) upstream calls : 1 of 50 (49 duplicates coalesced for $0) saved on the burst : $1.99 vs $2.03 had every duplicate hit the provider ────────────────────────────────────────────────────────────────────────── VERIFIED GRAY-ZONE CACHE (correctness-gated extra hits, opt-in verifier) no verifier : near-match NOT served — the gray zone is never gambled rejecting verifier : wrong-region answer NOT served — a real call is made accepting verifier : paraphrase served for ~$0, netted of the verify cost → safely LOWER the match threshold for more hits with zero wrong answers ────────────────────────────────────────────────────────────────────────── MODELED (website ROI calculator on the equivalent workload) projected reduction: 89% (measured run() path: 66.2%) note: measured path excludes Batch API + prompt-cache breakpoints + the cascade, so the measured number is deliberately CONSERVATIVE. ────────────────────────────────────────────────────────────────────────── HONESTY INVARIANTS [PASS] exact repeats cost $0 (cache served them) [PASS] compression NEVER inflated input tokens [PASS] routing NEVER dropped below the task quality floor [PASS] volatile asks NEVER served from the semantic cache (freshness) [PASS] semantic cache caught paraphrased repeats [PASS] a big shared system prompt never serves a wrong cached answer [PASS] per-technique buckets telescope to total saved (±$0.01) [PASS] measured reduction is a real fraction (0 < r < 1, no free lunch) [PASS] workload is realistic, not cache-saturated (live ≥ 55%) [PASS] the ROI model never projects past its 92% honesty cap [PASS] right-sizing never lowers an explicit caller cap [PASS] single-flight collapsed a concurrent identical burst to ONE upstream call [PASS] single-flight coalesced exactly N−1 duplicates, ledgered honestly [PASS] verified gray-zone: with NO verifier a near-match is never served (correctness-first) [PASS] verified gray-zone: a rejecting verifier never serves a wrong-region answer (LO≪HI) [PASS] verified gray-zone: an accepting verifier serves a paraphrase, NET of verify cost ────────────────────────────────────────────────────────────────────────── RESULT: ALL INVARIANTS HOLD — numbers are real and bounded. ──────────────────────────────────────────────────────────────────────────
Baseline = the flagship model (Claude Opus) on every call — a deliberately unfavorable yardstick, so savings against a sane setup run higher, not lower. The harness fails CI if routing ever drops below a task's quality floor, if a cached answer is ever wrong or stale, or if the modeled projection exceeds its 92% honesty cap.
Twelve real levers. Stacked on every call.
These aren't tricks — they're the actual pricing mechanics of Claude and GPT, applied automatically so you don't have to think about them. Every dollar traces back to published provider pricing.
Stop paying for the same context twice
Your system prompt, instructions and pinned files are identical on every call. ClaudeJr marks that stable prefix with a cache breakpoint, so the provider reads it at ~10% of input price instead of re-billing it in full each turn.
biggest single lever · up to 90% off inputThe cheapest engine that still clears the bar
A fast classifier scores each task and routes it — across Claude and GPT — to the cheapest model whose reasoning tier meets the task's quality floor. Trivial calls land on Haiku / GPT-4o-mini; hard ones escalate to Sonnet or Opus. Never below the floor.
routes down, never below qualityIdentical requests cost zero
A SHA-256 fingerprint of the normalized request keys a TTL + LRU cache. Ask the same thing twice and the second answer returns instantly for 0 credits — no network, no model.
100% off exact repeatsHalf price for anything async
Work that isn't blocking a user — evals, backfills, agent fan-out — is dispatched through the Batch API at a flat 50% discount, then reconciled back transparently.
50% off non-urgent workFewer input tokens, same meaning
Collapses dead whitespace, blank-line runs and verbatim-duplicate context blocks before billing. The model sees identical meaning — you pay for fewer tokens on the variable part of every prompt.
trims the part caching can'tCatch the paraphrased repeats too
"List the steps" and "what are the steps" are the same question. ClaudeJr matches on meaning (Jaccard similarity over text shingles), not just exact bytes — so near-duplicate prompts hit cache and cost $0. Only for deterministic calls, so variety is never lost. And a borderline match is never guessed: below the safe-serve threshold sits an optional verifier-gated gray zone — a near-match there is served only if a cheap check you supply confirms it genuinely answers the new prompt, otherwise it falls through to a real call. That lets you lower the threshold for more hits with provably zero wrong answers, and the verify cost is netted out of the reported saving.
more hits · never a wrong answerA hard cap so you can't get burned
Set a monthly ceiling. ClaudeJr tracks real spend and blocks calls before they blow the budget — the safety net every broke founder and every Fortune 500 finance team actually wants.
no surprise billsProof, not promises
Every call records tokens and dollars saved, exact vs semantic hit rate, spend-to-date and routed model. You get a running total you can put in front of finance — or an investor.
auditable, per callDraft cheap, verify, escalate only if needed
A fast model writes a draft; a verifier checks it for refusals, truncation and invalid output; only when the draft misses the bar does ClaudeJr escalate to a flagship. You get flagship quality on the hard minority, cheap-model price on the easy majority — and a draft-accept rate in the ledger that proves it.
flagship quality, cheap-model priceNever serve a stale answer to save a buck
Caching repeats is free money — except for time-sensitive asks ("today", "latest", "price of…"). ClaudeJr flags those and bypasses the paraphrase cache for them, so you save on safe repeats without ever returning a stale reply where freshness matters.
savings that can't go staleA trivial ask can't run up a giant bill
Output tokens are the most expensive tokens. When a caller sets no completion limit, ClaudeJr fills in a sane per-task ceiling — small for a one-liner, generous for hard work. It never lowers a limit you set, so it can't truncate intended work.
caps the priciest tokensA concurrent burst becomes one upstream call
The newest lever. When identical requests arrive at the same moment, the response cache can't help — none has finished yet, so each would hit the provider. ClaudeJr collapses the burst with single-flight: one call goes upstream and the duplicates ride along on its exact result for $0. On any failure each caller falls back to its own attempt, so it only ever removes duplicate work — never changes an answer.
one call for the whole burstOne router. Claude and GPT.
ClaudeJr is provider-agnostic. It knows each model's price and reasoning tier, then sends every task to the cheapest option that still meets the quality floor — so you're never locked in and never overpaying.
| Model | Provider | Input /1M | Output /1M | Cache read | Tier |
|---|
Published list prices, USD per million tokens. Cache-read price is what you pay for the stable prefix after the first call — the heart of the savings.
See your savings
Match the sliders to your workload. The math below is exactly what ships in the engine (projectSavings()) — same pricing, same formulas.
🛡️ The "never alter quality" design principle
Every lever removes waste, not intelligence. Caching returns the same bytes. Compression preserves meaning. Routing only ever moves a task down to a model that still clears its quality floor — and escalates the moment a task looks hard. The speculative cascade goes further: it verifies a cheap draft and falls back to a flagship the instant the draft misses the bar, so it can only ever improve an answer, never ship a worse one to save money. When in doubt, ClaudeJr spends more, not less. You can verify it: the ledger logs the model, tier and draft-accept rate for every single call.
Drop-in. One wrap.
ClaudeJr already runs inside APxAI's LLM path. Download the two files below, drop them in, and wrap the SDK client you already have — Claude or GPT — in one line. The adapter mirrors the exact translation APxAI runs in production.
// two files you download below — 0 runtime deps (node:crypto only) import { claudeJr } from "./claudejr"; import { makeAnthropicExecutor } from "./claudejr-adapters"; // Wrap the SDK client you already have — that is the whole integration. const exec = makeAnthropicExecutor(new Anthropic()); // or makeOpenAIExecutor(new OpenAI()) // Cheapest path: cache, compression, cross-provider routing, coalescing & the ledger — free const result = await claudeJr.run(req, exec); // Or cheap-first-then-verify: drafts on a fast model, auto-verifies, and // only escalates to a flagship the instant the draft misses the bar const best = await claudeJr.runVerified(req, exec); console.log(result.content, result.savings.estimatedUsdSaved); // real answer + per-call saving console.log(claudeJr.savingsReport()); // cumulative ledger + draftAcceptRate
Download & install
Free. Self-contained. Runs anywhere Node runs — your laptop, your CI, your VS Code. Drop it in once and every Claude / GPT call gets cheaper automatically.
Engine + adapters + shield · the real thing
The actual engine that runs inside APxAI — zero runtime dependencies (only Node's built-in crypto) — plus the one-line SDK adapters. Drop both into any TypeScript or Node project.
import { claudeJr } from "./claudejr"; import { makeAnthropicExecutor } from "./claudejr-adapters";Download claudejr.ts ↓ claudejr.js ↓
Privacy Shield — redacts your API keys, secrets & PII before any prompt reaches OpenAI / Anthropic / xAI, then restores them in the answer. Your keys never leave your box.
Verify integrity: SHA256SUMS.txt — shasum -a 256 -c SHA256SUMS.txt
VS Code extension
Mission Control + ClaudeJr savings, right in your editor.
Bundled with the APxAI extension
Get the extension →Proxy gateway · works with anything
The universal path. Start the proxy, then point any SDK's base URL at it — Python, Go, Ruby, anything — and every call gets cached, routed and compressed. Zero code changes.
# 1. build & start it from source (Node 20+) git clone https://github.com/caiocarvalho93/APxAI && cd APxAI npm install && npm run build OPENAI_API_KEY=… ANTHROPIC_API_KEY=… npm run claudejr -- --port 8787
# 2. point your client at it base_url="http://localhost:8787/v1"Self-host the proxy ↗
Fully OpenAI-compatible: streaming (stream: true → SSE), /v1/models, and optional bearer-token auth — so existing clients work unchanged. Dependency-free at runtime (node:crypto only) — nothing phones home, your prompts never leave your machine. The proxy is the same engine APxAI runs in production: drop it in front of OpenAI, Anthropic, or both, in any language.