Table of Content

How to Implement Real-Time Credit Gating with Flexprice

Q: What's the difference between credit gating and rate limiting?

Rate limiting controls request frequency: requests per second, per minute, per hour. Credit gating controls consumption against a prepaid balance. They often coexist: a customer might be rate-limited to 100 requests per second and credit-gated against a $500 prepaid balance. Rate limiting sits at the network layer, before business logic. Credit gating sits at the application layer, after authentication. In Flexprice, rate limiting is out of scope (typically handled by a gateway like Kong or Nginx), while the wallet and entitlement system handles credit gating.

Q: What should happen when a customer's balance hits zero mid-request?

This depends on whether the entitlement has is_soft_limit: true or false. With a hard limit, the pre-request balance check should have blocked the request. If it didn't (because the check read a stale balance), the recommendation is to serve the request, record the overage, and notify the customer. Cutting off a request mid-execution is usually worse than a small amount of over-service. With a soft limit, the system serves the request regardless and bills the overage at the plan's overage rate at invoice time.

Q: How do you handle credit expiry without surprising customers?

The threshold alert system fires webhook events at configurable thresholds (critical, warning, info) before the balance reaches zero. For credits with a specific expiry date, each credit tranche stores its own expiry and the feature.balance.threshold.alert webhook fires as expiry approaches. The practical implementation: subscribe to this webhook, send a transactional email 7 days before expiry, and again at 1 day. Flexprice logs the alert state transitions (OK to info to warning to in_alarm) so you have a full audit trail of when notifications should have fired.

Q: How do you check a customer's current usage against their entitlement limit?

For metered entitlements, the entitlement stores the `usage_limit` and `usage_reset_period`. Current consumption is tracked separately in the usage events system. Flexprice gives you a single call that combines both: ```bash GET /api/v1/customers/usage?customer_id={customer_id} ``` You can also pass `customer_lookup_key` instead of `customer_id` if that's how you reference customers. The response includes a `features` array. Each entry in that array contains `total_limit`, `current_usage`, `usage_percent`, `is_soft_limit`, and `is_unlimited` for that feature, plus a `sources` array that breaks the numbers down by subscription. If a customer is on a plan with multiple metered features, all of them come back in the same call. Use this endpoint for access control decisions. The [Get Customer Usage Summary](https://docs.flexprice.io/api-reference/customers/get-customer-usage-summary) docs cover the full response shape. The `POST /api/v1/meter-usage/analytics` endpoint ([Get Meter Usage Analytics](https://docs.flexprice.io/api-reference/events/get-usage-analytics)) is built for reporting across multiple meters, not for real-time gating.

Q: What's the right architecture for credit gating at very high request volumes?

At low volume (under 1,000 requests per second), synchronous balance checks before every request are fine. At higher volumes, the recommended pattern shifts: check balance at session start or at periodic intervals, batch usage events with the async client, and use the alert threshold system to catch balance exhaustion before it happens rather than checking before every single request. The aggregate table architecture (agg_usage_period_totals in ClickHouse) is specifically designed for sub-100ms balance reads at scale, but calling it thousands of times per second on hot paths still has a cost. The right answer depends on your traffic pattern and your tolerance for over-service.

Apr 18, 2026

• 13 min read

Aanchal Parmar

Product Marketing Manager, Flexprice

The support ticket came in at 11 am on a Tuesday.

An AI API company we work with was describing a problem that had surfaced overnight. One of their users had burned through $400 in prepaid API credits and then kept going for another $180 before the system caught it.

The gating check was reading from a cached balance that was five minutes stale. Five minutes is a long time when a user is running a batch job at 3,000 requests per minute.

The team had built credit gating into their product. It was just the wrong kind.

Credit gating is one of those things that sounds straightforward until you try to implement it at a production scale.

The basic idea is simple: before you serve a request, check if the user has enough credit. If they do, serve it and deduct. If they don't, block it. Three steps.

The hard part isn't the logic. It's the timing, the concurrency, the expiry ordering, and the question of what "current balance" actually means for a user who is sending hundreds of requests per second.

Here's what a production credit gating implementation needs to handle, and how Flexprice's architecture addresses each piece.

What credit gating actually means

The phrase "credit gating" covers two distinct checks that most implementations collapse into one.

The first check is entitlement-based: does this customer have access to this feature at all? This is plan-level access control. A customer on the free tier doesn't have access to certain features regardless of their credit balance.

The second check is balance-based: does this customer have enough credits to pay for this specific request? This is consumption-based access control. The customer has the feature, but they may have run out of prepaid units.

Collapsing these into a single check works fine at low volume. At scale, they fail in different ways at different moments, and you need to handle them independently.

The entitlement layer

In Flexprice, entitlements represent "the benefits a customer gets from a subscription plan." When a customer subscribes to a plan, Flexprice assigns the associated entitlements automatically based on the plan's configuration.

Three types of features can be attached to an entitlement:

Boolean features

This is the simplest form of gating. The customer either has access or they don't. When you check this entitlement, you're reading a single IsEnabled flag. There's no balance involved. A customer on the free plan has IsEnabled: false for the premium API. A customer on the growth plan has IsEnabled: true.

Metered features

This is where it gets interesting. Each metered entitlement has a UsageLimit (nil means unlimited), a UsageResetPeriod (monthly, yearly, one-time), and an IsSoftLimit boolean. The IsSoftLimit field is the one most implementations ignore: if it's true, the system allows usage beyond the limit rather than blocking it. You still track and bill for the overage. You just don't deny the request.

Config features return a static value. Useful for things like "max file size" or "rate limit tier" that vary by plan but don't involve consumption.

Creating an entitlement that attaches a metered feature to a plan:

POST /v1/entitlements
{

"entity_type": "plan",

"entity_id": "plan_growth_01",

"feature_id": "feat_api_calls_01",

"feature_type": "metered",

"usage_limit": 100000,

"usage_reset_period": "monthly",

"is_soft_limit": false
}
With is_soft_limit: false, the system tracks usage against usage_limit: 100000. When a customer's metered usage for the current period hits that ceiling, the gate blocks access for the rest of the period.

With is_soft_limit: true, the system continues serving beyond the limit and bills for the overage. This is the right choice for most production AI API products. Cutting off a customer mid-request because they hit a soft threshold is usually worse than letting them continue and having a conversation about their plan.

The balance layer

Below the entitlement check sits the wallet. Wallets in Flexprice hold prepaid credit balances for a customer. Every wallet tracks two numbers: CreditBalance (in credit units) and Balance (in currency). The relationship between them is Balance = CreditBalance × ConversionRate. This lets you sell credits in bulk at a different rate than the standard conversion. A 20% discount on a large credit purchase is just a different TopupConversionRate.

Creating a wallet and issuing an initial credit grant:
POST /v1/wallets
{

"customer_id": "cust_abc123",

"currency": "USD",

"conversion_rate": 1.0

}
For recurring grants (the kind that replenish automatically each billing period), use cadence: "RECURRING" with a period value:

POST /v1/creditgrants
{

"name": "Monthly Credit Allowance",

"scope": "PLAN",

"subscription_id": "sub_xyz789",

"plan_id": "plan_growth_01",

"credit_amount": 200,

"currency": "USD",

"cadence": "RECURRING",

"period": "monthly",

"expire_in_days": 60
}
The expire_in_days The field sets how long each credit tranche lasts. When credits expire, the system generates a corresponding debit transaction automatically so the ledger stays accurate.
Checking balance in real time

The balance check before serving a request uses a single endpoint:

GET /v1/wallets/{wallet_id}/balance/real-time

The response includes the current balance with pending usage factored in. For prepaid wallets, the real-time calculation subtracts pending usage charges and unpaid invoices from the stored balance. This is what separates a real-time balance check from a cached one: the number reflects what's actually been consumed, not what was committed to the database at the last reconciliation.

For latency-sensitive paths, the balance API is designed for a p95 target under 100ms. The architecture that makes this possible: usage events flow through Kafka into a ClickHouse aggregate table (agg_usage_period_totals) via a materialized view. The balance query reads from that aggregate table, not from raw events. No full event scan on the hot path.

For the highest-volume gating paths, the endpoint also supports a get_from_cache parameter that trades some freshness for lower latency. The cache has a configurable staleness window via max_live_seconds. The right value depends on your risk tolerance: a 10-second stale cache is fine for a dashboard, but not for gating at 3,000 requests per minute.

If you're resolving a customer by their external ID rather than their internal Flexprice ID:

GET /v1/customers/wallets?external_customer_id=cust_external_001

This returns all wallets for that customer, which you can then check individually or in aggregate.

The Gating Pattern in Practice

Here's the sequence a properly implemented credit gate runs on every incoming request:

Step 1: Check entitlement.

GET /v1/entitlements?entity_type=subscription&entity_id=sub_xyz789&feature_id=feat_api_calls_01

If the entitlement doesn't exist, the customer doesn't have access to this feature. Return 403.

If the entitlement exists and feature_type is boolean, check is_enabled. If false, return 403.

If feature_type is metered and is_soft_limit is false, you'll need to compare current usage against usage_limit. If current usage is at or above the limit, return 429 with a clear message about the period reset date.

Step 2: Check wallet balance.

GET /v1/wallets/{wallet_id}/balance/real-time
If balance is 0 or below, return 402. The customer needs to top up.

Step 3: Serve the request.

Step 4: Ingest the usage event.
POST /v1/events
{

"event_name": "api_call",

"external_customer_id": "cust_external_001",

"timestamp": "2026-04-09T14:32:00Z",

"properties": {

"model": "gpt-4o",

"tokens_used": "1200"
}
}
For high-volume event ingestion, the Go SDK's async client batches events and sends them in the background:

asyncConfig := flexprice.DefaultAsyncConfig()

asyncClient := client.NewAsyncClientWithConfig(asyncConfig)

defer asyncClient.Close()

asyncClient.Enqueue("api_call", externalCustomerID, map[string]interface{}{

"model": "gpt-4o",

"tokens_used": 1200,
})

This keeps the post-request path non-blocking. The batch flushes on a configurable interval rather than per-event.

Get started with your billing today.

Get Started

Join Community

The Problems You Don't Think About Until They Hit You

The concurrency problem. Two requests arrive 50 milliseconds apart. Both read the same balance: $1.20. Both pass the balance check. Both get served. The customer now owes $2.40 against a $1.20 balance.

Flexprice handles this at the wallet operation level with an advisory lock on the wallet before any credit or debit. The lock is acquired before reading the wallet state within a transaction, so concurrent modifications serialize. The wallet's debit flow goes: acquire lock, read state, validate amounts, consume eligible credits in expiry order, create the transaction record, update balance atomically, release lock, publish webhook event. An IdempotencyKey on wallet operations means retries don't double-debit.

The expiry ordering problem. A customer has three credit tranches: 500 credits expiring in 5 days, 1,000 credits expiring in 30 days, and 2,000 credits expiring in 90 days. Your system consumes them FIFO: oldest first. The 2,000 credits, issued first, get consumed. The 500 credits expire five days later, unused. The customer lost credits they legitimately purchased.

Flexprice's debit algorithm uses expiry-first consumption: soonest-to-expire credits are always consumed before longer-lived ones. Within credits sharing the same expiry, priority ordering applies. This is the correct default for nearly every use case, and it's one of those things that's easy to specify and surprisingly hard to implement correctly under load.

The stale cache problem. This is the one from the opening. Real-time balance checks are more expensive than cached reads. But for credit gating, the decision to use a cached balance needs to be explicit. It shouldn't be something that happens because someone added caching at the middleware layer without realizing it was being used for access control.

The Flexprice balance endpoint's get_from_cache parameter makes the caching decision explicit per call. For a credit gate check, you want get_from_cache=false unless you've done the math on what stale reads cost you at your traffic volume.

The threshold alert problem. A customer's balance is depleting but hasn't hit zero yet. Their batch job will run in 6 hours. Their current burn rate will exhaust the account in 4 hours. Without proactive notification, the job fails mid-run.

Wallet balance alerts fire via Kafka when the balance crosses configured critical, warning, and info thresholds. The CheckWalletBalanceAlert method evaluates the balance and triggers state transitions from OK through info, warning, and into in_alarm. The system throttles alerts at the customer level via in-memory cache to prevent spam during rapid consumption, but ForceCalculateBalance: true bypasses the throttle for critical path checks.

The auto top-up problem. Some customers want credits to replenish automatically when the balance falls below a threshold. The AutoTopup field on a wallet, combined with a configured threshold, handles this. When the balance alert check detects a balance below the auto top-up threshold, triggerAutoTopup() fires. It either creates a new invoice-backed credit or directly credits the wallet depending on the AutoCompletePurchasedCreditTransaction setting.

Where This Breaks If You Build It Yourself

The actual gating check (two API calls before serving, one event after) is not complex code. You could write it in an afternoon.

What takes months to get right is everything underneath the check itself.

Selecting credits by expiry date under concurrent load requires advisory locking at the wallet level, not application-level checks. Making balance reads sub-100ms requires an aggregate table architecture, not raw event scans. Threshold alerts that fire before customers hit zero require a Kafka pipeline with per-customer throttling. Idempotency on retries requires explicit keys on every wallet operation. Soft limits that don't cut off customers mid-request require those limits to be configured per-entitlement, not as a global billing setting.

Each piece is independently solvable. The problem is they need to work together, and they interact. A change to expiry ordering logic can break concurrent debit behavior. Adding a cache to the balance check invalidates the alert threshold logic. None of this shows up in testing.

Teams that build credit gating in-house typically get the happy path right in sprint one. They find the edge cases in production, over the next 18 months, one customer incident at a time.

If you're pressure-testing a credit-gating implementation before it ships, we can walk through the specific scenarios your architecture needs to handle. No pitch, just a look at where the gaps usually appear.

Frequently Asked Questions

What's the difference between credit gating and rate limiting?

What should happen when a customer's balance hits zero mid-request?

How do you handle credit expiry without surprising customers?

How do you check a customer's current usage against their entitlement limit?

What's the right architecture for credit gating at very high request volumes?

Aanchal Parmar

Aanchal Parmar heads content marketing at Flexprice.io. She’s been in the content for seven years across SaaS, Web3, and now AI infra. When she’s not writing about monetization, she’s either signing up for a new dance class or testing a recipe that’s definitely too ambitious for a weeknight.

< Previous Blog

Next Blog >

Share it on: