Multi-Country Payment Engine

The Problem

Every Monday morning, the finance analyst opened her spreadsheet and started over. Payments had processed across UK, Germany, France, and the US over the previous week, but the settlement records didn't match the order records, so she spent six hours manually tracing which charge corresponded to which order and which currency conversion had applied at which moment. Six hours. Every week. That was the human cost before anything technical went wrong.

Technical things went wrong constantly. At peak hours, 3-4% of webhook deliveries failed. The system had no retry logic, so those failures were permanent: an order's payment would show as pending in the database while Stripe had already confirmed the charge. Worse, Stripe would retry the webhook, and without idempotency protection, some of those retries re-triggered the charge flow entirely. Duplicate charges. Not often, but often enough that customer support was fielding complaints every week.

The marketplace had grown into a multi-party payments problem and was still running infrastructure designed for a single merchant. Four countries meant four currencies, separate tax rules, and merchants in different jurisdictions each expecting payouts in their local currency. There was no infrastructure for splitting platform fees from merchant payouts, no system for routing settlements to connected accounts. The team knew they needed something different. They just didn't know how far the existing system was from being able to get there.

The Approach

Stripe Connect was the answer to the multi-party problem. Not a payment widget, not a hosted form: a platform product built specifically for marketplaces that move money between buyers, sellers, and the platform across multiple currencies. Stripe Connect handled connected account KYC/KYB and PSD2 Strong Customer Authentication for EU transactions. Tax calculation was managed by the client's finance team using Stripe Tax. My scope was the payment processing core: the state machine, idempotency layer, webhook pipeline, and financial ledger. Rebuilding that from scratch would have been months of compliance work and ongoing regulatory maintenance. Choosing Stripe Connect meant buying out of that problem entirely and focusing engineering effort on the payment reliability layer instead.

The reliability layer had two components. BullMQ backed by Redis handled all async payment operations: webhook processing, settlement triggers, retry scheduling, dead-letter queue routing. Every payment job type got its own retry strategy. Redis backed idempotency: every payment attempt generated a key from the order ID and attempt number, stored with a 72-hour TTL. A retry hitting an existing key returned the cached result immediately. No reprocessing. No duplicate charges. PostgreSQL served as the financial ledger, append-only, with advisory locks preventing race conditions on the same payment.

Architecture

Payment Flow

A payment starts when the client calls the payment API. The first thing the API does is check Redis for an idempotency key derived from the order ID and attempt number. If the key exists, return the cached result. No Stripe call, no database write. The request is done.

If the key doesn't exist, the API creates a Stripe Connect PaymentIntent, writes a pending record to the PostgreSQL ledger, stores the idempotency key in Redis, and returns to the client. The client has a confirmed payment intent. Settlement happens asynchronously.

When the webhook arrives from Stripe: verify the signature using constructEvent(). Invalid signatures are hard rejections. Check the idempotency key again. If the key has a settled result, return it. Otherwise, update the ledger to the new state, enqueue a settlement job in BullMQ, and return 200.

Every payment moves through seven states: created, authorized, captured, settled, refunded, disputed, failed. Each state transition is a new row in the ledger. The current state is always the most recent row for a given payment ID. In hindsight, 4 states cover 95% of production cases: pending, settled, refunded, failed. The authorized/captured/disputed states were added because I anticipated edge cases that showed up rarely enough to not justify the complexity. A smaller state machine would have been easier to reason about.

Webhook Reliability

Idempotency key design: sha256(orderId + attemptNumber), stored in Redis with a 72-hour TTL. The TTL covers Stripe's maximum webhook retry window with margin. If a webhook arrives three days late and the key is still live, the response is instant and correct.

BullMQ job lifecycle: a webhook job that fails retries with exponential backoff at 5s, 25s, 125s, up to 5 attempts. Beyond 5 failures, the job moves to the dead-letter queue and fires an alert. Engineers see it; it doesn't silently disappear.

Webhook jobs and settlement jobs have different retry schedules. Webhook processing retries fast because those jobs are time-sensitive and failures are usually transient network issues. Settlement jobs retry with longer windows because hammering Stripe's API on a rate-limit error makes things worse. BullMQ's per-job-class configuration made this clean to express.

P99 measured from webhook receipt to ledger write, not including downstream side effects like email notifications.

Result: zero duplicate charges. Idempotency layer intercepted 340+ duplicate webhook deliveries over the first 8 months. Zero duplicate charges reached customers. Verified via monthly reconciliation of our ledger against Stripe's payout reports.

Financial State Machine

The ledger is append-only. No UPDATE statements on payment records. Ever.

Every state transition is a new row with the new state, the timestamp, and the triggering event. When a dispute arrives three months after a settlement, the full history is there: exactly when each state changed, what triggered it, what the Stripe response was. An audit trail that's complete by construction, not by discipline.

PostgreSQL advisory locks handle race conditions. When two webhook retries arrive simultaneously for the same payment, both try to acquire an advisory lock on the payment ID. One wins; one blocks until the first completes and then reads the result the winner wrote. No double-processing. The lock acquisition pattern: acquire lock on payment_id, read current state, validate the transition is legal, write the new state row, release lock.

Why not eventual consistency? Because eventual consistency for financial state means two concurrent requests can both believe they're the first to process a charge. One must lose. Advisory locks make that deterministic.

Key Technical Decisions

Stripe Connect vs Direct Integration

A later project (a DTC fashion storefront) used a different Stripe product entirely: a hosted payment page for a single merchant, simple checkout, no multi-party complexity. That's a solved problem with off-the-shelf tooling.

This was different. A marketplace with merchants in four countries isn't a single-merchant checkout problem. It's a money-movement platform problem. Stripe Connect is a platform product that manages connected accounts as first-class entities: each merchant has a Stripe account linked to the platform, with their own KYC verification, their own payout schedule, their own currency. The platform takes a fee on each transaction before the remainder routes to the merchant.

The alternatives were: build it ourselves (months of compliance work, ongoing regulatory updates across four jurisdictions), use a competing platform product (Adyen, Checkout.com, each with their own integration complexity), or stay on the current single-account setup (not viable for multi-party payouts). Stripe Connect was the right call.

The tradeoffs are real. Connect fees are higher than direct charges. Onboarding connected accounts adds friction for merchants. I chose Connect Standard accounts for the existing merchant base since they already managed their own payment setups and didn't need the platform controlling their Stripe dashboards. The fee premium was worth the compliance coverage.

The marketplace's product team built the merchant onboarding UI and payout dashboard. I built the payment processing backend that powered both.

BullMQ vs AWS SQS

This came down to control. SQS removes the operational burden of running a queue but gives up: job priority (SQS queues are FIFO within message groups, not across them), per-job-type retry strategies with custom backoff curves, rate limiting per job class.

We needed different throughput caps for different job types. Webhook processing needs to run fast and at high concurrency. Settlement jobs need rate limiting to respect Stripe's API rate limits. With SQS, that control requires multiple queues with separate consumers and custom logic to coordinate between them. With BullMQ, it's configuration on a single Redis instance.

The risk is Redis availability. When Redis goes down, the queue goes down with it. Mitigated with Redis Sentinel for high availability: automatic failover, no manual intervention. The operational overhead of running Sentinel was less than the overhead of managing multiple SQS queues and the tooling to observe them.

BullMQ's bull-board dashboard gave us real-time visibility into job states, retry counts, and dead-letter queue depth. That observability was worth the choice on its own.

Results

Before: 3-4% webhook failure rate at peak hours, 6 hours of Monday morning reconciliation every week, duplicate charges reaching customer support. The team knew the failure rate. They didn't know the idempotency gap was causing the duplicates.

After 8 months in production: zero duplicate charges, webhook P99 under 200ms, reconciliation runs automatically. The finance analyst's Monday morning routine is now a 10-minute review of the automated settlement report. Settlement rate defined as percentage of captured charges successfully paid out to connected accounts within the expected settlement window, excluding intentional holds.

8-Month Payment Volume

Settlement Rate by Country

What Worked

Idempotency-first design. Making every payment operation idempotent from day one meant retry logic was safe to add anywhere. The cost was discipline in key design: every operation needs a stable, deterministic key before it touches Stripe. The payoff was never writing a compensating transaction.

BullMQ's per-job backoff. Different job types needed different retry windows. Expressing that as configuration per job class rather than a single retry policy for the entire queue was the right abstraction. When we added a new job type for dispute handling in week 8, it got its own backoff curve without touching any existing logic.

Append-only ledger. Never updating a financial row was the right call from the start. When our first dispute arrived six weeks after settlement, the full state history was there. We resolved it in 20 minutes instead of digging through logs.

Dispute alerting. The first dispute arrived six weeks after launch. The state machine handled it correctly. The full payment history was in the append-only ledger, and we resolved the dispute in 20 minutes using the audit trail. What it exposed was a gap in the alerting: nobody was notified when a dispute arrived. The dead-letter queue had alerts, but the dispute webhook processed successfully (it just wrote a new state row), so it never triggered. We added a separate alert channel for state transitions into 'disputed' and 'failed', states that require human attention regardless of processing success.

Stripe Connect's sub-account model. Having each merchant as a connected account with their own Stripe dashboard meant they could see their own payouts without needing access to the platform's Stripe account. Reduced support burden, cleaner separation of concerns.

What I'd Reconsider

The 7-state machine was over-engineered. In 8 months, the 'authorized' state was used exactly twice (for manual-capture merchants we never onboarded), and the 'captured' state was always immediately followed by settlement. Start with 4 states; add when production demands them.

When Redis is unavailable, the idempotency check can't complete, so the operation proceeds as if it's a new request. Fail-open at the application layer. For most payment operations that's acceptable. For charge initiation, it isn't. I addressed this in a later project (the crypto wallet engine) with a two-layer approach: Redis for speed, PostgreSQL as the always-available fallback.

The 72-hour idempotency TTL was chosen to match Stripe's maximum webhook retry window. In practice, 99% of retries arrive within 2 hours. A tiered approach, short TTL in Redis with a longer-lived fallback in Postgres, would have reduced Redis memory pressure without sacrificing correctness on the long tail.

Built with: Node.js, TypeScript, Stripe Connect, Redis, BullMQ, PostgreSQL