Every operations team eventually hits the same wall. The automation that worked beautifully at 200 events a day starts dropping, duplicating, or silently stalling at 20,000. The executive response is almost always to blame the tool — Zapier, n8n, Make, the integration platform of the month. The real cause is usually further down the stack.
Automation is not a feature you install. It is infrastructure. And like all infrastructure, it breaks the same way — for the same reasons — regardless of which vendor's logo is on the dashboard.
1. Idempotency — the missing contract#
A step is idempotent if running it twice has the same effect as running it once. Billing systems understand this. Operations automations almost never do.
When a webhook retries — and they all do — you get two shipments, two invoices, two "thank you" emails. The fix is not "turn off retries." The fix is to make every write operation carry a stable key the downstream system can deduplicate against.
# Not idempotent (breaks on retry)
create_shipment(order)
# Idempotent (same call, same result)
create_shipment(order, idempotency_key = order.id + "." + attempt)2. Backpressure — the missing valve#
A healthy pipeline has a way to say "slow down." An unhealthy one doesn't. The first 60% of scale happens inside whatever queue, retry buffer, or polling loop is already there. The last 40% happens as those buffers overflow silently into lost work.
Backpressure is the engineering term for the pipeline being able to tell its producer to wait. Without it, a carrier API going slow during peak season doesn't just slow your shipping — it corrupts the upstream record, because the caller is still writing but the callee is still processing something from ten minutes ago.
- ▸Use durable queues between stages (Kafka, NATS, SQS, Redis Streams).
- ▸Set explicit concurrency limits on every worker pool.
- ▸Measure queue depth as a first-class KPI, not an afterthought.
3. Observability — the missing nervous system#
Most automation platforms give you two things: a log pane, and a per-execution history view. Neither is observability. Observability is the ability to answer a question you didn't predict, about a production event you didn't anticipate, on the timescale of the incident.
If your automation cannot answer "which stage slowed down this morning?" — or worse, cannot answer "is anything slowing down right now?" — you don't have a system. You have a black box that happens to still be running.
- Logs
- Every step emits structured events tagged with workflow id, attempt, and input hash.
- Metrics
- Latency, throughput, error rate per stage — queryable on a time window, not per-run.
- Traces
- A single request id follows the work through every stage and every downstream call.
- Alerts
- Defined on symptoms (error rate, queue depth) not causes (specific integrations).
4. Recovery — the missing rehearsal#
Recovery is not "does it work when things go right." It is "what happens on the third attempt, after a partial write, when the upstream provider returned the wrong 200?"
Mature automation treats every step as something that will eventually fail, resume, and replay. Dead-letter queues, compensating actions, and explicit retry policies move from nice-to-have to load-bearing. The cost of adding them on day one is a fraction of the cost of untangling silent corruption six months in.
“If you cannot replay yesterday's events through the pipeline, you don't have a pipeline. You have a sequence of lucky runs.”
The shift: treat workflows like software#
Every team we audit that has solved automation at scale has stopped thinking about workflows as ops tools and started thinking about them as software. That means version control, staging environments, review, rollback, monitoring, and on-call. It is a shift in category, not in vendor.
The tools you already have — n8n, Temporal, your bespoke worker, a handful of SaaS APIs — are almost always capable of running infrastructure-grade pipelines. They just have to be wired with the four primitives above. Skip them, and it doesn't matter how fast or cheap the underlying platform is. It will fail. The only question is whether it fails loud on day one, or quiet for a quarter.
At Forgequbit, these four primitives are the first thing we instrument on any new build. They are also the first thing we look for on any audit — because they are the fastest read on whether a system is genuinely engineered, or simply assembled and waiting.