← Engineering Blog/Systems Thinking/FQ-03

SystemDesignforOperations,NotSoftware

Software patterns don't map cleanly to operations. The teams that try to use them unchanged ship the wrong system.

/published08 Mar 2026

/read-time13 min read

/byForgequbit Engineering

Operations engineering sits in an awkward place. Most of the people capable of building it come from software backgrounds, so they reach for the patterns they already know. Most of the businesses commissioning it describe it as "just automation", so they expect it to look like anything Zapier can stitch together. Neither framing is right.

An operations system shares a surprising amount of DNA with a backend — event buses, workers, data stores, observability — but the three places it diverges are exactly the places software patterns break down.

1. Humans are a legitimate system node#

In software, a human is either a user (outside the system) or an administrator (edge case). In operations, a human is a first-class node in the graph. Approvals, exception handling, compliance sign-offs — these are not edge cases. They are the steady-state workflow.

Designing for this means accepting that some steps must pause, wait for a human signal, and resume. Temporal, Inngest, and bespoke state machines all solve for this well. Zaps, cron jobs, and imperative scripts don't — they tend to either hang or force a premature decision.

/textblock

order.received
  → policy.check       (machine)
  → manual.review      (human · SLA 2h · escalate to supervisor)
  → dispatch.execute   (machine)
  → pod.capture        (machine)

2. Replay safety is not optional#

A software backend often tolerates ambiguity about what ran. An operations system cannot. Money moved, shipments left the warehouse, invoices were emitted — none of those are reversible. If you cannot reason about what happened, you cannot safely retry, backfill, or correct.

Replay safety means three things: every action is deterministic given its inputs and idempotency key; every state transition is logged before it takes effect; and every downstream side-effect is keyed so it can be safely re-issued.

▸Side-effects are keyed (idempotency_key, request_id, or equivalent).
▸State transitions are recorded before execution (append-only event log).
▸Retries are safe by design, not by convention.
▸Replay is a first-class operation, not a recovery hack.

3. Audit-first design beats audit-later instrumentation#

Most software gets observability bolted on after the first incident. Operations systems don't have that luxury. The data they move is frequently regulated, contested, or directly tied to revenue. The first incident is often the first audit.

An audit-first design treats the audit trail as a primary artifact — not a log — and builds the rest of the system around it. The event stream is the source of truth; the database is a projection; reports are queries over the event log, not snapshots from whatever SaaS happens to have the latest state.

“When the audit trail is the source of truth, every other table in the system is an opinion about it. That's the right way round.”

Why familiar software patterns underfit#

If you come from a software background, the temptation is to reach for request/response architectures, RESTful CRUD, and a relational store as the spine. All three work fine until the business logic requires a human, a regulated side-effect, or an incident you don't yet have the query for.

Request/response: Can't cleanly express waits that last hours or days. Turns into cron + state tables — an event system pretending not to be one.
CRUD: Throws away intent. You know a shipment is in state 'booked', but not that it was re-booked twice after a policy change.
Relational-first: Optimises for the current view, not the history. Forces you to choose between auditability and query performance at exactly the moment you need both.

The alternative — event-sourced, state-machine-driven, observability-first — is more work on day one, and dramatically less work on day 400. The inflection point usually lands well before either side expected.

⧫

"We'll build it like software" is the right instinct. "We'll build it exactly like software" is the wrong conclusion. Operations is adjacent to software, not a subset of it, and the category matters enough to pick patterns accordingly.

/filed-underSystems Thinking · FQ-03

All articles

/keep-reading

Adjacent articles.

FQ-01Systems Thinking

Why Most Automations Fail at Scale

Every operations team eventually hits the wall: the automations that worked at 200 events a day collapse at 20,000. The reason is almost never the tool. It is the absence of four engineering primitives.

9 min readRead

/next

If this described a problem you actually have, the fastest next step is an Operations Audit.

Start Audit→Talk to an Engineer→