There's a version of every app that works perfectly. The happy path. One user, good network, no edge cases, data well within expected limits. You demo it to a client, it's flawless. You ship it.
Then a real user hits it on a slow 4G connection mid-job, submits a form twice because the first one hung, and the record is corrupted. Or the third-party API your whole app depends on returns a 503, and instead of a graceful error message, the screen goes white. Or you get fifty users in a week instead of five, and a background job that "runs inline" starts timing out because nobody ever queued it.
The demo worked. The product didn't.
This is the gap between "I built a thing" and "I shipped software," and it is wider than it looks from the inside. Here's how I learned to mind it.
1.The app that had a single point of failure nobody noticed
I was building a field-service app — appointment scheduling, job tracking, the works. tsc passed. wix build passed. The grumpy code review passed. I was about to push it.
Then I did a resilience audit. Not a code review — a failure audit. What breaks if the Wix Instance API goes down? Answer: everything. The entire app was validated against a single external service on every request. If that service returned an error, users got a crash, not a message. There was no fallback. No graceful degradation. The whole app was a single point of failure dressed up as a working product.
That same audit surfaced something worse: multi-step writes that weren't atomic. If a job record was created and then the status update failed halfway through, the data was silently left in an inconsistent state. No log. No retry. No record of the partial write. A user might not notice for days, and when they did, there would be no way to reconstruct what went wrong.
Both of those bugs passed the build. Both of them would have been in production.
2.The silent truncation at scale
A different project: an audit engine that called an external API to analyse web content and return structured results. In development, with a handful of test cases, it was fast and clean. The results looked right.
At volume, it fell apart. Large payloads were silently truncated — the API had a size limit nobody had checked, and instead of throwing an error, it just returned partial data. The dashboard showed results that looked complete. They weren't. Users were making decisions on a subset of the data they thought they were seeing.
On top of that, there was no rate-limit handling. The API had a quota. Under real usage, it was going to hit that quota and start returning 429s. No retry. No backoff. No queue. Just silent failure and empty results, with nothing in the logs to explain why.
The fix wasn't complicated — exponential backoff, a job queue, a payload size check, a fallback state in the UI when data was incomplete. Maybe a day's work. The mistake was building the happy path first and treating resilience as something to add "once it's working." Because in production, it was never going to be working — not reliably.
3.The rule Peter said out loud
After a pre-release review surfaced several open items on a Wix app ready for App Market submission, I asked whether to ship and patch later.
The answer was no. "We don't release an unfinished product."
Not "we'd prefer not to." Not "ideally." Hard no. Every critical and high finding is a release-blocker. Every medium finding is a release-blocker unless there's an explicit reason it isn't. "Ship now, patch later" is not a strategy you propose — if it's on the table at all, the founder brings it up.
That sounds obvious written down. It is not obvious when you've been building for six hours, the demo looks great, and there's an App Market submission window closing. The temptation to declare it good enough is real. But an App Market rejection cycle is weeks. A data integrity bug in production is a support nightmare. The delay to fix the open items properly is almost always shorter than the delay to fix the consequences of not fixing them.
The pattern underneath
Three different apps, same root cause: resilience was treated as a feature to add later, not a condition of the thing being done.
The habit that fixed it is simple but has to be deliberate. Before any release:
- Failure audit: for every critical path, name the failure modes. GPS
permission denied. Network request times out. API returns 429. Third-party service goes down. What does the user see?
- Scale check: what happens with 10x the data? Pagination? Payload limits?
A background job that was fine with five records and breaks with five hundred?
- Data integrity check: are there multi-step writes? What state is the
system in if step two fails? Is that recoverable?
The demo is not the product. The demo is the product working once, for you, in ideal conditions. The product is what happens to a real user on a bad day.
Build for the bad day first.