The difference between good automation and bad automation isn't the code. It's how you handle failure.
Bad automation assumes everything works perfectly. APIs are always available. Data is always formatted correctly. Networks never fail. Systems never change.
Good automation assumes everything will eventually break. And plans for it.
Here's how to build automation that keeps running even when things go wrong.
Principle 1: Expect APIs to fail.
You're integrating with an external service. Their API documentation says it's available 99.9% of the time. Great. That means it's down 0.1% of the time.
If your automation runs 1,000 times monthly, it'll hit that downtime once. If it crashes when the API is down, it'll fail once monthly. Guaranteed.
Add error handling. If the API call fails, retry it. Try three times with delays. If it still fails, log the failure and alert someone. Don't just crash.
Built an automation for a client that pulled data from their e-commerce platform's API. Platform was normally reliable. But occasionally their API went down for maintenance. Automation crashed. Orders didn't process. Client was frustrated.
Added retry logic. If the API fails, wait 30 seconds. Try again. Still fails? Wait 60 seconds. Try again. Still fails? Send an alert and queue the request to retry later.
Now when the API goes down, automation handles it gracefully. Retries automatically. Usually succeeds on the second or third attempt. No manual intervention needed.
Principle 2: Expect data to be messy.
You test automation with clean sample data. Perfect formatting. All required fields filled in. Everything works.
Then it runs in production. Real data is messy. Missing fields. Unexpected formats. Extra spaces. Special characters. Edge cases you never thought of.
Don't assume data is clean. Validate it. Check that required fields exist. Check that formats are correct. Clean up common issues automatically.
Had automation that processed invoices. Assumed invoice numbers were always numeric. Then client sent an invoice with a letter in the number. Automation crashed.
Updated it to handle any format. If invoice number doesn't match expected format, flag it for review instead of crashing. Log what happened so we can see patterns and adjust if needed.
Now it handles whatever data comes in. Unexpected formats get flagged. But automation keeps running.
Principle 3: Log everything important.
When automation breaks, you need to know why. Can't fix problems you can't diagnose.
Log what happens. Not everything. Don't need to log every successful step. But log failures. Log unexpected conditions. Log anything unusual.
Include enough context to diagnose. Timestamp. What step failed. What the data was. What the error was.
Client's automation mysteriously stopped working. No obvious errors. Just stopped processing orders. Checked the logs. None. Automation wasn't logging anything.
Had to rebuild parts of it with logging to figure out what was happening. Turns out the source system changed their data format slightly. Automation silently failed because it couldn't parse the data.
If it had logged "failed to parse order data: unexpected format in field X," would have diagnosed it in 5 minutes instead of 2 hours.
Now every automation I build logs failures. Saves enormous debugging time.
Principle 4: Alert on failures, not successes.
Don't send notifications every time automation succeeds. That's noise. People stop paying attention.
Send alerts when something goes wrong. When automation fails. When data looks suspicious. When error rates spike.
Make alerts actionable. "Automation failed" isn't helpful. "Automation failed to process 10 orders due to API timeout. Retrying in 5 minutes" is helpful.
Include enough information to decide if manual intervention is needed or if automatic retry will handle it.
Principle 5: Build in manual override.
Sometimes automation gets it wrong. Needs to handle edge cases. Needs human judgment. Don't make manual intervention impossible.
Include a way to manually trigger specific steps. A way to manually fix data. A way to force a retry. A way to skip problematic items and flag them for review.
Client's automation approved expense reports automatically under certain conditions. Worked great 99% of the time. But 1% needed human judgment. Unusual expenses. Weird categories. Borderline policy violations.
Built in a manual review queue. Automation flags anything uncertain. Human reviews those cases. Approves or denies manually. Rest processes automatically.
Best of both worlds. Automation handles the bulk. Humans handle exceptions. Neither gets in the way of the other.
Principle 6: Test with bad data.
Don't just test the happy path. Test failure cases. What happens if a field is missing? What if it's the wrong type? What if the value is negative when you expect positive?
What happens if the API times out? What if it returns an error? What if the network drops mid-request?
What happens if input is 10× larger than expected? What if it's empty?
Every one of those scenarios will happen eventually. Better to find out how your automation handles them during testing than during production.
I keep a collection of "bad data" test cases. Deliberately malformed inputs. Use them to test every automation I build. Find and fix issues before they hit clients.
Principle 7: Don't trust external systems to stay the same.
You integrate with a platform. It works perfectly. Then they update their API. Or change their data format. Or deprecate a feature you were using.
Suddenly your automation breaks. Not because you did anything wrong. Because they changed something.
Can't prevent this. But can make it less painful. Don't tightly couple your automation to specific API versions or data structures. Add abstraction layers that can adapt to changes.
Monitor for changes. If API responses start looking different, alert even if automation still works. Might be a sign of upcoming breaking changes.
Have fallback options when possible. If the primary integration breaks, can you temporarily fall back to a manual process or alternative source?
Principle 8: Keep it simple.
Complex automation has more failure points. More dependencies. More things that can break. More edge cases.
Simple automation is easier to maintain. Easier to debug. Fewer failure modes.
When tempted to add complexity, ask: is this necessary? Or is it nice-to-have? If it's not essential, skip it.
I've built complex automation that did 10 things. Broke constantly. Spent more time fixing it than it saved.
I've built simple automation that does 3 things. Runs for months without issues. That's the goal.
The real test: Can it run unsupervised?
Good automation runs without constant babysitting. You check on it occasionally. But it handles problems itself.
Bad automation requires constant attention. Something breaks. You fix it. Something else breaks. You fix that too. Never ends.
If you're fixing your automation weekly, it's not reliable enough. Either the underlying process is too complex, or you didn't build in enough error handling.
Go back. Add retry logic. Add better error handling. Add logging. Add alerts. Make it resilient.
The best automation is boring. It just works. Quietly. In the background. You forget it exists until you notice time being saved.
That's the goal. Build boring automation that handles failures gracefully and keeps running.