Deployment and Rollback: Shipping Software Safely
You can write perfect code, pass every test, and get three approvals on your pull request. None of that matters if the deployment breaks production. Deployment is where your software leaves the safety of your development environment and meets the real world. Things go wrong. The question is not whether they will, but how quickly you can recover when they do.
That is what rollback is for. A rollback is the ability to revert your production system to the previous working state when a deployment causes problems. Without a rollback plan, a bad deployment means downtime, data corruption, or angry users while you scramble to fix the issue. With a rollback plan, it means a 30-second revert and a calm investigation afterward.
Why deployments fail
Even thoroughly tested code can fail in production. Here are the most common reasons:
Environment differences. Your staging environment is not identical to production. Different database sizes, different traffic patterns, different configurations. A query that takes 10ms on staging might take 10 seconds on production because the production database has 100x more data.
Configuration errors. A missing environment variable, a wrong API key, a typo in a connection string. The code is correct, but the configuration it depends on is wrong.
Dependency issues. A third-party service changed its API. A library update introduced a subtle incompatibility. The CDN is having an outage. Your code is fine, but something it depends on is not.
Data migration problems. A database migration that works on an empty test database fails on production because it locks a table with 50 million rows for 20 minutes.
Traffic spikes. The new code handles normal traffic fine, but it falls over under peak load because the new feature allocates more memory per request.
Race conditions. Bugs that only appear under concurrent load from thousands of real users, which your test suite does not simulate.
The common thread: production is different from every other environment, and no amount of testing can perfectly replicate it.
Deployment strategies
Different strategies manage the risk of deploying to production in different ways.
Direct deployment (big bang)
Push the new version to all servers at once. Replace everything simultaneously.
This is the simplest approach and the most dangerous. If the new version has a problem, every user is affected immediately. There is no gradual rollout, no safety net, no partial failure. It is all or nothing.
Direct deployment works for small applications with low traffic. For anything with real users and real consequences, you need something better.
Rolling deployment
Update servers one at a time (or in small groups). While the update is happening, some servers run the old version and some run the new version. Once all servers are updated, the rollout is complete.
Server 1: v2 (updated)
Server 2: v1 -> v2 (updating)
Server 3: v1 (waiting)
Server 4: v1 (waiting)
If a problem is detected on the first few servers, you stop the rollout and roll back just those servers. Most users never saw the broken version.
The tricky part: during the rollout, both versions are running simultaneously. Your code needs to be backward-compatible. If v2 changes the database schema in a way that v1 cannot handle, the servers running v1 will break.
Blue-green deployment
Maintain two identical production environments: blue and green. One serves live traffic (say, blue). Deploy the new version to the other (green). Run smoke tests against green. When you are confident, switch the load balancer to point at green.
Before: Load Balancer -> Blue (v1, live)
Green (v2, testing)
Switch: Load Balancer -> Green (v2, live)
Blue (v1, standby)
Rollback is instant: flip the load balancer back to blue. No redeployment, no waiting, just a configuration change.
The downside: you need double the infrastructure. Two sets of servers, both production-grade, running at all times. For large systems, this can be expensive.
Canary deployment
Route a small percentage of traffic (say, 5%) to the new version while 95% continues on the old version. Monitor error rates, latency, and business metrics. If everything looks good, gradually increase the canary percentage: 10%, 25%, 50%, 100%.
5% traffic -> v2 (canary)
95% traffic -> v1 (stable)
If the canary shows problems, you route 100% back to v1. Only 5% of users were affected, and the damage is contained.
Canary deployments are the gold standard for large-scale systems. They combine the safety of gradual rollout with real production traffic testing. The new code proves itself against real users before it reaches everyone.
Feature flags
Deploy the new code to production, but keep the new feature turned off behind a flag. The code is in production, but users do not see it until you flip the flag.
if feature_flags.is_enabled("new_search_algorithm", user_id):
return new_search(query)
else:
return old_search(query)
You can enable the flag for internal users first, then a small percentage of real users, then everyone. If something goes wrong, disable the flag. No redeployment needed.
Feature flags decouple deployment from release. You can deploy code at any time without exposing it to users. This is powerful because it means deployment becomes routine and boring, which is exactly what you want it to be.
Rollback strategies
When something goes wrong, how do you get back to a working state?
Revert and redeploy
The simplest rollback: revert the git commit and deploy the previous version.
git revert HEAD
git push origin main
# CI/CD pipeline deploys the reverted version
This works but takes time. You have to wait for the CI/CD pipeline to build, test, and deploy. For a system where every minute of downtime costs money, this might be too slow.
Keep the previous version ready
In blue-green deployments, the old version is already running on standby. Rollback means switching the load balancer back. This takes seconds.
In container-based deployments (Kubernetes, Docker), you can keep the previous container image tagged and ready. Rolling back means telling the orchestrator to use the old image.
kubectl rollout undo deployment/my-app
One command, and the previous version is running again.
Database rollback
Code rollback is relatively easy. Database rollback is hard.
If your deployment included a database migration (adding a column, changing a schema, migrating data), reverting the code does not revert the database. You need a separate migration that undoes the schema change, and that migration needs to work even if the forward migration only partially completed.
This is why experienced teams follow two rules:
-
Make migrations backward-compatible. Add new columns, do not rename or delete old ones. The old code should still work with the new schema. Once you are confident the new code is stable, clean up the old columns in a separate deployment.
-
Separate schema changes from code changes. Deploy the migration first. Wait. Then deploy the code that uses the new schema. If the code deployment fails, the schema is still compatible with the old code.
Never deploy a database migration and a code change that depends on it in the same deployment. If the code breaks and you need to roll back, the old code will be running against the new schema, and nothing will work.
Data rollback
The hardest rollback scenario: a bug corrupted user data. Reverting the code does not fix corrupted data. You need database backups, and you need to know exactly what was corrupted so you can restore it without losing legitimate changes that happened after the corruption.
This is why production databases have automated backups, point-in-time recovery, and audit logs. Hope you never need them. Make sure they exist.
What makes deployment safe
The common thread across all these strategies is reducing blast radius and recovery time.
Blast radius is how many users are affected when something goes wrong. Direct deployment has maximum blast radius (everyone). Canary deployment has minimum blast radius (5% or whatever you set).
Recovery time is how long it takes to get back to a working state. A revert-and-redeploy rollback takes 10 to 30 minutes. A blue-green switch takes seconds. A feature flag toggle takes seconds.
The goal is to minimize both: affect the fewest users possible, and recover as fast as possible. Every deployment strategy is a different point on this tradeoff curve.
Monitoring after deployment
Deploying is not the last step. Monitoring is. Every deployment should be followed by active monitoring of key metrics:
- Error rates. Are 500 errors spiking? Are new error types appearing?
- Latency. Is response time increasing? Are specific endpoints slower?
- Business metrics. Are conversion rates dropping? Are users abandoning flows?
- Resource usage. Is CPU or memory spiking? Are database connections exhausting?
If any metric degrades after deployment, that is your signal to roll back first and investigate later. Do not debug in production while users are suffering. Roll back, restore service, and then figure out what went wrong in a calm, non-emergency investigation.
The best deployment processes make rollback a one-click operation. If rolling back requires a meeting, an approval chain, or a 15-step manual process, it will not happen fast enough when it matters.
Deployment in practice
Here is what a mature deployment process looks like:
- Code is merged to main.
- CI/CD pipeline runs tests, builds the artifact, and deploys to staging.
- Smoke tests run automatically against staging.
- Canary deployment sends 5% of production traffic to the new version.
- Automated monitoring watches error rates and latency for 15 minutes.
- If metrics are healthy, traffic gradually shifts to 100%.
- If metrics degrade, automatic rollback to the previous version.
- The team is notified either way.
The entire process is automated. No human needs to click a button, watch a dashboard, or make a judgment call. The system watches itself and rolls back if anything looks wrong.
This level of automation takes time to build, but it pays for itself many times over. Every manual step is a potential delay, a potential mistake, and a potential source of anxiety during a deployment.
The takeaway
Deployment is not just "pushing code." It is a deliberate process of getting new software into production while managing the risk that something goes wrong. The strategies differ in how much risk they accept and how fast they can recover:
- Direct deployment is simple but risky.
- Rolling deployment limits blast radius by updating servers gradually.
- Blue-green gives you instant rollback by keeping the old version ready.
- Canary proves the new version against real traffic before rolling it out widely.
- Feature flags decouple deployment from release entirely.
And for every strategy, the rollback plan matters as much as the deployment plan. Things will go wrong. The teams that deploy safely are not the ones that never have failures. They are the ones that detect failures in seconds and recover in minutes.
Related posts
- Verification vs. Validation covers how testing ensures you built the right thing correctly, before it reaches deployment.
- Software Testing Types covers unit, integration, and other testing strategies that catch bugs before deployment.
- Implementation in Software Engineering and Coding Interviews covers technical debt and coding principles that affect deployment reliability.