Sandip Das
6 min readMay 30, 2022

--

After waking up in the morning, it felt like an awesome day ….. throughout the day everything went well.. so happy… in the night 3 AM started getting calls from every other people on the team, especially from the senior technical managers / CTO / CEO / Founder, etc, then forget sleep, turn on the laptop and start finding what could have caused this and try to fix ASAP.

— This is what a typical technical disaster sounds like (been there, done that!)

So by definition what is a Technical Disaster?

“A Technical Disaster is an event that might have been caused by a malfunction of a technological structure and/or some human error who is in charge of controlling or handling the technology”

Now, let’s learn about different technical disasters and how to handle them!

1) Single Point of Failure

This happens when there is a single resource (servers/instances) supporting dev/staging or even sometimes critical infrastructures (in the name of cost savings).

If there is a sudden spike in workload or in form of any direct/indirect DDoS attack, the resources get overloaded and normal service functioning gets disrupted.

How to handle it?

Make sure infrastructure is designed considering high availability and fault-tolerant scenarios in mind. Autoscaling is enabled, so whenever required servers will scale up, and when in low traffic servers will scale down.

2) Database Destroyed by Mistake / Intentionally:

There are chances that the database can be destroyed by mistakes, the reason could be:

  • Mishandling of database by less experienced human resources.
  • Executing unverified / reviewed script, which has unintended deletion or update commands which might cause records selection or corrupt the data
  • Some hackers into DB instance and removed records or encrypted it or corrupted the data

How to handle it?

Make sure to enable/make/store daily production database backup and these days most standard databases support “Point-in-time recovery”, which helps to get the database back to a time point before the data deletion/corruption.

3) DNS Failure:

A DNS failure occurs when users are unable to connect to an IP address via a domain name. A message will pop up that may say “DNS server not available” or “Server DNS Address could not be found.

This happens when we use domain “A” record type and point domain/sub-domain to a particular IP, in the event of server/host failure, the IP doesn’t serve the request, which disrupts the service

How to solve it?

Make sure to utilize load balancers along with auto-scaling groups, and utilize the domain “CNAME” record and point domain/subdomain to the load lancer host address.

In this case, the load balancer resolves the IP and in the event of host failure, the load balancer will automatically detect it and will redirect the request to the next available healthy server.

4) Cloud Sevice Provider (CSP) Region failure:

Well, this just happens, and the worst part is it can happen for any reason that we can think and don’t able to think, but in absolute truth, it can happen anytime without warning for any cloud service provider 🤯

So better to prepare for it, and just for the sake of reasoning, the most common reason so far:

  • All Resources in the particular region got used up, and nothing to provide when needed
  • Cloud Service Provider’s side DNS misconfiguration or hardware issues
  • Cloud Service Provider’s side coding issue, which might affect its users

How to solve it?

Make sure to utilize Infra as code (Such as Terraform / Pulumi etc), so that if one region is down, we can quickly provision and configure all cloud resources in another region.

Well multi-cloud is a thing, and it’s real, so if it’s a critical infrastructure for which any kind of disruption is not an option, it’s better to opt for multi-cloud, and provision resources over multi-cloud, in the event of any cloud’s region failure or all-region failure, just provision resources of another cloud for that specific region, and of course enjoy the cost competitions. Kubernetes could be really helpful in achieving this feat (But of course complexity will grow more based on the project)

5) Coding / Programming Failure:

If bad quality code gets pushed to servers/services, it’s a sure shot way to fail. Of course, the reason for this could be anything like:

  • Less experienced engineers
  • The constant demand for fast feature delivery
  • Stressful working environment
  • Programmers not getting appreciated and more

How to solve it?

  • Make sure to use test libraries and add relevant test cases so codes can be automatically tested.
  • Have the CI/CD in place to ensure proper integration, testing, and deployment pipeline in place, in the event of test cases failure, engineers should get notified and fix the issues and in case of any failure in deployment, engineers should be noticed along with production servers/services should automatically roll back to previous normal state.

6) Security failure :

This happens when firms neglect security measures or have a tiny budget for it.

Common reasons are:

  • Store passwords in plain text
  • Didn’t encrypt data/files/critical credentials
  • Make confidential documents publicly accessible
  • Usage of easy to guess / not so string passwords and more

How to handle it?

  • Make sure to use strong passwords and should always be stored in a standard encrypted way
  • Store critical data/files/documents only accessible to authorized people and nothing should be public by default
  • If it’s a medium / large / enterprise and if possible by budget then have a separate security team

Well, there are many more, I found the above are the most common ones and while talking about disaster & disaster recovery, you will need to know about these two key metrics:

RTO:

RTO is the maximum amount of time your application can be offline and depends on the SLAs you offer your customers — the promise make them regarding the availability of your service and the consequences of failing to deliver.

RPO:

RPO is the maximum amount of time during which the data might be lost during an interruption.

RTO and RPO values should be as low as possible which makes it quicker for an application will recover from an interruption.

Did I miss any important points here? if so let me and others know in the comments section!

About the Author:

Sandip Das works as a Sr. Cloud Solutions Architect & DevOps Engineer for multiple tech product companies/start-ups, have AWS DevOps Engineer Professional certification, also holds the title of “AWS Container Hero”,

He is always in “keep on learning” mode, enjoys sharing knowledge with others, and currently holds 5 AWS Certifications. Sandip finds blogging a great way to share knowledge: he writes articles on Linkedin about Cloud, DevOps, Programming, and more. He also creates video tutorials on his YouTube channel.

--

--

Sandip Das

AWS Container Hero | Sr Cloud Solutions Architect | DevOps Engineer: App + Infra | Full Stack JavaScript Developer