Web Development

What is resilience testing with real-life examples

Resilience testing belongs to the category of “non-functional testing” and tests how an application behaves under stress. Due to increasing consumer demands, resilience testing is as important as never before.

That’s why companies like Cisco are taking resilience testing very seriously, with 75% of all of Cisco’s applications tested for resilience as of mid-2016.

What is software resilience testing?

Software testing, in general, involves many different techniques and methodologies to test every aspect of the software regarding functionality, performance, and bugs.

Try Usersnap for Resilience Testing

Try Usersnap Now

Resilience testing, in particular, is a crucial step in ensuring applications perform well in real-life conditions. It is part of the non-functional sector of software testing that also includes compliance testing, endurance testing, load testing, recovery testing and others.

As the term indicates, resilience in software describes its ability to withstand stress and other challenging factors to continue performing its core functions and avoid loss of data.

Or as defined by IBM: “Software solution resiliency refers to the ability of a solution to absorb the impact of a problem in one or more parts of a system, while continuing to provide an acceptable service level to the business.”

Since you can never ensure a 100% rate of avoiding failure for software, you should provide functions for recovery from disruptions in your software. By implementing fail-safe capacities, it is possible to largely avoid data loss in case of crashes and to restore the application to the last working state before the crash with minimal impact on the user.

One way of improving the resilience of software and solutions is by hosting them on cloud servers, thus minimizing the chance of failures to the internal system and choosing a much more resilient cloud architecture. While disruptions do occur on the cloud level as well, the cloud operators usually have sophisticated resilience and recovery systems in place.

Examples of how software resilience testing is done

Resilience testing at Netflix

A great example of how resilience testing can be done successfully on cloud level is Netflix and its so-called Simian Army. Even though all of the Netflix services are hosted on Amazon Web Services’ state of the art cloud servers with cutting edge hardware, the company realized that the sheer scale of their operations makes failures unavoidable.

To prepare for these failures, Netflix developed their own tool to create random disruptions to the system and tested it for resilience. The tool was designed to simulate “unleashing a wild monkey with a weapon in your data center (or cloud region) to randomly shoot down instances and chew through cables ” and was aptly called Chaos Monkey.

By identifying weaknesses in their systems, Netflix can then build automated recovery mechanisms to deal with them should they occur again in the future.

The tool is run while Netflix continues to operate its services, although in a controlled environment and in ideal time frames. By only running Chaos Monkey during US business hours on weekdays, the company ensures that their engineers will have the maximum capacity for dealing with the disruptions and that server loads are minimal compared to peak consumer usage times.

After early successes, Netflix quickly developed additional tools to test other kinds of failures and conditions. Among these tools were Latency Monkey, Conformity Monkey, Doctor Monkey and others, collectively known as the Netflix Simian Army.

Resilience testing with the Simian Army has since become a popular approach for many companies, and in 2016 Netflix released Chaos Monkey 2.0 with improved UX and integration for Spinnaker.

Resilience testing at IBM

To get an idea of how companies react to different kinds of failures, we can look at how resilience testing is done at IBM. The team at IBM has identified two significant components of resiliency, the problem impact and the service level that is considered acceptable once the problem occurs.

Ideally, any failure would have no impact at all on the consumer. Since that is impossible to achieve, IBM focuses on minimizing that impact as much as possible. If a machine that is hosting the system or one its components crashes, for instance, the requests on their way to that machine get redirected to another machine instantly and as transparently as possible to the users.

A more dramatic event would be the failure of an entire data center, in which case “all the work that was being processed by that data center is continued by another data center – again as transparently as possible to the users, although in the event of a catastrophic outage you should be prepared for a significant impact.”

The goal at IBM is to minimize the impact and duration of failures. For a machine failure, this duration is usually measured in minutes, while a failure in a data center could cause disruptions of several hours.

To come up with meaningful resiliency test cases, IBM uses the solution operational model where all the components of the solution to the problems as well as their interactions are identified. They then look at solution non-functional requirements to create a list of requirements to the solution such as response time, throughput and availability.

Recommended Reading:

Wrapping it up.

With consumer expectations increasing, it is vital to ensure minimal disruptions to any service or software that enters the market these days. While cloud hosting can go a long way in minimizing failures, resilience testing should still make up a significant part of overall software testing.

Try Usersnap for Resilience Testing

Try Usersnap Now

There are many different approaches for resilience testing. Using chaos engineering and the Netflix Simian Army can help discover unusual problem sources and potential weaknesses in the system’s architecture. It requires capacities for controlled testing though, and for many companies, a more structured and theoretical approach like the one used by IBM makes sense.

Rebecca Vogels

Next An Interview with Gretchen DeKnikker »

Previous « We Are The Biggest Developer Conference in Europe. We Are Developers.

Published by

Rebecca Vogels

8 years ago

How to Run 3 Health Checks to Improve Your Product Discovery Phases by David Pereira

Too many discovery efforts fail silently. Teams run interviews, ship features, and sprint ahead -…

3 days ago

PDLC

Dual Track Agile with Ant Murphy: How to Balance Discovery and Delivery Without Losing Your Mind

If you're sprinting with delivery while discovery is stuck in the parking lot, you're not…

2 weeks ago

PDLC

How to Design a Product Discovery Framework That Maximizes Impact – With Matt LeMay

"If you have 10 teams decorating the hood of a car with rhinestones, the hood…

2 months ago

PDLC

How to Create a B2B Ideal Customer Profile (ICP) with Examples of Research from Leah Tharin

Imagine fishing without bait. You might get lucky, but most of the time, you’ll be…

2 months ago

PDLC

9 Product Discovery Techniques/Methods to Build the Right Product

Product managers often feel stuck during discovery. The pressure to ship fast turns discovery into…

3 months ago

Customer Feedback & Experience

Product Discovery Process: Aligning Insights with the PDLC

Imagine launching a product feature that no one uses. The team spent months building it,…

3 months ago

What is resilience testing with real-life examples

What is software resilience testing?

Try Usersnap for Resilience Testing

Examples of how software resilience testing is done

Resilience testing at Netflix

Resilience testing at IBM

Wrapping it up.

Try Usersnap for Resilience Testing

Related Post

Recent Posts

How to Run 3 Health Checks to Improve Your Product Discovery Phases by David Pereira

Dual Track Agile with Ant Murphy: How to Balance Discovery and Delivery Without Losing Your Mind

How to Design a Product Discovery Framework That Maximizes Impact – With Matt LeMay

How to Create a B2B Ideal Customer Profile (ICP) with Examples of Research from Leah Tharin

9 Product Discovery Techniques/Methods to Build the Right Product

Product Discovery Process: Aligning Insights with the PDLC