AWS Fault Injection Simulator Review

Cloud solutions Infrastructure optimization

28 December 2021

2397

AWS Fault Injection Simulator: Controlled Chaos Experiments

AWS Fault Injection Simulator Features

Netflix Vs AWS Chaos Engineering: Who's first?

AWS FIS Benefits

Takeaway Message

The last thing you want to see when debugging your code is chaos. But what if this chaos is controlled and launched by the developers themselves? Why deliberately create turbulence in the well-coordinated operation of your application? How can you achieve peace of mind when releasing important features? When exactly does the practice of chaos engineering come in handy? Let’s find out!

AWS Fault Injection Simulator: Controlled Chaos Experiments

When digital systems suddenly collapse, the consequences for businesses are huge. The cost of downtime can mean thousands of dollars for big businesses, but the impact goes well past lost revenue. Sometimes, IT disruptions are just a sliver of the full scope of problems that lie in wait. Decreases in stock prices, reputation, and customer satisfaction can all be potential aftereffects of a digital systems crash. On top of that, It systems downtime indicate internal, unsolved vulnerabilities in the company operation.

As modern shared IT ecosystems are becoming more complex, it is clear why failures occur but it is hard to fix and prevent them. Most of the time, mutual interdependence (cloud computing, microservice architecture, etc) generates many points of failure that cannot be predicted within the working environment. Hence, introducing a certain level of chaos into a system can be a perfect way to predict and fix damage before it becomes irreparable. AWS Fault Injection Simulator provides a complete tool for that.

AWS Fault Injection Simulator is a completely controlled solution that allows you to run experiments on AWS to enhance your app’s performance, making it observable and resilient. Fault injection experimentation is an approach in chaos engineering. Chaos engineering is a cost-efficient testing practice which involves conducting a series of controlled experiments to determine the state of your IT system in a working environment. As a result of these experiences, you get useful insights about how the system behaves in an unstable network.

For developers, introducing chaos allows them to get an extra dose of confidence that the platform or app they are developing will be released without bugs or failures. For instance, it can help answer functionality-related questions such as:

What happens when a service is unavailable?
What is the result of crashes when an application is receiving too much traffic or when it is not available?
Will we run into cascading errors when an application crashes due to a single point of failure?
What happens when our application crashes?
What happens when something goes wrong with the network?

Note: Dark debt is a hidden debt that inevitably arises in complex systems. Unlike technical debt, dark debt is invisible at the stage of creating a system. It arises at the junction of components or hardware and software and can lead to a cascade of problems: something breaks down on one component, is superimposed on another, and as a result – the whole system shuts down. For example, in 2016, due to a cascading database shutdown, Facebook was down for 2.5 hours. Then the system, which checked the validity of the configuration files, began to delete them by mistake, not only in the caching subsystem but also in the database that was the primary source.

AWS Fault Injection Simulator Features

AWS Fault Injection Simulator is an entirely controlled fault injection service that facilitates the process of discovering app vulnerabilities to elevate its operation, observability, and resiliency.

Simple Setup

AWS Fault Injection Simulator incorporates top chaos engineering practices making it simple to build and run fault injection experiments, without the need to work with any agents. For starters, companies can utilize Sample experiments. Fault injection actions, such as stopping an instance, throttling an API, or stopping a cron process, are effectively used to deploy and manage the instances of such actions. The tool is compliant with Amazon CloudWatch for easy synchronization of your metrics to watch for the experiments.

Real-world Scenarios

Simplistic scenarios can provide insufficient coverage of real-life conditions that provoke premature, unforeseen failure. That is why simultaneously cropping resources of a different type are supported through the Simulator. Resources that are affected by those fault types can be randomized and custom fault types can be created and managed using AWS Systems Manager.

Safety Controls

Experiments in live environments are facing the risk of unintended impact. To keep your fault injection experiments organized and guarded, AWS allows you to set targeting basing on environments, apps, and other dimensions applying tags. For instance, you can enhance CPU usage on 10% of your instances with the tag “environment”:“prod”. In addition, you can limit or stop an experiment by setting rules in the Simulator based on CloudWatch Alarm. For instance, all experiments that take place on a website will come to a complete halt if a web page’s response time is less than an acceptable threshold.

Security Model

AWS Fault Injection Simulator is united with AWS Identity and Access Management (IAM)for better control over which users/resources have authority to access and run experiments, and which resources/services can be affected.

Experiment Visibility

The Simulator’s console and APIs serve to ensure visibility during each stage of an experiment. During the experiment, you can observe what actions were performed. Once the experiment is finished, you can view detailed information about what actions were taken, whether the stop states were triggered, how metrics are different from the expected steady-state, etc. To maintain correct operational metrics and efficient troubleshooting, you can also determine which resources/APIs are affected by the experiment.

Console and programmatic access

AWS Fault Injection Simulator can be used with the AWS Management Console, AWS CLI, and AWS SDKs. The APIs provide programmatical access to the service for integrating fault injection testing into the CI/CD pipeline, and custom toolset.

Netflix Vs AWS Chaos Engineering: Who's first?

Chaos engineering is a growing trend for DevOps and IT teams. Even companies like Netflix and Amazon use these principles when developing products. Netflix is the cradle of chaos engineering, an increasingly important approach to developing complex modern technological architectures.

Chaos Engineering was first developed by Netflix in 2008 when their subscription streaming service was migrated to the public cloud. Netflix engineers noted that they needed new ways to test their system for fault tolerance.

For this, Chaos Monkey was created in 2010. Since then engineering chaos has grown, and companies like Google, Facebook, Amazon, and Microsoft have adopted similar testing models.

AWS FIS Benefits

AWS FIS is a comprehensive tool for:

- Enhancing your app’s performance, making it observable and resilient: AWS makes it easy to launch, monitor, and share your experiments throughout the entire process, which includes launching your app, handling multiple users, logging usage, and reporting on results achieved. This allows weak points in your software’s performance to be discovered, or other unpredictable weaknesses that would otherwise be missed with traditional software.
- Testing the performance of your application on AWS: AWS Fault Injection Simulator helps generate game-changing experiences with various AWS services (Amazon EC2, Amazon EKS, Amazon ECS, Amazon RDS). You can conduct experiments to examine your app’s operation on AWS at scale and make sure it functions correctly.
- Ensuring security of fault injection stress tests: AWS FIS grants full control over the experiments that users run. You can suspend the test or roll back the changes at any time.
- Starting fault injection stress tests quickly and simply: AWS FIS has out-of-the-box templates that allow for setting up and running superior experiments rapidly. The Simulator arranges the experiments in a way that enables teams to test their app promptly and select from a predefined list of further steps.
- Obtaining quality results due to actual crash conditions: AWS FIS is built to enable various scenarios of app crash that are not possible or challenging for teams to run on their own. With FIS, you can reach gradual or concurrent impairment of various levels of the app in a production environment. This will enable proper validation of the app’s performance.

Advantages for Development

One of the important benefits that chaos engineering provides for developers is the refinement of crash tests. During the crash tests, developers are often faced with unexpected system failures with the same root cause. For example, a crash collapses due to the fact that the user might be flipping through the photo gallery at that time. As all situations are different, it is impossible to test all the combinations of logins. So, to automate this process, developers can implement chaos engineering.

Chaos engineering incorporates mutation testing. In this practice, developers change small pieces of code and see how this affects the tests. If the tests run correctly after the change, then there are not enough tests for these pieces of code.

Advantages for Business

Most companies resort to chaos engineering due to the need to reduce the cost of testing.
Chaos engineering frees up testers’ time as automated tests appear. Chaos engineering has a beneficial effect on the release cycle as you get quick feedback.

Chaos engineering offers many benefits that are not available with other forms of software testing or failure testing. Failure tests can only test one condition in a binary structure, which prevents us from testing the system under unprecedented or unexpected loads.

On the other hand, Chaos Engineering can explain complex, varied and real-world problems or downtime. With the help of Chaos Engineering, we can fix issues and gain new insights into the application for future improvement.

Experiments with chaos help reduce failures and outages, improving our understanding of our system design. Chaos Engineering improves service availability and reliability so customers experience less downtime. Chaos engineering can also help prevent revenue loss and reduce maintenance costs at the business level.

Takeaway Message

In the fast-paced world of technology, getting your business online is vital to its survival in a highly competitive environment. It takes a few minutes for your customer to switch to another service. It takes a long time to come up with a lasting solution. Chaos engineering is one way to shorten the development time and at the same time be confident in what you create. There are other ways to develop reliable software. Find out more in our blog article.