I believe it is essential to read Amazon's white-paper entitled: AWS Well-Architected Framework to be successful at building software systems on Amazon Web Services (AWS). The paper brings into perspective best practices for designing and running efficient and reliable systems which are not only secure but also cost-efficient. Five concepts are listed as pillars that must support a well-architected framework on AWS, and, I believe, on other cloud services.
The pillars are: operational excellence, security, reliability, performance efficiency and cost-optmization.
Operational excellence is: "The ability to support development and run workloads effectively, gain insight into their operations, and to continuously improve supporting processes and procedures to deliver business value."
In this article, I have summarised what I gathered as the noteworthy points for operational excellence according to the paper.
To achieve operational excellence, these five design principles must be applied:
1. Operations must be performed as code. Manual human actions are prone to errors. Eliminate them as much as possible. Operations procedures which humans are expected to undertake before or in response to events must be automated if possible to achieve consistency and reduce errors.
2. Make it possible to make small, frequent and reversible changes. Code, configurations and AWS resources must be separable into components and these components must allow for regular small incremental updates. An update should also be reversible if it introduces failure in the system. Frequent and small incremental updates allow frequent beneficial changes to be introduced into the workload (component system).
3. Review and improve operations procedures as frequently as possible. Always search for opportunities to improve operations procedures. Also, make sure to adapt your operations procedures as your components change and evolve.
4. Always expect failures and design procedures to respond to those failures in an efficient manner. Simulate failures and test designed response procedures to determine their effectiveness.
5. Do not let any operational failures pass without learning from them. Improve operations procedures with knowledge and understanding gained from previous failures and events.
In addition to the design principles for excellence in supporting developments, running workloads and improving their operations, one must understand four main areas which are necessary to achieve operational excellence in the cloud. These areas are:
1. Organization: The organizational priorities and culture, and operating model must be structured in such a way to:
To support your organization on AWS, use tools and services such as AWS Trusted Advisor, AWS Well-Architected Tool, AWS Organizations, AWS Control Tower, AWS Support Center and the AWS Managed Services and Providers.
2. Prepare: Your workload must be designed to provide your teams the information required to understand its internal state so that issues can be observed and investigated. All components of the workload must provide metrics, logs, events and traces necessary to monitor system health, identify risks and allow for effective responses to events. Design you workload to allow easy flow of changes into production, enable effective refactoring and issue fixing. The ability to obtain fast feedback on the quality of changes to enable quick recovery in cases where changes do not provide desired outcomes must be of utmost importance.
To know that you are prepared to support a workload, you must be able to evaluate processes, procedures and personnel to reveal operational risks to the workload.
3. Operate: To achieve business goals and outcomes, one must be able to successfully operate the workload. And to measure this success, one must define what the expected outcomes are and how they are measured. The status of workloads can be communicated via dashboards and notifications customized for the right audience. This will enable the right actions and responses to be carried out.
4. Evolve: Evolving your operations implies learning, sharing and continuously improving towards operational excellence. Factors that contributed to issues, actions performed to resolve issues and future actions to prevent the same issues from repeating must be documented and shared across teams to improve operations. Provide a means to motivate feedback and identify areas that need improvement within the execution of operations.
I believe this article has provided you with a clear summary of how to achieve operational excellence on AWS. If you have any questions and additions, please leave them in the comments below.