Designing resilient software architecture


Written by Anders Marzi Tornblad

Published on

This is part 2 of the Getting into software architecture series. If you haven't read the first part, here it is: A primer for emerging software architects

As we navigate the complexities of software architecture, three key aspects continually surface - system reliability, maintainability, and scalability. These are vital for the long-term success of any software system. But how do we architect systems to ensure these aspects are optimally addressed? In this article, we delve deeper into each of these aspects, understand their importance, and explore how to shape our software architecture around them.

System reliability

System reliability is an important quality of robust software architecture. It's the measure of the system's ability to perform its intended function, consistently, under both normal and unexpected conditions. The challenge of maintaining high reliability becomes particularly pronounced when dealing with distributed systems like microservices architectures. In such scenarios, practices like distributed logging, effective exception handling, data replication, planned redundancy, and adept failover procedures become crucial.

Distributed Logging

A reliable system requires a comprehensive understanding of its operational status, performance issues, and potential errors. This is where distributed logging comes in. Given the distributed nature of modern software systems, it's important to have a centralized log management system that aggregates logs from all services for easier monitoring and debugging.

Distributed logging provides visibility into the behavior of your applications and aids in debugging problems that could affect system reliability. An insightful source on this topic is the article What is distributed tracing? at the Splunk blog.

Exception handling

Proper exception handling is fundamental for maintaining system reliability. Without it, unexpected errors can cause cascading failures across the system. For distributed systems, exception handling becomes even more critical. It's important to capture and handle exceptions appropriately, ensuring they don't lead to system-wide crashes or instability.

Data replication

Data replication is another key component of system reliability in distributed systems. It involves storing redundant copies of data across different nodes, ensuring that the system remains operational even in the face of individual node failures. Moreover, data replication allows for load balancing of read requests, enhancing system performance.

Reduncancy and failover procedures

Planning for redundancy is a strategy to ensure high reliability and availability in distributed systems. Redundancy refers to the duplication of critical components of the system to increase its reliability. These can include servers, databases, or network connections.

Redundancy goes hand in hand with failover procedures, which are pre-planned strategies to smoothly switch over to a redundant or standby system component when the primary one fails. A well-designed failover procedure minimizes the downtime and prevents data loss, ensuring a high level of system reliability.

For those interested in delving deeper, Designing Distributed Systems: Patterns and Paradigms for Scalable, Reliable Services by Brendan Burns is a remarkable resource.


Maintainability, in the software architecture realm, is a measure of how easy it is to support, enhance, modify, and understand a system. A highly maintainable system ensures speedy resolution of bugs, facilitates smooth upgrades, and allows for seamless integration of new features. But to reach this level of maintainability, there's a secret weapon we must employ: simplicity.

Simplicity in software architecture stems from keeping systems as straightforward and uncomplicated as possible. This is where the KISS principle (Keep It Simple, Stupid) comes into play. It's a design principle that encourages simplicity and avoiding unnecessary complexity. By following the KISS principle, we can produce more maintainable code, easier to read and modify, which, in turn, makes the system more flexible and adaptable to changes.

A related principle is YAGNI (You Ain't Gonna Need It). This agile programming maxim asserts that programmers should not add functionality until deemed necessary. It can be tempting to build features or complex systems thinking that we might need them in the future. However, this often leads to unnecessary complexity and can adversely impact system maintainability. We should instead focus on what the system needs now and adapt as requirements evolve. This makes the system lighter, easier to understand, and consequently easier to maintain.

Achieving maintainability

Firstly, a modular design is crucial. By breaking down a system into smaller, self-contained modules, we create components that can be understood, modified, and tested independently. This not only enhances maintainability but also promotes simplicity, as each module has a single responsibility.

Adherence to SOLID principles is another crucial factor. SOLID is an acronym that stands for five design principles that, when applied together, make it more likely that a system will be easy to maintain and manage.

Efficient documentation is another critical aspect of maintainability. Clear, concise, and up-to-date documentation helps any new team member or even future you understand the system faster and contributes to the system's overall simplicity.

One should also consider practices such as code refactoring, which is the process of restructuring existing code without changing its external behavior. It's a way of keeping the codebase clean, simple, and understandable.

For those keen on expanding their knowledge on maintainable and clean code, Clean Code: A Handbook of Agile Software Craftsmanship by Robert C. Martin is an invaluable resource. It offers numerous techniques and practices to keep your codebase simple, clean, and, most importantly, maintainable.


Scalability refers to the capability of a system to handle increasing workloads without compromising performance or effectiveness. A scalable system can grow over time, adapting to the surge in users, data volume, or transaction load. It's a vital attribute for software architects to consider, especially given the unpredictability of user growth and data explosion. To grasp scalability better, we will delve into key areas such as load balancing, data partitioning, and distributed systems.

Load balancing

Load balancing is a method of distributing workloads across multiple computing resources, thereby improving responsiveness and availability of applications. It is crucial for ensuring a scalable, highly available system. Load balancers can help distribute network traffic to multiple servers, minimizing the risk of overloading a single server, which could lead to service degradation or outage. They can also help in failover, automatically redirecting traffic to available servers if one goes down.

There are different types of load balancing algorithms: round-robin, least connections, and IP hash, to name a few. The choice depends on your specific system needs. It is also crucial to monitor the load balancer's performance continuously to ensure that it effectively serves its purpose.

Data partitioning

As systems scale, managing the growing volume of data can become quite challenging. This is where data partitioning, also known as sharding, comes into play. Partitioning divides the data into smaller, more manageable pieces, each stored on a separate database server. This improves performance by reducing the amount of data that an application must search through when making queries.

Data partitioning can be performed in several ways: range partitioning, list partitioning, hash partitioning, and composite partitioning. The choice again depends on the specifics of your system.

Distributed systems

Distributed systems involve multiple independent computers (nodes) communicating with each other over a network to accomplish a common objective. This architecture offers an excellent way to scale out systems, allowing them to handle increased loads effectively.

Distributed systems can manage bigger workloads, provide redundancy, and ensure high availability. However, they also bring complexity in terms of data consistency, fault tolerance, and system orchestration. Familiarity with concepts like CAP theorem, consensus algorithms, and distributed databases becomes indispensable while working with such systems.

Some examples of software that is designed to be natively distributed are Apache Kafka (an event streaming platform), RabbitMQ (a message queue implementation), Redis (a caching solution that can scale to millions of nodes), and Elasticsearch (the de facto industry standard for large-scale search and analytics).

A seminal reference book on this topic is Designing Data-Intensive Applications by Martin Kleppmann, which is highly recommended for those who want to delve deeper into these concepts.

Identifying architectural drivers

Decoding architectural drivers is similar to solving a complex puzzle – it demands a deep understanding of functional and non-functional requirements, quality attributes, and constraints. These drivers form the backbone of our architectural decisions and influence the direction of the software system's architecture.

Functional and non-functional requirements

Functional requirements define the fundamental actions that a system must perform. They form the basic functionality of the system: the 'what' part. For instance, if we were building an e-commerce platform, a functional requirement might be that the platform should allow users to add products to their shopping cart.

Non-functional requirements, on the other hand, describe different qualities of system performance: the 'how' part. These requirements often pertain to the system's performance, security, usability, and reliability. In the same e-commerce platform scenario, a non-functional requirement could be that the system should be able to handle a certain number of concurrent users without performance degradation.

Recognizing quality attributes and constraints

Quality attributes define the characteristics of the system that give it its personality. They represent the 'ilities' of the system, such as scalability, maintainability, reliability, usability, and so on. Quality attributes are an extension of non-functional requirements, and they tend to have a significant impact on the software architecture.

Constraints, meanwhile, are the limiting factors that architects must operate within. They could be technological constraints (e.g., the requirement to use a specific programming language), organizational (e.g., budget or personnel constraints), or business-related (e.g., regulatory or market-related constraints).

Deciding on the most important drivers

Not all architectural drivers carry equal weight, and as an architect, you need to determine which drivers are most crucial to your system's success. These primary drivers will heavily influence the architectural decisions you make.

To identify these critical drivers, start by classifying your functional and non-functional requirements, quality attributes, and constraints in terms of their importance to the stakeholders and their impact on the system. This classification process often involves discussions and negotiations with stakeholders to align their desires with the technical realities.

Once you have this classification, the drivers that are both high-impact and high-importance emerge as the primary architectural drivers. These are the elements that should receive the most focus during the architectural design process, and they will often influence the trade-offs you need to make.

Remember, while the technical aspects of software architecture are critical, equally important is the understanding of business needs and stakeholder expectations. After all, software architecture is not just about building software; it's about building solutions that meet the needs of the organization and its users.


In summary, to ensure system reliability, especially in a distributed context, architects must master many different tools and techniques, such as distributed logging, robust exception handling, data replication, and planned redundancy, coupled with efficient failover procedures. These are not just nice-to-have attributes but fundamental requirements for any reliable system.

Articles in this series: