How can resilience be improved with Express Route in Azure connectivity – Part 2?

I have started a blog post series to discuss ExpressRoute resilience, starting with its basics in the first part and more to come in this part. Please visit the first part of the blog post to read about the Azure ExpressRoute foundation. 

The foundation of a well-structured system relies on network connections that are highly reliable, resilient, and available. Reliability, the core principle, is based on two essential components: resiliency and availability. Resiliency aims to prevent failures and, in case they occur, restore applications to full operational status. Availability focuses on ensuring consistent access to applications or workloads. It is critical to stress the significance of proactive planning for reliability based on the specific business needs and application requirements to create a sense of preparedness and control.

ExpressRoute has been carefully designed to ensure high availability by providing carrier-grade private network connectivity to Microsoft resources. This approach eliminates any single points of failure within the ExpressRoute path on the Microsoft network. To ensure maximum availability, the customer and service provider segments of the ExpressRoute circuit should be carefully designed to meet high availability standards.

High availability necessitates maintaining redundancy across the entire ExpressRoute circuit network path. This entails ensuring redundancy within the on-premises network and avoiding any compromises in redundancy within the service provider network. Preserving redundancy involves mitigating single-point network failures at a minimum. Additionally, implementing redundant power and cooling for network devices is critical in enhancing high availability.

  • ExpressRoute has both primary and secondary connections. Suppose both connections terminate on the same customer network equipment. In that case, it can compromise the high availability of the customer’s on-premises network.
  • Furthermore, configuring both the primary and secondary connections using the same port of customer network equipment forces the partner to compromise high availability on their network segment.
  • On the other hand, terminating the primary and secondary connections of ExpressRoute circuits in different geographical locations could compromise the connectivity’s network performance.
  • The Microsoft network is set to use the primary and secondary connections of ExpressRoute circuits simultaneously. However, you can control the route advertisements to make the redundant connections of an ExpressRoute circuit work in a backup mode.
  • One way to make one path preferred is by advertising more specific routes and using BGP AS path prepending.
  • To improve high availability, it’s best to use both connections of an ExpressRoute circuit simultaneously. When both connections are active, the Microsoft network balances the traffic across them based on each flow. 
  • Suppose the primary and secondary connections of an ExpressRoute circuit are run in backup mode. In that case, there’s a risk that both connections could fail if the active path fails. This can happen because the passive connection might need more active management and may advertise outdated routes.
  • To achieve the highest resiliency and availability, it’s recommended that a zone-redundant ExpressRoute virtual network gateway be configured. This deployment physically and logically separates gateways within a region across availability zones, safeguarding your on-premises network connectivity to Azure from zone-level failures.
  • Zone-redundant virtual network gateways can be deployed across availability zones. By utilising zone-redundant gateways, you can leverage zone resiliency to access your mission-critical, scalable services on Azure.

Three different ExpressRoute resiliency architectures can be used to ensure high availability and resiliency in your network connections between on-premises and Azure. These architectural designs are Standard, High, and Maximum resiliency.

ExpressRoute’s standard resiliency is a single circuit with two connections configured at a single site. Built-in redundancy (Active-Active) facilitates failover across the circuit’s two connections.

Standard resiliency (Source: Microsoft)

The high-level overview of standard resiliency Expressroute architecture:

  • Single circuit with two connections at a single site and Built-in redundancy for failover across the two connections.
  • The setup lacks site resiliency and has potential connectivity issues in case of site failure.
  • Not recommended for business or mission-critical workloads.
  • Potential connectivity issues in case of site failure.
  • Low cost.

High resiliency, also known as multi-site or site resiliency, allows multiple sites within the same metropolitan area to connect your on-premises network to Azure through ExpressRoute. It provides site diversity by splitting a single circuit across two sites.

High resiliency (Source: Microsoft)

The high-level overview of high-resiliency Expressroute architecture:

  • Single circuit with two connections at a single site and Built-in redundancy for failover across the two connections.
  • The setup provides a higher level of site resiliency than standard resiliency architecture.
  • Mitigating edge-site isolation and failures.
  • Achieving site diversity within a metropolitan city.
  • Providing resiliency to failures between edge and region.
  • Not recommended for business or mission-critical workloads.
  • Applicability for business and mission-critical workloads within a region.
  • Medium cost.

Circuits configured for maximum resiliency provide both site (peering location) and intra-site redundancy. After deploying multi-site redundant ExpressRoute circuits, it is essential to ensure that on-premises routes are advertised over the redundant circuits to utilize the benefits of multi-site redundancy fully.

Maximum resiliency (Source: Microsoft)

The high-level overview of maximum resiliency Expressroute architecture:

  • Configuring a pair of circuits across two locations for site diversity.
  • The setup provides a higher level of site resiliency than high-resiliency architecture.
  • Eliminating single points of failure within the Microsoft network path.
  • Improving reliability, resiliency, and availability.
  • Ensuring the highest level of resilience for business and mission-critical workloads.
  • Providing resiliency to failures between edge and region.
  • Recommended architecture in the reliability pillar of the Well-Architected Framework.
  • Higher cost.

Users depends on the ExpressRoute for the availability and performance of edge sites, WAN, and availability zones to sustain their connectivity to Azure. However, these components or sites may face failures due to various reasons such as equipment malfunction, network disruptions, weather conditions, or natural disasters. Therefore, planning for reliability, resiliency, and availability is a shared responsibility between customer and their cloud providers.

Azure has introduced a guided protal experience to assists in the configuration of ExpressRoute circuits for maximum resiliency. The selection of a resiliency architecture for ExpressRoute is contingent upon the specific needs of the customer and the business. It is essential to prioritize solutions beyond addressing failures through a single ExpressRoute circuit, considering the well-known adage: “If anything can go wrong, it will.”