How to route DNS traffic?

In this internet era, businesses are going global, and they are using the internet as their growth engine. However, the internet was originally designed for connectivity, and not really for performance. This lack of focus on performance will have a direct impact on businesses. As the number of online users increases, the internet can get overwhelmed, slowing down or even interrupting online transactions. Such disruptions will ruin end-user experience – key parameter that determines whether businesses will be successful or not. Hence, there is a need to develop a system to address the growing expectations from end-users. 

To address this issue, traffic steering mechanisms are formulated. Traffic steering mechanisms are sets of traffic routing rules to serve responses to DNS queries. In other words, different endpoints may serve a DNS query depending on the adapted rule to ensure better availability and scalability. DNS traffic steering mechanisms are simple yet very powerful techniques that can efficiently steer DNS traffic to address several DNS routing issues, providing failover capabilities, providing the ability to load balance traffic across multiple resources, and even accounting for the location where the query originated.

A. Static DNS Steering

These are frameworks containing rules that help serve DNS requests.

  1. Failover

Failover allows you to prioritize the order or endpoints where you want to direct the DNS traffic. For example, you have primary and secondary servers located in 2 different regions. In a failover mechanism, all DNS traffic, by default, is forwarded or directed to your primary server. If for some reason the primary server fails or becomes unresponsive, failover will then automatically steer the DNS traffic to your secondary (or failover) server.

  1. Load Balancer

Load Balancer allows distribution of DNS traffic across multiple endpoints. Endpoints can be assigned equal weights to distribute traffic evenly across multiple endpoints (round-robin), or custom weights may be assigned for ratio load balancing (weighted round-robin). For example, 80% of the DNS traffic is directed to your on-premise server, while 20% are directed to a public cloud.

  1. Geolocation Steering

Geolocation steering distributes DNS traffic to different endpoints based on the geographic location of the end-user (source of the query). Customers can assign a specific endpoint (or set of endpoints) to server end-users located at a particular geographic region (e.g. continent, countries, or states/provinces). For example, all requests from end-users originating from the US and Canada are directed to the North American server (or pool of servers), requests originating from Asia are directed to a Singapore server, while requests originating from the rest of the world are routed to the North American server as well.

  1. ASN Steering

ASN steering enables you to route DNS traffic based on Autonomous System Numbers (ASN). DNS queries originating from a specific ASN or set of ASNs can be steered to a specified endpoint.

  1. ISP Prefix Steering

ISP network steering enables customers to direct DNS traffic based on the preferred ISP provider.

  1. IP Network Steering

IP Prefix steering enables customers to direct DNS traffic based on the IP Prefix of the originating query.

B. Performance-based DNS Steering

Most of today’s DNS traffic management is based on ratio distribution (load balancing) and geographic proximity of the end-user to an end point. However, ratio distribution and geographic proximity (i.e. static DNS routing) “alone” does not necessarily correlate to better performance. In fact, performance metrics such as latency, availability, throughput, etc. are not taken into account in static DNS routing. Considering the complexity of the internet, static DNS routing will not be able to provide the best routing decisions.

With the growing number of end-users and user expectations, performance consideration is becoming more and more critical. Static DNS routing will not be sufficient anymore.  While static traffic steering such as load balancing and geolocation may work most of the time, “most of the time” in a context of hundreds of millions of DNS queries may translate to hundreds of thousands of suboptimal transactions. From a business standpoint, ignoring performance factors will definitely bring debilitating consequences on overall user experience, which may consequently lead to loss of revenues, or worse to business downfall.

Hence, there is a need to utilize performance metrics in deciding or steering DNS traffic. Traffic routing that uses performance metrics is called “performance-based steering mechanism”. Performance-based traffic mechanisms address the dynamic and complex nature of the internet, and it ensures all DNS queries will be able to consistently steer traffic to the fastest and most reliable route. Hence, guaranteeing improved user experience and better promise of business success.

There is a wide range of metrics that can be used in a performance-based traffic steering mechanism. These are: availability, reliability, latency, throughput, packet loss, and jitter. Each of these performance metrics or their combination can be used as a basis for steering DNS traffic in order to address specific business goals.

  1. Availability

Availability, or uptime, refers to the amount of time the server or network being “up-and-running”, over a specific period of time. Network availability is measured as a percentage, and is monitored to ensure the service provided is consistently kept running for end-users. The percentage is usually represented in the the form of “nines“. It is an important metric for end-users as it serves as a measure of the health and performance of a network, this is why availability should be considered as a priority. As a standard a healthy network is considered to have a very high availability (uptime) with a value of more than 99% (more than “two nines”).

  1. Reliability

Reliability is a measure of the likelihood a server or a system can experience failure (i.e. downtime) within a period of time. Reliability is usually represented by mean time between failures or MTBF, which is the total time the server is “up-and-running” divided by the number of times a failure happens. Sometimes, failure rate can be used as well, which is the reciprocal of MTFB (i.e. the number of failures divided by total uptime). Reliability basically tracks how long a server or a network’s infrastructure is functional without interruption.

  1. Latency

Latency measures the time it takes for some data to get from the source of request to its desired destination across the network, and then back again.  It is measured in round-trip time (RTT) usually measured in milliseconds (ms). Generally, latency within a data centre is measured in microseconds but between locations is measured in tens of milliseconds. Latency (together with throughput) is essentially the speed of the network. Hence, has a direct impact on the performance of the network. Longer distances directly correlate to higher RTT and (if combined with low throughput) will result in a network congestion. WAN acceleration and compression products sometimes help, but in some situations they increase latency instead. The surest way to reduce latency is to reduce the distance between communicating channels. Caching also works well, but mostly with static contents.

  1. Throughput

Throughput is the rate of successful message or data delivery/processing over a communication channel. It is usually measured in bits per second (bit/s or bps), or sometimes in data packets per second (p/s or pps). Throughput is essentially synonymous to digital bandwidth consumption. High bandwidth applications such as online gaming, video streaming, and large size file transfer applications require high throughput.

  1. Packet loss

Packet loss occurs when one or more data packets traveling across a computer network fail to reach their intended destination. Packet loss is caused by errors in data transmission or network congestion, and sometimes by outages resulting from human error, routing black holes, and even power and hardware failure. Packet loss is measured as a percentage of data packets lost with respect to total data packets sent. The Transmission Control Protocol (TCP) detects packet loss and performs retransmissions to ensure reliable messaging.  However if loss is too high, TCP will eventually run out of buffer space and will affect the overall network stability. Real-time applications like online games, or media streaming, packet loss can seriously affect a user’s quality of experience (QoE).

  1. Jitter

Jitter is the variation in the time between data packets arriving at their destination. These time differences are usually caused by network congestion or route changes since these make the steady stream of a packet flow to become erratic and resulting in some packets getting discarded. The standard jitter measurement is in milliseconds (ms). It’s a major problem for real-time communications such as VoIP and video streaming. Typically, if a receiving jitter is higher than 20ms, it can increase latency and result in packet loss, and will cause quality degradation and reduced performance. Some network appliances are equipped with jitter buffers, but the best way to combat jitter is to select the most stable path between two endpoints.

C. Intelligent DNS Traffic Steering

Different web applications will have different priorities and performance requirements. Depending on the type of web application, businesses can decide to use one or combinations of metrics as part of their routing decision in order to address their specific needs. Though the whole internet is dynamic, more often than not, performance of each server and other components within the network changes from time to time. It will be tedious to measure metrics and manually change the traffic route again and again. To address this issue, businesses can adapt the intelligent DNS traffic steering. 

Intelligent DNS traffic steering, allows real-time monitoring of servers/network performance metrics and automatically applies those metrics for routing decisions. In other words, when an end-user makes a query for a particular website, intelligent DNS traffic steering will dynamically update the DNS route based on the best real-time performance data. Not only that this saves businesses time, but more importantly, it will be able to offer the most optimum DNS route automatically at any point in time, hence improving user experience.

Data Monitoring 

A very important component of intelligent DNS traffic steering is real-time data monitoring. There are two commonly used tools for real-time data monitoring: Synthetic user monitoring (SUM); and Real user monitoring (RUM).

Synthetic user monitoring, or SUM is an active real-time monitoring technique that is done by using scripts that are created to simulate an action, or a path – for the case of DNS traffic. Those paths are then continuously monitored at specified intervals for performance. Because SUM is a simulation, it is often best used to monitor commonly trafficked paths.

Real user monitoring, or RUM is a passive real-time monitoring technique that monitors DNS traffic generated by real users. RUM uses a script called a beacon code, usually inserted in the web application and will be activated to collect performance data every time a real user visits the website. Since RUM is initiated by a real user, it will be able to monitor the traffic path starting from the end-user’s location until its destination.

Both SUM and RUM give you different views on performance, and are useful for a variety of things. SUM allows us to create a consistent testing environment by eliminating user-related variables. While, RUM captures the performance from the diversity of end-users who actually visit a page and use the network. Both RUM and SUM provide feedback about DNS traffic performance, but they vary between the size of traffic they generate. SUM helps diagnose and solve shorter-term performance problems, while RUM helps with understanding long-term trends. Each method has its respective benefits and drawbacks, however, using the data from both will allow you to have better visibility, and also allow you to look into specific types of problems and how to resolve them.