This is the second in a series of posts about the 8 Fallacies of Distributed Computing, a set of common misconceptions or false assumptions that are often made when building distributed systems. Keeping these fallacies in mind when building your cloud services, as well as your front-end clients can drastically improve your user experience and your developer experience.

You can read the rest of the series here

When a user or client interacts with our distributed system we often want them to feel as though they are interacting with it in “real-time”. When we’re developing applications locally on our machines, or running in the cloud from our fast office network (possibly even with a VPN connection) it often feels like everything is responding instantaneously. Then when we deploy the application we find things are not as speedy as we originally thought. Users complain about slow loading times, about videos taking ages to download, or constant timeouts trying to upload large files. Sometimes this is caused by building systems that don’t scale, or that haven’t been scaled correctly, but other times even though your system is running fine the user experience can still be terrible.

Every external request that your system makes adds latency. How much latency is added is dependant on the type of request. Making a request to disk is certainly faster than making a request to a web server, but if you have lots of requests to disk they will still add up. Fortunately for us engineers, the way humans perceive the world is not in “real-time”. In fact, it takes time for our brains to process the world around us. In 1968, Robert Miller published his paper on Response time in man-computer conversational transactions and found three orders of magnitude when a user is interacting with a system.

  1. If the system responds within 100ms, they believe it’s instantaneous
  2. If the system responds in 1 second or less, the user typically feels they are interacting with the system freely
  3. If the system takes more than 10 seconds to respond, the users attention is completely lost

More recently, a study by Microsoft in 2015 determined that the average globally attention space has decreased from 12 seconds to 8 seconds. 8 seconds seems like a long time, but if we remember that the network is not reliable all it takes is for a couple of failures, and subsequent retries, for a user to lose interest.

So with the clock ticking on our request, how do we ensure our users stay focussed?

Finding Bottlenecks

The first step in decreasing latency is to identify the bottlenecks. Every system is different, so you need real data to identify where the issues are, and which ones are the highest priority. When building our systems, we need to make sure they are observable using the Three Pillars of Observability (Metrics, Logs, and Traces). With a Distributed Tracing System, such as OpenTracing or AWS X-Ray, a subset of requests are “sampled” showing the latency as that request moved through the system. By being able to observe where the bottlenecks are we can then get around to fixing them.

Caching

One of the simplest and easiest ways to reduce latency is through caching. Caching can occur at multiple levels, but can introduce it’s own challenges. One of the key things to remember with caching is determining how long data can be cached for, and how to invalidate that data in the cache prematurely if you need to. Most caching systems make use of a Time to Live (TTL) to define how long to store that data, but sometimes things change so we need to be able to delete items from the cache.

The first thing you should cache is all shared assets that are distributed to your users. Images, videos, static HTML/CSS/JS, or any other asset that stays the same for every user download. This level of caching can be done in a Content Delivery Network (CDN) such as Amazon CloudFront. A CDN works by hosting the content closer to where your users are. That way, instead of having to go back to your datacenter to retrieve the data, the system will only need to go to the nearest Point of Presence (PoP), which is typically a much shorter and faster journey.

The next layer to consider is your dynamically generated content. Does every user receive unique, personalised content, or is it grouped by some sort of cohort, such as the subscription the user has or the location they are in? Do your users request the same data several times? When setting up your CDN you define a cache key, a set of parameters in the request that make the request unique, which allows you to cache dynamic content. Again, this allows the data to sit in the CDN and results in a much lower latency when fetching it.

After we’ve cached everything we can at the edge, we start to look internally at requests to see if we can reduce the latency there. Calls to a database or to another service can be costly, as we’re not only introducing network latency, but we’re also introducing latency as the database or other system processes our request. Again, we need to look at these requests and determine if they are dynamically generated, if the response changes quickly between requests, or if we can use stale data. If we’re working in a payment system and we need to know the users balance, then we need the latest data and shouldn’t be using caching, but if the data we’re retrieving only changes once per day then it’s a prime candidate.

To cache internal calls we have two options. The first is to introduce a component purely to cache data, such as Redis or ElastiCache. Keep in mind that while this will undoubtedly reduce latency, it still requires a network call which can fail, or the data may not be in the cache. It also requires you to build a way of populating the cache which can be complicated.

Another option is to introduce Memoization. This is a technique available in many programming languages where the service builds an internal cache of results for function/parameter combinations. Essentially, this means that when a function in your code is called it will first check to see if that function has been called with those arguments before and, if it hasn’t, it will run the function and store the result in a local cache. The next time that function is called with those arguments the cached response is returned. The downside here is that the cache takes up memory in your service, and each service will have it’s own cache, but it’s much simpler to implement. Just remember that you can’t and shouldn’t memoize everything. Sometimes, a cache lookup is more expensive than running the function, but if your function makes external calls or does heavy processing it may be worth looking in to.

Event Driven Architecture

When building a distributed system we can reduce latency by reducing the number of hops required to fulfil a request. This means we need to identify what requests need to happen synchronously, such as ensuring a card payment goes through, or asynchronously, like informing the warehouse they need to package and ship the order. By only making the requests that are required, we’re only introducing latency that is required. If a request can be done asynchronously then instead of making a call directly to another service simply fire off an event and allow that other system to pick it up.

Server Performance

Keep an eye on your servers. Through the observability of the system you should have metrics that let you know how your servers are performing. Ensure that they are scaled correctly, that they can scale both horizontally and vertically, and that they allow for hardware failures and network outages.

Client UX

Lastly, when looking at how latency affects our users, it’s important to remember that it’s all about reducing the time they perceive for the system to respond. This can be difficult, especially if your users are on their phones or tablets on a train with spotty internet. A technique called Optimistic UI can be used to make the user perceive the response as being instantaneous. When a user interacts with your UI the UI optimistically updates itself, assuming that the request to the server will succeed. Then, in the background, the request is sent to the server. If the request does succeed then great, your user has a lightning fast experience and are none the wiser. If the system fails after multiple retries, let the user know, and allow them to try again. Other techniques include pre-emptive loading of assets, using animations to hide request times and page loads, and using contextual elements to show that the user interaction is being processed.

Conclusion

Latency is inevitable, but we can manage that latency through a number of techniques. Ultimately, it’s not just about reducing how long a call takes, but making sure the user feels like the system is responsive. Remember, you have one second before the user starts to get annoyed, so make the most of it.