All applications that communicate with remote services and resources must be sensitive to transient faults. Transient faults include the momentary loss of network connectivity to services, the temporary unavailability of a service, or timeouts that arise when a service is busy. These faults are often self-correcting, and if the remote access request is repeated after a suitable delay it’s likely to succeed.
Transient faults can have a huge impact on the perceived quality of an application, even if it has been thoroughly tested under all forseeable circumstances. To ensure that an application that communicates with remote services operates reliably, it must be able to:
- Detect faults when they occur, and determine if the faults are likely to be transient.
- Retry the operation if it’s determined that the fault is likely to be transient, and keep track of the number of times the operation is retried.
- Use an appropriate retry strategy, which specifies the number of retries, the delay between each attempt, and the actions to take after a failed attempt.
This transient fault handling can be achieved by wrapping all attempts to access a remote service in code that implements the retry pattern.
If an application detects a failure when it tries to send a request to a remote service, it can handle the failure by:
- Retrying the operation – the application could retry the failing request immediately.
- Retrying the operation after a delay – the application could wait for a suitable amount of time before retrying the request.
- Cancelling the operation – the application should cancel the operation and report an exception.
The retry strategy should be tuned to match the business requirements of the application. For example, it’s important to optimize the retry count and retry interval to the operation being attempted. If the operation is part of a user interaction, the retry interval should be short and only a few retries attempted to avoid making users wait for a response. If the operation is part of a long running workflow, where cancelling or restarting the workflow is expensive of time-consuming, it’s appropriate to wait longer between attempts and retry more times.
If a request still fails after a number of retries, it’s better for the app to prevent further requests going to the same resource and to report a failure. Then, after a set period, the application can make one or more requests to the resource to see if they’re successful. I’ll return to this topic in a future blog post.
Retrying after an Exponential Delay
If transient faults are occurring because a remote service is overloaded, or being throttled at the service end, the service could reject new requests. While this scenario can be handled by the retry pattern, it’s possible that retry requests could add to the overloading of the service, which means that the service could take longer to recover from its overloaded state.
Exponential backoff attempts to deal with this problem by exponentially increasing the delay between retries, rather than retrying after a fixed delay. The purpose of this approach is give the service time to recover, in case the transient fault is due to a service overload. For example, when the initial request fails it can be retried after 1 second. If it fails for a second time, wait 2 seconds before the next retry. Then if the second retry fails, wait for 4 seconds before the next retry.
Implementing the Retry Pattern
In this blog post I’ll explain how I implemented the retry pattern, with exponential backoff. The advantage of the approach presented here is that the retry pattern is implemented without requiring any library code, for those sensitive to bloating their application package size.
My implementation of the retry pattern adds to Xamarin’s TodoREST sample. This sample demonstrates a Todo list application where the data is stored and accessed from a RESTful web service, hosted by Xamarin. However, I’ve modified the original implementation so that the RestService class moves some of its responsibilities to the RequestProvider class, which handles all REST requests.This ensures that all REST requests are made by a single class, which has a single responsibility. The following code example shows the GetAsync method from the RequestProvider class, which makes GET requests to a specified URI:
Note, however, that the sample application, which can be found on GitHub, doesn’t use the RequestProvider class. It’s included purely for comparison with the ResilientRequestProvider class, which the application uses, and which implements the retry pattern.
The App class in the sample application initializes the classes that are responsible for communicating with the REST service:
The RestService class provides data to the TodoItemManager class, with the RestService class making REST calls using the ResilientRequestProvider class, which uses the RetryWithExponentialBackoff class to implement the retry pattern, using exponential backoff.
The following code example shows the GetAsync method from the ResilientRequestProvider class, which makes GET requests to a specified URI:
Notice that the code from the GetAsync method in the RequestProvider class is still present, but is now specified as a lambda expression. This lambda expression is passed to the HttpInvoker method, which in turn passes it to the RetryOnExceptionAsync method of the RetryWithExponentialBackoff class. Therefore, the code in the lambda expression is what will be retried if the GET request fails.
The RetryWithExponentialBackoff class has a constructor with three arguments, as shown in the following code example:
The constructor arguments specify a maximum number of retries, the initial delay between retries (in milliseconds), and a maximum delay between retries (in milliseconds). However, the three constructor arguments also specify default values, allowing a parameterless constructor to be specified when the class is initialised in the App class.
The RetryOnExceptionAsync method in the RetryWithExponentialBackoff class is shown in the following code example:
This method is responsible for implementing the retry pattern - retrying failed operations with an exponential backoff up to a maximum number of retries. It uses an infinite loop to execute the operation that was specified as a lambda expression in the GetAsync method in the ResilientRequestProvider class. If the operation succeeds the infinite loop is broken out of, and the response received from the web service is returned. However, if the operation fails due to a transient fault, that is a TimeoutException or a HttpRequestException, the operation is retried after a delay controlled by the ExponentialBackoff class.
Obviously, the implementation of the RetryWithExponentialBackoff class could be tidied up so that it takes a dependency on an IBackoff type, which the ExponentialBackoff struct would implement. This would allow different backoff strategy implementations to easily be swapped in and out. However, the current implementation does adequately demonstrates what I’m trying to show – retrying requests that failed due to transient faults.
The ExponentialBackoff struct has a constructor requiring three arguments:
All three arguments must be specified when creating an instance of the struct, and they should be identical to the values of the arguments in the RetryWithExponentialBackoff class constructor.
The RetryOnExceptionAsync method in the RetryWithExponentialBackoff class invokes the Delay method in the ExponentialBackoff class if a transient fault has occurred. The Delay method is shown in the following code example:
This method implements a roughly exponential delay, up to a maximum number of milliseconds specified by the maxDelayMilliseconds variable, while ensuring that the maximum number of retry attempts isn’t exceeded by throwing a TimeoutException when the number of actual retries is equal to the maximum number of retries allowed.
Running the Sample Application
The sample application, which can be found on GitHub, connects to a read-only REST service hosted by Xamarin, and it’s most likely that when running the sample the GET operation will succeed on first attempt. To observe the retry pattern in operation, change the RestUrl property in the Constants class to an address that doesn’t exist – this can be accomplished by adding a random character to the end of the existing string. Then run the application and observe the output in the output window in Visual Studio. You should see something like:
This shows the GET operation being retried 10 times, after a roughly exponential increasing delay, up to a maximum of 2000 milliseconds. Remember that the number of retries, initial delay (in milliseconds), and maximum delay (in milliseconds) can be specified when creating the RetryWithExponentialBackoff instance. This allows the retry pattern to be customized to fit individual application requirements.
The retry pattern allows applications to retry a failing request to a remote service, after a suitable delay. Remote access requests that are repeated after a suitable delay are likely to succeed, if the fault in the remote service is transient.
This blog post has explained how to implement the retry pattern, with exponential backoff. The number of retries, initial delay, and maximum delay can all be specified, allowing the retry pattern to be customized to fit individual application requirements.The advantage of the approach presented here is that the retry pattern is implemented without requiring any library code, for those sensitive to bloating their application package size.
In my next blog post I’ll show how to re-implement the retry pattern using Polly.