Tuesday, 14 November 2017

Fault Handling in Xamarin.Forms: Circuit Breaker

Calls to remote services can fail due to transient faults, such as the momentary loss of network connectivity to services, the temporary unavailability of a service, or timeouts that arise when the service is busy. These faults are often self-correcting, and if the remote access request is repeated after a suitable delay, it’s likely to succeed.

Earlier in the year I wrote about transient fault handling in Xamarin.Forms using the retry pattern. The idea being that all attempts to access a remote service can be wrapped in code that retries the operation if it fails. However, there can also be situations where faults are not transient. Instead, they are due to unanticipated events, and might take longer to fix. Such faults can range from a partial loss of connectivity to a complete service failure. In such circumstances it’s pointless for an application to continually retry an operation that’s unlikely to succeed. Instead, the app should accept that the operation has failed and act accordingly.

Faults that take a variable amount of time to recover from can be handled by the circuit breaker pattern, improving the stability and resiliency of an application.

Circuit Breaker Pattern

The circuit breaker pattern prevents an application from repeatedly trying to execute an operation that’s likely to fail. It monitors the number of recent failures that have occurred, and uses this information to decide whether to allow the operation to proceed, or whether to return an exception immediately. In addition, it also enables an application to detect whether the fault has been resolved, allowing the operation to be invoked again.

The pattern is so named because it sets states that mimic the functionality of an electrical circuit breaker:

  • Closed. The remote access request is attempted. If the request fails a count of the number of recent failures is incremented. If this count exceeds a threshold within a time period, the circuit breaker is placed into Open state. A timeout timer then starts, to allow the problem that caused the failure to be fixed. When the timer expires the circuit breaker is placed into the Half-open state.
  • Open. The remote access request fails immediately and an exception is returned to the application.
  • Half-open. A limited number of remote access request are attempted. If the requests are successful, it’s assumed that the fault that was causing the failure has been fixed, and the circuit breaker is placed into the Closed state (while zeroing the failure counter). If any request fails it’s assumed that the fault is still present, and the circuit breaker is placed back into the Open state, where the timer restarts to allow the problem that caused the failure to be fixed.

Implementation

In this blog post I’ll explain how I implemented the circuit breaker pattern. Patterns & Practices describe an implementation of the pattern here, but don’t provide a fully working implementation. Therefore, my implementation is an attempt to flesh out the code they provided with a fully working implementation. The advantage of the approach presented here is that the circuit breaker pattern is implemented without requiring any library code, for those sensitive to bloating their application package size.

My implementation of the circuit breaker pattern adds to Xamarin’s TodoREST sample. This sample demonstrates a Todo list application where the data is stored and accessed from a RESTful web service, hosted by Xamarin. However, I’ve modified the original implementation so that the RestService class moves some of its responsibilities to the RequestProvider class, which handles all REST requests. This ensures that all REST requests are made by a single class, which has a single responsibility. The following code example shows the GetAsync method from the RequestProvider class, which makes GET requests to a specified URI:

public async Task<TResult> GetAsync<TResult>(string uri) { var response = await client.GetAsync(uri); response.EnsureSuccessStatusCode(); string serialized = await response.Content.ReadAsStringAsync(); return JsonConvert.DeserializeObject<TResult>(serialized); }

Note, however, that the sample application, which can be found on GitHub, doesn’t use the RequestProvider class. It’s included purely for comparison with the ResilientRequestProvider class, which the application uses, and which implements the circuit breaker pattern.

Initialization

The App class in the sample application initializes the classes that are responsible for communicating with the REST service:

TodoManager = new TodoItemManager( new RestService( new ResilientRequestProvider( new CircuitBreakerService(typeof(RestService).FullName, 1000))));

The RestService class provides data to the TodoItemManager, with the RestService class making REST calls using the ResilientRequestProvider class, which uses the CircuitBreakerService class to implement the circuit breaker pattern.

ResilientRequestProvider

The following code example shows the GetAsync method from the ResilientRequestProvider class, which makes GET requests to a specified URI:

async Task<HttpResponseMessage> HttpInvoker(Func<Task<HttpResponseMessage>> operation) { return await circuitBreakerService.InvokeAsync( operation, // Perform a different operation when the breaker is open (circuitBreakerOpenException) => Debug.WriteLine($"Circuit is open. Exception: {circuitBreakerOpenException.InnerException}"), // Different exception thrown (exception) => Debug.WriteLine($"Operation failed. Exception: {exception.Message}") ); } public async Task<TResult> GetAsync<TResult>(string uri) { string serialized = null; var httpResponse = await HttpInvoker(async () => { var response = await client.GetAsync(uri); response.EnsureSuccessStatusCode(); serialized = await response.Content.ReadAsStringAsync(); return response; }); return JsonConvert.DeserializeObject<TResult>(serialized); }

Notice that the code from the GetAsync method in the RequestProvider class is still present, but is now specified as a lambda expression. This lambda expression is passed to the HttpInvoker method, which it turn passes it to the InvokeAsync method of the CircuitBreakerService class. Therefore, the code in the lambda expression is what will be executed by the circuit breaker, provided it’s in a state that allows that. In addition, the HttpInvoker method catches the CircuitBreakerOpenException if the operation fails because the circuit breaker is open.

CircuitBreakerService

The CircuitBreakerService class has a constructor with two arguments, as shown in the following code example:

public CircuitBreakerService(string resource, int openToHalfOpenWaitTime) { _stateStore = CircuitBreakerStateStoreFactory.GetCircuitBreakerStateStore(resource); _resourceName = resource; _openToHalfOpenWaitTime = new TimeSpan(0, 0, 0, 0, openToHalfOpenWaitTime); }

The constructor arguments specify a resource name that the circuit breaker is attempting to protect, and the time in milliseconds to wait when switching from the Open to Half-open state (the length of time the circuit breaker waits for the fault to be fixed).

The InvokeAsync method in the CircuitBreakerService class is shown in the following code example:

public async Task<HttpResponseMessage> InvokeAsync(Func<Task<HttpResponseMessage>> operation) { HttpResponseMessage response = null; if (IsOpen) { // Circuit breaker is open return await WhenCircuitIsOpenAsync(operation); } else { // Circuit breaker is closed - execute the operation try { response = await operation(); } catch (Exception ex) { // Retrip the breaker immediately and throw the exception so that // the caller can tell the type of exception that was thrown TrackException(ex); throw; } } return response; } public async Task<HttpResponseMessage> InvokeAsync( Func<Task<HttpResponseMessage>> operation, Action<CircuitBreakerOpenException> circuitBreakerOpenAction, Action<Exception> anyOtherExceptionAction) { HttpResponseMessage response = null; try { response = await InvokeAsync(operation); } catch (CircuitBreakerOpenException ex) { // Perform a different operation when the circuit breaker is open circuitBreakerOpenAction(ex); } catch (Exception ex) { anyOtherExceptionAction(ex); } return response; }

The first InvokeAsync method wraps an operation, specified as a Func. If the circuit breaker is closed, it invokes the Func. If the operation fails, an exception handler calls the TrackException method, which sets the circuit breaker state to Open. The second InvokeAsync method wraps an operation, specified as a Func, and an Action to be performed if a CircuitBreakerOpenException is thrown, and an Action to be performed when any other exception is thrown.

If the circuit breaker is in an Open state, the InvokeAsync method calls the WhenCircuitIsOpenAsync, which is shown in the following code example:

async Task<HttpResponseMessage> WhenCircuitIsOpenAsync(Func<Task<HttpResponseMessage>> operation) { HttpResponseMessage response = null; if (_stateStore.LastStateChangedDate + _openToHalfOpenWaitTime < DateTime.UtcNow) { bool lockTaken = false; try { Monitor.TryEnter(_halfOpenSyncObject, ref lockTaken); if (lockTaken) { _stateStore.HalfOpen(); response = await operation(); _stateStore.Reset(); return response; } } catch (Exception ex) { _stateStore.Trip(ex); throw; } finally { if (lockTaken) { Monitor.Exit(_halfOpenSyncObject); } } } }

This method first checks if the circuit breaker open timeout has expired. If this is the case, the circuit breaker is set to a Half-open state, and then the operation specified by the Func is performed. If the operation is successful, the circuit breaker is reset to the Closed state. If the operation fails, it is tripped back to the Open state and the time the exception occurred is updated so that the circuit breaker will wait for a further period before trying to perform the operation again.

If the circuit breaker has only been in an Open state for a short time (the open timeout hasn’t expired), the method throws a CircuitBreakerOpenException and returns the error that caused the circuit breaker to transition to the Open state.

Note that the WhenCircuitIsOpenAsync method uses a lock to prevent the circuit breaker from trying to perform concurrent calls to the operation while it’s Half-open. A concurrent attempt to invoke the operation will be handled as if the circuit breaker was Open, and it’ll fail with a CircuitBreakerOpenException.

Running the Sample Application

The sample application, which can be found on GitHub, connects to a read-only REST service hosted by Xamarin, and it’s most likely that when running the sample the GET operation will succeed on first attempt. To observe the circuit breaker pattern in operation, change the RestUrl property in the Constants class to an address that doesn’t exist – this can be accomplished by adding a random character to the existing string. Then run the application and observe the output in the output window in Visual Studio. You should see something like:

Circuit is closed. Executing operation. Tripping the circuit breaker. Operation failed. Exception: 404 (Not Found) ERROR: Value cannot be null. Parameter name: value

This shows attempted execution of the GET operation through the circuit breaker. Initially the circuit is Closed and so an attempt is made to execute the GET operation. The operation fails and so the circuit breaker is tripped, and the error message is presented to the user. Remember that the length of the circuit breaker open timeout can be specified through the CircuitBreakerService constructor. This allows the circuit breaker pattern to be customized to fit individual application requirements.

Summary

The circuit breaker pattern prevents an application from repeatedly trying to execute an operation that’s likely to fail. The pattern also enables an application to detect whether the fault has been resolved, allowing the operation to be invoked again. The pattern monitors the number of recent failures that have occurred, and uses this information to decide whether to allow the operation to proceed, or whether to return an exception immediately. It does this by setting states that mimic the functionality of an electrical circuit breaker.

This blog post has described an implementation of the circuit breaker pattern, based on descriptions provided by Patterns & Practices. The length of the circuit breaker open timeout can be specified, allowing the pattern to be customized to fit individual application requirements. The advantage of the approach presented here is that the circuit breaker pattern is implemented without requiring any library code, for those sensitive to bloating their application package size.

In my next blog post I’ll show how to re-implement the circuit breaker pattern using Polly.

Wednesday, 9 August 2017

Using PKCE with IdentityServer from a Xamarin Client

The OpenID Connect and OAuth 2.0 specifications define a number of authentication flows between clients and authentication providers. These include:

  • Implicit.This authentication flow is optimized for browser-based apps. All tokens are transmitted via the browser.
  • Authorization code. This authentication flow provides the ability to retrieve tokens on a back channel, as opposed to the browser front channel, while also supporting client authentication.
  • Hybrid. This authentication flow is a combination of the implicit and authorization code flows. The identity token is transmitted via the browser channel and contains the signed protocol response along with other artifacts such as the authorization code. After successful validation of the response, the back channel is used to retrieve the access and refresh tokens.

The eShopOnContainers mobile app communicates with an identity microservice, which uses IdentityServer 4 to perform authentication, and access control for APIs. The app uses the hybrid authentication flow to retrieve access tokens, as this flow mitigates a number of attacks that apply to the browser channel, and this approach is explained in the guidance documentation.

However, OAuth 2.0 clients that utilize authorization codes are susceptible to an authorization code interception attack. In this attack, the authorization code returned from an authorization endpoint is intercepted within a communication path not protected by Transport Layer Security (TLS), such as inter-application communication within the client’s operating system. Once the attacker has gained access to the authorization code, it can be used to obtain the access token. While a number of pre-conditions must hold for the authorization code interception attack to work, it has been observed in the wild.

To mitigate this attack, the Proof Key for Code Exchange (PKCE) extension to OAuth 2.0 adds additional parameters to the OAuth 2.0 authorization and access token requests:

  1. The client creates a cryptographically random key called a code verifier, and derives a transformed value, called a code challenge, which is sent in the OAuth 2.0 authorization request along with the transformation method.
  2. The authorization endpoint responds as usual but records the code challenge and transformation method.
  3. The clients sends the authorization code in the access token request, and also includes the code verifier.
  4. The authorization server transforms the code verifier and compares it to the code challenge. Access is denied if they are not equal.

This works as a mitigation for native apps because if an attacker intercepts the authorization code in step (2), it can’t redeem it for an access token as the attacker is not in possession of the code verifier. In addition, the code verifier can’t be intercepted since it’s sent over TLS.

For detailed information about PKCE, see Proof Key for Code Exchange by OAuth Public Clients.

Implementing PKCE

Following the guidance in the OAuth 2.0 for Native Apps specification, that PKCE should be used in authorization code based authentication flows, I’ve recently updated the eShopOnContainers mobile app to use PKCE when communicating with IdentityServer.

Server Side

IdentityServer must be configured to require the use of PKCE. This is achieved by modifying the configuration of the IdentityServer Client object for the Xamarin client:

public static IEnumerable<Client> GetClients(Dictionary<string,string> clientsUrl) { return new List<Client> { ... new Client { ClientId = "xamarin", ClientName = "eShop Xamarin OpenId Client", AllowedGrantTypes = GrantTypes.Hybrid, ClientSecrets = { new Secret("secret".Sha256()) }, RedirectUris = { clientsUrl["Xamarin"] }, RequireConsent = false, RequirePkce = true, PostLogoutRedirectUris = { $"{clientsUrl["Xamarin"]}/Account/Redirecting" }, AllowedCorsOrigins = { "http://eshopxamarin" }, AllowedScopes = new List<string> { IdentityServerConstants.StandardScopes.OpenId, IdentityServerConstants.StandardScopes.Profile, IdentityServerConstants.StandardScopes.OfflineAccess, "orders", "basket" }, AllowOfflineAccess = true, AllowAccessTokensViaBrowser = true }, ... }; }

This configuration adds the RequirePkce property to the Client object. The RequirePkce property specifies whether clients using an authorization code must send a proof key.

Client Side

The CreateAuthorizationRequest method in the IdentityService class creates the URI for IdentityServer’s authorization endpoint, and the URI must be modified to include additional query parameters. The following code example shows the modified method:

public string CreateAuthorizationRequest() { // Create URI to authorization endpoint var authorizeRequest = new AuthorizeRequest(GlobalSetting.Instance.IdentityEndpoint); // Dictionary with values for the authorize request var dic = new Dictionary<string, string>(); dic.Add("client_id", GlobalSetting.Instance.ClientId); dic.Add("client_secret", GlobalSetting.Instance.ClientSecret); dic.Add("response_type", "code id_token"); dic.Add("scope", "openid profile basket orders locations marketing offline_access"); dic.Add("redirect_uri", GlobalSetting.Instance.IdentityCallback); dic.Add("nonce", Guid.NewGuid().ToString("N")); dic.Add("code_challenge", CreateCodeChallenge()); dic.Add("code_challenge_method", "S256"); // Add CSRF token to protect against cross-site request forgery attacks. var currentCSRFToken = Guid.NewGuid().ToString("N"); dic.Add("state", currentCSRFToken); var authorizeUri = authorizeRequest.Create(dic); return authorizeUri; }

The client first creates a code verifier for the authorization request, with the CreateCodeChallenge method:

private string CreateCodeChallenge() { _codeVerifier = RandomNumberGenerator.CreateUniqueId(); var sha256 = HashAlgorithmProvider.OpenAlgorithm(HashAlgorithm.Sha256); var challengeBuffer = sha256.HashData( CryptographicBuffer.CreateFromByteArray(Encoding.UTF8.GetBytes(_codeVerifier))); byte[] challengeBytes; CryptographicBuffer.CopyToByteArray(challengeBuffer, out challengeBytes); return Base64Url.Encode(challengeBytes); }

The CreateUniqueId method in the RandomNumberGenerator class creates a high-entropy cryptographic random string using the PCLCrypto library. Note that the PKCE specification requires that the code verifier is base64 URL-encoded to produce a URL safe string. However, the code verifier here is already URL safe, and so this additional operation isn’t required.

The CreateCodeChallenge method then creates a code challenge derived from the code verifier. This can be achieved by using one of the following transformations:

  • code challenge = code verifier (known as the plain transformation)

OR

  • code challenge = base64urlencode(Sha256(code_verifier)) (known as the S256 transformation)

If the client is capable of using the S256 transformation, it must do so, as this transformation is mandatory to implement on compliant servers. The CreateAuthorizationRequest method uses the S256 transformation, which SHA256 encodes the code verifier, and then base64 url-encodes the SHA256 output.

The client then sends the code challenge as part of the OAuth 2.0 authorization request, using the following additional parameters:

  • code_challenge – the derived code challenge
  • code_challenge_method – S256 (or plain)

When IdentityServer issues the authorization code in the authorization response, it associates the code challenge and code challenge method values with the authorization code so that it can be verified later. Note that if IdentityServer is configured to use PKCE, and the client does not send the code challenge, the authorization endpoint responds with an error response set to invalid_request.

Upon receipt of the authorization code, the client sends the access token request to the token endpoint. In addition to the existing parameters, it also sends the following parameter:

  • code_verifier – the code verifier

In the eShopOnContainers app, this is achieved with the GetTokenAsync method in the IdentityService class:

public async Task<UserToken> GetTokenAsync(string code) { var data = string.Format("grant_type=authorization_code&code={0}&redirect_uri={1}&code_verifier={2}", code, WebUtility.UrlEncode(GlobalSetting.Instance.IdentityCallback), _codeVerifier); var token = await _requestProvider.PostAsync<UserToken>(GlobalSetting.Instance.TokenEndpoint, data, GlobalSetting.Instance.ClientId, GlobalSetting.Instance.ClientSecret); return token; }

Upon receipt of the request at the token endpoint, IdentityServer verifies it by calculating the code challenge from the received code verifier, and comparing it with the previously associated code challenge, after first transforming it according to the code challenge method specified by the client. If the values are not equal, an error response indicating invalid_grant is returned. If the values are equal, the token endpoint continues processing as normal and responds with an access token, identity token, and refresh token.

Summary

OAuth 2.0 clients that utilize authorization codes are susceptible to an authorization code interception attack. To mitigate this attack, the PKCE extension to OAuth 2.0 adds additional parameters to the OAuth 2.0 authorization and access token requests.

Following the guidance in the OAuth 2.0 for Native Apps specification, that PKCE should be used in authorization code based authentication flows, I’ve recently updated the eShopOnContainers mobile app to use PKCE when communicating with IdentityServer.

For detailed information about PKCE, see Proof Key for Code Exchange by OAuth Public Clients.

Monday, 31 July 2017

Transient Fault Handling in Xamarin.Forms using Polly

Previously I wrote about transient fault handling in Xamarin.Forms, and discussed an implementation of the retry pattern that uses exponential backoff. The advantage of the implementation was that the retry pattern was implemented without requiring any library code, for those sensitive to bloating their application package size.

There are, however, transient fault handling libraries available and the go to library for .NET is Polly, which includes fluent support for the retry pattern, circuit breaker pattern, bulkhead isolation, and more. In Polly, these patterns are implemented via fault handling policies, which handle specific exceptions thrown by, or results returned by, the delegates that are executed through the policy.

This blog post will discuss using Polly’s RetryPolicy, which implements the retry pattern.

Implementation

The sample application, which can be found on GitHub, is similar to the sample application from my previous blog post, with the custom implementation of the retry pattern replaced with Polly’s.

Initialization

The App class in the sample application initializes the classes that are responsible for communicating with the REST service:

TodoManager = new TodoItemManager( new RestService( new ResilientRequestProvider()));

The RestService class provides data to the TodoItemManager class, with the RestService class making REST calls using the ResilientRequestProvider class, which uses Polly to implement the retry pattern, using exponential backoff.

ResilientRequestProvider

The following code example shows the GetAsync method from the ResilientRequestProvider class, which makes GET requests to a specified URI:

async Task<HttpResponseMessage> HttpInvoker(Func<Task<HttpResponseMessage>> operation) { return await retryPolicy.ExecuteAsync(operation); } public async Task<TResult> GetAsync<TResult>(string uri) { string serialized = null; var httpResponse = await HttpInvoker(async () => { var response = await client.GetAsync(uri); response.EnsureSuccessStatusCode(); serialized = await response.Content.ReadAsStringAsync(); return response; }); return JsonConvert.DeserializeObject<TResult>(serialized); }

The GetAsync method code is identical to my previous blog post. The lambda expression is passed to the HttpInvoker method, which in turn passes it to the ExecuteAsync method of the RetryPolicy instance. Therefore, the code in the lambda expression is what will be retried if the GET request fails.

The RetryPolicy type is a Polly type, which represents a retry policy that can be applied to delegates that return a value of type T. The following code example show the RetryPolicy<T> declaration from the sample application:

RetryPolicy<HttpResponseMessage> retryPolicy;

This declares a RetryPolicy that can be applied to delegates that return a HttpResponseMessage.

There are 3 steps to using a fault handling policy, including the RetryPolicy<T> type, in Polly:

  1. Specify the exceptions you want the policy to handle.
  2. Optionally specify the returned results you want the policy to handle.
  3. Specify how the policy should handle any faults.

The following code example shows all three steps for defining the operation of the RetryPolicy<T> instance:

HttpStatusCode[] httpStatusCodesToRetry = { HttpStatusCode.RequestTimeout, // 408 HttpStatusCode.InternalServerError, // 500 HttpStatusCode.BadGateway, // 502 HttpStatusCode.ServiceUnavailable, // 503 HttpStatusCode.GatewayTimeout // 504 }; retryPolicy = Policy .Handle<TimeoutException>() .Or<HttpRequestException>() .OrResult<HttpResponseMessage>(r => httpStatusCodesToRetry.Contains(r.StatusCode)) .WaitAndRetryAsync(3, retryAttempt => TimeSpan.FromSeconds(Math.Pow(2, retryAttempt)), (response, delay, retryCount, context) => { Debug.WriteLine($"Retry {retryCount} after {delay.Seconds} seconds delay due to {response.Exception.Message}"); });

The Policy.Handle method is used to specify the exceptions and results you want the policy to handle. Here it specifies that the delegate should be retried if a TimeoutException or HttpRequestException occurs, or if the resulting HttpResponseMessage includes any of the HTTP status codes contained in the httpStatusCodesToRetry array. Therefore, the RetryPolicy<T> instance handles both exceptions and return values in a single policy.

After specifying the exceptions and results you want the policy to handle, you must specify how the policy should handle any faults. Several of Polly’s methods can be used here, including Retry, RetryForever, WaitAndRetry, or WaitAndRetryForever (along with their async variants). I chose to use one of the WaitAndRetryAsync overloads, for which three arguments must be specified:

  1. The maximum number of retries to make. Note that the overall number of attempts that will be made is one plus the number of retries configured. Therefore, four attempts can be made with this code: the initial attempt, plus up to three retries.
  2. A delegate, expressed here as a lambda expression, that calculates the duration to wait between retries based on the current retry attempt.
  3. An action to be called on each retry, that provides the current exception, duration, retry count, and context.

The advantage of using this overload is that it allows an exponential backoff strategy to be specified through calculation, and to output messages as retries are attempted.

Executing an Action through the Policy

The overall operation is that the GetAsync method in the ResilientRequestProvider class invokes the HttpInvoker method, passing an action that represents the GET request. The HttpInvoker method invokes Polly’s RetryPolicy.ExecuteAsync method, passing the received action as an argument.

The retry policy then attempts the action passed in via the ExecuteAsync method. If the action executes successfully, the return value is returned and the policy exits. If the action throws an unhandled exception, it’s rethrown and the policy exits – no further retries are made. However, if the action throws a handled exception, the policy performs the following operations:

  • Counts the exception.
  • Checks whether another retry is permitted:
    • If not, the exception is rethrown and the policy terminates.
    • If another try is permitted, the policy calculates the duration to wait from the supplied sleep duration configuration, waits for the calculation duration, and returns to the beginning of the cycle to retry executing the action again.

Running the Sample Application

The sample application, which can be found on GitHub, connects to a read-only REST service hosted by Xamarin, and it’s most likely that when running the sample the GET operation will succeed on first attempt. To observe the retry pattern in operation, change the RestUrl property in the Constants class to an address that doesn’t exist – this can be accomplished by adding a random character to the end of the existing string. Then run the application and observe the output window in Visual Studio. You should see something like:

Retry 1 after 2 seconds delay due to 404 (Not Found) Thread started: <Thread Pool> #10 Thread started: <Thread Pool> #11 Retry 2 after 4 seconds delay due to 404 (Not Found) Retry 3 after 8 seconds delay due to 404 (Not Found) ERROR: 404 (Not Found)

This shows the GET operation being retried 3 times, after an exponentially increasing delay. Remember that the number of retries, and backoff strategy can be specified with Polly’s WaitAndRetry method. This allows the RetryPolicy to be customized to fit individual application requirements.

The advantage of using Polly over implementing your own retry pattern is that Polly includes multiple transient fault handling patterns that can easily be combined for additional resilience when handling transient faults.

Summary

Polly is a .NET transient fault handling library, which includes fluent support for the retry pattern. In Polly, the retry pattern is implemented by the RetryPolicy type, which handles specific exceptions thrown by, or results returned by, the delegates that are executed through the policy.

The RetryPolicy type is highly configurable, allowing you to specify the exceptions to be handled, the return results to be handled, and how the policy should handle any faults.

The advantage of using Polly over implementing your own retry pattern is that Polly includes multiple transient fault handling patterns that can easily be combined for additional resilience when handling transient faults.

Wednesday, 26 July 2017

Transient Fault Handling in Xamarin.Forms

All applications that communicate with remote services and resources must be sensitive to transient faults. Transient faults include the momentary loss of network connectivity to services, the temporary unavailability of a service, or timeouts that arise when a service is busy. These faults are often self-correcting, and if the remote access request is repeated after a suitable delay it’s likely to succeed.

Transient faults can have a huge impact on the perceived quality of an application, even if it has been thoroughly tested under all forseeable circumstances. To ensure that an application that communicates with remote services operates reliably, it must be able to:

  1. Detect faults when they occur, and determine if the faults are likely to be transient.
  2. Retry the operation if it’s determined that the fault is likely to be transient, and keep track of the number of times the operation is retried.
  3. Use an appropriate retry strategy, which specifies the number of retries, the delay between each attempt, and the actions to take after a failed attempt.

This transient fault handling can be achieved by wrapping all attempts to access a remote service in code that implements the retry pattern.

Retry Pattern

If an application detects a failure when it tries to send a request to a remote service, it can handle the failure by:

  • Retrying the operation – the application could retry the failing request immediately.
  • Retrying the operation after a delay – the application could wait for a suitable amount of time before retrying the request.
  • Cancelling the operation – the application should cancel the operation and report an exception.

The retry strategy should be tuned to match the business requirements of the application. For example, it’s important to optimize the retry count and retry interval to the operation being attempted. If the operation is part of a user interaction, the retry interval should be short and only a few retries attempted to avoid making users wait for a response. If the operation is part of a long running workflow, where cancelling or restarting the workflow is expensive of time-consuming, it’s appropriate to wait longer between attempts and retry more times.

If a request still fails after a number of retries, it’s better for the app to prevent further requests going to the same resource and to report a failure. Then, after a set period, the application can make one or more requests to the resource to see if they’re successful. I’ll return to this topic in a future blog post.

Retrying after an Exponential Delay

If transient faults are occurring because a remote service is overloaded, or being throttled at the service end, the service could reject new requests. While this scenario can be handled by the retry pattern, it’s possible that retry requests could add to the overloading of the service, which means that the service could take longer to recover from its overloaded state.

Exponential backoff attempts to deal with this problem by exponentially increasing the delay between retries, rather than retrying after a fixed delay. The purpose of this approach is give the service time to recover, in case the transient fault is due to a service overload. For example, when the initial request fails it can be retried after 1 second. If it fails for a second time, wait 2 seconds before the next retry. Then if the second retry fails, wait for 4 seconds before the next retry.

Implementing the Retry Pattern

In this blog post I’ll explain how I implemented the retry pattern, with exponential backoff. The advantage of the approach presented here is that the retry pattern is implemented without requiring any library code, for those sensitive to bloating their application package size.

My implementation of the retry pattern adds to Xamarin’s TodoREST sample. This sample demonstrates a Todo list application where the data is stored and accessed from a RESTful web service, hosted by Xamarin. However, I’ve modified the original implementation so that the RestService class moves some of its responsibilities to the RequestProvider class, which handles all REST requests.This ensures that all REST requests are made by a single class, which has a single responsibility. The following code example shows the GetAsync method from the RequestProvider class, which makes GET requests to a specified URI:

public async Task<TResult> GetAsync<TResult>(string uri) { var response = await client.GetAsync(uri); response.EnsureSuccessStatusCode(); string serialized = await response.Content.ReadAsStringAsync(); return JsonConvert.DeserializeObject<TResult>(serialized); }

Note, however, that the sample application, which can be found on GitHub, doesn’t use the RequestProvider class. It’s included purely for comparison with the ResilientRequestProvider class, which the application uses, and which implements the retry pattern.

Initialization

The App class in the sample application initializes the classes that are responsible for communicating with the REST service:

TodoManager = new TodoItemManager ( new RestService ( new ResilientRequestProvider( new RetryWithExponentialBackoff())));

The RestService class provides data to the TodoItemManager class, with the RestService class making REST calls using the ResilientRequestProvider class, which uses the RetryWithExponentialBackoff class to implement the retry pattern, using exponential backoff.

ResilientRequestProvider

The following code example shows the GetAsync method from the ResilientRequestProvider class, which makes GET requests to a specified URI:

async Task<HttpResponseMessage> HttpInvoker(Func<Task<HttpResponseMessage>> operation) { return await retry.RetryOnExceptionAsync<HttpRequestException>(operation); } public async Task<TResult> GetAsync<TResult>(string uri) { string serialized = null; var httpResponse = await HttpInvoker(async () => { var response = await client.GetAsync(uri); response.EnsureSuccessStatusCode(); serialized = await response.Content.ReadAsStringAsync(); return response; }); return JsonConvert.DeserializeObject<TResult>(serialized); }

Notice that the code from the GetAsync method in the RequestProvider class is still present, but is now specified as a lambda expression. This lambda expression is passed to the HttpInvoker method, which in turn passes it to the RetryOnExceptionAsync method of the RetryWithExponentialBackoff class. Therefore, the code in the lambda expression is what will be retried if the GET request fails.

RetryWIthExponentialBackoff

The RetryWithExponentialBackoff class has a constructor with three arguments, as shown in the following code example:

public RetryWithExponentialBackoff(int retries = 10, int delay = 200, int maxDelay = 2000) { maxRetries = retries; delayMilliseconds = delay; maxDelayMilliseconds = maxDelay; }

The constructor arguments specify a maximum number of retries, the initial delay between retries (in milliseconds), and a maximum delay between retries (in milliseconds). However, the three constructor arguments also specify default values, allowing a parameterless constructor to be specified when the class is initialised in the App class.

The RetryOnExceptionAsync method in the RetryWithExponentialBackoff class is shown in the following code example:

public async Task<HttpResponseMessage> RetryOnExceptionAsync<TException>(Func<Task<HttpResponseMessage>> operation) where TException : Exception { HttpResponseMessage response; var backoff = new ExponentialBackoff(maxRetries, delayMilliseconds, maxDelayMilliseconds); while (true) { try { response = await operation(); break; } catch (Exception ex) when (ex is TimeoutException || ex is TException) { Debug.WriteLine("Exception: " + ex.Message); await backoff.Delay(); } } return response; }

This method is responsible for implementing the retry pattern - retrying failed operations with an exponential backoff up to a maximum number of retries. It uses an infinite loop to execute the operation that was specified as a lambda expression in the GetAsync method in the ResilientRequestProvider class. If the operation succeeds the infinite loop is broken out of, and the response received from the web service is returned. However, if the operation fails due to a transient fault, that is a TimeoutException or a HttpRequestException, the operation is retried after a delay controlled by the ExponentialBackoff class.

Obviously, the implementation of the RetryWithExponentialBackoff class could be tidied up so that it takes a dependency on an IBackoff type, which the ExponentialBackoff struct would implement. This would allow different backoff strategy implementations to easily be swapped in and out. However, the current implementation does adequately demonstrates what I’m trying to show – retrying requests that failed due to transient faults.

ExponentialBackoff

The ExponentialBackoff struct has a constructor requiring three arguments:

public ExponentialBackoff(int noOfRetries, int delay, int maxDelay) { maxRetries = noOfRetries; delayMilliseconds = delay; maxDelayMilliseconds = maxDelay; retries = 0; pow = 1; }

All three arguments must be specified when creating an instance of the struct, and they should be identical to the values of the arguments in the RetryWithExponentialBackoff class constructor.

The RetryOnExceptionAsync method in the RetryWithExponentialBackoff class invokes the Delay method in the ExponentialBackoff class if a transient fault has occurred. The Delay method is shown in the following code example:

public Task Delay() { if (retries < maxRetries) { retries++; pow = pow << 1; } else { throw new TimeoutException($"{maxRetries} retry attempts made. Retries failed."); } int delay = Math.Min(delayMilliseconds * (pow - 1) / 2, maxDelayMilliseconds); Debug.WriteLine($"Retry {retries} after {delay} milliseconds delay. Maximum delay is {maxDelayMilliseconds} milliseconds."); return Task.Delay(delay); }

This method implements a roughly exponential delay, up to a maximum number of milliseconds specified by the maxDelayMilliseconds variable, while ensuring that the maximum number of retry attempts isn’t exceeded by throwing a TimeoutException when the number of actual retries is equal to the maximum number of retries allowed.

Running the Sample Application

The sample application, which can be found on GitHub, connects to a read-only REST service hosted by Xamarin, and it’s most likely that when running the sample the GET operation will succeed on first attempt. To observe the retry pattern in operation, change the RestUrl property in the Constants class to an address that doesn’t exist – this can be accomplished by adding a random character to the end of the existing string. Then run the application and observe the output in the output window in Visual Studio. You should see something like:

Exception: 404 (Not Found) Retry 1 after 100 milliseconds delay. Maximum delay is 2000 milliseconds. Exception: 404 (Not Found) Retry 2 after 300 milliseconds delay. Maximum delay is 2000 milliseconds. Thread started: <Thread Pool> #10 Exception: 404 (Not Found) Retry 3 after 700 milliseconds delay. Maximum delay is 2000 milliseconds. Exception: 404 (Not Found) Retry 4 after 1500 milliseconds delay. Maximum delay is 2000 milliseconds. Exception: 404 (Not Found) Retry 5 after 2000 milliseconds delay. Maximum delay is 2000 milliseconds. Exception: 404 (Not Found) Retry 6 after 2000 milliseconds delay. Maximum delay is 2000 milliseconds. Exception: 404 (Not Found) Retry 7 after 2000 milliseconds delay. Maximum delay is 2000 milliseconds. Exception: 404 (Not Found) Retry 8 after 2000 milliseconds delay. Maximum delay is 2000 milliseconds. Thread started: <Thread Pool> #11 Thread started: <Thread Pool> #12 Exception: 404 (Not Found) Retry 9 after 2000 milliseconds delay. Maximum delay is 2000 milliseconds. Exception: 404 (Not Found) Retry 10 after 2000 milliseconds delay. Maximum delay is 2000 milliseconds. Exception: 404 (Not Found) ERROR: 10 retry attempts made. Retries failed.

This shows the GET operation being retried 10 times, after a roughly exponential increasing delay, up to a maximum of 2000 milliseconds. Remember that the number of retries, initial delay (in milliseconds), and maximum delay (in milliseconds) can be specified when creating the RetryWithExponentialBackoff instance. This allows the retry pattern to be customized to fit individual application requirements.

Summary

The retry pattern allows applications to retry a failing request to a remote service, after a suitable delay. Remote access requests that are repeated after a suitable delay are likely to succeed, if the fault in the remote service is transient.

This blog post has explained how to implement the retry pattern, with exponential backoff. The number of retries, initial delay, and maximum delay can all be specified, allowing the retry pattern to be customized to fit individual application requirements.The advantage of the approach presented here is that the retry pattern is implemented without requiring any library code, for those sensitive to bloating their application package size.

In my next blog post I’ll show how to re-implement the retry pattern using Polly.