A Primer to Building Retry in Automated Tests

5 min readDec 5, 2023

Ways to write resilient and fail-safe tests — For QA engineers

Introduction —

This blog is about writing tests that are resilient and fail-safe, and in my experience in testing software products, If I’d to say one thing, it’s that many things can go wrong with the product that I test while testing. One thing that occurs more frequently is transient errors such as network glitches, operation time-outs, heavy loads on downstream services, to name a few. It’s important that such errors are handled systematically, considering the fact that these are errors that would resolve by themselves when you try the same operation the next time.

Flip sides —

It’s worth mentioning that, I’d often seen that every testing framework I came across primarily emphasised re-running only failed and skipped tests entirely, anticipating that all of them that failed due to product/test flakiness would turn out to be successful in the subsequent runs. To say the least, it’s also a bit flawed reasoning.

If I’d take a pause and rethink, it’s no different from flipping a coin — it ends up in either head or tail. Though the number of failed tests may have come down or even gone up for each re-run, there is no guaranteed way that these flaky tests will certainly be fixed all the time. I figured out after a while, that these were all just a matter of a mere coincidence when they succeeded, and yet susceptible to failure; what these tests simply lacked was resilience.

In other words, I still happened to notice flakiness with retry in place. In addition, there are also other intangible drawbacks: this method normally consumed quite a bit of resources in terms of test infrastructures and also delayed the feedback loop when it involved multiple instances of re-run, let alone the adverse impact that it has on the confidence of testing.

On a side note, here is one such code example as to how retry is to be configured CucumberJS for your reference and understanding.

Fix the weak spot —

Let’s take a deep dive into ways in which aforementioned transient exceptions and failures can be handled by placing configurable retry blocks on a test step level instead of the one on the suite level. It’s beneficial as it instantaneously attempts to run only the failed steps at the time that they fail, thus avoiding the necessity to re-run whole scenarios altogether later point in time. In addition, since the retry is configurable on a step level, it is possible to handle a step according to its nature, as if by retrying as many times as we want, or even by placing a certain time delay before each retry.

In this blog, I use Java-based failsafe library, along with Selenium for illustration purposes, however, the principles remain the same if they were to be implemented using other languages and frameworks.

implementation 'dev.failsafe:failsafe:3.3.2'

Firstly, there is a certain type of step that merely performs a sequence of actions and causes some state change on the product being tested, however, it doesn’t verify the results.

User navigates to create new customer account

Secondly, there is also another type of step that may or may not cause state change on the product, however, it verifies that the results from the state change are matched with the expected ones.

User creates an account and verifies the message: ‘Thank you for …’

If these cases were to be implemented in a Java context, I would need functional interfaces: Runnable and Supplier respectively.

In the following code illustration, the method retryStep accepts Runnable, along with other two parameters:

runnable — It’s a lambda expression that is to be retried in the context of scenario steps and it’s worth mentioning that it doesn’t return values to the calling step.
maxRetry — Number of times, that a specific step needs to be retried. Note that this excludes the first attempt.
delaySeconds — The wait time before attempting the next retry.

default void retryStep(CheckedRunnable runnable, int maxRetry, int delaySeconds) {
  RetryPolicy<Object> policy = policyBuilder.withMaxRetries(maxRetry)
      .withDelay(Duration.ofSeconds(delaySeconds))
      .handle(exceptionsList)
      .onRetry(e -> log.warn(sf("attempting retry#: %d", e.getAttemptCount())))
      .onFailure(
          e -> log.error(sf("attempts: %d have failed", e.getExecutionCount())))
      .build();
  Failsafe.with(policy).run(runnable);
}

Furthermore, in the following example, the method getWithRetryStep accepts Supplier in place of Runnable, along with other two parameters:

supplier — It’s a lambda expression that is to be retried in the context of scenario steps and the difference is that it does return a value to the calling step.
maxRetry — Number of times, that a specific step needs to be retried. Note that this excludes the first attempt.
delaySeconds — The wait time before attempting the next retry.

default Object getWithRetryStep(CheckedSupplier<Object> supplier, int maxRetry,
    int delaySeconds) {
  RetryPolicy<Object> policy = policyBuilder.withMaxRetries(maxRetry)
      .withDelay(Duration.ofSeconds(delaySeconds))
      .handle(exceptionsList)
      .handleResult(null)
      .onRetry(e -> log.warn(sf("attempting retry#: %d", e.getAttemptCount())))
      .onFailure(
          e -> log.error(sf("attempts: %d have failed", e.getExecutionCount())))
      .build();
  return Failsafe.with(policy).get(supplier);
}

Now, I wrap the test step implementation within a lambda and pass it as a parameter to the method: retryStep, along with the other two retry configuration parameters.

@When("User navigates to create new customer account")
public void userNavigatesToCreateNewCustomerAccount() {
  retryStep(() -> homePage.navigateToNewAccountPage(),
      MAX_RETRIES_DEFAULT,
      MAX_DELAY_SECONDS_DEFAULT);
}

The following case is slightly different, as we assert the result returned from the method: getWithRetryStep with the expected value.

@Then("User creates an account and verifies the message: {string}")
public void userCreatesAnAccountAndVerifiesTheMessage(String alertText) {
  boolean IsAlertMessageDisplayed = (boolean) getWithRetryStep(() -> {
        createNewAccountPage.createAnAccount();
        return createNewAccountPage.IsAlertMessageDisplayed(alertText);
      },
      MAX_RETRIES_DEFAULT,
      MAX_DELAY_SECONDS_DEFAULT
  );
  Assertions.assertThat(IsAlertMessageDisplayed)
      .as("Assert that alert is displayed")
      .isTrue();
}

In a nutshell, it’s better to have two variations of the retry method — one that doesn’t return value and the other that returns value to the calling method from Lambda; besides, other libraries that can also be considered in place of failsafe are resilience4j, spring-retry

I’m sure that you find this useful and this is a repository containing all the code examples used in this blog.

Thank you for reading.

Veera.

A Primer to Building Retry in Automated Tests

Ways to write resilient and fail-safe tests — For QA engineers

Introduction —

Flip sides —

Fix the weak spot —

Written by Veera.