REST调用或同步是服务器之间通讯的经常方式,在没有分布式事务机制保障情况下,需要我们开发人员手工进行重试,重试几次失败后进行业务回退操作,重试非常重要,容易造成网络堵塞,引入断路器又过于重量,完善重试算法也许是一条出路:
在处理我们的应用程序时,我们不得不依赖远程资源,这些远程资源很容易出现故障,这种故障称为暂时或临时故障,那么如何通过重试避免一些呢?
临时故障
瞬态故障是指暂时发生并在短时间内发生的故障,由于网络拥塞或高系统负载,远程资源可能仅暂时不可用,并且这种情况完全可能会快速自我纠正。在这种情况下,我们必须等待一段时间并重新获取资源,重试以后如果失败就将其视为永久性故障。
正常的过程可能是这样的:
1. 我们获取资源,
2. 如果我们找到它,我们继续处理它。
3. 如果没有,我们等待一段时间再次获取资源,并继续这样做,直到我们拥有资源。
4. 我们显然不能永远等待获取到资源,因此在经过一定数量的重试之后,我们认为它是一个永久性的失败并在我们的代码中适当地处理它。
临时瞬态故障的一个例子可能是API访问限制,许多第三方API(包括Amazon Web Services API)对执行特定任务的API每秒可执行的请求数量有限制。例如,Route 53的每秒限制为5个请求,尽管它们允许在一个请求中批处理多个操作。当达到此限制时,你很可能会收到某种Rate Limit Exceeded错误,表明已超过允许的请求数阈值。那你准备怎么办?你认为它就是一个失败并在你的应用程序中处理?实际上,你可以等待一段时间,然后重试该操作。有许多策略可以让你“退后”一段时间,并在延迟后重试操作。
退避Backoff策略
让我们考虑一个do_something()函数,其中代码执行一些容易出现瞬态失败的操作。我们将使用示例来考虑策略,github源码
1. 没有退避
这是默认方案。如果失败,可以立即重试,不要等待。请记住使用最大允许重试限制尝试次数,否则这可能会永远持续下去。代码可能如下所示:
def no_backoff(max_retry: int): """ No backoff. If things fail, retry immediately. """
counter = 0 while counter < max_retry:
response = do_something() if response:
return True else: print('> No Backoff > Retry number: %s' % counter)
counter += 1
|
2.恒定Constant退避
每次尝试后,恒定退避是指增加固定延迟,在这种情况下,我们会在每次等待1秒钟以后再尝试。
def constant_backoff(max_retry: int): """ Constant backoff. If things fail, retry after a fixed amount of delay. """
total_delay = 0 counter = 0 while counter < max_retry:
response = do_something() if response:
return True else:
sleepy_time = 1 print('> Constant Backoff > Retry number: %s, sleeping for %s seconds.' % (counter, sleepy_time)) total_delay += sleepy_time sleep(sleepy_time)
counter += 1
print('> Constant Backoff > Total delay: %s seconds.' % total_delay)
|
3.线性退避
在线性回退中,延迟随着每次尝试而增加,遵循线性曲线。
def linear_backoff(max_retry: int): """ Linear backoff. If things fail, retry with linearly increasing delays. """
total_delay = 0 counter = 0 while counter < max_retry:
response = do_something() if response:
return True else:
sleepy_time = counter print('> Linear Backoff > Retry number: %s, sleeping for %s seconds.' % (counter, sleepy_time)) total_delay += sleepy_time sleep(sleepy_time)
counter += 1
print('> Linear Backoff > Total delay: %s seconds.' % total_delay)
|
4.斐波那契退避
我们也可以根据重试计数器的Fibonacci系列的总和进行延迟。
def fibonacci_backoff(max_retry: int): """ Fibonacci backoff. If things fail, retry with delays increasing by fibonacci numbers. """
total_delay = 0 counter = 0 while counter < max_retry:
response = do_something() if response:
return True else:
sleepy_time = get_fib(counter) print( '> Fibonacci Backoff > Retry number: %s, sleeping for %s seconds.' % ( counter, sleepy_time)) total_delay += sleepy_time sleep(sleepy_time)
counter += 1
print('> Fibonacci Backoff > Total delay: %s seconds.' % total_delay)
|
5. 二次退避
延迟也可以遵循二次曲线。
def quadratic_backoff(max_retry: int): """ Quadratic backoff. If things fail, retry with polynomially increasing delays. """
total_delay = 0 counter = 0 while counter < max_retry:
response = do_something() if response:
return True else:
sleepy_time = counter ** 2 print( '> Quadratic Backoff > Retry number: %s, sleeping for %s seconds.' % ( counter, sleepy_time)) total_delay += sleepy_time sleep(sleepy_time)
counter += 1
print('> Quadratic Backoff > Total delay: %s seconds.' % total_delay)
|
6. 指数退避
延迟也可以遵循指数曲线,如下例所示。
def exponential_backoff(max_retry: int): """ Exponential backoff. If things fail, retry with exponentially increasing delays. """
total_delay = 0 counter = 0 while counter < max_retry:
response = do_something() if response:
return True else:
sleepy_time = 2 ** counter print( '> Exponential Backoff > Retry number: %s, sleeping for %s seconds.' % ( counter, sleepy_time)) total_delay += sleepy_time sleep(sleepy_time)
counter += 1
print('> Exponential Backoff > Total delay: %s seconds.' % total_delay)
|
7. 多项式退避
延迟也可以遵循多项式函数。
def polynomial_backoff(max_retry: int): """ Polynomial backoff. If things fail, retry with polynomially increasing delays. """
total_delay = 0 counter = 0 while counter < max_retry:
response = do_something() if response:
return True else:
sleepy_time = counter ** 3 print( '> Polynomial Backoff > Retry number: %s, sleeping for %s seconds.' % ( counter, sleepy_time)) total_delay += sleepy_time sleep(sleepy_time)
counter += 1
print('> Polynomial Backoff > Total delay: %s seconds.' % total_delay)
|
8. 总延误
使用的策略取决于你想要的延迟类型。以下是上述算法的运行,以及它们在几秒钟内导致的总延迟,适用于各种最大重试次数。
Starting experiments with maximum 1 retries. > Constant Backoff > Total delay: 1 seconds. > Linear Backoff > Total delay: 0 seconds. > Fibonacci Backoff > Total delay: 0 seconds. > Quadratic Backoff > Total delay: 0 seconds. > Exponential Backoff > Total delay: 1 seconds. > Polynomial Backoff > Total delay: 0 seconds.
Starting experiments with maximum 3 retries. > Constant Backoff > Total delay: 3 seconds. > Linear Backoff > Total delay: 3 seconds. > Fibonacci Backoff > Total delay: 2 seconds. > Quadratic Backoff > Total delay: 5 seconds. > Exponential Backoff > Total delay: 7 seconds. > Polynomial Backoff > Total delay: 9 seconds.
Starting experiments with maximum 5 retries. > Constant Backoff > Total delay: 5 seconds. > Linear Backoff > Total delay: 10 seconds. > Fibonacci Backoff > Total delay: 7 seconds. > Quadratic Backoff > Total delay: 30 seconds. > Exponential Backoff > Total delay: 31 seconds. > Polynomial Backoff > Total delay: 100 seconds.
Starting experiments with maximum 10 retries. > Constant Backoff > Total delay: 10 seconds. > Linear Backoff > Total delay: 45 seconds. > Fibonacci Backoff > Total delay: 88 seconds. > Quadratic Backoff > Total delay: 285 seconds. > Exponential Backoff > Total delay: 1023 seconds. > Polynomial Backoff > Total delay: 2025 seconds.
Starting experiments with maximum 20 retries. > Constant Backoff > Total delay: 20 seconds. > Linear Backoff > Total delay: 190 seconds. > Fibonacci Backoff > Total delay: 10945 seconds. > Quadratic Backoff > Total delay: 2470 seconds. > Exponential Backoff > Total delay: 1048575 seconds. > Polynomial Backoff > Total delay: 36100 seconds.
|
9. 封顶/截断延迟
使用算法为重试逻辑添加延迟时,请记住限制这些延迟。在我们的示例中,我们将重试尝试限制/限制为10次重试,这可能就足够了。请记住,我们最终添加的延迟总量取决于我们的应用程序do_something()在下一次迭代中实际运行和到达它所花费的时间,以及我们使用的算法。在线性退避的情况下,10次重试增加45秒,但使用多项式退避策略超过30分钟。在选择方法之前运行关于实际延迟的方案非常重要。
此外,不是按重试次数限制延迟,最好将延迟限制在最大允许延迟时间,以便它们在一段时间后保持平稳。你可以检查一下这个值sleepy_time 变量并确保它永远不会设置为大于您指定的阈值的值。
抖动/随机性的案例
如果你发现即使在使用上述方法之后,也存在多个客户端争用相同资源并面临瞬态故障的情况,考虑添加一些随机性sleepy_time以分散调用的时间。
Retry Strategies for Transient Failures - DEV Comm