Skip to main content
Version: 2.6.0

Service Protection with Average Latency Feedback

Introduction

This policy detects traffic overloads and cascading failure build-up by comparing the real-time latency with its exponential moving average. A gradient controller calculates a proportional response to limit accepted concurrency. The concurrency is reduced by a multiplicative factor when the service is overloaded, and increased by an additive factor while the service is no longer overloaded.

At a high level, this policy works as follows:

  • Latency EMA-based overload detection: A Flux Meter is used to gather latency metrics from a service control point. The latency signal gets fed into an Exponential Moving Average (EMA) component to establish a long-term trend that can be compared to the current latency to detect overloads.
  • Gradient Controller: Set point latency and current latency signals are fed to the gradient controller that calculates the proportional response to adjust the accepted concurrency (Control Variable).
  • Integral Optimizer: When the service is detected to be in the normal state, an integral optimizer is used to additively increase the concurrency of the service in each execution cycle of the circuit. This design allows warming-up a service from an initial inactive state. This also protects applications from sudden spikes in traffic, as it sets an upper bound to the concurrency allowed on a service in each execution cycle of the circuit based on the observed incoming concurrency.
  • Load Scheduler and Actuator: The Accepted Concurrency at the service is throttled by a weighted-fair queuing scheduler. The output of the adjustments to accepted concurrency made by gradient controller and optimizer logic are translated to a load multiplier that is synchronized with Aperture Agents through etcd. The load multiplier adjusts (increases or decreases) the token bucket fill rates based on the incoming concurrency observed at each agent.
info

Please see reference for the AdaptiveLoadScheduler component that is used within this blueprint.

info

See tutorials on Basic Service Protection and Workload Prioritization to see this blueprint in use.

Configuration

Blueprint name: policies/service-protection/average-latency

Parameters

policy

Parameterpolicy.components
DescriptionList of additional circuit components.
TypeArray of Object (aperture.spec.v1.Component)
Default Value
Expand
[]
Parameterpolicy.evaluation_interval
DescriptionThe interval between successive evaluations of the Circuit.
Typestring
Default Value10s
Parameterpolicy.policy_name
DescriptionName of the policy.
Typestring
Default Value__REQUIRED_FIELD__
Parameterpolicy.resources
DescriptionAdditional resources.
TypeObject (aperture.spec.v1.Resources)
Default Value
Expand
flow_control:
classifiers: []
policy.service_protection_core
Parameterpolicy.service_protection_core.adaptive_load_scheduler
DescriptionParameters for Adaptive Load Scheduler.
TypeObject (aperture.spec.v1.AdaptiveLoadSchedulerParameters)
Default Value
Expand
alerter:
alert_name: Load Throttling Event
gradient:
max_gradient: 1
min_gradient: 0.1
slope: -1
load_multiplier_linear_increment: 0.025
load_scheduler:
selectors:
- control_point: __REQUIRED_FIELD__
service: __REQUIRED_FIELD__
max_load_multiplier: 2
Parameterpolicy.service_protection_core.dry_run
DescriptionDefault configuration for setting dry run mode on Load Scheduler. In dry run mode, the Load Scheduler acts as a passthrough and does not throttle flows. This config can be updated at runtime without restarting the policy.
TypeBoolean
Default Valuefalse
Parameterpolicy.service_protection_core.overload_confirmations
DescriptionList of overload confirmation criteria. Load scheduler can throttle flows when all of the specified overload confirmation criteria are met.
TypeArray of Object (overload_confirmation)
Default Value
Expand
[]
policy.latency_baseliner
Parameterpolicy.latency_baseliner.flux_meter
DescriptionFlux Meter defines the scope of latency measurements.
TypeObject (aperture.spec.v1.FluxMeter)
Default Value
Expand
selectors:
- control_point: __REQUIRED_FIELD__
service: __REQUIRED_FIELD__
Parameterpolicy.latency_baseliner.ema
DescriptionEMA parameters.
TypeObject (aperture.spec.v1.EMAParameters)
Default Value
Expand
correction_factor_on_max_envelope_violation: 0.95
ema_window: 1500s
warmup_window: 60s
Parameterpolicy.latency_baseliner.latency_ema_limit_multiplier
DescriptionCurrent latency value is multiplied with this factor to calculate maximum envelope of Latency EMA.
TypeNumber (double)
Default Value2
Parameterpolicy.latency_baseliner.latency_tolerance_multiplier
DescriptionTolerance factor beyond which the service is considered to be in overloaded state. E.g. if EMA of latency is 50ms and if Tolerance is 1.1, then service is considered to be in overloaded state if current latency is more than 55ms.
TypeNumber (double)
Default Value__REQUIRED_FIELD__

dashboard

Parameterdashboard.extra_filters
DescriptionAdditional filters to pass to each query to Grafana datasource.
TypeObject (map[string]string)
Default Value
Expand
{}
Parameterdashboard.refresh_interval
DescriptionRefresh interval for dashboard panels.
Typestring
Default Value15s
Parameterdashboard.time_from
DescriptionTime from of dashboard.
Typestring
Default Valuenow-15m
Parameterdashboard.time_to
DescriptionTime to of dashboard.
Typestring
Default Valuenow
Parameterdashboard.title
DescriptionName of the main dashboard.
Typestring
Default ValueAperture Service Protection
dashboard.datasource
Parameterdashboard.datasource.filter_regex
DescriptionDatasource filter regex.
Typestring
Default Value
Parameterdashboard.datasource.name
DescriptionDatasource name.
Typestring
Default Value$datasource

Schemas

overload_confirmation

Parameteroperator
DescriptionThe operator for the overload confirmation criteria. oneof: `gt | lt | gte | lte | eq | neq`
Typestring
Default Value
Parameterquery_string
DescriptionThe Prometheus query to be run. Must return a scalar or a vector with a single element.
Typestring
Default Value
Parameterthreshold
DescriptionThe threshold for the overload confirmation criteria.
TypeNumber (double)
Default Value

Dynamic Configuration

note

The following configuration parameters can be dynamically configured at runtime, without reloading the policy.

Parameters

Parameterdry_run
DescriptionDynamic configuration for setting dry run mode at runtime without restarting this policy. In dry run mode the scheduler acts as pass through to all flow and does not queue flows. It is useful for observing the behavior of load scheduler without disrupting any real traffic.
TypeBoolean
Default Value__REQUIRED_FIELD__