Skip to main content
Version: 2.7.0

Detecting Overload

Overview

Monitoring the health of a service is a critical aspect of ensuring reliable operations. This policy provides a mechanism for detecting an overload state of a service and sending alerts using Aperture's declarative policy language. The policy creates a circuit that models the typical latency behavior of the service using an exponential moving average (EMA). This automated learning of the normal latency threshold for each service reduces the need for manual tuning of alert policies.

One reliable metric for detecting overload is the latency of service requests. In Aperture, latency can be reported using a Flux Meter.

tip

To prevent the mixing of latency measurements across different workloads, it's recommended to apply the Flux Meter to a single type of workload. For instance, if a service has both Select and Insert API calls, it is advised to measure the latency of only one of these workloads using a Flux Meter. Refer to the Selector documentation for guidance on applying the Flux Meter to a subset of API calls for a service.

Configuration

In this example, the EMA of latency of checkout-service.prod.svc.cluster.local is computed using metrics reported by the Flux Meter and obtained periodically through a PromQL query. The EMA of latency is then multiplied by a tolerance factor to calculate the setpoint latency, which serves as a threshold for detecting an overloaded state - if the real-time latency of the service exceeds this setpoint (which is based on the long-term EMA), the service is considered overloaded.

apiVersion: fluxninja.com/v1alpha1
kind: Policy
metadata:
labels:
fluxninja.com/validate: "true"
name: signal-processing
spec:
circuit:
components:
- query:
promql:
evaluation_interval: 10s
out_ports:
output:
signal_name: LATENCY
query_string:
sum(increase(flux_meter_sum{decision_type!="DECISION_TYPE_REJECTED",
flow_status="OK",
flux_meter_name="test"}[30s]))/sum(increase(flux_meter_count{decision_type!="DECISION_TYPE_REJECTED",
flow_status="OK", flux_meter_name="test"}[30s]))
- ema:
in_ports:
input:
signal_name: LATENCY
out_ports:
output:
signal_name: LATENCY_EMA
parameters:
ema_window: 1500s
warmup_window: 10s
- arithmetic_combinator:
in_ports:
lhs:
signal_name: LATENCY_EMA
rhs:
constant_signal:
value: 1.1
operator: mul
out_ports:
output:
signal_name: LATENCY_SETPOINT
- decider:
in_ports:
lhs:
signal_name: LATENCY
rhs:
signal_name: LATENCY_SETPOINT
operator: gt
out_ports:
output:
signal_name: IS_OVERLOAD_SWITCH
- alerter:
in_ports:
signal:
signal_name: IS_OVERLOAD_SWITCH
parameters:
alert_name: overload
severity: crit
evaluation_interval: 10s
resources:
flow_control:
flux_meters:
test:
selectors:
- agent_group: default
control_point: ingress
service: service1-demo-app.demoapp.svc.cluster.local

Circuit Diagram

flowchart LR subgraph root.0[<center>PromQL<br/>every 10s</center>] subgraph root.0_outports[ ] style root.0_outports fill:none,stroke:none root.0output[output] end end subgraph root.1[<center>EMA<br/>win: 150</center>] subgraph root.1_inports[ ] style root.1_inports fill:none,stroke:none root.1input[input] end subgraph root.1_outports[ ] style root.1_outports fill:none,stroke:none root.1output[output] end end subgraph root.2[<center>ArithmeticCombinator<br/>mul</center>] subgraph root.2_inports[ ] style root.2_inports fill:none,stroke:none root.2lhs[lhs] root.2rhs[rhs] end subgraph root.2_outports[ ] style root.2_outports fill:none,stroke:none root.2output[output] end end root.2_rhs_FakeConstantout((1.10)) subgraph root.3[<center>Decider<br/>gt for 0s</center>] subgraph root.3_inports[ ] style root.3_inports fill:none,stroke:none root.3lhs[lhs] root.3rhs[rhs] end subgraph root.3_outports[ ] style root.3_outports fill:none,stroke:none root.3output[output] end end subgraph root.4[<center>Alerter<br/>overload/crit</center>] subgraph root.4_inports[ ] style root.4_inports fill:none,stroke:none root.4signal[signal] end end root.0output --> |LATENCY| root.1input root.0output --> |LATENCY| root.3lhs root.1output --> |LATENCY_EMA| root.2lhs root.2output --> |LATENCY_SETPOINT| root.3rhs root.2_rhs_FakeConstantout --> root.2rhs root.3output --> |IS_OVERLOAD_SWITCH| root.4signal

Policy in Action

As the service processes traffic, various signal metrics collected from the execution of the policy can be visualized:

LATENCY LATENCY: Signal gathered from the periodic execution of PromQL query on Flux Meter metrics.

LATENCY_EMA LATENCY_EMA: Exponential Moving Average of LATENCY signal.

LATENCY_SETPOINT LATENCY_SETPOINT: Latency above which the service is considered to be overloaded. This is calculated by multiplying the exponential moving average with a tolerance factor (LATENCY_EMA * 1.1).

IS_OVERLOAD_SWITCH IS_OVERLOAD_SWITCH is a signal that represents whether the service is in an overloaded state. This signal is derived by comparing LATENCY with LATENCY_SETPOINT. A value of 0 indicates no overload, while a value of 1 signals an overload.