Version: 2.6.0

Protection

note

The following policy is based on the Service Protection with Average Latency Feedback blueprint.

Overview

The response times of a service start to deteriorate when the service's underlying concurrency limit is surpassed. Consequently, a degradation in response latency can serve as a reliable signal for identifying service overload. This policy is designed to detect overload situations based on latency deterioration. During overload, the request rate is throttled so that latency gets restored back to an acceptable range.

Configuration

This policy monitors the latency of requests processed by the cart-service.prod.svc.cluster.local service. It calculates the deviations in current latency from a baseline historical latency, which serves as an indicator of service overload. A deviation of 1.1 from the baseline is considered as a signal of service overload.

To mitigate service overload, the requests to cart-service.prod.svc.cluster.local service are passed through a load scheduler. The load scheduler reduces the request rate in overload scenarios, temporarily placing excess requests in a queue.

As service latency improves, indicating a return to normal operational state, the request rate is incrementally increased until it matches the incoming request rate. This responsive mechanism helps ensure that service performance is optimized while mitigating the risk of service disruptions due to overload.

aperturectl values.yaml

# yaml-language-server: $schema=../../../../../../blueprints/policies/service-protection/average-latency/gen/definitions.json
# Generated values file for policies/service-protection/average-latency blueprint
# Documentation/Reference for objects and parameters can be found at:
# https://docs.fluxninja.com/reference/blueprints/policies/service-protection/average-latency

policy:
  # Name of the policy.
  # Type: string
  # Required: True
  policy_name: basic-service-protection
  service_protection_core:
    adaptive_load_scheduler:
      load_scheduler:
        # The selectors determine the flows that are protected by this policy.
        # Type: []aperture.spec.v1.Selector
        # Required: True
        selectors:
          - control_point: ingress
            service: cart-service.prod.svc.cluster.local
  latency_baseliner:
    # Tolerance factor beyond which the service is considered to be in overloaded state. E.g. if EMA of latency is 50ms and if Tolerance is 1.1, then service is considered to be in overloaded state if current latency is more than 55ms.
    # Type: float64
    latency_tolerance_multiplier: 1.1
    # Flux Meter defines the scope of latency measurements.
    # Type: aperture.spec.v1.FluxMeter
    # Required: True
    flux_meter:
      selectors:
        - control_point: ingress
          service: cart-service.prod.svc.cluster.local

Generated Policy

apiVersion: fluxninja.com/v1alpha1
kind: Policy
metadata:
  annotations:
    fluxninja.com/blueprint-name: policies/service-protection/average-latency
    fluxninja.com/blueprints-uri: local
    fluxninja.com/values:
      '{"policy": {"latency_baseliner": {"flux_meter": {"selectors":
      [{"control_point": "ingress", "service": "cart-service.prod.svc.cluster.local"}]},
      "latency_tolerance_multiplier": 1.1000000000000001}, "policy_name": "basic-service-protection",
      "service_protection_core": {"adaptive_load_scheduler": {"load_scheduler": {"selectors":
      [{"control_point": "ingress", "service": "cart-service.prod.svc.cluster.local"}]}}}}}'
  labels:
    fluxninja.com/validate: "true"
  name: basic-service-protection
spec:
  circuit:
    components:
      - flow_control:
          adaptive_load_scheduler:
            dry_run: false
            dry_run_config_key: dry_run
            in_ports:
              overload_confirmation:
                constant_signal:
                  value: 1
              setpoint:
                signal_name: SETPOINT
              signal:
                signal_name: SIGNAL
            out_ports:
              desired_load_multiplier:
                signal_name: DESIRED_LOAD_MULTIPLIER
              observed_load_multiplier:
                signal_name: OBSERVED_LOAD_MULTIPLIER
            parameters:
              alerter:
                alert_name: Load Throttling Event
              gradient:
                max_gradient: 1
                min_gradient: 0.1
                slope: -1
              load_multiplier_linear_increment: 0.0025
              load_scheduler:
                selectors:
                  - control_point: ingress
                    service: service1-demo-app.demoapp.svc.cluster.local
              max_load_multiplier: 2
      - query:
          promql:
            evaluation_interval: 10s
            out_ports:
              output:
                signal_name: SIGNAL
            query_string:
              sum(increase(flux_meter_sum{flow_status="OK", flux_meter_name="basic-service-protection"}[30s]))/sum(increase(flux_meter_count{flow_status="OK",
              flux_meter_name="basic-service-protection"}[30s]))
      - arithmetic_combinator:
          in_ports:
            lhs:
              signal_name: SIGNAL
          out_ports:
            desired_load_multiplier:
              signal_name: DESIRED_LOAD_MULTIPLIER
            observed_load_multiplier:
              signal_name: OBSERVED_LOAD_MULTIPLIER
          parameters:
            alerter:
              alert_name: Load Throttling Event
            gradient:
              max_gradient: 1
              min_gradient: 0.1
              slope: -1
            load_multiplier_linear_increment: 0.0025
            load_scheduler:
              selectors:
                - control_point: ingress
                  service: cart-service.prod.svc.cluster.local
            max_load_multiplier: 2
      - query:
          promql:
            evaluation_interval: 10s
            out_ports:
              output:
                signal_name: MAX_EMA
      - ema:
          in_ports:
            input:
              signal_name: SIGNAL
            max_envelope:
              signal_name: MAX_EMA
          out_ports:
            output:
              signal_name: SIGNAL_EMA
          parameters:
            correction_factor_on_max_envelope_violation: 0.95
            ema_window: 1500s
            warmup_window: 60s
      - arithmetic_combinator:
          in_ports:
            lhs:
              signal_name: SIGNAL_EMA
            rhs:
              constant_signal:
                value: 1.1
          operator: mul
          out_ports:
            output:
              signal_name: SETPOINT
    evaluation_interval: 10s
  resources:
    flow_control:
      classifiers: []
      flux_meters:
        basic-service-protection:
          selectors:
            - control_point: ingress
              service: cart-service.prod.svc.cluster.local

info

Circuit Diagram for this policy.

Policy is Action

To see the policy in action, the traffic is generated such that it starts within the service's capacity and then goes beyond the capacity after some time. Such a traffic pattern is repeated periodically. The below dashboard demonstrates that when latency spikes due to high traffic at cart-service.prod.svc.cluster.local, the controller throttles the rate of requests admitted into the service. This approach helps protect the service from becoming unresponsive and maintains the current latency within the tolerance limit (1.1) of historical latency.

Basic Service Protection

Dry Run Mode

You can run this policy in the Dry Run mode by setting the default_config.dry_run option to true. In the Dry Run mode, the policy does not throttle the request rate while still evaluating the decisions it would take in each cycle. This is useful for evaluating the policy without impacting the service.

note

The Dry Run mode can also be toggled dynamically at runtime, without reloading the policy.

Demo Video

The below demo video shows the basic service protection and workload prioritization policy in action within Aperture Playground.

Overview​

Configuration​

Policy is Action​

Dry Run Mode​

Demo Video​

Overview

Configuration

Policy is Action

Dry Run Mode

Demo Video