Skip to main content

Integrating FluxNinja Aperture with Nginx for Effective Load Management

· 13 min read
Sudhanshu Prajapati

Today, everything is available online, and people tend to turn to the internet for even the smallest things. We see new products and websites popping up every day, catering to specific needs, from groceries to online studying. This leads to an increasing number of users coming online to use these services, leading to a surge in traffic to websites and web applications.

When launching a product or website, we often have estimates, but sometimes these estimates are exceeded, leading to overload scenarios. For instance, after the announcement of ChatGPT 3.5, there was a massive influx of traffic and interest from people all around the world. The sudden surge of visitors surpassed their limits and buffers, leading to a website downtime. In such situations, it is essential to have load management in place to avoid possible business loss.

As businesses grow and expand, it becomes increasingly important to ensure that their web applications and websites can handle the influx of traffic and demand without sacrificing performance or reliability. Even big organizations are still suffering from these downtime issues, which highlights the need for an intelligent load management platform.

That's where FluxNinja Aperture comes in. In this blog post, we'll explore how Aperture can help manage high traffic on e-commerce websites by integrating with Nginx Gateway. With Aperture's advanced load management techniques, such as dynamic rate limiting and prioritized load shedding, you can ensure the reliability and stability of your web application, even during periods of high traffic.

The Challenge: Managing High-Traffic E-commerce Website with Nginx

  • Unpredictable traffic spikes: E-commerce websites often experience sudden surges in demand, especially during holidays or special events. These traffic spikes can put a strain on the website, leading to slow response times, server crashes, or downtime.
  • Limited resources: E-commerce websites usually have limited resources to handle traffic spikes and sudden surges in demand. This can lead to a lack of scalability and can make it challenging to provide a consistent user experience.
  • Dynamic traffic patterns: E-commerce websites can experience unpredictable traffic patterns that vary significantly throughout the day, week, or month. Managing this dynamic traffic can be challenging, especially during peak periods.
  • Performance and reliability: E-commerce websites need to maintain a high level of performance and reliability to provide a good user experience. Slow response times, errors, or downtime can lead to lost revenue and damage to the brand's reputation.

If these challenges are not managed effectively, they can lead to significant consequences such as downtime, lost revenue, and a negative impact on the brand's reputation.

To overcome these challenges effectively, let's examine a concrete setup that can be implemented.

Solving the Load Management Challenge

As we have seen major challenges and their consequences of it, Now we will look at how FluxNinja Aperture resolves these challenges for you. Before that, let's discuss briefly what exactly it is.

FluxNinja Aperture is a load management platform that integrates with Nginx Gateway to provide advanced load management techniques such as load shedding, rate limiting, dynamic rate limiting, concurrency limiting, and prioritized load shedding and workload prioritization. With Aperture, high-traffic websites can effectively manage the challenges associated with high-traffic and unpredictable traffic.

These are some ways Aperture can help you solve load management challenges:

  • Load shedding: Aperture can identify and shed traffic that is not critical or high-priority, allowing critical traffic to pass through smoothly.
  • Rate limiting: Aperture can limit the number of requests from a particular IP address or API, preventing excessive traffic from overwhelming the website.
  • Dynamic rate limiting: Aperture can adjust the rate limiting based on the current traffic patterns to provide a balance between maintaining website availability and preventing overload.
  • Concurrency limiting: Aperture can limit the number of concurrent connections to prevent resource exhaustion and maintain website stability.
  • Prioritized load shedding: Aperture can prioritize the shedding of non-critical traffic based on pre-defined criteria, ensuring critical traffic can pass through without disruption.
  • Workload prioritization: Aperture can prioritize certain workloads, ensuring that critical processes are prioritized over less important ones.

To configure one of the strategies in use with Nginx Gateway and Aperture, let's explore the steps involved.

Integrating Aperture with Nginx Gateway

To begin with, integrating Aperture, you need to verify prerequisites are satisfied; Here, I’m assuming you have already installed Aperture Controller and Agent in your setup before jumping into Flow Control Gateway Integration prerequisites.

For an overview, you need to ensure you have Nginx Server, lua-nginx-module enabled, and LuaRocks have to be configured. If you don’t know how to configure these things, head out to the documentation mentioned earlier.

To Integrate Aperture with Nginx Gateway, these are the high-level steps you have to perform:

  1. Install the Aperture Lua module: The opentelemetry-lua SDK has to be installed beforehand to install the Aperture Lua module. Check out the Nginx Integration doc for detailed steps.

  2. Configure Nginx: This involves adding certain blocks of code to the Nginx configuration file to initialize and execute the Aperture Lua module. Below is an example of these blocks —

    http {
    ...
    init_by_lua_block {
    access = require "aperture-plugin.access"
    log = require "aperture-plugin.log"
    }

    access_by_lua_block {
    local authorized_status = access(ngx.var.destination_hostname, ngx.var.destination_port)

    if authorized_status ~= ngx.HTTP_OK then
    return ngx.exit(authorized_status)
    end
    }

    log_by_lua_block {
    log()
    }

    server {
    listen 80;
    proxy_http_version 1.1;

    location /service1 {
    set $destination_hostname "service1-demo-app.demoapp.svc.cluster.local";
    set $destination_port "80";
    proxy_pass http://$destination_hostname:$destination_port/request;
    }
    ...
    }
    ...
    }
    • init_by_lua_block initializes the module
    • access_by_lua_block executes the Aperture check for all servers and locations before the request is forwarded upstream.
    • log_by_lua_block forwards the OpenTelemetry logs to Aperture for all servers and locations after the response is received from upstream.

    Additionally, the Aperture Lua module needs the upstream address of the server using destination_hostname and destination_port variables, which need to be set from the Nginx location block.

  3. Set environment variables:

    1. APERTURE_AGENT_ENDPOINT - To connect to the Aperture Agent,
    2. APERTURE_CHECK_TIMEOUT - To specify the timeout for execution of the Aperture check.

Demo

In this demonstration, we will explore how rate-limiting escalation can assist an e-commerce website during unexpected high traffic. E-commerce websites typically have three types of users: crawlers, guests, and subscribed members. During periods of high traffic, the website may struggle to respond to each request without prioritization, leading to frustration among paying users.

To demonstrate this scenario, we have set up a playground environment with a Nginx server forming a topology, as depicted below. To see it in live action, you can run the playground on your local machine by referring to the instructions in the Try Local Playground guide.

This playground is a Kubernetes-based environment that includes the necessary components, such as the Aperture Controller and Agent, already installed in the cluster.

flowchart LR subgraph loadgen [Load Generator] direction LR k6([k6]) end subgraph nginx-server [Nginx Server Gateway] direction LR nginx[[Nginx]] end subgraph demoapp [Demo Application] direction LR s1[[service1]] s2[[service2]] s3[[service3]] s1 ==> s2 ==> s3 end subgraph agent [Aperture Agent] direction TB f1[Flux Meter] r1[Rate</br>Limiter] c1[Concurrency</br>Limiter] end k6 ==> nginx nginx --> s1 s3 --> f1 s1 --> r1 --> c1

Traffic Generator

The playground also has a load generator named wavepool-generator, which will help us mimic the high-traffic scenario for our use case.

info

👉 For your information, the load generator is configured to generate the following traffic pattern forsubscriber, guest and crawler traffic types:

  • Ramp up to 5 concurrent users in 10s.
  • Hold at 5 concurrent users for 2m.
  • Ramp up to 30 concurrent users in 1m (overloads service3).
  • Hold at 30 concurrent users for 2m (overloads service3).
  • Ramp down to 5 concurrent users in 10s.
  • Hold at 5 concurrent users for 2m.

Nginx Configuration

I’ve configured Nginx in the demo by assigning a URL to each service and defining their respective locations. The Nginx deployment and configuration file can be found in the Aperture repo Playground Resources.

Below is a snippet of the Nginx configuration file, which includes the definition of worker processes, events, and the HTTP server. The server block defines the listening port and the proxy_pass directive, which is used to pass the requests to the respective services.

worker_processes auto;
pid /run/nginx.pid;

events {
worker_connections 4096;
}

http {
default_type application/octet-stream;
resolver 10.96.0.10;

sendfile on;
keepalive_timeout 65;

init_by_lua_block {
access = require "aperture-plugin.access"
log = require "aperture-plugin.log"
}

access_by_lua_block {
local authorized_status = access(ngx.var.destination_hostname, ngx.var.destination_port)

if authorized_status ~= ngx.HTTP_OK then
return ngx.exit(authorized_status)
end
}

log_by_lua_block {
log()
}

server {
listen 80;
proxy_http_version 1.1;

location /service1 {
set $destination_hostname "service1-demo-app.demoapp.svc.cluster.local";
set $destination_port "80";
proxy_pass http://$destination_hostname:$destination_port/request;
}

location /service2 {
set $destination_hostname "service2-demo-app.demoapp.svc.cluster.local";
set $destination_port "80";
proxy_pass http://$destination_hostname:$destination_port/request;
}

location /service3 {
set $destination_hostname "service3-demo-app.demoapp.svc.cluster.local";
set $destination_port "80";
proxy_pass http://$destination_hostname:$destination_port/request;
}
}
}

Aperture Policy

Aperture includes a declarative policy that helps you customize how your system should react to a situation in a given policy; we have used a rate limiter and concurrency controller components.

You don’t need to worry about wiring stuff from scratch. Aperture follows a blueprint pattern, where you can use the existing blueprints to build upon it. For example, in this policy, we’re using latency AIMD concurrency limiting blueprint and configuring to our needs.

This policy contains the following configurations:

  • common: common parameters for generating the policy and the dashboard. In this case, it specifies the name of the policy as "service1-demo-app".
  • policy: Configurations specific to this policy. It includes:
    • flux_meter: This configuration specifies the Flux Meter that will be used to measure the flow of traffic for the service. It uses a flow selector that matches traffic for a specific service and control point.
    • classifiers: This configuration specifies a list of classification rules that will be used to classify traffic. It includes a flow selector that matches traffic for a specific service and control point, and a rule that extracts a user_type value from the request headers.
    • components: This configuration specifies a list of additional circuit components that will be used for this policy. It includes:
      • decider: This component sets a signal to true for 30 seconds if the observed load multiplier is less than 1.0. This signal is used to determine whether the request is from a crawler or not.
      • switcher: This component switches between two signals based on the value of the IS_BOT_ESCALATION signal that was set by the decider component. If the signal is true, the output signal is set to 0.0. If the signal is false, the output signal is set to 10.0.
      • flow_control: This component applies rate limiting to traffic that matches a specific label. It uses a flow selector to match traffic for a specific service and control point based on the label http.request.header.user_type with value crawler.
    • load_scheduler: This configuration specifies the concurrency limiter flow selector and scheduler parameters. It uses the flow selector from the classifiers configuration and includes a scheduler that prioritizes traffic based on the user_type label or the http.request.header.user_type header value. It also includes a load multiplier linear increment that is applied when the system is not in an overloaded state.
# yaml-language-server: $schema=../../../../blueprints/policies/latency-aimd-concurrency-limiting/gen/definitions.json
# Generated values file for policies/latency-aimd-concurrency-limiting blueprint
# Documentation/Reference for objects and parameters can be found at:
# https://docs.fluxninja.com/reference/policies/bundled-blueprints/policies/latency-aimd-concurrency-limiting

common:
# Name of the policy.
# Type: string
# Required: True
policy_name: "service1-demo-app"

policy:
# Flux Meter.
# Type: aperture.spec.v1.FluxMeter
flux_meter:
selectors:
- service: service3-demo-app.demoapp.svc.cluster.local
control_point: ingress
# List of classification rules.
# Type: []aperture.spec.v1.Classifier
classifiers:
- selectors: &concurrency_flow_selector
- service: service1-demo-app.demoapp.svc.cluster.local
control_point: ingress
rules:
user_type:
extractor:
from: request.http.headers.user-type
# List of additional circuit components.
# Type: []aperture.spec.v1.Component
components:
- decider:
in_ports:
lhs:
signal_name: OBSERVED_LOAD_MULTIPLIER
rhs:
constant_signal:
value: 1.0
out_ports:
output:
signal_name: IS_CRAWLER_ESCALATION
operator: lt
true_for: 30s
- switcher:
in_ports:
switch:
signal_name: IS_CRAWLER_ESCALATION
on_signal:
constant_signal:
value: 0.0
off_signal:
constant_signal:
value: 10.0
out_ports:
output:
signal_name: RATE_LIMIT
- flow_control:
rate_limiter:
selectors:
- service: service1-demo-app.demoapp.svc.cluster.local
label_matcher:
match_labels:
"http.request.header.user_type": "crawler"
control_point: ingress
in_ports:
limit:
signal_name: RATE_LIMIT
parameters:
limit_reset_interval: 1s
label_key: http.request.header.user_id
dynamic_config_key: rate_limiter
concurrency_controller:
selectors: *concurrency_flow_selector
# Scheduler parameters.
# Type: aperture.spec.v1.SchedulerParameters
scheduler:
workloads:
- parameters:
priority: 50
label_matcher:
match_labels:
user_type: guest
- parameters:
priority: 200
label_matcher:
match_labels:
http.request.header.user_type: subscriber
# Linear increment to load multiplier in each execution tick (0.5s) when the system is not in overloaded state.
# Type: float64
load_multiplier_linear_increment: 0.0025

Start the Playground

Once you understand the Nginx configuration and complete the playground prerequisites, run the following commands, and verify you’re within the cloned aperture repository.


# change directory to playground,
cd aperture/playground
tilt up -- --scenario=./scenarios/rate_limiting_escalation_nginx

This will bring up all services and resources. Now, head over to Grafana at localhost:3000/

Demo Application

What are the consequences of not implementing load management?

In the snapshot of the Grafana dashboard below, you can see that the latency increases to 400ms as the number of users increases. If the number of users continues to increase, it can overload the service, leading to fatal errors and cascading failure scenarios. Additionally, there is no prioritization between subscribed and guest users, and crawler traffic is allowed at all times, contributing to a rise in the overall latency of the service.

These are some of the key consequences —

  • Increased latency and slower response times for users
  • Increased server load and potential overload, leading to fatal errors and crashes
  • Cascading failure scenarios that can affect other parts of the system
  • No prioritization between different types of users, leading to poor user experience for some users
  • Allowing Crawler traffic at all times, contributing to a rise in overall latency and server load

No load management is set up

When Aperture is integrated with Nginx

Aperture is intelligent load management; it doesn’t just act on one signal, but rather the overall performance of the system observing golden signals. Once we get Aperture in the picture, system performance significantly increases, and service latency hovers around 150ms.

Rate Limiting is limiting all the crawler traffic based on label key as configured in the policy. We can see how well the system is prioritizing the workload for each user type, i.e., subscribed and guest.

Latency for individual workloads is significantly dropped from ~ 400ms to 200ms.

When Aperture is Active

System Overview

The impact of Aperture on reliability can be better understood by examining the graph below. Before integrating Aperture, the latency peak was around 400ms. However, after integrating Aperture, the latency significantly decreased to less than 150ms, which is a remarkable difference. This is particularly significant for organizations that optimize to reduce latency by just 10ms.

Across-the-board-Overview

This example clearly demonstrates how Aperture's techniques have improved the performance and reliability of the e-commerce website. By implementing Aperture's load management strategy, the service was able to handle high-traffic loads and prevent any downtime effectively.

Conclusion

In conclusion, managing high-traffic e-commerce websites with Nginx can be a daunting task, but integrating FluxNinja Aperture can make it easier. We have discussed the benefits of using Aperture to manage load and prevent server crashes, as well as the various techniques that Aperture offers to help manage high-traffic loads. By implementing Aperture's techniques, websites can handle high-traffic loads, prevent downtime, and ensure a consistent and reliable user experience even during peak periods. With Aperture, load management becomes more efficient, allowing websites to focus on providing high-quality service to their customers.

To learn more about Aperture, please visit our GitHub repository and documentation site. You can also join our Slack community to discuss best practices, ask questions, and engage in discussions on reliability management.