Skip to content

Performance degradation when upgrading from 1.35.0 to 1.41.0 under high load #48288

@xavierhamel

Description

@xavierhamel

Summary

We're experiencing severe performance degradation and message handling issues when upgrading from Pushpin 1.35.0 to 1.41.0 in our production environment. The issue appears under high load (100k requests/minute) but not in our staging environment (100 requests/minute). Pushpin 1.35.0 has been running flawlessly for 2 years with the same configuration.

Environment

  • Platform: AWS Fargate (Linux/x86_64)
  • Installation: Docker images from Docker Hub
    • Working version: fanout/pushpin:1.35.0-1
    • Problem version: fanout/pushpin:1.41.0-1
  • Production load: ~100k requests/minute (proxy), ~4k messages/instance/minute received (handler), ~4k messages/instance/minute sent (handler)
  • Staging load: ~100 requests/minute
  • CPU/Memory usage: ~10% (no unusual spikes during incidents)

Setup

Our setup consists of 2 instances of Pushpin, each receiving all messages and dispatching them to the correct channels/clients.

Configuration

Note: We used near-default configurations for both versions. The default config did change between versions.

Incident 1: Direct upgrade

Setup: Direct upgrade from 1.35.0 to 1.41.0

Observations:

  • Immediately after booting 1.41.0 and replacing 1.35.0, message throughput dropped dramatically
  • Requests received (proxy): Remained at ~100k/minute ✓
  • Messages received (handler): Dropped from 4k to 1k/minute ✗
  • Messages sent (handler): Dropped from 4k to 100/minute ✗
  • Proxied requests became very slow (10s to 1 minute response times)
  • Everything else in our system was functioning correctly
  • log_level=1 was active, limited logging information available

Resolution: Reverted to 1.35.0, issue immediately resolved.

Incident 2: Rolling rollout with load balancer

After a few weeks of investigation without finding the root cause, and being unable to reproduce in staging, we changed our setup:

New setup:

  • 2 instances of Pushpin 1.41.0
  • 2 instances of Pushpin 1.35.0
  • Load balancer in front to route traffic between versions
  • All the traffic was going to version 1.35.0

Deployment:

  • Rolling rollout from 0% to 100% traffic to 1.41.0 over 24 hours
  • System worked correctly for 48 hours after reaching 100% on version 1.41.0

Trigger event (possibly coincidental):
A user made 1200 requests in 1 second. All returned errors with the following log entry (repeated 1200x):

GET http://webchat.botpress.cloud/{webhook_id}/conversations/{conv_id}/listen -> webhook.botpress.cloud:443 error

Note: webhook.botpress.cloud:443 is our backend server, which continued responding normally to direct requests.

This was indentified as the start of the incident. We don't know if it is what caused the incident or if it was a
coincidence.

Observations after trigger:

  • Requests received (proxy): Dropped from 100k to 80k/minute
  • Messages received (handler): Dropped from 4k to 1k/minute
  • Messages sent (handler): Dropped from 4k to 1k/minute
  • Proxy error rate: Jumped from ~0.5% to ~30% of requests
  • Proxied requests became very slow (10s to 1 minute response times)
    • Even requests not using SSE/WebSocket channels were significantly slowed

Resolution:

  • Switched traffic back to 1.35.0 via load balancer, issue immediately resolved
  • Did not manually restart the 1.41.0 containers
  • When tested the next day, the same 1.41.0 containers (still running, no restart) were working correctly again

Questions

  1. Are there known performance or scalability issues in 1.41.0 under high load?
  2. Could the default configuration changes between versions be causing this behavior?
  3. Are there any known issues with message handling or connection management in 1.41.0?
  4. Are there any recommended configuration changes when upgrading to 1.41.0 for high-throughput scenarios?

Additional Context

  • We tested 1.41.0 in staging for several weeks without issues, but staging has 1000x less load
  • The issue appears to be specific to high-load scenarios
  • No unusual CPU or memory usage was observed during incidents
  • Backend server health was confirmed throughout both incidents

Any guidance on debugging this issue or configuration recommendations for high-throughput deployments would be greatly appreciated.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions