Performance degradation when upgrading from 1.35.0 to 1.41.0 under high load

## Summary

We're experiencing severe performance degradation and message handling issues when upgrading from Pushpin 1.35.0 to 1.41.0 in our production environment. The issue appears under high load (100k requests/minute) but not in our staging environment (100 requests/minute). Pushpin 1.35.0 has been running flawlessly for 2 years with the same configuration.

## Environment

- **Platform**: AWS Fargate (Linux/x86_64)
- **Installation**: Docker images from Docker Hub
  - Working version: `fanout/pushpin:1.35.0-1`
  - Problem version: `fanout/pushpin:1.41.0-1`
- **Production load**: ~100k requests/minute (proxy), ~4k messages/instance/minute received (handler), ~4k messages/instance/minute sent (handler)
- **Staging load**: ~100 requests/minute
- **CPU/Memory usage**: ~10% (no unusual spikes during incidents)

## Setup

Our setup consists of 2 instances of Pushpin, each receiving all messages and dispatching them to the correct channels/clients.

### Configuration

- **1.35.0 configuration**: https://gist.github.com/xavierhamel/c21bc864b3c90726ee082814e0c7d18d
- **1.35.0 internal configuration**: https://gist.github.com/xavierhamel/2baabc3c7421ffd34a0a695e1f2d049c
- **1.41.0 configuration**: https://gist.github.com/xavierhamel/efacafc2c52bc07dfea65137f75f702e

Note: We used near-default configurations for both versions. The default config did change between versions.

## Incident 1: Direct upgrade

**Setup**: Direct upgrade from 1.35.0 to 1.41.0

**Observations**:
- Immediately after booting 1.41.0 and replacing 1.35.0, message throughput dropped dramatically
- Requests received (proxy): Remained at ~100k/minute ✓
- Messages received (handler): Dropped from 4k to 1k/minute ✗
- Messages sent (handler): Dropped from 4k to 100/minute ✗
- Proxied requests became very slow (10s to 1 minute response times)
- Everything else in our system was functioning correctly
- `log_level=1` was active, limited logging information available

**Resolution**: Reverted to 1.35.0, issue immediately resolved.

## Incident 2: Rolling rollout with load balancer

After a few weeks of investigation without finding the root cause, and being unable to reproduce in staging, we changed our setup:

**New setup**:
- 2 instances of Pushpin 1.41.0
- 2 instances of Pushpin 1.35.0
- Load balancer in front to route traffic between versions
- All the traffic was going to version 1.35.0

**Deployment**:
- Rolling rollout from 0% to 100% traffic to 1.41.0 over 24 hours
- System worked correctly for 48 hours after reaching 100% on version 1.41.0

**Trigger event** (possibly coincidental):
A user made 1200 requests in 1 second. All returned errors with the following log entry (repeated 1200x):
```
GET http://webchat.botpress.cloud/{webhook_id}/conversations/{conv_id}/listen -> webhook.botpress.cloud:443 error
```
Note: `webhook.botpress.cloud:443` is our backend server, which continued responding normally to direct requests.

This was indentified as the start of the incident. We don't know if it is what caused the incident or if it was a
coincidence.

**Observations after trigger**:
- Requests received (proxy): Dropped from 100k to 80k/minute
- Messages received (handler): Dropped from 4k to 1k/minute
- Messages sent (handler): Dropped from 4k to 1k/minute
- Proxy error rate: Jumped from ~0.5% to ~30% of requests
- Proxied requests became very slow (10s to 1 minute response times)
  - Even requests not using SSE/WebSocket channels were significantly slowed

**Resolution**: 
- Switched traffic back to 1.35.0 via load balancer, issue immediately resolved
- Did not manually restart the 1.41.0 containers
- When tested the next day, the same 1.41.0 containers (still running, no restart) were working correctly again

## Questions

1. Are there known performance or scalability issues in 1.41.0 under high load?
2. Could the default configuration changes between versions be causing this behavior?
3. Are there any known issues with message handling or connection management in 1.41.0?
5. Are there any recommended configuration changes when upgrading to 1.41.0 for high-throughput scenarios?

## Additional Context

- We tested 1.41.0 in staging for several weeks without issues, but staging has 1000x less load
- The issue appears to be specific to high-load scenarios
- No unusual CPU or memory usage was observed during incidents
- Backend server health was confirmed throughout both incidents

Any guidance on debugging this issue or configuration recommendations for high-throughput deployments would be greatly appreciated.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance degradation when upgrading from 1.35.0 to 1.41.0 under high load #48288

Summary

Environment

Setup

Configuration

Incident 1: Direct upgrade

Incident 2: Rolling rollout with load balancer

Questions

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Performance degradation when upgrading from 1.35.0 to 1.41.0 under high load #48288

Description

Summary

Environment

Setup

Configuration

Incident 1: Direct upgrade

Incident 2: Rolling rollout with load balancer

Questions

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions