-
Notifications
You must be signed in to change notification settings - Fork 155
Description
Summary
We're experiencing severe performance degradation and message handling issues when upgrading from Pushpin 1.35.0 to 1.41.0 in our production environment. The issue appears under high load (100k requests/minute) but not in our staging environment (100 requests/minute). Pushpin 1.35.0 has been running flawlessly for 2 years with the same configuration.
Environment
- Platform: AWS Fargate (Linux/x86_64)
- Installation: Docker images from Docker Hub
- Working version:
fanout/pushpin:1.35.0-1 - Problem version:
fanout/pushpin:1.41.0-1
- Working version:
- Production load: ~100k requests/minute (proxy), ~4k messages/instance/minute received (handler), ~4k messages/instance/minute sent (handler)
- Staging load: ~100 requests/minute
- CPU/Memory usage: ~10% (no unusual spikes during incidents)
Setup
Our setup consists of 2 instances of Pushpin, each receiving all messages and dispatching them to the correct channels/clients.
Configuration
- 1.35.0 configuration: https://gist.github.com/xavierhamel/c21bc864b3c90726ee082814e0c7d18d
- 1.35.0 internal configuration: https://gist.github.com/xavierhamel/2baabc3c7421ffd34a0a695e1f2d049c
- 1.41.0 configuration: https://gist.github.com/xavierhamel/efacafc2c52bc07dfea65137f75f702e
Note: We used near-default configurations for both versions. The default config did change between versions.
Incident 1: Direct upgrade
Setup: Direct upgrade from 1.35.0 to 1.41.0
Observations:
- Immediately after booting 1.41.0 and replacing 1.35.0, message throughput dropped dramatically
- Requests received (proxy): Remained at ~100k/minute ✓
- Messages received (handler): Dropped from 4k to 1k/minute ✗
- Messages sent (handler): Dropped from 4k to 100/minute ✗
- Proxied requests became very slow (10s to 1 minute response times)
- Everything else in our system was functioning correctly
log_level=1was active, limited logging information available
Resolution: Reverted to 1.35.0, issue immediately resolved.
Incident 2: Rolling rollout with load balancer
After a few weeks of investigation without finding the root cause, and being unable to reproduce in staging, we changed our setup:
New setup:
- 2 instances of Pushpin 1.41.0
- 2 instances of Pushpin 1.35.0
- Load balancer in front to route traffic between versions
- All the traffic was going to version 1.35.0
Deployment:
- Rolling rollout from 0% to 100% traffic to 1.41.0 over 24 hours
- System worked correctly for 48 hours after reaching 100% on version 1.41.0
Trigger event (possibly coincidental):
A user made 1200 requests in 1 second. All returned errors with the following log entry (repeated 1200x):
GET http://webchat.botpress.cloud/{webhook_id}/conversations/{conv_id}/listen -> webhook.botpress.cloud:443 error
Note: webhook.botpress.cloud:443 is our backend server, which continued responding normally to direct requests.
This was indentified as the start of the incident. We don't know if it is what caused the incident or if it was a
coincidence.
Observations after trigger:
- Requests received (proxy): Dropped from 100k to 80k/minute
- Messages received (handler): Dropped from 4k to 1k/minute
- Messages sent (handler): Dropped from 4k to 1k/minute
- Proxy error rate: Jumped from ~0.5% to ~30% of requests
- Proxied requests became very slow (10s to 1 minute response times)
- Even requests not using SSE/WebSocket channels were significantly slowed
Resolution:
- Switched traffic back to 1.35.0 via load balancer, issue immediately resolved
- Did not manually restart the 1.41.0 containers
- When tested the next day, the same 1.41.0 containers (still running, no restart) were working correctly again
Questions
- Are there known performance or scalability issues in 1.41.0 under high load?
- Could the default configuration changes between versions be causing this behavior?
- Are there any known issues with message handling or connection management in 1.41.0?
- Are there any recommended configuration changes when upgrading to 1.41.0 for high-throughput scenarios?
Additional Context
- We tested 1.41.0 in staging for several weeks without issues, but staging has 1000x less load
- The issue appears to be specific to high-load scenarios
- No unusual CPU or memory usage was observed during incidents
- Backend server health was confirmed throughout both incidents
Any guidance on debugging this issue or configuration recommendations for high-throughput deployments would be greatly appreciated.