-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
Hi Temporal team and community,
I’m trying to understand whether it’s realistically possible to run Temporal on Kubernetes in a way that keeps the history service memory usage stable and within limits, without eventually hitting OOM kills.
Environment
• Temporal version: 1.29.1
• Persistence: PostgreSQL
• Deployment: Kubernetes
• History service replicas: 1
• Shards: tried multiple values
• Monitoring: Grafana community dashboard for Temporal server
Load model
• Execute 100 workflows per minute, evenly distributed
• After each minute of execution: 1 minute sleep
• After 15 cycles, there is a 30-minute idle period
• This pattern repeats
What I’ve tried, adjusting:
• Number of shards
• History and events cache settings
• Database connection pool sizes
• Verifying via Grafana that traffic drops during idle periods
Observed behavior
• History pod memory usage continuously accumulates over time
• Memory does not stabilize or decrease, even during idle periods
• Eventually the history pod is OOM killed
• I don’t see a clear plateau in memory usage in Grafana
Main question
Has anyone successfully deployed Temporal on Kubernetes such that the history service memory remains stable over time (especially under cyclical or bursty workloads), without relying on restarts or OOM recovery?
If so:
• What configuration patterns made the difference?
• Is multiple history replicas required for memory stability?
• Are there known limitations or expected behavior around cache eviction / GC in this scenario?
I’d appreciate any guidance, references, or confirmation of whether this is expected behavior or a misconfiguration on my side.
Thanks in advance!