Skip to content

Conversation

@prathyushpv
Copy link
Contributor

@prathyushpv prathyushpv commented Jan 26, 2026

What changed?

Add a WorkflowTaskQueueScheduler to serialize tasks from a busy workflow execution. Currently, all workflow tasks go through FIFO scheduler. We have a per-workflow-execution lock that must be aquired in each history task. But if a single execution has large number of tasks, each of these tasks compete for this lock and create large number of retries. By Adding this new scheduler, we can serialize task processing from such busy workflows. These tasks are routed to a new WorkflowQueueScheduler when lock contention is detected in a workflow. This will create a new queue for that workflow. Additional tasks for this workflow will be then routed to this new queue. This queue will be cleaned up after a few seconds of inactivity from that workflow execution. We have added a WorkflowAwareScheduler which will manage this routing of workflow tasks to either FIFO scheduler or this new WQ Scheduler.

Task → InterleavedWeightedRoundRobinScheduler
            ↓
      WorkflowAwareScheduler
            ↓
      ┌─────┴─────┐
      ↓           ↓
  FIFOScheduler   WorkflowQueueScheduler
  (normal path)   (contended workflows)

This new scheduler is only enabled when history.taskSchedulerEnableWorkflowQueueScheduler is enabled. The number of workflow queues created in this scheduler will be controlled by config history.taskSchedulerWorkflowQueueSchedulerQueueSize.
Tasks are routed to FIFOScheduler(Like the way it was before this change) if number of queues reaches this value.

A new goroutine is spawned for each queue in this new scheduler. This is fine here as we don’t expect more than a few hundred hot workflows per history host. This simplifies the design for this scheduler.

Why?

To reduce workflow lock contention and wasted history CPU when tasks are competing for workflow lock.

Benchmark results

This result is collected by running the new functional test for this scheduler.

Task Routing & Failures

Metric WQS ENABLED WQS DISABLED Improvement
WQS Tasks Submitted 1,964 0 -
WQS Tasks Completed 1,907 0 -
WQS Tasks Failed 57 0 -
WQS Submit Rejected 78 0 -
FIFO Tasks Completed 194 2,084 -
FIFO Tasks Failed 70 7,123 102x fewer
Total Failures 127 7,123 56x fewer failures

End-to-End Task Latency

Percentile WQS ENABLED WQS DISABLED Improvement
Count 2,018 2,077 -
Average 938ms 1,281ms 1.4x faster
P90 1,756ms 3,963ms 2.3x faster
P99 1,840ms 6,621ms 3.6x faster
Max 1,848ms 7,051ms 3.8x faster

How did you test it?

  • built
  • run locally and tested manually
  • covered by existing tests
  • added new unit test(s)
  • added new functional test(s)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants