Add per-workflow scheduler for history task processing #9141
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changed?
Add a WorkflowTaskQueueScheduler to serialize tasks from a busy workflow execution. Currently, all workflow tasks go through FIFO scheduler. We have a per-workflow-execution lock that must be aquired in each history task. But if a single execution has large number of tasks, each of these tasks compete for this lock and create large number of retries. By Adding this new scheduler, we can serialize task processing from such busy workflows. These tasks are routed to a new WorkflowQueueScheduler when lock contention is detected in a workflow. This will create a new queue for that workflow. Additional tasks for this workflow will be then routed to this new queue. This queue will be cleaned up after a few seconds of inactivity from that workflow execution. We have added a WorkflowAwareScheduler which will manage this routing of workflow tasks to either FIFO scheduler or this new WQ Scheduler.
This new scheduler is only enabled when history.taskSchedulerEnableWorkflowQueueScheduler is enabled. The number of workflow queues created in this scheduler will be controlled by config history.taskSchedulerWorkflowQueueSchedulerQueueSize.
Tasks are routed to FIFOScheduler(Like the way it was before this change) if number of queues reaches this value.
A new goroutine is spawned for each queue in this new scheduler. This is fine here as we don’t expect more than a few hundred hot workflows per history host. This simplifies the design for this scheduler.
Why?
To reduce workflow lock contention and wasted history CPU when tasks are competing for workflow lock.
Benchmark results
This result is collected by running the new functional test for this scheduler.
Task Routing & Failures
End-to-End Task Latency
How did you test it?