Introduce logger on dask workers by sbrandstaeter · Pull Request #170 · queens-py/queens

sbrandstaeter · 2025-06-07T12:23:53Z

Description and Context:
What and Why?

Summary

This PR introduces logging support for code running inside Dask workers, specifically within the driver.run and dataprocessor.get_data_from_file methods, as discussed in issue #165. This should enable debugging these methods even on an HPC cluster.

Motivation

Currently, there's no logging when these methods are executed on a Dask worker, making it difficult to debug or monitor behavior during distributed execution.

What's Proposed

A dedicated logger (self.logger_on_dask_worker) is initialized inside the above methods if it doesn't already exist.
This allows logging within Dask workers to be captured reliably.

Logging Setup

Logs are written both to file and to the standard stream.
One log file per worker is used. If a worker restarts and gets the same ID (e.g. due to restart_worker=True), it reuses the same log file.
Log files are stored under <experiment_dir>/dask_workers_logs.
The stream output might also be captured by job schedulers like SLURM, depending on the environment.

Notes

This PR is meant to establish a solid initial foundation for Dask-side logging. While the current solution may not cover all edge cases or logging best practices, it's a step in the right direction and open for future improvements. Feedback on design and setup is welcome.

Related Issues and Pull Requests

Closes Logging for dask cluster and workers, i.e., drivers, is not working #165
Related to

Interested Parties

queens/drivers/_driver.py

gilrrei

@sbrandstaeter I like this feature and I think we should merge this :)

gilrrei · 2025-08-23T05:49:19Z

queens/utils/config_directories.py

+    Args:
+        experiment_dir (Path): Directory for data of a specific experiment on the computing machine.
+    """
+    log_dir = experiment_dir / "dask_workers_logs"


I must admit I find this a bit unintuitive. I would look for the logging of a job in the job_dir of the job. Why go for this, is there a particular reason?

That is a very good point and should be clearly stated. If you do not restart the worker after each job, a specific worker can take on several jobs, and there is no way to know which ones these are in advance. Of course, the information is available once the job starts, and we could just change the log file on a per-job basis.
however, the logger also writes more general logging information of the worker (the same info that is written to stdout/ stderr on a cluster and thus captured in the respective files written by eg. SLURM). This information is particularly useful when e.g. debugging dask job states (so dask internal stuff). If this information is scattered around several files (per-job basis) it will be difficult to track it. I admit that I might be biased here because that was one of the main application scenarios I was using the logging for during development of this feature.
For this reason, I opted to collect all the logging of one worker in one file. I introduced the new folder to not clutter the experiment dir even more.

~~The information that is missing and that has to be captured somewhere, perhaps in the metadata (?): job x was run on worker y. We could also write the log-file path to the metadata for easier access.~~

My statement above was incorrect. The place where all the information is collected is, of course, in the SLURM log files that capture stdout and stderr.

It should be possible to adjust the log-file on a per-job basis and I agree that is makes perfect sense.
It will be the loggers on the driver level (and below like data-processor) that are affected by this so this is really always on the single job level.
We will still be able to collect the entire log of a worker chronologically with the SLURM log (if we also use the streamhandler in the driver log).

I have implemented a version that creates a log file per job.

This required exposing the _manage_paths method to the Driver parent class.
I don't see this as a negative, though, as the Function driver might make use of the predefined QUEENS experiment dir structure.

Eventually, I decided to further split the _manage_paths method into a _manage_paths and a _manage_output_files method.
This allows better control of the structure.

src/queens/drivers/_driver.py

sbrandstaeter · 2025-08-23T11:44:22Z

src/queens/drivers/function.py

+        job_dir,
+        output_dir,
+        output_file,
+        log_file,


While these arguments are not used here, they might be used by a user-defined Python function.

I agree that this is useful. However, experiment_dir, job_dir and output_dir do not exist when using the function driver. Not saying this is a bad thing, just that it might give user a wrong impression. idk...

I am also not sure about passing all of these parameters. I would prefer to only pass the ones that are actually needed in the current implementation of all drivers. For the jobscript driver, we could just simply call self._manage_paths() again instead of passing job_dir, output_dir, output_file, log_file.

I reverted the signature of the _run function.

Note that the experiment_dir seems to be always created (especially if we merge #230).
This might be unwanted behaviour, e.g., in the case of a Function driver that does not need any directories.
On the plus side, the implementation proposed in this PR only creates the job_dir/output_dir folders in case the worker logger files are written.

That's a very good point about the experiment_dir always being created. I will add a check to #230 after the QUEENS run is over to delete the experiment_dir if it's empty 👍

That is an interesting idea. If we do it this way, we might as well also create the whole folder structure experiment_dir/job_dir/output_dir and delete it if no files exist inside.
It is not a particularly elegant solution, but it should work fine, and we could remove the flag write_worker_log_files after all.

gilrrei · 2025-08-24T19:20:02Z

src/queens/drivers/function.py

+        job_dir,
+        output_dir,
+        output_file,
+        log_file,


I agree that this is useful. However, experiment_dir, job_dir and output_dir do not exist when using the function driver. Not saying this is a bad thing, just that it might give user a wrong impression. idk...

src/queens/data_processors/_data_processor.py

leahaeusel

Since Sebastian asked us to have a look at this PR in a past developer meeting, I have added my point of view on the open conversations. Hopefully, this will help to get this merged soon 🤞

src/queens/drivers/_driver.py

leahaeusel · 2025-10-17T06:06:19Z

src/queens/drivers/function.py

+        job_dir,
+        output_dir,
+        output_file,
+        log_file,


I am also not sure about passing all of these parameters. I would prefer to only pass the ones that are actually needed in the current implementation of all drivers. For the jobscript driver, we could just simply call self._manage_paths() again instead of passing job_dir, output_dir, output_file, log_file.

src/queens/data_processors/_data_processor.py

sbrandstaeter force-pushed the introduce-logger-on-dask-worker branch from 143eff2 to 01c9f02 Compare June 7, 2025 22:06

gilrrei reviewed Jun 11, 2025

View reviewed changes

queens/drivers/_driver.py Outdated Show resolved Hide resolved

gilrrei reviewed Aug 23, 2025

View reviewed changes

sbrandstaeter force-pushed the introduce-logger-on-dask-worker branch from 01c9f02 to de5bd40 Compare August 23, 2025 10:56

sbrandstaeter commented Aug 23, 2025

View reviewed changes

src/queens/drivers/_driver.py Show resolved Hide resolved

sbrandstaeter commented Aug 23, 2025

View reviewed changes

src/queens/drivers/_driver.py Outdated Show resolved Hide resolved

sbrandstaeter commented Aug 23, 2025

View reviewed changes

gilrrei reviewed Aug 24, 2025

View reviewed changes

sbrandstaeter force-pushed the introduce-logger-on-dask-worker branch from e687545 to 6cf1c78 Compare October 8, 2025 10:58

leahaeusel reviewed Oct 17, 2025

View reviewed changes

leahaeusel mentioned this pull request Oct 27, 2025

Meeting on 27.10.2025 queens-py/queens-meetings#20

Closed

8 tasks

sbrandstaeter mentioned this pull request Nov 6, 2025

Clean up _manage_paths in the driver object #238

Open

sbrandstaeter added 6 commits November 6, 2025 12:46

feat: introduce logger on dask workers in drivers and dataprocessors

3f0d061

feat: use one worker log file per job written to subdir of job dir

b050089

feat: expose worker log level and control of log file to user

3fce606

refactor: rename job_directory helper function and use in driver

72b3a0b

fix: add type hinting

b8cd850

feat: split manage paths and output files

86ac5d0

sbrandstaeter force-pushed the introduce-logger-on-dask-worker branch from 6cf1c78 to 86ac5d0 Compare November 6, 2025 15:13

sbrandstaeter added 2 commits November 6, 2025 16:17

feat: turn off worker log files for Function driver on default

5657cad

fix: duplicate code pylint warning

acc4d16

leahaeusel mentioned this pull request Nov 7, 2025

Remove empty experiment dirs #242

Merged

Conversation

sbrandstaeter commented Jun 7, 2025

Description and Context: What and Why?

Summary

Motivation

What's Proposed

Logging Setup

Notes

Related Issues and Pull Requests

Interested Parties

Uh oh!

Uh oh!

gilrrei left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sbrandstaeter Aug 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

leahaeusel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Description and Context:
What and Why?

sbrandstaeter Aug 23, 2025 •

edited

Loading