Skip to content

Introduce logger on dask workers#170

Open
sbrandstaeter wants to merge 8 commits intoqueens-py:mainfrom
sbrandstaeter:introduce-logger-on-dask-worker
Open

Introduce logger on dask workers#170
sbrandstaeter wants to merge 8 commits intoqueens-py:mainfrom
sbrandstaeter:introduce-logger-on-dask-worker

Conversation

@sbrandstaeter
Copy link
Member

Description and Context:
What and Why?

Summary

This PR introduces logging support for code running inside Dask workers, specifically within the driver.run and dataprocessor.get_data_from_file methods, as discussed in issue #165. This should enable debugging these methods even on an HPC cluster.

Motivation

Currently, there's no logging when these methods are executed on a Dask worker, making it difficult to debug or monitor behavior during distributed execution.

What's Proposed

  • A dedicated logger (self.logger_on_dask_worker) is initialized inside the above methods if it doesn't already exist.
  • This allows logging within Dask workers to be captured reliably.

Logging Setup

  • Logs are written both to file and to the standard stream.
  • One log file per worker is used. If a worker restarts and gets the same ID (e.g. due to restart_worker=True), it reuses the same log file.
  • Log files are stored under <experiment_dir>/dask_workers_logs.
  • The stream output might also be captured by job schedulers like SLURM, depending on the environment.

Notes

This PR is meant to establish a solid initial foundation for Dask-side logging. While the current solution may not cover all edge cases or logging best practices, it's a step in the right direction and open for future improvements. Feedback on design and setup is welcome.

Related Issues and Pull Requests

Interested Parties

@sbrandstaeter sbrandstaeter force-pushed the introduce-logger-on-dask-worker branch from 143eff2 to 01c9f02 Compare June 7, 2025 22:06
Copy link
Member

@gilrrei gilrrei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sbrandstaeter I like this feature and I think we should merge this :)

Args:
experiment_dir (Path): Directory for data of a specific experiment on the computing machine.
"""
log_dir = experiment_dir / "dask_workers_logs"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must admit I find this a bit unintuitive. I would look for the logging of a job in the job_dir of the job. Why go for this, is there a particular reason?

Copy link
Member Author

@sbrandstaeter sbrandstaeter Aug 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a very good point and should be clearly stated. If you do not restart the worker after each job, a specific worker can take on several jobs, and there is no way to know which ones these are in advance. Of course, the information is available once the job starts, and we could just change the log file on a per-job basis.
however, the logger also writes more general logging information of the worker (the same info that is written to stdout/ stderr on a cluster and thus captured in the respective files written by eg. SLURM). This information is particularly useful when e.g. debugging dask job states (so dask internal stuff). If this information is scattered around several files (per-job basis) it will be difficult to track it. I admit that I might be biased here because that was one of the main application scenarios I was using the logging for during development of this feature.
For this reason, I opted to collect all the logging of one worker in one file. I introduced the new folder to not clutter the experiment dir even more.

The information that is missing and that has to be captured somewhere, perhaps in the metadata (?): job x was run on worker y. We could also write the log-file path to the metadata for easier access.

My statement above was incorrect. The place where all the information is collected is, of course, in the SLURM log files that capture stdout and stderr.

It should be possible to adjust the log-file on a per-job basis and I agree that is makes perfect sense.
It will be the loggers on the driver level (and below like data-processor) that are affected by this so this is really always on the single job level.
We will still be able to collect the entire log of a worker chronologically with the SLURM log (if we also use the streamhandler in the driver log).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have implemented a version that creates a log file per job.

This required exposing the _manage_paths method to the Driver parent class.
I don't see this as a negative, though, as the Function driver might make use of the predefined QUEENS experiment dir structure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually, I decided to further split the _manage_paths method into a _manage_paths and a _manage_output_files method.
This allows better control of the structure.

@sbrandstaeter sbrandstaeter force-pushed the introduce-logger-on-dask-worker branch from 01c9f02 to de5bd40 Compare August 23, 2025 10:56
Comment on lines 131 to 134
job_dir,
output_dir,
output_file,
log_file,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While these arguments are not used here, they might be used by a user-defined Python function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this is useful. However, experiment_dir, job_dir and output_dir do not exist when using the function driver. Not saying this is a bad thing, just that it might give user a wrong impression. idk...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also not sure about passing all of these parameters. I would prefer to only pass the ones that are actually needed in the current implementation of all drivers. For the jobscript driver, we could just simply call self._manage_paths() again instead of passing job_dir, output_dir, output_file, log_file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reverted the signature of the _run function.

Note that the experiment_dir seems to be always created (especially if we merge #230).
This might be unwanted behaviour, e.g., in the case of a Function driver that does not need any directories.
On the plus side, the implementation proposed in this PR only creates the job_dir/output_dir folders in case the worker logger files are written.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a very good point about the experiment_dir always being created. I will add a check to #230 after the QUEENS run is over to delete the experiment_dir if it's empty 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is an interesting idea. If we do it this way, we might as well also create the whole folder structure experiment_dir/job_dir/output_dir and delete it if no files exist inside.
It is not a particularly elegant solution, but it should work fine, and we could remove the flag write_worker_log_files after all.

Comment on lines 131 to 134
job_dir,
output_dir,
output_file,
log_file,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this is useful. However, experiment_dir, job_dir and output_dir do not exist when using the function driver. Not saying this is a bad thing, just that it might give user a wrong impression. idk...

@sbrandstaeter sbrandstaeter force-pushed the introduce-logger-on-dask-worker branch from e687545 to 6cf1c78 Compare October 8, 2025 10:58
Copy link
Member

@leahaeusel leahaeusel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since Sebastian asked us to have a look at this PR in a past developer meeting, I have added my point of view on the open conversations. Hopefully, this will help to get this merged soon 🤞

Comment on lines 131 to 134
job_dir,
output_dir,
output_file,
log_file,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also not sure about passing all of these parameters. I would prefer to only pass the ones that are actually needed in the current implementation of all drivers. For the jobscript driver, we could just simply call self._manage_paths() again instead of passing job_dir, output_dir, output_file, log_file.

@sbrandstaeter sbrandstaeter force-pushed the introduce-logger-on-dask-worker branch from 6cf1c78 to 86ac5d0 Compare November 6, 2025 15:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Logging for dask cluster and workers, i.e., drivers, is not working

3 participants