Introduce logger on dask workers#170
Conversation
143eff2 to
01c9f02
Compare
gilrrei
left a comment
There was a problem hiding this comment.
@sbrandstaeter I like this feature and I think we should merge this :)
queens/utils/config_directories.py
Outdated
| Args: | ||
| experiment_dir (Path): Directory for data of a specific experiment on the computing machine. | ||
| """ | ||
| log_dir = experiment_dir / "dask_workers_logs" |
There was a problem hiding this comment.
I must admit I find this a bit unintuitive. I would look for the logging of a job in the job_dir of the job. Why go for this, is there a particular reason?
There was a problem hiding this comment.
That is a very good point and should be clearly stated. If you do not restart the worker after each job, a specific worker can take on several jobs, and there is no way to know which ones these are in advance. Of course, the information is available once the job starts, and we could just change the log file on a per-job basis.
however, the logger also writes more general logging information of the worker (the same info that is written to stdout/ stderr on a cluster and thus captured in the respective files written by eg. SLURM). This information is particularly useful when e.g. debugging dask job states (so dask internal stuff). If this information is scattered around several files (per-job basis) it will be difficult to track it. I admit that I might be biased here because that was one of the main application scenarios I was using the logging for during development of this feature.
For this reason, I opted to collect all the logging of one worker in one file. I introduced the new folder to not clutter the experiment dir even more.
The information that is missing and that has to be captured somewhere, perhaps in the metadata (?): job x was run on worker y. We could also write the log-file path to the metadata for easier access.
My statement above was incorrect. The place where all the information is collected is, of course, in the SLURM log files that capture stdout and stderr.
It should be possible to adjust the log-file on a per-job basis and I agree that is makes perfect sense.
It will be the loggers on the driver level (and below like data-processor) that are affected by this so this is really always on the single job level.
We will still be able to collect the entire log of a worker chronologically with the SLURM log (if we also use the streamhandler in the driver log).
There was a problem hiding this comment.
I have implemented a version that creates a log file per job.
This required exposing the _manage_paths method to the Driver parent class.
I don't see this as a negative, though, as the Function driver might make use of the predefined QUEENS experiment dir structure.
There was a problem hiding this comment.
Eventually, I decided to further split the _manage_paths method into a _manage_paths and a _manage_output_files method.
This allows better control of the structure.
01c9f02 to
de5bd40
Compare
src/queens/drivers/function.py
Outdated
| job_dir, | ||
| output_dir, | ||
| output_file, | ||
| log_file, |
There was a problem hiding this comment.
While these arguments are not used here, they might be used by a user-defined Python function.
There was a problem hiding this comment.
I agree that this is useful. However, experiment_dir, job_dir and output_dir do not exist when using the function driver. Not saying this is a bad thing, just that it might give user a wrong impression. idk...
There was a problem hiding this comment.
I am also not sure about passing all of these parameters. I would prefer to only pass the ones that are actually needed in the current implementation of all drivers. For the jobscript driver, we could just simply call self._manage_paths() again instead of passing job_dir, output_dir, output_file, log_file.
There was a problem hiding this comment.
I reverted the signature of the _run function.
Note that the experiment_dir seems to be always created (especially if we merge #230).
This might be unwanted behaviour, e.g., in the case of a Function driver that does not need any directories.
On the plus side, the implementation proposed in this PR only creates the job_dir/output_dir folders in case the worker logger files are written.
There was a problem hiding this comment.
That's a very good point about the experiment_dir always being created. I will add a check to #230 after the QUEENS run is over to delete the experiment_dir if it's empty 👍
There was a problem hiding this comment.
That is an interesting idea. If we do it this way, we might as well also create the whole folder structure experiment_dir/job_dir/output_dir and delete it if no files exist inside.
It is not a particularly elegant solution, but it should work fine, and we could remove the flag write_worker_log_files after all.
src/queens/drivers/function.py
Outdated
| job_dir, | ||
| output_dir, | ||
| output_file, | ||
| log_file, |
There was a problem hiding this comment.
I agree that this is useful. However, experiment_dir, job_dir and output_dir do not exist when using the function driver. Not saying this is a bad thing, just that it might give user a wrong impression. idk...
e687545 to
6cf1c78
Compare
leahaeusel
left a comment
There was a problem hiding this comment.
Since Sebastian asked us to have a look at this PR in a past developer meeting, I have added my point of view on the open conversations. Hopefully, this will help to get this merged soon 🤞
src/queens/drivers/function.py
Outdated
| job_dir, | ||
| output_dir, | ||
| output_file, | ||
| log_file, |
There was a problem hiding this comment.
I am also not sure about passing all of these parameters. I would prefer to only pass the ones that are actually needed in the current implementation of all drivers. For the jobscript driver, we could just simply call self._manage_paths() again instead of passing job_dir, output_dir, output_file, log_file.
6cf1c78 to
86ac5d0
Compare
Description and Context:
What and Why?
Summary
This PR introduces logging support for code running inside Dask workers, specifically within the
driver.runanddataprocessor.get_data_from_filemethods, as discussed in issue #165. This should enable debugging these methods even on an HPC cluster.Motivation
Currently, there's no logging when these methods are executed on a Dask worker, making it difficult to debug or monitor behavior during distributed execution.
What's Proposed
self.logger_on_dask_worker) is initialized inside the above methods if it doesn't already exist.Logging Setup
restart_worker=True), it reuses the same log file.<experiment_dir>/dask_workers_logs.Notes
This PR is meant to establish a solid initial foundation for Dask-side logging. While the current solution may not cover all edge cases or logging best practices, it's a step in the right direction and open for future improvements. Feedback on design and setup is welcome.
Related Issues and Pull Requests
Interested Parties