Nexus: Add check for failure to properly call qmcpack executable#5836
Nexus: Add check for failure to properly call qmcpack executable#5836brockdyer03 wants to merge 2 commits intoQMCPACK:developfrom
qmcpack executable#5836Conversation
… checking for `'QMCPACK'` in output
|
Have you thought about a more general way to handle these kinds of problems rather than ad hoc one code at-a-time fixes? e.g. This fix won't catch a missing pw.x This problem is very general ( #5833 ). Some possibilities involve noticing that the process is no longer present (workstation mode) or that the queued job is no longer present (slurm). I thought (perhaps incorrectly) that Nexus had logical to catch this historically, so I wonder if it was broken at some point. |
|
How does nexus track the completion of a job or command? |
|
This requires quite a bit more explanation |
|
Nexus always knows whether a process is complete (workstation) or out of the queue (cluster). Nexus does not check executable paths. One reason is that the exe might be accessed via PATH or an alias. Also, the path may exist on your workstation (where you ran scf) but not on Frontier (where you run your qmc). |
|
Exit codes from |
|
To fully communicate the reasons behind Nexus' design, I suggest we have an offline meeting, if that is indeed necessary. The general intended route is via |
Solely replying on output is just not enough. |
|
I am pretty sure that if I think the main problem though is that it isn't clear what the errors would be in most cases. If there's a place you can recommend where the various errors and exit codes are listed, that'd make it easier to write logic for it. |
|
@ye-luo many QMCPACK optimization runs produce usable output even if the code calls ABORT. Many tools/codes don't produce correct exit codes. Correct output is both necessary and sufficient to assess success/failure. Full stop. |
|
Migrating this discussion back to #5833. The PR here increases Nexus' robustness, and although it doesn't come with a pony, it should get merged. |
|
I tried my reproducer and got The output is quite confusing and doesn't give sufficient info about the failure. I do have questions Does this happen when Nexus detected process returned and analyzed the output or it is discovered through a periodic check? |
results from a post check on the QMCPACK output file. The check is whether the word "QMCPACK" is found in the output file, since it appears in the splash text right at the beginning. The root cause for this can be all kinds of things. It is basically impossible to catch and differentiate them all. The reason why it is a warning is that Nexus errors halt the execution of the main Nexus process. Nexus should not abort in these cases since, e.g. a user that has 100 workflows going doesn't want Nexus to stop working when one of them has some problem. The message could be clearer in saying what is wrong and what to do about it. |
…analyzed` for `QmcpackAnalyzer`
|
@ye-luo can you give the latest commit a test? I believe the problem was that, unlike the less-developed analyzers, the I updated the code in Based on the error message you provided, I think this should fix that. |
@jtkrogel I was wondering when this "post check" happens |
|
@brockdyer03 I don't know what to expect from your last change. here is the output I didn't see noticeable change. |
|
Yeah after some conversation with @jtkrogel this morning I think we realized that there's a fair bit more at play than we originally expected. I think fortifying the related bugs/unintended behavior will take some time. |
Proposed changes
This fixes a bug where Nexus wouldn't be able to detect a run failure if the
qmcpackexecutable was not called at all. In my case I forgot to set my$PATHto point toqmcpack/build/bin, sompirunnever got anywhere. To fix this, a check for'QMCPACK'in the output file was added, and some logic in the analyzer was added to inform the user of the problem and skip the analysis.It's a very barebones and simple check, but without sufficient information about other possible failure modes it's the best available that most likely won't get any false positives.
What type(s) of changes does this code introduce?
Does this introduce a breaking change?
What systems has this change been tested on?
Laptop, Fedora 43, Python 3.14.2, Numpy 2.4.2
Checklist