Nexus: Add check for failure to properly call `qmcpack` executable by brockdyer03 · Pull Request #5836 · QMCPACK/qmcpack

brockdyer03 · 2026-03-03T16:26:49Z

Proposed changes

This fixes a bug where Nexus wouldn't be able to detect a run failure if the qmcpack executable was not called at all. In my case I forgot to set my $PATH to point to qmcpack/build/bin, so mpirun never got anywhere. To fix this, a check for 'QMCPACK' in the output file was added, and some logic in the analyzer was added to inform the user of the problem and skip the analysis.

It's a very barebones and simple check, but without sufficient information about other possible failure modes it's the best available that most likely won't get any false positives.

What type(s) of changes does this code introduce?

Bugfix
New feature
Testing changes (e.g. new unit/integration/performance tests)

Does this introduce a breaking change?

No

What systems has this change been tested on?

Laptop, Fedora 43, Python 3.14.2, Numpy 2.4.2

Checklist

- I have read the pull request guidance and develop docs
- This PR is up to date with the current state of 'develop'

… checking for `'QMCPACK'` in output

prckent · 2026-03-03T16:43:04Z

Have you thought about a more general way to handle these kinds of problems rather than ad hoc one code at-a-time fixes? e.g. This fix won't catch a missing pw.x

This problem is very general ( #5833 ). Some possibilities involve noticing that the process is no longer present (workstation mode) or that the queued job is no longer present (slurm). I thought (perhaps incorrectly) that Nexus had logical to catch this historically, so I wonder if it was broken at some point.

ye-luo · 2026-03-03T16:59:23Z

How does nexus track the completion of a job or command?
I'm not confident about relying on file printout.
Technically commands and queued jobs have exit code to use.
Does nexus track individual commands within a job? Or just use one mpirun per job?

jtkrogel · 2026-03-03T18:02:40Z

This requires quite a bit more explanation

jtkrogel · 2026-03-03T18:04:59Z

Nexus always knows whether a process is complete (workstation) or out of the queue (cluster).

Nexus does not check executable paths. One reason is that the exe might be accessed via PATH or an alias. Also, the path may exist on your workstation (where you ran scf) but not on Frontier (where you run your qmc).

jtkrogel · 2026-03-03T18:08:52Z

Exit codes from sbatch, srun/mpirun or the simulation executable itself cannot be relied on to reflect the actual success or failure of a simulation run. Other means are necessary.

jtkrogel · 2026-03-03T18:10:52Z

To fully communicate the reasons behind Nexus' design, I suggest we have an offline meeting, if that is indeed necessary.

The general intended route is via check_sim_status in the classes derived from Simulation, as Brock is doing here.

ye-luo · 2026-03-03T19:30:02Z

Exit codes from sbatch, srun/mpirun or the simulation executable itself cannot be relied on to reflect the actual success or failure of a simulation run. Other means are necessary.

Solely replying on output is just not enough. srun/mpirun exit code is also very important to validate a successful run. If exit code is bad, I will never consider a run successful even if the print out seems complete.

brockdyer03 · 2026-03-03T19:36:05Z

I am pretty sure that if srun or mpirun doesn't exit properly that will get echoed to stderr, which should be picked up in the .err file associated with the calculation. It's possible that it could be caught there and assessed.

I think the main problem though is that it isn't clear what the errors would be in most cases. If there's a place you can recommend where the various errors and exit codes are listed, that'd make it easier to write logic for it.

jtkrogel · 2026-03-03T20:13:59Z

@ye-luo many QMCPACK optimization runs produce usable output even if the code calls ABORT.

Many tools/codes don't produce correct exit codes.

Correct output is both necessary and sufficient to assess success/failure. Full stop.

jtkrogel · 2026-03-03T20:18:05Z

Migrating this discussion back to #5833.

The PR here increases Nexus' robustness, and although it doesn't come with a pony, it should get merged.

jtkrogel

LGTM

ye-luo · 2026-03-03T21:21:33Z

I tried my reproducer and got

    Entering ./scale_1.0 2 
      writing input files  2 opt 
    Entering ./scale_1.0 2 
      sending required files  2 opt 
      submitting job  2 opt 
    Entering ./scale_1.0 2 
      Executing:  
        export OMP_NUM_THREADS=1
        mpirun -np 4 qmcpack opt.in.xml 

  elapsed time 141.2 s  memory 105.79 MB 

  Qmcpack warning:
    QMCPACK did not start properly!
    Entering ./scale_1.0 2 
      copying results  2 opt 
        warning: the following files were missing 
          opt.s000.scalar.dat 
          opt.s000.stat.h5 
          opt.s000.opt.xml 
          opt.s001.scalar.dat 
          opt.s001.stat.h5 
          opt.s001.opt.xml 
          opt.s002.scalar.dat 
          opt.s002.stat.h5 
          opt.s002.opt.xml 
          opt.s003.scalar.dat 
          opt.s003.stat.h5 
          opt.s003.opt.xml 
          opt.s004.scalar.dat 
          opt.s004.stat.h5 
          opt.s004.opt.xml 
          opt.s005.scalar.dat 
          opt.s005.stat.h5 
          opt.s005.opt.xml 
          opt.s006.scalar.dat 
          opt.s006.stat.h5 
          opt.s006.opt.xml 
          opt.s007.scalar.dat 
          opt.s007.stat.h5 
          opt.s007.opt.xml 
          opt.s008.scalar.dat 
          opt.s008.stat.h5 
          opt.s008.opt.xml 
          opt.s009.scalar.dat 
          opt.s009.stat.h5 
          opt.s009.opt.xml 
          opt.s010.scalar.dat 
          opt.s010.stat.h5 
          opt.s010.opt.xml 
          opt.s011.scalar.dat 
          opt.s011.stat.h5 
          opt.s011.opt.xml 
    Entering ./scale_1.0 2 
      analyzing  2 opt 

  QmcpackAnalyzer warning:
    Simulation failed, skipping analysis!
Traceback (most recent call last):
  File "/home/yeluo/opt/qmcpack/labs/lab2_qmc_basics/oxygen_dimer/O_dimer.py", line 163, in <module>
    run_project(sims) 
  File "/home/yeluo/opt/qmcpack/nexus/nexus/__init__.py", line 84, in run_project
    pm.run_project()
  File "/home/yeluo/opt/qmcpack/nexus/nexus/project_manager.py", line 99, in run_project
    self.progress_cascades()
  File "/home/yeluo/opt/qmcpack/nexus/nexus/project_manager.py", line 332, in progress_cascades
    cascade.progress()
  File "/home/yeluo/opt/qmcpack/nexus/nexus/simulation.py", line 1290, in progress
    sim.progress(self.simid)
  File "/home/yeluo/opt/qmcpack/nexus/nexus/simulation.py", line 1290, in progress
    sim.progress(self.simid)
  File "/home/yeluo/opt/qmcpack/nexus/nexus/simulation.py", line 1266, in progress
    self.analyze()
  File "/home/yeluo/opt/qmcpack/nexus/nexus/simulation.py", line 1175, in analyze
    self.post_analyze(analyzer)
  File "/home/yeluo/opt/qmcpack/nexus/nexus/qmcpack.py", line 1228, in post_analyze
    opt_file = analyzer.results.optimization.optimal_file
AttributeError: 'OptimizationAnalyzer' object has no attribute 'optimal_file'

The output is quite confusing and doesn't give sufficient info about the failure.
Missing files can be caused by 1) qmcpack doesn't run 2) cannot create files due to disk quota.

I do have questions

  Qmcpack warning:
    QMCPACK did not start properly!

Does this happen when Nexus detected process returned and analyzed the output or it is discovered through a periodic check?
In the former case, there is no concern. The message should say this is an error not a warning.
In the later case, what happens if qmcpack just starts slow due to delays by mpirun? Still stop or will keep running? It may cause false failure if command/job starts slow.

jtkrogel · 2026-03-03T22:12:34Z

  Qmcpack warning:
    QMCPACK did not start properly!

results from a post check on the QMCPACK output file. The check is whether the word "QMCPACK" is found in the output file, since it appears in the splash text right at the beginning.

The root cause for this can be all kinds of things. It is basically impossible to catch and differentiate them all.

The reason why it is a warning is that Nexus errors halt the execution of the main Nexus process. Nexus should not abort in these cases since, e.g. a user that has 100 workflows going doesn't want Nexus to stop working when one of them has some problem.

The message could be clearer in saying what is wrong and what to do about it.

…analyzed` for `QmcpackAnalyzer`

brockdyer03 · 2026-03-04T13:19:13Z

@ye-luo can you give the latest commit a test? I believe the problem was that, unlike the less-developed analyzers, the QmcpackAnalyzer class stores progress info in self.info, which is an instance of QAinformation.

I updated the code in Simulation.progress() to check if that attribute exists, which I don't think exists in other analyzer classes, and then appropriately set the analysis status.

Based on the error message you provided, I think this should fix that.

ye-luo · 2026-03-04T16:08:26Z

  Qmcpack warning:
    QMCPACK did not start properly!
results from a post check on the QMCPACK output file. The check is whether the word "QMCPACK" is found in the output file, since it appears in the splash text right at the beginning.

@jtkrogel I was wondering when this "post check" happens
Preferably a direct answer to my previous question.

Does this happen when Nexus detected command execution process returned and analyzed the output or it is discovered through a periodic check?

ye-luo · 2026-03-04T16:22:59Z

@brockdyer03 I don't know what to expect from your last change. here is the output

    Entering ./scale_1.0 2 
      writing input files  2 opt 
    Entering ./scale_1.0 2 
      sending required files  2 opt 
      submitting job  2 opt 
    Entering ./scale_1.0 2 
      Executing:  
        export OMP_NUM_THREADS=1
        mpirun -np 4 qmcpack opt.in.xml 

  elapsed time 147.2 s  memory 105.92 MB 

  Qmcpack warning:
    QMCPACK did not start properly!
    Entering ./scale_1.0 2 
      copying results  2 opt 
        warning: the following files were missing 
          opt.s000.scalar.dat 
          opt.s000.stat.h5 
          opt.s000.opt.xml 
          opt.s001.scalar.dat 
          opt.s001.stat.h5 
          opt.s001.opt.xml 
          opt.s002.scalar.dat 
          opt.s002.stat.h5 
          opt.s002.opt.xml 
          opt.s003.scalar.dat 
          opt.s003.stat.h5 
          opt.s003.opt.xml 
          opt.s004.scalar.dat 
          opt.s004.stat.h5 
          opt.s004.opt.xml 
          opt.s005.scalar.dat 
          opt.s005.stat.h5 
          opt.s005.opt.xml 
          opt.s006.scalar.dat 
          opt.s006.stat.h5 
          opt.s006.opt.xml 
          opt.s007.scalar.dat 
          opt.s007.stat.h5 
          opt.s007.opt.xml 
          opt.s008.scalar.dat 
          opt.s008.stat.h5 
          opt.s008.opt.xml 
          opt.s009.scalar.dat 
          opt.s009.stat.h5 
          opt.s009.opt.xml 
          opt.s010.scalar.dat 
          opt.s010.stat.h5 
          opt.s010.opt.xml 
          opt.s011.scalar.dat 
          opt.s011.stat.h5 
          opt.s011.opt.xml 
    Entering ./scale_1.0 2 
      analyzing  2 opt 

  QmcpackAnalyzer warning:
    Simulation failed, skipping analysis!
Traceback (most recent call last):
  File "/home/yeluo/opt/qmcpack/labs/lab2_qmc_basics/oxygen_dimer/O_dimer.py", line 163, in <module>
    run_project(sims) 
  File "/home/yeluo/opt/qmcpack/nexus/nexus/__init__.py", line 84, in run_project
    pm.run_project()
  File "/home/yeluo/opt/qmcpack/nexus/nexus/project_manager.py", line 99, in run_project
    self.progress_cascades()
  File "/home/yeluo/opt/qmcpack/nexus/nexus/project_manager.py", line 332, in progress_cascades
    cascade.progress()
  File "/home/yeluo/opt/qmcpack/nexus/nexus/simulation.py", line 1296, in progress
    sim.progress(self.simid)
  File "/home/yeluo/opt/qmcpack/nexus/nexus/simulation.py", line 1296, in progress
    sim.progress(self.simid)
  File "/home/yeluo/opt/qmcpack/nexus/nexus/simulation.py", line 1272, in progress
    self.analyze()
  File "/home/yeluo/opt/qmcpack/nexus/nexus/simulation.py", line 1175, in analyze
    self.post_analyze(analyzer)
  File "/home/yeluo/opt/qmcpack/nexus/nexus/qmcpack.py", line 1228, in post_analyze
    opt_file = analyzer.results.optimization.optimal_file
AttributeError: 'OptimizationAnalyzer' object has no attribute 'optimal_file'

I didn't see noticeable change.

brockdyer03 · 2026-03-04T16:47:47Z

Yeah after some conversation with @jtkrogel this morning I think we realized that there's a fair bit more at play than we originally expected. I think fortifying the related bugs/unintended behavior will take some time.

Nexus: Add check for failure to properly call qmcpack executable by…

8d25098

… checking for `'QMCPACK'` in output

brockdyer03 requested review from jtkrogel and ye-luo March 3, 2026 16:26

brockdyer03 added nexus python Pull requests that update python code labels Mar 3, 2026

jtkrogel mentioned this pull request Mar 3, 2026

Nexus doesn't detect run failure #5833

Open

jtkrogel previously approved these changes Mar 3, 2026

View reviewed changes

Nexus: Add extra check for self.info.analyzed in addition to `self.…

d91ec52

…analyzed` for `QmcpackAnalyzer`

brockdyer03 dismissed jtkrogel’s stale review via d91ec52 March 4, 2026 13:16

Conversation

brockdyer03 commented Mar 3, 2026

Proposed changes

What type(s) of changes does this code introduce?

Does this introduce a breaking change?

What systems has this change been tested on?

Checklist

Uh oh!

prckent commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ye-luo commented Mar 3, 2026

Uh oh!

jtkrogel commented Mar 3, 2026

Uh oh!

jtkrogel commented Mar 3, 2026

Uh oh!

jtkrogel commented Mar 3, 2026

Uh oh!

jtkrogel commented Mar 3, 2026

Uh oh!

ye-luo commented Mar 3, 2026

Uh oh!

brockdyer03 commented Mar 3, 2026

Uh oh!

jtkrogel commented Mar 3, 2026

Uh oh!

jtkrogel commented Mar 3, 2026

Uh oh!

jtkrogel left a comment

Choose a reason for hiding this comment

Uh oh!

ye-luo commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jtkrogel commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

brockdyer03 commented Mar 4, 2026

Uh oh!

ye-luo commented Mar 4, 2026

Uh oh!

ye-luo commented Mar 4, 2026

Uh oh!

brockdyer03 commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

prckent commented Mar 3, 2026 •

edited

Loading

ye-luo commented Mar 3, 2026 •

edited

Loading

jtkrogel commented Mar 3, 2026 •

edited

Loading