Skip to content

Nexus: Add check for failure to properly call qmcpack executable#5836

Open
brockdyer03 wants to merge 2 commits intoQMCPACK:developfrom
brockdyer03:update_qmc_fail_check
Open

Nexus: Add check for failure to properly call qmcpack executable#5836
brockdyer03 wants to merge 2 commits intoQMCPACK:developfrom
brockdyer03:update_qmc_fail_check

Conversation

@brockdyer03
Copy link
Contributor

Proposed changes

This fixes a bug where Nexus wouldn't be able to detect a run failure if the qmcpack executable was not called at all. In my case I forgot to set my $PATH to point to qmcpack/build/bin, so mpirun never got anywhere. To fix this, a check for 'QMCPACK' in the output file was added, and some logic in the analyzer was added to inform the user of the problem and skip the analysis.

It's a very barebones and simple check, but without sufficient information about other possible failure modes it's the best available that most likely won't get any false positives.

What type(s) of changes does this code introduce?

  • Bugfix
  • New feature
  • Testing changes (e.g. new unit/integration/performance tests)

Does this introduce a breaking change?

  • No

What systems has this change been tested on?

Laptop, Fedora 43, Python 3.14.2, Numpy 2.4.2

Checklist

    • I have read the pull request guidance and develop docs
    • This PR is up to date with the current state of 'develop'

@brockdyer03 brockdyer03 requested review from jtkrogel and ye-luo March 3, 2026 16:26
@brockdyer03 brockdyer03 added nexus python Pull requests that update python code labels Mar 3, 2026
@prckent
Copy link
Contributor

prckent commented Mar 3, 2026

Have you thought about a more general way to handle these kinds of problems rather than ad hoc one code at-a-time fixes? e.g. This fix won't catch a missing pw.x

This problem is very general ( #5833 ). Some possibilities involve noticing that the process is no longer present (workstation mode) or that the queued job is no longer present (slurm). I thought (perhaps incorrectly) that Nexus had logical to catch this historically, so I wonder if it was broken at some point.

@ye-luo
Copy link
Contributor

ye-luo commented Mar 3, 2026

How does nexus track the completion of a job or command?
I'm not confident about relying on file printout.
Technically commands and queued jobs have exit code to use.
Does nexus track individual commands within a job? Or just use one mpirun per job?

@jtkrogel
Copy link
Contributor

jtkrogel commented Mar 3, 2026

This requires quite a bit more explanation

@jtkrogel
Copy link
Contributor

jtkrogel commented Mar 3, 2026

Nexus always knows whether a process is complete (workstation) or out of the queue (cluster).

Nexus does not check executable paths. One reason is that the exe might be accessed via PATH or an alias. Also, the path may exist on your workstation (where you ran scf) but not on Frontier (where you run your qmc).

@jtkrogel
Copy link
Contributor

jtkrogel commented Mar 3, 2026

Exit codes from sbatch, srun/mpirun or the simulation executable itself cannot be relied on to reflect the actual success or failure of a simulation run. Other means are necessary.

@jtkrogel
Copy link
Contributor

jtkrogel commented Mar 3, 2026

To fully communicate the reasons behind Nexus' design, I suggest we have an offline meeting, if that is indeed necessary.

The general intended route is via check_sim_status in the classes derived from Simulation, as Brock is doing here.

@ye-luo
Copy link
Contributor

ye-luo commented Mar 3, 2026

Exit codes from sbatch, srun/mpirun or the simulation executable itself cannot be relied on to reflect the actual success or failure of a simulation run. Other means are necessary.

Solely replying on output is just not enough. srun/mpirun exit code is also very important to validate a successful run. If exit code is bad, I will never consider a run successful even if the print out seems complete.

@brockdyer03
Copy link
Contributor Author

I am pretty sure that if srun or mpirun doesn't exit properly that will get echoed to stderr, which should be picked up in the .err file associated with the calculation. It's possible that it could be caught there and assessed.

I think the main problem though is that it isn't clear what the errors would be in most cases. If there's a place you can recommend where the various errors and exit codes are listed, that'd make it easier to write logic for it.

@jtkrogel
Copy link
Contributor

jtkrogel commented Mar 3, 2026

@ye-luo many QMCPACK optimization runs produce usable output even if the code calls ABORT.

Many tools/codes don't produce correct exit codes.

Correct output is both necessary and sufficient to assess success/failure. Full stop.

@jtkrogel
Copy link
Contributor

jtkrogel commented Mar 3, 2026

Migrating this discussion back to #5833.

The PR here increases Nexus' robustness, and although it doesn't come with a pony, it should get merged.

jtkrogel
jtkrogel previously approved these changes Mar 3, 2026
Copy link
Contributor

@jtkrogel jtkrogel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ye-luo
Copy link
Contributor

ye-luo commented Mar 3, 2026

I tried my reproducer and got

    Entering ./scale_1.0 2 
      writing input files  2 opt 
    Entering ./scale_1.0 2 
      sending required files  2 opt 
      submitting job  2 opt 
    Entering ./scale_1.0 2 
      Executing:  
        export OMP_NUM_THREADS=1
        mpirun -np 4 qmcpack opt.in.xml 

  elapsed time 141.2 s  memory 105.79 MB 

  Qmcpack warning:
    QMCPACK did not start properly!
    Entering ./scale_1.0 2 
      copying results  2 opt 
        warning: the following files were missing 
          opt.s000.scalar.dat 
          opt.s000.stat.h5 
          opt.s000.opt.xml 
          opt.s001.scalar.dat 
          opt.s001.stat.h5 
          opt.s001.opt.xml 
          opt.s002.scalar.dat 
          opt.s002.stat.h5 
          opt.s002.opt.xml 
          opt.s003.scalar.dat 
          opt.s003.stat.h5 
          opt.s003.opt.xml 
          opt.s004.scalar.dat 
          opt.s004.stat.h5 
          opt.s004.opt.xml 
          opt.s005.scalar.dat 
          opt.s005.stat.h5 
          opt.s005.opt.xml 
          opt.s006.scalar.dat 
          opt.s006.stat.h5 
          opt.s006.opt.xml 
          opt.s007.scalar.dat 
          opt.s007.stat.h5 
          opt.s007.opt.xml 
          opt.s008.scalar.dat 
          opt.s008.stat.h5 
          opt.s008.opt.xml 
          opt.s009.scalar.dat 
          opt.s009.stat.h5 
          opt.s009.opt.xml 
          opt.s010.scalar.dat 
          opt.s010.stat.h5 
          opt.s010.opt.xml 
          opt.s011.scalar.dat 
          opt.s011.stat.h5 
          opt.s011.opt.xml 
    Entering ./scale_1.0 2 
      analyzing  2 opt 

  QmcpackAnalyzer warning:
    Simulation failed, skipping analysis!
Traceback (most recent call last):
  File "/home/yeluo/opt/qmcpack/labs/lab2_qmc_basics/oxygen_dimer/O_dimer.py", line 163, in <module>
    run_project(sims) 
  File "/home/yeluo/opt/qmcpack/nexus/nexus/__init__.py", line 84, in run_project
    pm.run_project()
  File "/home/yeluo/opt/qmcpack/nexus/nexus/project_manager.py", line 99, in run_project
    self.progress_cascades()
  File "/home/yeluo/opt/qmcpack/nexus/nexus/project_manager.py", line 332, in progress_cascades
    cascade.progress()
  File "/home/yeluo/opt/qmcpack/nexus/nexus/simulation.py", line 1290, in progress
    sim.progress(self.simid)
  File "/home/yeluo/opt/qmcpack/nexus/nexus/simulation.py", line 1290, in progress
    sim.progress(self.simid)
  File "/home/yeluo/opt/qmcpack/nexus/nexus/simulation.py", line 1266, in progress
    self.analyze()
  File "/home/yeluo/opt/qmcpack/nexus/nexus/simulation.py", line 1175, in analyze
    self.post_analyze(analyzer)
  File "/home/yeluo/opt/qmcpack/nexus/nexus/qmcpack.py", line 1228, in post_analyze
    opt_file = analyzer.results.optimization.optimal_file
AttributeError: 'OptimizationAnalyzer' object has no attribute 'optimal_file'

The output is quite confusing and doesn't give sufficient info about the failure.
Missing files can be caused by 1) qmcpack doesn't run 2) cannot create files due to disk quota.

I do have questions

  Qmcpack warning:
    QMCPACK did not start properly!

Does this happen when Nexus detected process returned and analyzed the output or it is discovered through a periodic check?
In the former case, there is no concern. The message should say this is an error not a warning.
In the later case, what happens if qmcpack just starts slow due to delays by mpirun? Still stop or will keep running? It may cause false failure if command/job starts slow.

@jtkrogel
Copy link
Contributor

jtkrogel commented Mar 3, 2026

  Qmcpack warning:
    QMCPACK did not start properly!

results from a post check on the QMCPACK output file. The check is whether the word "QMCPACK" is found in the output file, since it appears in the splash text right at the beginning.

The root cause for this can be all kinds of things. It is basically impossible to catch and differentiate them all.

The reason why it is a warning is that Nexus errors halt the execution of the main Nexus process. Nexus should not abort in these cases since, e.g. a user that has 100 workflows going doesn't want Nexus to stop working when one of them has some problem.

The message could be clearer in saying what is wrong and what to do about it.

@brockdyer03
Copy link
Contributor Author

@ye-luo can you give the latest commit a test? I believe the problem was that, unlike the less-developed analyzers, the QmcpackAnalyzer class stores progress info in self.info, which is an instance of QAinformation.

I updated the code in Simulation.progress() to check if that attribute exists, which I don't think exists in other analyzer classes, and then appropriately set the analysis status.

Based on the error message you provided, I think this should fix that.

@ye-luo
Copy link
Contributor

ye-luo commented Mar 4, 2026

  Qmcpack warning:
    QMCPACK did not start properly!

results from a post check on the QMCPACK output file. The check is whether the word "QMCPACK" is found in the output file, since it appears in the splash text right at the beginning.

@jtkrogel I was wondering when this "post check" happens
Preferably a direct answer to my previous question.

Does this happen when Nexus detected command execution process returned and analyzed the output or it is discovered through a periodic check?

@ye-luo
Copy link
Contributor

ye-luo commented Mar 4, 2026

@brockdyer03 I don't know what to expect from your last change. here is the output

    Entering ./scale_1.0 2 
      writing input files  2 opt 
    Entering ./scale_1.0 2 
      sending required files  2 opt 
      submitting job  2 opt 
    Entering ./scale_1.0 2 
      Executing:  
        export OMP_NUM_THREADS=1
        mpirun -np 4 qmcpack opt.in.xml 

  elapsed time 147.2 s  memory 105.92 MB 

  Qmcpack warning:
    QMCPACK did not start properly!
    Entering ./scale_1.0 2 
      copying results  2 opt 
        warning: the following files were missing 
          opt.s000.scalar.dat 
          opt.s000.stat.h5 
          opt.s000.opt.xml 
          opt.s001.scalar.dat 
          opt.s001.stat.h5 
          opt.s001.opt.xml 
          opt.s002.scalar.dat 
          opt.s002.stat.h5 
          opt.s002.opt.xml 
          opt.s003.scalar.dat 
          opt.s003.stat.h5 
          opt.s003.opt.xml 
          opt.s004.scalar.dat 
          opt.s004.stat.h5 
          opt.s004.opt.xml 
          opt.s005.scalar.dat 
          opt.s005.stat.h5 
          opt.s005.opt.xml 
          opt.s006.scalar.dat 
          opt.s006.stat.h5 
          opt.s006.opt.xml 
          opt.s007.scalar.dat 
          opt.s007.stat.h5 
          opt.s007.opt.xml 
          opt.s008.scalar.dat 
          opt.s008.stat.h5 
          opt.s008.opt.xml 
          opt.s009.scalar.dat 
          opt.s009.stat.h5 
          opt.s009.opt.xml 
          opt.s010.scalar.dat 
          opt.s010.stat.h5 
          opt.s010.opt.xml 
          opt.s011.scalar.dat 
          opt.s011.stat.h5 
          opt.s011.opt.xml 
    Entering ./scale_1.0 2 
      analyzing  2 opt 

  QmcpackAnalyzer warning:
    Simulation failed, skipping analysis!
Traceback (most recent call last):
  File "/home/yeluo/opt/qmcpack/labs/lab2_qmc_basics/oxygen_dimer/O_dimer.py", line 163, in <module>
    run_project(sims) 
  File "/home/yeluo/opt/qmcpack/nexus/nexus/__init__.py", line 84, in run_project
    pm.run_project()
  File "/home/yeluo/opt/qmcpack/nexus/nexus/project_manager.py", line 99, in run_project
    self.progress_cascades()
  File "/home/yeluo/opt/qmcpack/nexus/nexus/project_manager.py", line 332, in progress_cascades
    cascade.progress()
  File "/home/yeluo/opt/qmcpack/nexus/nexus/simulation.py", line 1296, in progress
    sim.progress(self.simid)
  File "/home/yeluo/opt/qmcpack/nexus/nexus/simulation.py", line 1296, in progress
    sim.progress(self.simid)
  File "/home/yeluo/opt/qmcpack/nexus/nexus/simulation.py", line 1272, in progress
    self.analyze()
  File "/home/yeluo/opt/qmcpack/nexus/nexus/simulation.py", line 1175, in analyze
    self.post_analyze(analyzer)
  File "/home/yeluo/opt/qmcpack/nexus/nexus/qmcpack.py", line 1228, in post_analyze
    opt_file = analyzer.results.optimization.optimal_file
AttributeError: 'OptimizationAnalyzer' object has no attribute 'optimal_file'

I didn't see noticeable change.

@brockdyer03
Copy link
Contributor Author

Yeah after some conversation with @jtkrogel this morning I think we realized that there's a fair bit more at play than we originally expected. I think fortifying the related bugs/unintended behavior will take some time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

nexus python Pull requests that update python code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants