Skip to content

Add utility functions for recentering data#23

Merged
JamesMcClung merged 22 commits intopsc-code:mainfrom
JamesMcClung:recenter
Mar 24, 2025
Merged

Add utility functions for recentering data#23
JamesMcClung merged 22 commits intopsc-code:mainfrom
JamesMcClung:recenter

Conversation

@JamesMcClung
Copy link
Collaborator

@JamesMcClung JamesMcClung commented Mar 20, 2025

Adds two new utility functions:

  • get_recentered: creates a new DataArray by recentering a given array along a given dimension
  • auto_recenter: automatically recenters variables in a given Dataset in-place according to given parameters

The intended workflow (for now) is for the user to call decode_psc and then, if desired, call auto_recenter. The get_recentered function might be useful when auto_recenter fails, and is thus also made available.

Both functions are also tested.

Optimization was not a consideration for this first implementation.

Copy link
Contributor

@germasch germasch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me (as usual, without going through all of the details), except some minor comments.

Not something for this PR, but more for the usual long to-do list: I suspect there's a problem with not even writing all of the actually needed data in the case of non-periodic b.c.s. That is, in a situation with $n$ cells, certain quantities should have $n+1$ data points in certain dimensions because of these staggering issues, and I know I've thought about this before, but I kinda doubt it's a solved issue... (I think that data may be available in some variation of the xdmf writer, where the arrays are written separately for each patch including ghost points, but that of course doesn't help unless one is doing just that.)

Comment on lines +36 to +38
def _rename_var(ds: xr.Dataset, old_name: str, new_name: str) -> None:
ds[new_name] = ds[old_name].rename(new_name)
del ds[old_name]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't you use Dataset.rename()? I suppose that's what it's there for -- it does return a new Dataset, but I think that's actually not bad, since it kinda treats Datasets as immutable, so one doesn't have to worry about changing an existing object and possibly messing something up.

I suppose that all of that xarray stuff is implemented efficiently, ie., doesn't make copies and wastes memory.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't use rename because I wanted to do it in-place. I personally just don't like the pattern of ds = ds.do_something() in python, but I don't feel that strongly, and wouldn't mind changing the api to that.

However, I don't trust xarray to do anything efficiently anymore, and even if rename uses views, the recentering itself can't. Using the builtin rename would necessitate that all the changes happen to a copy of the original ds. At best, the original ds would then be garbage-collected, but that all still seems wasteful when these datasets can become so huge.

I'd actually like to make get_recentered work in-place, too, or at least to not make multiple copies (or views? who knows). You can always pass a copy to an in-place function, but you can't do something in-place with a function that makes a copy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that in particular ds = ds.do_something() isn't great, in particular it's iffy if used in notebooks. But if used as ds_cc = ds.psc.recenter('cc'), I think it is a pattern that's more flexible in that it allows for keeping the original data around unchanged, and that again is something that's useful for not breaking other cells in a notebook.

This now actually reminds me of the marimo notebooks I mentioned before -- which IIRC does have restrictions on mutating objects since that makes it hard to automatically figure out a dependency graph.

It is foreign to me to not mutate objects, since that's what one usually does for efficiency reasons, rather than copying and then modifying. But it seems to be xarray's model to not mutate objects, and so I'm kinda inclined to give that approach the benefit of the doubt.

In any case, as far as this PR is concerned, it's fine as-is, it's not worth holding up progress for questions that can be resolved later -- it's not like this defines an interface we expect to be set in stone.

Comment on lines +41 to +45
def auto_recenter(
ds: xr.Dataset,
to_centering: Literal["nc", "cc"],
**boundaries: BoundaryInterpMethod,
) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my preference would be to just call this recenter() (also in that I think eventually this could become ds2 = ds.psc.recenter(). It's automatic, I guess, in that it uses knowledge about the existing staggering, but well, that's more or less an implementation detail.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also automatic in the sense that it can fail to do the correct thing. It detects which variables need to be recentered based on their names and the passed dimension names (the keys to boundares), but that's a heuristic. For example, it fails on the gauss data, since neither rho nor dive end with _nc or _cc. It similarly fails on pfd_moments (although that could potentially be fixed by appending _nc or _cc to variable names during the decoding step based on the all_1st_*c super-variable name).

If recenter existed, I would expect it to take a map of variable names to their current centerings instead of guessing. That actually seems like a good idea, and I'd be happy to make that so (and auto_recenter would just call recenter with the guessed map). There would be the question of whether/how to rename recentered variables, however.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, not worth the argument at this point. In the longer run, it'd certainly be nice to store the centering information in the output directly, rather than relying on field names / heuristics, making this mostly moot.

Also, IMHO nothing wrong with having recenter() try to do the right thing based on heuristics for now, but later extending it be follow a map of centerings if one is passed.

@germasch
Copy link
Contributor

You will have to fix that CI error, though (it looks pretty trivial).

FWIW, if you run pre-commit install (IIRC), it'll install a git pre-commit hook that'll not let you commit unless you fix issues like this first.

pre-commit run -a should help to reproduce the problem locally. (Though I recurrently have a heck of a time with something getting into inconsistent state with pre-commit, where I still haven't figured out what the underlying problem is...)

@JamesMcClung
Copy link
Collaborator Author

The CI error is weird. CI passes on my machine, and I didn't even change the line of code that is failing. I was going to check if there was some discrepancy between my CI and this CI, or if the CI updated since the last PR somehow (I'm not yet knowledgeable about how the CI works). I suppose for now, I can just fix it manually.

@germasch
Copy link
Contributor

Looks like you're right that this error happens independently of your PR -- on the main branch I get locally:

mypy.....................................................................Failed
- hook id: mypy
- exit code: 1

src/pscpy/psc.py:35: error: Unused "type: ignore" comment  [unused-ignore]
Found 1 error in 1 file (checked 3 source files)

Now, this would make more sense to me if there actually was a type: ignore comment there... 🤷

@germasch
Copy link
Contributor

I take the comment above back, I had local unsaved changes. So that comment is there, and it may now be unneeded because maybe numpy got updated.

        return np.linspace(  # type: ignore[no-any-return]
            start=self.corner[coord_idx],
            stop=self.corner[coord_idx] + self.length[coord_idx],
            num=self.gdims[coord_idx],
            endpoint=False,
        )

@codecov
Copy link

codecov bot commented Mar 21, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Files with missing lines Coverage Δ
src/pscpy/__init__.py 100.00% <100.00%> (ø)
src/pscpy/postprocessing.py 100.00% <100.00%> (ø)
src/pscpy/psc.py 98.07% <100.00%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@JamesMcClung
Copy link
Collaborator Author

Thanks. My machine fails that change, though, and updating numpy didn't seem to help. How can I make it work?

Also, that 1 line of code not covered is pretty irrelevant and not worth testing; I think it can be ignored.

@germasch
Copy link
Contributor

Since somehow my virtual env had disappeared, I think all I did was

uv venv
source .venv/bin/activate
uv pip install pre-commit
pre-commit run -a

The numpy thing is just a guess, and in fact it's not all that easy to figure out what numpy was used by pre-commit since its mypy venv is hiding in some cache directory somewhere.

When I manually installed pscpy, I ended up with numpy==2.2.4, and when I manually run mypy, I ended up with a bunch of complaints about test_postprocessing.py -- that might be because the rules for tests/ are relaxed in pyproject.toml, but it may not have picked that up. I also got complaints about adios2.

Running nox passed, though -- that's I think pretty close to running the actual CI.

@germasch
Copy link
Contributor

On the codecov -- it's not a big deal, but I wouldn't mind keeping it at 100%. Do you ever get something that's not a str? (I think strictly speaking, those keys are Hashable, so it's possible, but not sure it's worth worrying about. They could be something else that's not a str, but still understands .endswith(), in theory...

In any case, I wouldn't mind to just remove that continue, and if it ever triggers then well, at least you'll have a test to add ;)

this seemed like the "correct" way of handling the coverage problem
@JamesMcClung
Copy link
Collaborator Author

for future reference/posterity's sake: removing my ~/cache/.pre-commit/ resolved the CI discrepancy

@germasch
Copy link
Contributor

Do you have permissions to merge this yourself?

@JamesMcClung JamesMcClung merged commit a2f40bc into psc-code:main Mar 24, 2025
4 checks passed
@JamesMcClung JamesMcClung deleted the recenter branch March 24, 2025 20:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants