Add utility functions for recentering data#23
Conversation
just the interface for now
just ec to nc for now
this is why i can't stand line length limits
germasch
left a comment
There was a problem hiding this comment.
Looks good to me (as usual, without going through all of the details), except some minor comments.
Not something for this PR, but more for the usual long to-do list: I suspect there's a problem with not even writing all of the actually needed data in the case of non-periodic b.c.s. That is, in a situation with
| def _rename_var(ds: xr.Dataset, old_name: str, new_name: str) -> None: | ||
| ds[new_name] = ds[old_name].rename(new_name) | ||
| del ds[old_name] |
There was a problem hiding this comment.
Why don't you use Dataset.rename()? I suppose that's what it's there for -- it does return a new Dataset, but I think that's actually not bad, since it kinda treats Datasets as immutable, so one doesn't have to worry about changing an existing object and possibly messing something up.
I suppose that all of that xarray stuff is implemented efficiently, ie., doesn't make copies and wastes memory.
There was a problem hiding this comment.
I didn't use rename because I wanted to do it in-place. I personally just don't like the pattern of ds = ds.do_something() in python, but I don't feel that strongly, and wouldn't mind changing the api to that.
However, I don't trust xarray to do anything efficiently anymore, and even if rename uses views, the recentering itself can't. Using the builtin rename would necessitate that all the changes happen to a copy of the original ds. At best, the original ds would then be garbage-collected, but that all still seems wasteful when these datasets can become so huge.
I'd actually like to make get_recentered work in-place, too, or at least to not make multiple copies (or views? who knows). You can always pass a copy to an in-place function, but you can't do something in-place with a function that makes a copy.
There was a problem hiding this comment.
I agree that in particular ds = ds.do_something() isn't great, in particular it's iffy if used in notebooks. But if used as ds_cc = ds.psc.recenter('cc'), I think it is a pattern that's more flexible in that it allows for keeping the original data around unchanged, and that again is something that's useful for not breaking other cells in a notebook.
This now actually reminds me of the marimo notebooks I mentioned before -- which IIRC does have restrictions on mutating objects since that makes it hard to automatically figure out a dependency graph.
It is foreign to me to not mutate objects, since that's what one usually does for efficiency reasons, rather than copying and then modifying. But it seems to be xarray's model to not mutate objects, and so I'm kinda inclined to give that approach the benefit of the doubt.
In any case, as far as this PR is concerned, it's fine as-is, it's not worth holding up progress for questions that can be resolved later -- it's not like this defines an interface we expect to be set in stone.
| def auto_recenter( | ||
| ds: xr.Dataset, | ||
| to_centering: Literal["nc", "cc"], | ||
| **boundaries: BoundaryInterpMethod, | ||
| ) -> None: |
There was a problem hiding this comment.
I think my preference would be to just call this recenter() (also in that I think eventually this could become ds2 = ds.psc.recenter(). It's automatic, I guess, in that it uses knowledge about the existing staggering, but well, that's more or less an implementation detail.
There was a problem hiding this comment.
It's also automatic in the sense that it can fail to do the correct thing. It detects which variables need to be recentered based on their names and the passed dimension names (the keys to boundares), but that's a heuristic. For example, it fails on the gauss data, since neither rho nor dive end with _nc or _cc. It similarly fails on pfd_moments (although that could potentially be fixed by appending _nc or _cc to variable names during the decoding step based on the all_1st_*c super-variable name).
If recenter existed, I would expect it to take a map of variable names to their current centerings instead of guessing. That actually seems like a good idea, and I'd be happy to make that so (and auto_recenter would just call recenter with the guessed map). There would be the question of whether/how to rename recentered variables, however.
There was a problem hiding this comment.
Again, not worth the argument at this point. In the longer run, it'd certainly be nice to store the centering information in the output directly, rather than relying on field names / heuristics, making this mostly moot.
Also, IMHO nothing wrong with having recenter() try to do the right thing based on heuristics for now, but later extending it be follow a map of centerings if one is passed.
|
You will have to fix that CI error, though (it looks pretty trivial). FWIW, if you run
|
|
The CI error is weird. CI passes on my machine, and I didn't even change the line of code that is failing. I was going to check if there was some discrepancy between my CI and this CI, or if the CI updated since the last PR somehow (I'm not yet knowledgeable about how the CI works). I suppose for now, I can just fix it manually. |
|
Looks like you're right that this error happens independently of your PR -- on the Now, this would make more sense to me if there actually was a |
|
I take the comment above back, I had local unsaved changes. So that comment is there, and it may now be unneeded because maybe numpy got updated. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
🚀 New features to boost your workflow:
|
|
Thanks. My machine fails that change, though, and updating numpy didn't seem to help. How can I make it work? Also, that 1 line of code not covered is pretty irrelevant and not worth testing; I think it can be ignored. |
|
Since somehow my virtual env had disappeared, I think all I did was The numpy thing is just a guess, and in fact it's not all that easy to figure out what numpy was used by pre-commit since its mypy venv is hiding in some cache directory somewhere. When I manually installed pscpy, I ended up with Running |
|
On the codecov -- it's not a big deal, but I wouldn't mind keeping it at 100%. Do you ever get something that's not a In any case, I wouldn't mind to just remove that |
this seemed like the "correct" way of handling the coverage problem
|
for future reference/posterity's sake: removing my |
|
Do you have permissions to merge this yourself? |
Adds two new utility functions:
get_recentered: creates a newDataArrayby recentering a given array along a given dimensionauto_recenter: automatically recenters variables in a givenDatasetin-place according to given parametersThe intended workflow (for now) is for the user to call
decode_pscand then, if desired, callauto_recenter. Theget_recenteredfunction might be useful whenauto_recenterfails, and is thus also made available.Both functions are also tested.
Optimization was not a consideration for this first implementation.