Skip to content

Proposed code cleanup#1

Merged
mblesac merged 12 commits intomasterfrom
5ttgen_neonatal_rs
Dec 3, 2021
Merged

Proposed code cleanup#1
mblesac merged 12 commits intomasterfrom
5ttgen_neonatal_rs

Conversation

@Lestropie
Copy link
Collaborator

Hi @mblesac @paola-g,

I had some fun cleaning up and Pythonifying the code and so now have a bit of a clearer picture as to what's going on. Given I don't yet have any exemplar data I've not made any attempt at all to run it, so there may well be outright syntax errors in there let alone runtime issues; but depending on their severity / how cryptic the errors are you might be able to address some stuff at your end (feel free to commit directly to this branch if you wish). My changes should also clean up the terminal output quite a bit.

This PR proposes merging my new branch on your fork into the master branch of your fork. Therefore, once this PR is merged, the corresponding changes will appear in MRtrix3#2393. So we will use this PR to discuss the cleanup / refactoring between us, and then MRtrix3#2393 can be used for final discussion with a wider audience before merge into the MRtrix3 dev branch.


I'm tempted to suggest that the dHCP pipeline should be a separate 5ttgen algorithm.
While it's true that the joint label fusion with the M-CRIB data is executed in both scenarios:

  • In the absence of the dHCP pipeline, that forms the entirety of the output data;
  • In the presence of the dHCP pipeline, it forms the vast minority of the output data, only being used to insert additional sub-cortical grey matter structures into the dHCP segmentation.

I'm unsure of how much utility the output of the -parcellation option would be if the dHCP pipeline data are used. Also, if the dHCP version were standalone, it could omit the modality command-line argument, which is non-functional in that use case.


Minor points:

  1. Given other work in BIDS & quantitative imaging I prefer to use e.g. "T1w" rather than "T1" wherever possible.
    (Though admittedly now that I think of it this is a problem in 5ttgen fsl...)

  2. Config file entries are camel case, so I changed "MCRIBpath" to "MCRIBPath".

  3. I'm not sure if the histogram matching is requisite here; I'd have expected the ANTs scripts to be doing their own intensity matching (or using a registration metric that doesn't necessitate such)?

  4. I don't think it is necessarily guaranteed that the sGM segmentations provided by M-CRIB will always lie within voxels that are classified by the dHCP pipeline as containing 100% WM. So greater care might be required regarding how those sGM segmentations are inserted; i.e. it might not be only the WM image that needs to be decreased to accommodate those extra parcels.

  5. I'm strongly considering changing the -sgm_amyg_hipp option to be specific to the 5ttgen fsl algorithm, and to always assign the amygdalae and hippocampi to the sGM volume in other algorithms. The option was put there because in 5ttgen fsl the segmentations of those structures by FIRST are on occasion less extensive than the GM fraction estimated by FAST, and so streamlines are not able to reach those structures for the sGM priors to kick in; but it's a problem specific to that algorithm, and I suspect that it should not only be the default behaviour for other algorithms but also that the converse behaviour doesn't even need to be provided. Do you have any thoughts on this based on your own experiences?

  6. There's a lot of importing of data into the scratch directory that can probably be omitted. Not only the import as .mif then conversion to .nii; much of those data could likely be used in-place without having to convert anything into the scratch directory. You just need to make sure to use absolute filesystem paths (since the user may specify a path relative to the working directory, but the relevant command is executed from inside the scratch directory).

  7. Many scratch file names could possibly be shortened in order to reduce the length of terminal outputs.

  8. There were some errors in I think it was the MCRIB GM parcel list: a couple of nodes missing and a couple of nodes duplicated. I hope you can see from the code restructuring how it facilitates diagnosing such things. But it's worth (once it's in an operational state) comparing the output of this version to your own code to confirm.

Lestropie and others added 4 commits November 15, 2021 19:11
Co-authored-by: Manuel Blesa <mblesac@gmail.com>
Co-authored-by: Paola Galdi <paola.galdi@gmail.com>
Co-authored-by: Manuel Blesa <mblesac@gmail.com>
Co-authored-by: Paola Galdi <paola.galdi@gmail.com>
Co-authored-by: Manuel Blesa <mblesac@gmail.com>
Co-authored-by: Paola Galdi <paola.galdi@gmail.com>
@mblesac
Copy link
Owner

mblesac commented Nov 16, 2021

Hi @Lestropie,

We've been working on this following your advice. You'll see three commits:

  • Fixing some syntax errors
  • Removing the histogram matching
  • Splitting the command into two different algorithms: neonatal_mcrib and neonatal_dhcp

The results are very similar to the original implementation (note that the results are never identical). Let us know what do you think about the changes. In the meantime, I'm doing the paperwork to be able to send you a dataset, then you can do the actual testing.

I think your points have been addressed in the different modifications to the code, but there are thee I would like to comment:

I'm unsure of how much utility the output of the -parcellation option would be if the dHCP pipeline data are used. Also, if the dHCP version were standalone, it could omit the modality command-line argument, which is non-functional in that use case.

I think is still an interesting option to have. The cortical parcellation provided by the dHCP has only 32 nodes. So I don't think it is gonna be used for connectomics. The M-CRIB atlas is the translation of the freesurfer parcellation to the neonatal brain, so I think that will become one of the most used parcellations (with permission of the neonatal AAL derived or the JHU parcellations). My opinion is to leave it as an additional output, the people can always use another parcellation, but in this way, you can save a lot of time.

I don't think it is necessarily guaranteed that the sGM segmentations provided by M-CRIB will always lie within voxels that are classified by the dHCP pipeline as containing 100% WM. So greater care might be required regarding how those sGM segmentations are inserted; i.e. it might not be only the WM image that needs to be decreased to accommodate those extra parcels.

The initial WM segmentation created is done combining the WM and the sGM of the dHCP outputs (we call it WM, which can be misleading). So in fact we are inserting the sGM of the M-CRIB atlas in the combination of WM+sGM of the dHCP. I found this approach to provide a better delineation of the subcortical structures.

I'm strongly considering changing the -sgm_amyg_hipp option to be specific to the 5ttgen fsl algorithm, and to always assign the amygdalae and hippocampi to the sGM volume in other algorithms. The option was put there because in 5ttgen fsl the segmentations of those structures by FIRST are on occasion less extensive than the GM fraction estimated by FAST, and so streamlines are not able to reach those structures for the sGM priors to kick in; but it's a problem specific to that algorithm, and I suspect that it should not only be the default behavior for other algorithms but also that the converse behavior doesn't even need to be provided. Do you have any thoughts on this based on your own experiences?

I have no answer for that. These structures are very difficult to segment in this population due to the size/resolution and the low contrast. I think it makes sense to leave it also for these algorithms as it is and let the users decide, but again, I have no definitive answer.

@Lestropie
Copy link
Collaborator Author

Splitting the command into two different algorithms: neonatal_mcrib and neonatal_dhcp

  • Would simply naming them as "dhcp" and "mcrib" be a problem in any way?
    (e.g. is there dHCP data to which this method would not be applicable)

  • One advantage of having separate algorithms is that they can have independent command-line interfaces. So for instance, with the dHCP algorithm, the path to the dHCP processed data must always be specified. Currently it's a mandatory command-line option, which we use vary sparingly in MRtrix3: we try to have anything compulsory as a positional argument. So for instance:

    • Is there a mask provided by the dHCP pipeline that one would expect to be used (if not always, then at least in the vast majority of cases)?
    • Is there any need to provide separately an input structural image and the path to the dHCP pipeline output, or would it make more sense to instead only provide the dHCP path, and the structural image would be pulled from that structure?

The results are very similar to the original implementation (note that the results are never identical).

My question here was not about any potential stochasticity within the relevant ANTs commands (which is what I suspect you're referring to here), but moreso the fact that I observed potential errors in your code relating to duplicate / absent indices, and so hoped that you would spot the differences in script outcomes from those and confirm whether or not their resolution is correct.

The initial WM segmentation created is done combining the WM and the sGM of the dHCP outputs (we call it WM, which can be misleading). So in fact we are inserting the sGM of the M-CRIB atlas in the combination of WM+sGM of the dHCP. I found this approach to provide a better delineation of the subcortical structures.

While true, it doesn't quite address the original question. Let's say you have a voxel that is outside of the dHCP combined WM & sGM mask; it could be cGM or CSF according to dHCP. Now that voxel is classified as sGM by M-CRIB. What will be the outcome of your script in that voxel?

I have no answer for that.

No worries. I'll pose that question over on the main MRtrix3 repo.

DHCP_AMYG_HIPP = list(range(1, 5))

MCRIB_CGM = list(range(1000, 1036) + range(2000, 2036))
MCRIB_CGM = list(range(1000, 1004)) + list(range(1005,1036)) + list(range(2000, 2004)) + list(range(2005,2036))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This relates to my question RE: absent / duplicated indices. Looks like the elements I thought may have been accidentally omitted you intended to omit; but I have some recollection that there were additionally indices that were duplicated in your original code. This is the better way to do it, but I was partly curious whether or not you would spot the influence of those duplications in a change in output of the script.

(P.S. Given it looks like an unusual omission, it might be worth adding a code comment indicating the structures omitted and why)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,

We have been working on this, these are the new changes:

  • Now the algorithms are called mcrib and dhcp.
  • The dHCP algorithm has been modified, now you just need to input the path of the derivates (/anat) folder and that's all, no mask required anymore. Note that there are some versions of the dHCP pipeline around with different names, I think now it covers all the possible cases.
  • In the dHCP algorithm we added a filter to ensure that the merging of the sGM doesn't overlap with the CSF.
  • A note has been added about the missing labels: the labels 1004 and 2004 don't exist in the original M-CRIB atlas.

As it is now, the dHCP algorithm is the same as in my original paper (with the addition of this CSF filter to ensure that the sGM doesn't enter in the CSF). Basically, it combines the tissue probability maps of the dHCP pipeline with the binary subcortical gray matter parcellations of the M-CRIB atlas. However, I've been thinking to modify it and add two possibilities: the use of the probability maps or not. This will translate in the addition of a new flag (-posteriors), that if specify it it will work as it does now, the only difference will be the use of tissue probability maps for the subcortical GM, in this way all the tissues will have the partial volumes. If not specify it, the algorithm will use only the binary segmentations for everything. I was thinking about this because probably a lot of the people that run the dHCP pipeline didn't use the -additional flag (for example in the dHCP data itself, this data is not there). What do you think? Does this make sense?

…put)

Manuel Blesa <mblesac@gmail.com>
Paola Galdi <paola.galdi@gmail.com>
@Lestropie
Copy link
Collaborator Author

Diverting conversation from #1 (comment) to main thread as it relates to more items than were highlighted in the code for that specific comment.

now you just need to input the path of the derivates (/anat) folder

If you wanted to be really fancy, you could make parsing of the user-specified input path more robust, so that as long as the user provides a path that can be reasonably interpreted the script will still proceed. So e.g. if the user provides the path to the subject directory, the anat/ sub-directory within that, or a file within that sub-directory, the script will figure out the location of the anat/ directory and then proceed from there. Just makes it less likely to throw errors at users. I do this with some command interfaces. But it's most definitely not compulsory.

However, I've been thinking to modify it and add two possibilities: the use of the probability maps or not.

The description of the situation is still a bit too ambiguous for me to be able to immediately advise here, as I'm not familiar with the data provided by the pipeline. Let's see if I understand the situation correctly:

  • Everything other than sub-cortical GM:
    • dHCP provides binary tissue classifications
    • dHCP provides tissue probability priors, only if -additional is specified (and there is distributed pre-processed data where this flag was not used)
  • Sub-cortical GM:
    • dHCP provides minimal segmentation, which is immediately ignored in the 5ttgen dhcp algorithm here
    • M-CRIB + ANTs provides binary segmentations only (?)

If this is the case, then my advice would be that the script should check for presence of tissue probability priors in dHCP data.

  • If present, utilise those priors;
  • If absent, or if the user specifies some command-line option e.g. -binary, then utilise the binary segmentations.

However I'm concerned my interpretation is incorrect because you then talk about use of tissue probability priors for sub-cortical GM:

the only difference will be the use of tissue probability maps for the subcortical GM

So please correct my dot points above as necessary and I can revise.

@mblesac
Copy link
Owner

mblesac commented Nov 24, 2021

The description of the situation is still a bit too ambiguous for me to be able to immediately advise here, as I'm not familiar with the data provided by the pipeline. Let's see if I understand the situation correctly:

  • Everything other than sub-cortical GM:

    • dHCP provides binary tissue classifications
    • dHCP provides tissue probability priors, only if -additional is specified (and there is distributed pre-processed data where this flag was not used)

Yes, that's right.

Sub-cortical GM:

  • dHCP provides minimal segmentation, which is immediately ignored in the 5ttgen dhcp algorithm here
  • M-CRIB + ANTs provides binary segmentations only (?)

dHCP provides a segmentation of the sGM (also tissue probability priors only if -additional is specified), but I was not happy about how this segmentation looks for this purpose. See an example

image

M-CRIB + ANTs provides both, binary and tissue probability priors, but for the original implementation I used the binary masks. Please take a look to an example (ignore the cerebellum):
image

My reason for choosing the M-CRIB subcortical parcellation was that it was not as "bulky" as the dHCP one and it could allow the WM tracts to better pass through the middle. This was purely visual, based on my opinion, I didn't do any quantitative experiment.

If this is the case, then my advice would be that the script should check for presence of tissue probability priors in dHCP data.

  • If present, utilise those priors;
  • If absent, or if the user specifies some command-line option e.g. -binary, then utilise the binary segmentations.

However I'm concerned my interpretation is incorrect because you then talk about use of tissue probability priors for sub-cortical GM:

the only difference will be the use of tissue probability maps for the subcortical GM

So please correct my dot points above as necessary and I can revise.

What I was thinking is:

  • if the -binary flag is selected (or if the folder of the probability maps doesn't exist): rise a warning and use the binary segmentation for both: cGM and sGM
  • By default use the posteriors to create the 5tt file using the tissue probability maps for both tissues, the cGM and sGM.

Does make sense?

@Lestropie
Copy link
Collaborator Author

OK cool. So an important detail here is that both dHCP and M-CRIB (may) have non-binary segmentations. I personally don't think there's a huge benefit in offering independent control of the two.

if the -binary flag is selected (or if the folder of the probability maps doesn't exist): rise a warning and use the binary segmentation for both: cGM and sGM

That logic doesn't quite work as it results in issuing a warning if the user specifies that command-line option. It should instead be something like:

use_hard_segmentation = False
if app.ARGS.hard:
    use_hard_segmentation = True
elif not os.path.isfile(soft_parcellation_path):
    app.warn('No tissue posteriors found (-additional flag not used); output will be hard segmentation')
    use_hard_segmentation = True

Variable use_hard_segmentation would then apply to both dHCP and M-CRIB data utilisation.

Note I've changed to "hard segmentation" rather than "binary"; I think that's more faithful to the fact that the output 5TT image is 4D rather than 3D.

…atal algorithms

Co-authored-by: Manuel Blesa <mblesac@gmail.com>
Co-authored-by: Paola Galdi <paola.galdi@gmail.com>
@mblesac
Copy link
Owner

mblesac commented Nov 26, 2021

Hi,

I added the option to use the hard or the soft segmentation (by default uses the soft).

I tested in a dHCP subject and it looks like this:
image
image
image
image

Now you can test it with any subject of the dHCP and it should work. I'm still waiting for the confirmation from the HR office to be able to send you another subject from our cohort (with all the posteriors) so you can try it there as well.

- Ensure that input files are unique and unambiguous.
- Remove redundant logic across get_inputs() and execute() regarding use of hard segmentation.
- Simplify logic of primary conversion loop.
- Generate sub-cortical grey matter image from M-CRIB only once, rather than once per tissue; this should facilitate code execution without using the -force option.
@Lestropie
Copy link
Collaborator Author

Also, at some point run the docs/generate_user_docs.sh script and commit the resulting changes to docs/reference/commands/5ttgen.rst and docs/reference/commands_list.rst under your accounts.

@mblesac mblesac merged commit d7475aa into master Dec 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants