Skip to content

Conversation

@Jamie001129
Copy link
Contributor

@Jamie001129 Jamie001129 commented Oct 24, 2025

Implementation Details
(1) 4 variants implemented: none, sparsity, proximity, plausibility
(2) Source: Adapted from official NICE repository (https://github.com/DBrughmans/NICE)
(3) Paper: Brughmans et al. (2024) "NICE: an algorithm for nearest instance counterfactual explanations" Data Mining and Knowledge Discovery
(4) Dataset: Adult
(5) Predictive model: Random Forest and MLP

Potential Differences
(1) Original autoencoder's structure was not provided so we have to create our own
(2) The 200 test samples were originally chosen at random so our samples could be different
(3) Run time will be different but the rankings of the four variants are the same

Reproduced Results
(1) RF as predictive model (updated on 11/13/2025)

Variant Coverage CPU (ms) Sparsity Proximity (L1) Plausibility
none 200/200 20.11 3.19 ± 1.07 0.52 ± 0.52 0.0885 ± 0.0209
sparsity 200/200 53.58 1.61 ± 0.91 0.34 ± 0.42 0.0886 ± 0.0220
proximity 200/200 56.70 1.77 ± 1.01 0.32 ± 0.41 0.0887 ± 0.0224
plausibility 200/200 70.33 2.15 ± 1.08 0.38 ± 0.43 0.0892 ± 0.0227

(2) MLP as predictive model

Variant Coverage CPU (ms) Sparsity Proximity (HEOM) Plausibility (AE error)
none 200/200 6.70 3.89 ± 1.37 3.14 ± 1.34 0.2044 ± 0.0319
sparsity 200/200 10.80 1.22 ± 0.50 1.06 ± 0.52 0.2008 ± 0.0360
proximity 200/200 11.34 1.41 ± 0.69 1.09 ± 0.67 0.2034 ± 0.0346
plausibility 200/200 19.68 2.37 ± 1.37 2.04 ± 1.24 0.2014 ± 0.0342

Files Added/Modified
Main implementation:
methods/catalog/nice/model.py - Main NICE wrapper class implementing RecourseMethod interface
methods/catalog/nice/reproduce.py - Comprehensive test reproducing paper results (part of table 6)

Library components:
methods/catalog/nice/library/init.py - Library exports
methods/catalog/nice/library/autoencoder.py - Autoencoder for plausibility measurement
methods/catalog/nice/library/data.py - Data handling and candidate filtering
methods/catalog/nice/library/distance.py - HEOM distance metric implementation
methods/catalog/nice/library/heuristic.py - Best-first greedy search
methods/catalog/nice/library/reward.py - Three reward functions (sparsity, proximity, plausibility)

Integration:
Updated methods/init.py to export NICE
Updated methods/catalog/init.py to include NICE

std_ae_error = ae_errors.std()

# ============================================
# PRINT ALL FOUR METRICS (like Table 5 in paper)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran these tests locally, but the printed results don’t match the values reported in the paper. Please turn these print statements into assertions using the numbers from the table (a small tolerance is acceptable).
Also, these tests use the Random Forest model, so the correct reference is Table 6.

elif optimization == "none":
# None should be very plausible (it's an actual instance!)
# But we allow some tolerance since we measure on test set
assert avg_ae_error <= 0.02, \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn’t find where the paper reports the average error rate. The only place I see something similar is in Table 7, but that value seems different from what’s being checked here. Could you point me to the exact reference?

Copy link
Contributor Author

@Jamie001129 Jamie001129 Nov 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an online_appendix.xlsx table in a NICE_Experiments repo (https://github.com/DBrughmans/NICE_experiments/online_appendix.xlsx) that contains raw results instead of ranks. The dataset I accessed through DataCatalog is normalized while the author has a different preprocessing workflow. As a result, my AE error is much smaller. Still working on it, and we need the same preprocessing as the author's in our implementation?

for opt in ["none", "sparsity", "proximity", "plausibility"]:
nice = NICE(mlmodel=model, hyperparams={"optimization": opt})

# Measure CPU time
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CPU time isn’t a reliable metric for unit tests, since it depends on the hardware and environment where the code is executed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed cpu time assertions

print(f" NICE({opt:<12}): {metrics['cpu_time_total_ms']:>8.2f} ms total "
f"({metrics['cpu_time_avg_ms']:>6.2f} ms per instance)")

# Verify expectations
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’d suggest making each of these assertions a separate unit test for better clarity.

print(f"✓ NICE integrates correctly with {dataset_name} dataset")


if __name__ == "__main__":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script runs tests manually with print statements, but we should convert it into proper unit tests (e.g., using pytest) instead of using "print" outputs.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These 3 tests failed when I tried to run them. Please check them to run successfully
test_nice_quality[mlp-proximity]
test_nice_quality[mlp-plausibility]
nice_variants_comparison[mlp]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please avoid having multiple assertions in a single unit test. I’d suggest keeping one assertion per test to make it clearer and easier to debug later.

"""
Test that NICE produces quality counterfactuals with all metrics in expected ranges.
"""
data = DataCatalog("adult", model_type=model_type, train_split=0.7)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use pytest fixtures to build DataCatalog/ModelCatalog and the AutoEncoder once per dataset/model, then reuse them across tests. Each test can still create a fresh NICE instance and slice the same factuals for isolation.

Copy link
Collaborator

@zkhotanlou zkhotanlou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, please fetch the changes from the main branch so that the pre-commit hooks could be run successfully.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants