fix(pre-launch science review): changing config, default behaviour fo… by franluca · Pull Request #138 · aws/fmeval

franluca · 2023-11-24T16:00:27Z

changing config, default behavior for aggregation strategy, introducing unknown ratio measure, splitting also considers \n

BREAKING CHANGES: ClassificationAccuracyConfig changing fields, introducing 2 more fields and renaming one

Description of changes:

(minor) Renamed column name for model output after converter function. reason: previous name was non-standard/misleading
(minor) added splitting wrt \n with re.split for default converter function. reason: some models do not to put white spaces before/after new line, causing parsing to miss many valid outputs
(major) reworked aggregation strategies for precision and recall. reason: previous default behavior was non-standard and did not consider differences of binary vs multiclass setting.
- changes to ClassificationAccuracyConfig. To implement correct behaviour, the cleanest way is to change config. I added binary_average_strategy and positive_label to handle binary classification case. I changed the default of multiclass_average_strategy as micro is non-standard.
- precision and recall are now computed only on the subset of records that are not classified as unknown
(major) added a further metrics that computes the ratio of unknown labels to compliment new behavior of precision and recall computation. It is fine if this measure is not shown in the report for now. We can follow up on this
(minor) changed some docstrings

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

…r aggregation strategy, introducing unknown ratio measure, splitting also considers \n BREAKING CHANGES: ClassificationAccuracyConfig changing fields, introducing 2 more fields

malhotra18 · 2023-11-24T18:01:12Z

src/fmeval/eval_algorithms/classification_accuracy.py

 BALANCED_ACCURACY_SCORE = "balanced_accuracy_score"
 PRECISION_SCORE = "precision_score"
 RECALL_SCORE = "recall_score"
+UNKNOWN_RATIO = "unknown_ratio"


We can't add new scores right now. We are very close to launch, this can break integration tests for us and clients.

Can we please stick to bug fixes and refactoring only? new scores should be added post launch.

Sure we can put this one later on

keerthanvasist · 2023-11-24T16:22:31Z

src/fmeval/eval_algorithms/classification_accuracy.py

+PREDICTED_CLASS_COLUMN_NAME = "predicted_class"
+
+
+def unknown_ratio(y_pred) -> float:


Please add type annotations

I think type annotations are tricky in these kinds of use cases where the argument is best described as a "1D array-like" (see sklearn docs for example).

I supposed we could do something like Union[pandas.DataFrame, pandas.Series, numpy.array, List[Any]], but that would be quite clunky, and not even necessarily exhaustive.

keerthanvasist · 2023-11-24T16:30:14Z

src/fmeval/eval_algorithms/classification_accuracy.py

    valid_labels: Optional[List[str]] = None
    converter_fn: Callable[[str, List[str]], str] = convert_model_output_to_label
-    multiclass_average_strategy: Optional[str] = "micro"
+    binary_average_strategy: Optional[str] = "binary"


If you update this, you have to update ClassificationAccuracySemanticRobustnessConfig as well.

Do we need a variable for binary_average_strategy="binary"? It can be hardcoded since there is no other options for the binary case.

@keerthanvasist ok will do.
@polaschwoebel no, all other options are valid as well for the binary case and result in different computations, so they still make sense

I see. I doubt that others will be used much -- "binary" is very much the standard behavior as we discussed offline -- but if we can add this new parameter to the config without too much trouble it's of course more flexible.

I very much agree with Pola; "binary" is pretty much the only option that makes sense here. I suppose giving the user freedom to choose other options is fine, since we've configured the default appropriately. If they choose something like "micro" and are confused why precision and recall are always the same, that's a self-inflicted customer error.

keerthanvasist · 2023-11-24T18:01:40Z

src/fmeval/eval_algorithms/classification_accuracy.py

    converter_fn: Callable[[str, List[str]], str] = convert_model_output_to_label
-    multiclass_average_strategy: Optional[str] = "micro"
+    binary_average_strategy: Optional[str] = "binary"
+    positive_label: str = "1"


I am also concerned with this default. I am worried this won't play well with valid_labels. Let's discuss on Monday.

Ok. What are you concerned about?

I agree with Keerthan, positive_label should be one of the valid_labels.

yes true, but the problem is that if we autodetect the valid labels, I don't think the order of the resulting list is going to be deterministinc (especially when we do downsampling). So.. if we pick, say, the first then we might pick a label depending on the random seed which is also not that good.

What I can do is to leave positive_label here, and if positive_label is not in valid_labels then I pick the first of valid_labels and print a warning. Would this work?

I think so!

keerthanvasist

More comments to come. We should discuss the changes, and we should not merge this new score now.

src/fmeval/eval_algorithms/classification_accuracy.py

polaschwoebel · 2023-11-27T11:03:59Z

src/fmeval/eval_algorithms/classification_accuracy.py

    valid_labels = [label.lower().strip() for label in valid_labels]

-    response_words = model_output.split(" ")
+    response_words = re.split(r"[\n\s+]", model_output)


This is not tested as far as I can tell, and I am in doubt about what it does and how it relates to batching. Can you add a unit test please?

This is a standard regular expression. It matches new lines \n or one or more spaces \s+. The [...] indicates set matching (which is same as a logical or among characters).

So we are anticipating new lines within a single answer? Are you imagining a case where the model replies something like this?

The answer is 2

I still think a unit test could be a good way to show what we are matching here, could just be one more test case in test_convert_model_output_to_label.

test/unit/eval_algorithms/test_classification_accuracy.py

franluca · 2023-11-27T16:31:56Z

Hi all, we need to converge on this. We can:

leave everything as is and wait to make changes up until next week
I can remove the new score, and add the default handling of positive_label as I wrote above (and propagate changes to semantic robustness)
ENGs can pick this up. The main point is that micro is not a good default. We want to have binary when the task is binary classification and weighted when is multi-class classification

Let me know what you prefer.

danielezhu · 2024-01-12T17:34:55Z

src/fmeval/eval_algorithms/classification_accuracy.py

+PREDICTED_CLASS_COLUMN_NAME = "predicted_class"
+
+
+def unknown_ratio(y_pred) -> float:


Perhaps rename this to unknown_fraction instead?

danielezhu · 2024-01-16T21:56:56Z

src/fmeval/eval_algorithms/classification_accuracy.py

@@ -274,14 +297,28 @@ def _generate_columns(row: Dict[str, Any]) -> Dict[str, Any]:  # pragma: no cove

    def _get_score(self, y_true, y_pred, eval_fn: Callable[..., float]) -> float:


Please add unit tests for _get_score for the various cases (binary recall, binary precision, multiclass recall, multiclass precision, etc).

danielezhu · 2024-01-16T22:03:19Z

test/unit/eval_algorithms/test_classification_accuracy.py

+    EvalScore(name=UNKNOWN_RATIO, value=0.0),
+]
+
+DATASET_SCORES_WO_CONFIG = [


Can you explain what this represents? I see that you replaced all instances of DATASET_SCORES with DATASET_SCORES_WO_CONFIG. Shouldn't we still have some tests that use DATASET_SCORES? Also, could you please come up with a case where the expected precision and recall scores are different?

franluca added 2 commits November 24, 2023 16:23

fix(pre-launch science review): changing config, default behaviour fo…

d97ce50

…r aggregation strategy, introducing unknown ratio measure, splitting also considers \n BREAKING CHANGES: ClassificationAccuracyConfig changing fields, introducing 2 more fields

fix(pre-launch science review): changing config, default behaviour fo…

a037b04

…r aggregation strategy, introducing unknown ratio measure, splitting also considers \n BREAKING CHANGES: ClassificationAccuracyConfig changing fields, introducing 2 more fields

franluca requested review from keerthanvasist and polaschwoebel November 24, 2023 16:05

malhotra18 requested changes Nov 24, 2023

View reviewed changes

keerthanvasist reviewed Nov 24, 2023

View reviewed changes

keerthanvasist suggested changes Nov 24, 2023

View reviewed changes

polaschwoebel reviewed Nov 27, 2023

View reviewed changes

danielezhu reviewed Jan 12, 2024

View reviewed changes

danielezhu reviewed Jan 16, 2024

View reviewed changes

		PREDICTED_CLASS_COLUMN_NAME = "predicted_class"


		def unknown_ratio(y_pred) -> float:

		@@ -274,14 +297,28 @@ def _generate_columns(row: Dict[str, Any]) -> Dict[str, Any]: # pragma: no cove

		def _get_score(self, y_true, y_pred, eval_fn: Callable[..., float]) -> float:

Conversation

franluca commented Nov 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danielezhu Jan 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

keerthanvasist left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

franluca commented Nov 27, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

franluca commented Nov 24, 2023 •

edited

Loading

danielezhu Jan 12, 2024 •

edited

Loading