fix(pre-launch science review): changing config, default behaviour fo…#138
fix(pre-launch science review): changing config, default behaviour fo…#138
Conversation
…r aggregation strategy, introducing unknown ratio measure, splitting also considers \n BREAKING CHANGES: ClassificationAccuracyConfig changing fields, introducing 2 more fields
…r aggregation strategy, introducing unknown ratio measure, splitting also considers \n BREAKING CHANGES: ClassificationAccuracyConfig changing fields, introducing 2 more fields
| BALANCED_ACCURACY_SCORE = "balanced_accuracy_score" | ||
| PRECISION_SCORE = "precision_score" | ||
| RECALL_SCORE = "recall_score" | ||
| UNKNOWN_RATIO = "unknown_ratio" |
There was a problem hiding this comment.
We can't add new scores right now. We are very close to launch, this can break integration tests for us and clients.
Can we please stick to bug fixes and refactoring only? new scores should be added post launch.
There was a problem hiding this comment.
Sure we can put this one later on
| PREDICTED_CLASS_COLUMN_NAME = "predicted_class" | ||
|
|
||
|
|
||
| def unknown_ratio(y_pred) -> float: |
There was a problem hiding this comment.
Please add type annotations
There was a problem hiding this comment.
I think type annotations are tricky in these kinds of use cases where the argument is best described as a "1D array-like" (see sklearn docs for example).
I supposed we could do something like Union[pandas.DataFrame, pandas.Series, numpy.array, List[Any]], but that would be quite clunky, and not even necessarily exhaustive.
| valid_labels: Optional[List[str]] = None | ||
| converter_fn: Callable[[str, List[str]], str] = convert_model_output_to_label | ||
| multiclass_average_strategy: Optional[str] = "micro" | ||
| binary_average_strategy: Optional[str] = "binary" |
There was a problem hiding this comment.
If you update this, you have to update ClassificationAccuracySemanticRobustnessConfig as well.
There was a problem hiding this comment.
Do we need a variable for binary_average_strategy="binary"? It can be hardcoded since there is no other options for the binary case.
There was a problem hiding this comment.
@keerthanvasist ok will do.
@polaschwoebel no, all other options are valid as well for the binary case and result in different computations, so they still make sense
There was a problem hiding this comment.
I see. I doubt that others will be used much -- "binary" is very much the standard behavior as we discussed offline -- but if we can add this new parameter to the config without too much trouble it's of course more flexible.
There was a problem hiding this comment.
I very much agree with Pola; "binary" is pretty much the only option that makes sense here. I suppose giving the user freedom to choose other options is fine, since we've configured the default appropriately. If they choose something like "micro" and are confused why precision and recall are always the same, that's a self-inflicted customer error.
| converter_fn: Callable[[str, List[str]], str] = convert_model_output_to_label | ||
| multiclass_average_strategy: Optional[str] = "micro" | ||
| binary_average_strategy: Optional[str] = "binary" | ||
| positive_label: str = "1" |
There was a problem hiding this comment.
I am also concerned with this default. I am worried this won't play well with valid_labels. Let's discuss on Monday.
There was a problem hiding this comment.
Ok. What are you concerned about?
There was a problem hiding this comment.
I agree with Keerthan, positive_label should be one of the valid_labels.
There was a problem hiding this comment.
yes true, but the problem is that if we autodetect the valid labels, I don't think the order of the resulting list is going to be deterministinc (especially when we do downsampling). So.. if we pick, say, the first then we might pick a label depending on the random seed which is also not that good.
What I can do is to leave positive_label here, and if positive_label is not in valid_labels then I pick the first of valid_labels and print a warning. Would this work?
keerthanvasist
left a comment
There was a problem hiding this comment.
More comments to come. We should discuss the changes, and we should not merge this new score now.
| valid_labels = [label.lower().strip() for label in valid_labels] | ||
|
|
||
| response_words = model_output.split(" ") | ||
| response_words = re.split(r"[\n\s+]", model_output) |
There was a problem hiding this comment.
This is not tested as far as I can tell, and I am in doubt about what it does and how it relates to batching. Can you add a unit test please?
There was a problem hiding this comment.
This is a standard regular expression. It matches new lines \n or one or more spaces \s+. The [...] indicates set matching (which is same as a logical or among characters).
There was a problem hiding this comment.
So we are anticipating new lines within a single answer? Are you imagining a case where the model replies something like this?
The answer is
2
I still think a unit test could be a good way to show what we are matching here, could just be one more test case in test_convert_model_output_to_label.
|
Hi all, we need to converge on this. We can:
Let me know what you prefer. |
| PREDICTED_CLASS_COLUMN_NAME = "predicted_class" | ||
|
|
||
|
|
||
| def unknown_ratio(y_pred) -> float: |
There was a problem hiding this comment.
Perhaps rename this to unknown_fraction instead?
| @@ -274,14 +297,28 @@ def _generate_columns(row: Dict[str, Any]) -> Dict[str, Any]: # pragma: no cove | |||
|
|
|||
| def _get_score(self, y_true, y_pred, eval_fn: Callable[..., float]) -> float: | |||
There was a problem hiding this comment.
Please add unit tests for _get_score for the various cases (binary recall, binary precision, multiclass recall, multiclass precision, etc).
| EvalScore(name=UNKNOWN_RATIO, value=0.0), | ||
| ] | ||
|
|
||
| DATASET_SCORES_WO_CONFIG = [ |
There was a problem hiding this comment.
Can you explain what this represents? I see that you replaced all instances of DATASET_SCORES with DATASET_SCORES_WO_CONFIG. Shouldn't we still have some tests that use DATASET_SCORES? Also, could you please come up with a case where the expected precision and recall scores are different?
changing config, default behavior for aggregation strategy, introducing unknown ratio measure, splitting also considers \n
BREAKING CHANGES: ClassificationAccuracyConfig changing fields, introducing 2 more fields and renaming one
Description of changes:
re.splitfor default converter function. reason: some models do not to put white spaces before/after new line, causing parsing to miss many valid outputsbinary_average_strategyandpositive_labelto handle binary classification case. I changed the default of multiclass_average_strategy asmicrois non-standard.unknownBy submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.