setup basic FR preprocessing#87
Conversation
kdavis-mozilla
left a comment
There was a problem hiding this comment.
Thanks for the commit!
However, before I review the code proper, could you change all the packaging code to adhere to the convention of this repo. In other words using setup.cfg as described here, for example.
|
thanks @kdavis-mozilla for the review. I've made requested changes |
|
|
||
| from corporacreator.utils import maybe_normalize, replace_numbers, FIND_PUNCTUATIONS_REG, FIND_MULTIPLE_SPACES_REG | ||
|
|
||
| FIND_ORDINAL_REG = re.compile(r"(\d+)([ème|éme|ieme|ier|iere]+)") |
There was a problem hiding this comment.
@nicolaspanel Maybe we shoud include potential spaces? I saw data like "1 er".
There was a problem hiding this comment.
@lissyx since their is no such case in clips.tsv I suggest to wait
kdavis-mozilla
left a comment
There was a problem hiding this comment.
There as several issues, see the comments, that I've questions on and/or need to be addressed.
| """ | ||
| # TODO: Clean up fr data | ||
| return sentence | ||
| text = maybe_normalize(sentence, mapping=FR_NORMALIZATIONS) |
There was a problem hiding this comment.
Do these always make sense? (See comments above on FR_NORMALIZATIONS.)
There was a problem hiding this comment.
I think so.
If not then special cases should be handled using client_id.
@kdavis-mozilla can you think of an example ?
|
|
||
|
|
||
| FR_NORMALIZATIONS = [ | ||
| [re.compile(r'(^|\s)(\d+)\s(0{3})(\s|\.|,|\?|!|$)'), r'\1\2\3\4'], # "123 000 …" => "123000 …" |
There was a problem hiding this comment.
We ideally want to not have digits. That said I'm not sure I understand the motivation for this change.
For example "123 000" might have been read "one hundred twenty three zero zero zero". However, now its changed to "123000" which I doubt would be read as "one hundred twenty three zero zero zero". So we'd introduce a miss-match between the audio and the text.
Are you assuming that replace_numbers() fixes this?
If so, how can replace_numbers() do this accurately as it does no know about splits like "123 000". All it would see is "123000", which, due to the split, may have been pronounced "one hundred twenty three zero zero zero".
There was a problem hiding this comment.
Part of that is my code from CommonVoice-fr repo, so it was accurate enough on a dataset like the one from French parliament.
There was a problem hiding this comment.
Are you assuming that
replace_numbers()fixes this?
yes
For example "123 000" might have been read "one hundred twenty three zero zero zero".
I assume it is not the case and user said cent vingt trois mille.
There was a problem hiding this comment.
Part of that is my code from CommonVoice-fr repo, so it was accurate enough on a dataset like the one from French parliament.
thanks @lissyx
It seems that some case were still not properly handled. for example, clips.tsv contains sentences like:
les trois,000 valeur du trésor de Lorettoà ma fille, et dix.000 fr.Loretto contenait un trésor à peu près de trois,000 liv.
There was a problem hiding this comment.
@nicolaspanel Yep, those are actually part of another dataset, that was much less well formatted and some error slipped throuh :/
There was a problem hiding this comment.
I know this is not the perfect place to discuss this, but....
I'm wondering if we could save a lot of time by simply having common.py mark as invalid any sentences with digits in them.
It's Draconian, but I think there are many problems like the ones we are thinking about here in multiple languages and I don't think they will be solved soon in all the languages, and we want to get the data out the door as soon as possible.
I'd be interested in your opinions
There was a problem hiding this comment.
$ cut -f3 source/fr/validated.tsv | grep '[[:digit:]]' | wc -l
366
I guess we can skip numbers for now, fix it in the CV dataset, and hand-craft the current recording, 366 is not impossible.
There was a problem hiding this comment.
Looks like there are a few leftover digits being searched for here.
src/corporacreator/utils.py
Outdated
| try: | ||
| ee = ''.join(e.split()) | ||
| if int(e) >= 0: | ||
| newinp = num2words(int(ee), lang=locale) |
There was a problem hiding this comment.
How can this work in all cases?
For example, "Room 456" can validly be read as "Room four five six" or as "Room four hundred and fifty six" . This code can't catch that.
It is for reasons exactly like this that the client_id is passed to fr() so you can hear what the person said and provide the correct transcript.
There was a problem hiding this comment.
here we assume value is not ambiguous. situation like "Room four five six" should have already been handle by maybe_normalize step to produce "Room 4 5 6" instead of original "Room 456"
There was a problem hiding this comment.
Are we allowed to assume we are in a non-ambiguous case?
I don't see how we can assume such without hearing the audio.
There was a problem hiding this comment.
@kdavis-mozilla I think in French we should be fine regarding ambigous cases unless for numbers > 1099 <= 9999. Those might (and are often in the case of dates) be spelled by hundreds. But as I said to @nicolaspanel if it's too much work for him and too tricky / edge cases to risk polluting the dataset, then I can dig into clips and listen, later.
|
@lissyx @kdavis-mozilla |
|
@kdavis-mozilla @lissyx I think I've made all requested changes. |
|
|
||
| from corporacreator.utils import maybe_normalize, replace_numbers, FIND_PUNCTUATIONS_REG, FIND_MULTIPLE_SPACES_REG | ||
|
|
||
| FIND_ORDINAL_REG = re.compile(r"(\d+)([ème|éme|ieme|ier|iere]+)") |
| # TODO: Clean up fr data | ||
| return sentence | ||
| text = maybe_normalize(sentence, mapping=FR_NORMALIZATIONS + [REPLACE_SPELLED_ACRONYMS]) | ||
| text = replace_numbers(text, locale='fr', ordinal_regex=FIND_ORDINAL_REG) |
There was a problem hiding this comment.
Just wondering if we should do that now or later: as you shown, my heuristics were not perfect, so maybe it'd be best if I listened to recording and adjust with client_id, and fix Common Voice dataset at the same time ?
There was a problem hiding this comment.
As far as I can tell, it work just fine as is (I am also using it in trainingspeech projet).
It is a good idea to pick / listen a few examples to check but checking all the examples will take a lot of time...
Personally, I won't have such bandwidth anytime soon...
There was a problem hiding this comment.
That's why I was suggesting that I do it :)
There was a problem hiding this comment.
@nicolaspanel That's what was implied :)
There was a problem hiding this comment.
@lissyx @nicolaspanel This is related to my "invalidate all sentences with digits" comment above.
I'd be interested in your take on the Draconian idea to have common.py mark as invalid any sentences with digits in them.
There was a problem hiding this comment.
@kdavis-mozilla I guess it's not such a bad idea, with a logging mode to help identify and fix the dataset as well.
| @@ -0,0 +1,34 @@ | |||
| import pytest | |||
kdavis-mozilla
left a comment
There was a problem hiding this comment.
I've added in a few comments on the code, but I think more than anything I wanted to ask everyone following on this issue about the Draconian idea of having a separate PR that has common.py mark as invalid any sentences with digits in them.
The most complicated part of this PR, and of other PR's in other languages, are the digits. So my thought was to just solve the problem once and for all in all languages. Throw out any sentence with digits.
That said, if a separate PR is made to have common.py mark as invalid any sentences with digits in, then a lot of the code here is not needed.
@lissyx it'd be great to have your feed back too!
|
|
||
| FIND_ORDINAL_REG = re.compile(r"(\d+)([ème|éme|ieme|ier|iere]+)") | ||
|
|
||
| SPELLED_ACRONYMS = { |
There was a problem hiding this comment.
If this contains all the acronyms in fr for the current clips.tsv, then we're fine.
| ] | ||
|
|
||
|
|
||
| FR_NORMALIZATIONS = [ |
|
|
||
|
|
||
| FR_NORMALIZATIONS = [ | ||
| [re.compile(r'(^|\s)(\d+)\s(0{3})(\s|\.|,|\?|!|$)'), r'\1\2\3\4'], # "123 000 …" => "123000 …" |
There was a problem hiding this comment.
I know this is not the perfect place to discuss this, but....
I'm wondering if we could save a lot of time by simply having common.py mark as invalid any sentences with digits in them.
It's Draconian, but I think there are many problems like the ones we are thinking about here in multiple languages and I don't think they will be solved soon in all the languages, and we want to get the data out the door as soon as possible.
I'd be interested in your opinions
| # TODO: Clean up fr data | ||
| return sentence | ||
| text = maybe_normalize(sentence, mapping=FR_NORMALIZATIONS + [REPLACE_SPELLED_ACRONYMS]) | ||
| text = replace_numbers(text, locale='fr', ordinal_regex=FIND_ORDINAL_REG) |
There was a problem hiding this comment.
@lissyx @nicolaspanel This is related to my "invalidate all sentences with digits" comment above.
I'd be interested in your take on the Draconian idea to have common.py mark as invalid any sentences with digits in them.
src/corporacreator/utils.py
Outdated
| try: | ||
| ee = ''.join(e.split()) | ||
| if int(e) >= 0: | ||
| newinp = num2words(int(ee), lang=locale) |
There was a problem hiding this comment.
Are we allowed to assume we are in a non-ambiguous case?
I don't see how we can assume such without hearing the audio.
|
@lissyx @nicolaspanel Sorry for my review comment being more about starting a discussion, but I think it's a discussion that needs to happen as complicated digit manipulation in many languages isn't going to happen on a timescale that's useful for a data release. (@nicolaspanel this is not reflective on you as you are the only person who's taken a real shot at doing the digits right!) |
|
@nicolaspanel @lissyx I am going to introduce a PR later today to mark as invalid any sentence in any language with digits, see issue 89. As the release date looms and the number of languages that deal with digits properly is almost zero, I talked with other members of the project to establish that this is the most practical way forward to maintain a high quality data set with a release in the foreseeable future. As a note, this would increase the number of invalid French sentences by 0.52% which is think is acceptable. |
lissyx
left a comment
There was a problem hiding this comment.
LGTM now, let's not loose time on numbers.
|
@nicolaspanel I just merged code that marks all sentences in all languages that have digits as invalid. For French this decreases the number of valid sentences by 0.52% which is think is acceptable. So could you can remove all code in your PR that deals with digits? |
@kdavis-mozilla done I marked related unit tests as skipped since we may want to support them later |
kdavis-mozilla
left a comment
There was a problem hiding this comment.
Thanks for removing most of the digits code, but it seems like there are still digit relics in the regular expressions that should be removed.
I think I commented on all of them with "Looks like there are a few leftover digits being searched for here.", but I might have missed one or two.
|
|
||
| FR_NORMALIZATIONS = [ | ||
| ['Jean-Paul II', 'Jean-Paul deux'], | ||
| [re.compile(r'(^|\s)(\d+)T(\s|\.|,|\?|!|$)'), r'\1\2 tonnes\3'], |
There was a problem hiding this comment.
Looks like there are a few leftover digits being searched for here.
|
|
||
|
|
||
| FR_NORMALIZATIONS = [ | ||
| [re.compile(r'(^|\s)(\d+)\s(0{3})(\s|\.|,|\?|!|$)'), r'\1\2\3\4'], # "123 000 …" => "123000 …" |
There was a problem hiding this comment.
Looks like there are a few leftover digits being searched for here.
| [re.compile(r'(^|\s)/an(\s|\.|,|\?|!|$)'), r'\1par an\2'], | ||
| [re.compile(r'(^|\s)(\d+)\s(0{3})(\s|\.|,|\?|!|$)'), r'\1\2\3\4'], # "123 000 …" => "123000 …" | ||
| [re.compile(r'(^|\s)km(\s|\.|,|\?|!|$)'), r'\1 kilomètres \2'], | ||
| [re.compile(r'(^|\s)0(\d)(\s|\.|,|\?|!|$)'), r'\1zéro \2 \3'], |
There was a problem hiding this comment.
Looks like there are a few leftover digits being searched for here.
| [re.compile(r'(^|\s)0(\d)(\s|\.|,|\?|!|$)'), r'\1zéro \2 \3'], | ||
| ['%', ' pourcent'], | ||
| [re.compile(r'(^|\s)\+(\s|\.|,|\?|!|$)'), r'\1 plus \2'], | ||
| [re.compile(r'(\d+)\s?m(?:2|²)(\s|\.|,|\?|!|$)'), r'\1 mètre carré\2'], |
There was a problem hiding this comment.
Looks like there are a few leftover digits being searched for here.
| [re.compile(r'(\d+)\s?m(?:2|²)(\s|\.|,|\?|!|$)'), r'\1 mètre carré\2'], | ||
| [re.compile(r'(^|\s)m(?:2|²)(\s|\.|,|\?|!|$)'), r'\1mètre carré\2'], | ||
| [re.compile(r'/\s?m(?:2|²)(\s|\.|,|\?|!|$)'), r' par mètre carré\1'], | ||
| [re.compile(r'(^|\s)(\d+),(\d{2})\s?€(\s|\.|,|\?|!|$)'), r'\1\2 euros \3 \4'], |
There was a problem hiding this comment.
Looks like there are a few leftover digits being searched for here.
|
|
||
|
|
||
| FIND_MULTIPLE_SPACES_REG = re.compile(r'\s{2,}') | ||
| FIND_PUNCTUATIONS_REG = re.compile(r"[/°\-,;!?.()\[\]*…—«»]") |
There was a problem hiding this comment.
In parallel with removal of the above commented out code, I'd ask that this is also removed as it's not used.
| text = maybe_normalize(sentence, mapping=FR_NORMALIZATIONS + [REPLACE_SPELLED_ACRONYMS]) | ||
| # TODO: restore this once we are clear on which punctuation marks should be kept or removed | ||
| # text = FIND_PUNCTUATIONS_REG.sub(' ', text) | ||
| text = FIND_MULTIPLE_SPACES_REG.sub(' ', text) |
There was a problem hiding this comment.
This has no effect as multiple white spaces are already removed here. So it seems like it should be removed.
| from corporacreator import preprocessors | ||
|
|
||
|
|
||
| @pytest.mark.parametrize('locale, client_id, sentence, expected', [ |
There was a problem hiding this comment.
I'd suggest removing tests that no longer make sense in like of digits being banned. For example
('fr', '*', "donc, ce sera 299 € + 99 €", "donc, ce sera deux cent quatre-vingt-dix-neuf euros plus quatre-vingt-dix-neuf euros"),Some of the tests here, for example
('fr', '*', "Jean-Paul II.", "Jean-Paul deux.")have nothing to do with digits and can actually be run independent of this comment.
| ['%', ' pourcent'], | ||
| [re.compile(r'(^|\s)\+(\s|\.|,|\?|!|$)'), r'\1 plus \2'], | ||
| [re.compile(r'(\d+)\s?m(?:2|²)(\s|\.|,|\?|!|$)'), r'\1 mètre carré\2'], | ||
| [re.compile(r'(^|\s)m(?:2|²)(\s|\.|,|\?|!|$)'), r'\1mètre carré\2'], |
There was a problem hiding this comment.
Looks like there are a few leftover digits being searched for here.
| [re.compile(r'(^|\s)\+(\s|\.|,|\?|!|$)'), r'\1 plus \2'], | ||
| [re.compile(r'(\d+)\s?m(?:2|²)(\s|\.|,|\?|!|$)'), r'\1 mètre carré\2'], | ||
| [re.compile(r'(^|\s)m(?:2|²)(\s|\.|,|\?|!|$)'), r'\1mètre carré\2'], | ||
| [re.compile(r'/\s?m(?:2|²)(\s|\.|,|\?|!|$)'), r' par mètre carré\1'], |
There was a problem hiding this comment.
Looks like there are a few leftover digits being searched for here.
@kdavis-mozilla you're right, sorry I missed it. fixed in 149e960 |
kdavis-mozilla
left a comment
There was a problem hiding this comment.
Assuming my eyes are parsing the regex's correctly, it looks like there's still some regexs that deal with digits.
For example
[re.compile(r'(^|\s)m(?:2|²)(\s|\.|,|\?|!|$)'), r'\1mètre carré\2']Also there is some "dead code"
text = FIND_MULTIPLE_SPACES_REG.sub(' ', text)that's "dead" as a result of code in common.py
No description provided.