Added pseudo alignment strategy based on phoneme duration #116

popcornell · 2025-06-24T11:38:24Z

Hi guys,

Greetings from Brno.
I am trying to add phoneme-based duration as an another pseudo-alignment word duration strategy.
This could enable tcpWER for languages such as Japanese for which one character e.g. a kanji could be of much longer duration than others.
Adding here also Alexander Polok as he is responsible for https://huggingface.co/spaces/BUT-FIT/EMMA_leaderboard
@Lakoc

boeddeker · 2025-06-24T12:26:37Z

meeteval/wer/wer/time_constrained.py

 import typing
 from dataclasses import dataclass, replace

+import transphone


Can you use a lazy import? We want to keep the mandatory dependencies for the core code small.

thequilo · 2025-06-24T12:27:20Z

Hey @popcornell!

This is a good extension for evaluating such languages as Japanese.

I'm unsure about your choice for the interface. I'm not that happy to add a language argument to that many functions just for this one pseudo word level timestamp strategy. This would have to be added to the whole interface including api and the CLI. I prefer something like strategy='phoneme_based_jpn' or strategy=phoneme_based('jpn'), so that the interface doesn't change.

boeddeker · 2025-06-24T12:31:45Z

meeteval/wer/wer/time_constrained.py

+    """Divides the interval into one interval per word where the size of the interval is
+    proportional to the number of phonemes in the word."""
+
+    g2p = transphone.read_tokenizer(language)


The call transphone.read_tokenizer sounds, that it loads a model. Has it a caching?
If not, we should do a caching, since this function is called for every segment.

boeddeker · 2025-06-24T12:36:17Z

meeteval/wer/preprocess.py

                (s['start_time'], s['end_time']),
-                words
-            )
+                words, language)


Can you change this to words, language=language)?

boeddeker · 2025-06-24T12:37:47Z

meeteval/wer/wer/time_constrained.py


 # pseudo-timestamp strategies
-def equidistant_intervals(interval, words):
+def equidistant_intervals(interval, words, *args):


I would prefer (interval, words, language) as signature, or at least (interval, words, **kwargs).

boeddeker · 2025-06-24T12:46:21Z

Hi, thanks for the PR.
Having a phone based splitting sound great.

Is transphone some kind of standard or at least, the result is some kind of standard?

I am thinking, if transphone_phoneme_based would be a better name for that option. It is a bit lengthy, but tells the user, what is used and doesn't block the introduction of alternative phone based splitters.

boeddeker · 2025-06-24T13:12:24Z

I'm unsure about your choice for the interface. I'm not that happy to add a language argument to that many functions just for this one pseudo word level timestamp strategy. This would have to be added to the whole interface including api and the CLI. I prefer something like strategy='phoneme_based_jpn' or strategy=phoneme_based('jpn'), so that the interface doesn't change.

I see arguments for both realizations.
The language argument makes the code a bit easier, e.g., simple dict lookup to get the subsegment function, better cli help text and CLI value checking (not sure, if we have checks for this implemented).
On the other hand, encoding the language into the strategy makes it more obvious, that only the strategy uses the language.

Since we have now an expert in this chat, maybe one other question first:
Samuele, do you know, if splitting transcripts at whitespace is the correct implementation for languages like Japanese?
For Japanese often the character error rate is reported, but I don't know if usually the tools are language aware or people prepare the transcript to use WER calculators. Depending on your answer, the language argument could be used at multiple positions.

thequilo · 2025-06-24T13:23:42Z

I agree, when the language argument is useful in other locations, like word splitting or normalization, it may be worth to add it to the interface.

popcornell · 2025-06-24T15:03:49Z

For Japanese often the character error rate is reported, but I don't know if usually the tools are language aware or people prepare the transcript to use WER calculators. Depending on your answer, the language argument could be used at multiple positions.

yeah actually I was unsure about that too. I was assuming that one would split the reference before feeding to meeteval but yeah maybe the best way to handle this is to make it dependent on the language or have an additional argument.
I am not sure if we should handle it depending on language though because the reference and/or system might have the whitespaces or may not.

Are you guys ok with another argument ? Like has_whitespaces: Optional[Bool] = Triue

thequilo · 2025-06-25T13:54:11Z

@popcornell Do you have examples for the output of a Japanese ASR system? The guys from NTT said that CER is usually used instead of WER, which completely ignores whitespace and splits individual characters. In that case, we may want to add a time-constrained CER

boeddeker · 2025-06-27T14:15:44Z

For documentation:

Until now, we have no clear answer on the best way to support Japanese and Chinese (e.g., what typical system outputs look like).
- Supporting CER is probably the best option (e.g., removing whitespace and converting the string into a list of characters instead of the split call). The current python api already supports this implicitly, as the user can do the split manually.
We discussed CER, and as of now, we tend toward introducing unit='word' and unit='char' in the python api signature, and adding meeteval.cer as a CLI entry point.

thequilo · 2025-07-16T12:51:23Z

We discussed the following:

The language should be encoded in the strategy name.
We found no other use case for a language argument than this alignment strategy. The distinction between CER and WER should be explicit and independent of the language. So, we want the language to be encoded in the alignment strategy key, like transphone_phoneme_based_jpn. For this, the pseudo_word_level_strategies should become a class so that it can split off the language and pass it on to the transphone library. Simlar to the normalizer in https://github.com/fgnt/meeteval/blob/main/meeteval/wer/normalizer.py.

Since there are potentially multiple libraries for obtaining phoneme durations, the package name should also be encoded in the strategy name.

CER in a different PR
We'll do the split into characters in a different pull request. In that PR, we'll for now supply a short script that splits segments into words as a preprocessing so that the WER functions yield a CER.

@popcornell Are you willing to adjust the PR with the required changes or should we do it?

popcornell · 2025-07-16T14:33:25Z

Hey guys, yeah I plan to adjust it. But I am currently busy in the JSALT I thought I would have more time. I can do it this weekend though.

thequilo · 2025-09-18T12:16:14Z

@popcornell ping! Do you still plan to work on this? If not, I'd have some time.

popcornell · 2025-09-21T13:16:34Z

Hey Thilo, currently still busy for ICLR...

popcornell added 2 commits June 24, 2025 13:29

added phoneme counting duration

2ca75bf

added tests

a72cfe3

boeddeker reviewed Jun 24, 2025

View reviewed changes

Added pseudo alignment strategy based on phoneme duration #116

Are you sure you want to change the base?

Added pseudo alignment strategy based on phoneme duration #116

Uh oh!

Conversation

popcornell commented Jun 24, 2025

Uh oh!

boeddeker Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

thequilo commented Jun 24, 2025

Uh oh!

boeddeker Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

boeddeker Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

boeddeker Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

boeddeker commented Jun 24, 2025

Uh oh!

boeddeker commented Jun 24, 2025

Uh oh!

thequilo commented Jun 24, 2025

Uh oh!

popcornell commented Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thequilo commented Jun 25, 2025

Uh oh!

boeddeker commented Jun 27, 2025

Uh oh!

thequilo commented Jul 16, 2025

Uh oh!

popcornell commented Jul 16, 2025

Uh oh!

thequilo commented Sep 18, 2025

Uh oh!

popcornell commented Sep 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

popcornell commented Jun 24, 2025 •

edited

Loading