Skip to content

standardize_tag does not properly identify macrolanguages #67

@rimusa

Description

@rimusa

Hi! I am trying to use this package to normalize languages in a project. For our specific use-case, we don't differentiate between macro-languages and their variations, as this difference was not made when compiling the data. However, in some cases, the individual language was listed, resulting on a need to identify the macro language.

According to the documentation, using standardize_tag(code,macro=True) should do this. However, we have noticed that it does not always work. Some examples follow:

Standardizing Mandarin gives the macro code for Chinese: standardize_tag("cmn",macro=True) gives 'zh'as an output.

Standardizing Northern Ping Chinese, Hainanese, or Pu-Xian Chinese does not gives the macro code for Chinese, it gives back the code of the respective individual language. As an example, standardize_tag("cpx",macro=True) should give 'zh'as an output but gives cpx instead.

The same thing happens with variations of Arabic, where it fails to identify even the most common dialects, such as Levantine Arabic, Moroccan Arabic, or Egyptian Arabic.

I am currently using version 3.3.0 of langcodes

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions