Skip to content

Non-ascii letters not recognised #24

@premasagar

Description

@premasagar

For this simple query in Spanish, with empty stopwords (or with stopwords; it doesn't matter):

rake.generate("Cuantos años tienes?", {stopwords: []})

I get the error:

TypeError: Cannot read property 'forEach' of null
    at phraseList.forEach
    at Array.forEach
    at Rake.calculatePhraseScores

If I omit the stopwords, then there is no error, but the word "años" is incorrectly split up:

rake.generate("Cuantos años tienes?")

=> [ 'ños tienes', 'Cuantos' ]

I think the code is treating the ñ as a word-break character, leading to the word being split in the second example, and leading to the single character ñ being used as a whole phrase in the function calculatePhraseScores, which leads to the error in the first example. The wordList regex seems to be looking only for 0-9a-z as acceptable word characters, which will be incomplete.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions