Skip to content

Expand Unicode Support #85

@Thomas-S-Allen

Description

@Thomas-S-Allen

Some Unicode entities are unsupported in adsrefpipe/refparsers/unicode.py. In particular the entity for parrot (&#x1f99c in the error trace below), however, support needs to be extended to other characters, error handling should be improves, or both.

Traceback (most recent call last): File "/app/adsrefpipe/refparsers/unicode.py", line 222, in __sub_hexnumasc_entity if self.unicode[entno]: IndexError: list index out of range During handling of the above exception, another exception occurred: Traceback (most recent call last): File "run.py", line 323, in process_files(source_filenames) File "run.py", line 107, in process_files parsed_references = toREFs.process_and_dispatch() File "/app/adsrefpipe/refparsers/arXivTXT.py", line 80, in process_and_dispatch reference = self.cleanup(raw_reference) File "/app/adsrefpipe/refparsers/arXivTXT.py", line 61, in cleanup reference = unicode_handler.ent2asc(reference) File "/app/adsrefpipe/refparsers/unicode.py", line 171, in ent2asc result = self.re_hexnumentity.sub(self.__sub_hexnumasc_entity, result) File "/app/adsrefpipe/refparsers/unicode.py", line 227, in __sub_hexnumasc_entity raise UnicodeHandlerError('Unknown hexadecimal entity: %s' % match.group(0)) adsrefpipe.refparsers.unicode.UnicodeHandlerError: Unknown hexadecimal entity: 🦜

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions