Skip to content

TypeError when string contains certain Unicode Format characters #70

@l30t

Description

@l30t

Summary

string-width 8.1.0 crashes with TypeError: Expected a code point, got 'undefined' when processing strings that contain certain Unicode Format characters that are not in the Default_Ignorable_Code_Point category.

Environment

  • string-width version: 8.1.0
  • get-east-asian-width version: 1.3.0
  • Node.js version: 22.x / Bun 1.x
  • OS: macOS, Linux, Windows (all affected)

Steps to Reproduce

import stringWidth from 'string-width';

// These Unicode Format characters cause the crash:
// They are in \p{Format} but NOT in \p{Default_Ignorable_Code_Point}

stringWidth('\u0600'); // ARABIC NUMBER SIGN - CRASH!
stringWidth('\u0601'); // ARABIC SIGN SANAH - CRASH!
stringWidth('\u0602'); // ARABIC FOOTNOTE MARKER - CRASH!
stringWidth('\u0603'); // ARABIC SIGN SAFHA - CRASH!
stringWidth('\u0604'); // ARABIC SIGN SAMVAT - CRASH!
stringWidth('\u0605'); // ARABIC NUMBER MARK ABOVE - CRASH!
stringWidth('\u06DD'); // ARABIC END OF AYAH - CRASH!
stringWidth('\u070F'); // SYRIAC ABBREVIATION MARK - CRASH!
stringWidth('\u0890'); // ARABIC POUND MARK ABOVE - CRASH!
stringWidth('\u0891'); // ARABIC PIASTRE MARK ABOVE - CRASH!
stringWidth('\u08E2'); // ARABIC DISPUTED END OF AYAH - CRASH!
stringWidth('\u110BD'); // KAITHI NUMBER SIGN - CRASH!
stringWidth('\u110CD'); // KAITHI NUMBER SIGN ABOVE - CRASH!

// Real-world example - Arabic text often contains these:
const arabicText = '؀١٢٣'; // Arabic number sign followed by digits
console.log(stringWidth(arabicText)); // CRASH!

Expected Behavior

The function should return 0 for zero-width Format characters, or handle them gracefully without throwing.

Actual Behavior

TypeError: Expected a code point, got `undefined`.
    at validate (node_modules/get-east-asian-width/index.js:5:13)
    at eastAsianWidth (node_modules/get-east-asian-width/index.js:16:2)
    at node_modules/string-width/index.js:82:12

Root Cause Analysis

The bug is a mismatch between zeroWidthClusterRegex and leadingNonPrintingRegex in string-width/index.js.

The Problem

  1. zeroWidthClusterRegex uses \p{Default_Ignorable_Code_Point}:

    const zeroWidthClusterRegex = /^(?:\p{Default_Ignorable_Code_Point}|\p{Control}|\p{Mark}|\p{Surrogate})+$/v;
  2. leadingNonPrintingRegex uses \p{Format}:

    const leadingNonPrintingRegex = /^[\p{Default_Ignorable_Code_Point}\p{Control}\p{Format}\p{Mark}\p{Surrogate}]+/v;
  3. The gap: Some Unicode \p{Format} characters are NOT in \p{Default_Ignorable_Code_Point}. These include:

    • U+0600-U+0605 (Arabic number signs)
    • U+06DD (Arabic end of ayah)
    • U+070F (Syriac abbreviation mark)
    • U+0890-U+0891 (Arabic pound/piastre marks)
    • U+08E2 (Arabic disputed end of ayah)
    • U+110BD, U+110CD (Kaithi number signs)
    • And others...
  4. The fatal sequence (lines 66-73 in index.js):

    for (const {segment} of segmenter.segment(string)) {
        // Zero-width / non-printing clusters
        if (isZeroWidthCluster(segment)) {
            continue;  // ❌ Does NOT catch \u0600 (not Default_Ignorable)
        }
        // ...
        const codePoint = baseVisible(segment).codePointAt(0);
        //                 ↑ Strips \u0600 via \p{Format}, returns ''
        //                                      ↑ Returns undefined for ''
        width += eastAsianWidth(codePoint);  // 💥 CRASH!
    }

Why these specific characters?

The Unicode Standard defines Default_Ignorable_Code_Point as characters that should be ignored in rendering if not supported. However, some Format characters like Arabic number signs are not default-ignorable because they carry semantic meaning in certain contexts (e.g., Quranic text formatting).

The leadingNonPrintingRegex correctly includes \p{Format} to strip these characters, but zeroWidthClusterRegex doesn't use \p{Format}, creating a gap where a segment consisting only of these Format characters passes the zero-width check but gets completely stripped by baseVisible().

Suggested Fix

There are two possible fixes:

Option 1: Guard against empty string after baseVisible() (Minimal fix)

// In string-width/index.js, after baseVisible() call:
const base = baseVisible(segment);
if (base.length === 0) {
    continue; // Skip segments that become empty after stripping ignorables
}
const codePoint = base.codePointAt(0);

Option 2: Expand isZeroWidthCluster() to catch all Format characters (More thorough)

// Add explicit check for Format-only segments before baseVisible():
const isFormatOnlyCluster = segment => /^[\p{Cf}]+$/u.test(segment);

// Then in the loop:
if (isControlCluster(segment) || isZeroWidthCluster(segment) || isFormatOnlyCluster(segment)) {
    continue;
}

Option 3: Fix in get-east-asian-width (Defense in depth)

The validate() function in get-east-asian-width could handle undefined gracefully:

function validate(codePoint) {
    if (codePoint === undefined || codePoint === null) {
        return; // Allow undefined, caller will handle
    }
    if (!Number.isSafeInteger(codePoint)) {
        throw new TypeError(`Expected a code point, got \`${typeof codePoint}\`.`);
    }
}

export function eastAsianWidth(codePoint, {ambiguousAsWide = false} = {}) {
    validate(codePoint);
    if (codePoint === undefined || codePoint === null) {
        return 1; // Default to narrow width for invalid input
    }
    // ... rest of function
}

Recommended Fix

Option 1 is the cleanest fix for string-width - it's minimal, targeted, and handles the root cause (empty string after stripping ignorables).

Workaround

Until this is fixed upstream, users can sanitize input before calling string-width:

// Remove problematic Format characters before measuring
const PROBLEMATIC_FORMAT_CHARS = /[\u061C\u200E\u200F\u202A-\u202E\u2066-\u2069]/g;

function safeStringWidth(str) {
    return stringWidth(str.replace(PROBLEMATIC_FORMAT_CHARS, ''));
}

Impact

This bug affects any application using string-width 8.x that processes:

  • Text copied from web pages (often contains invisible direction marks)
  • Text from PDFs (frequently include formatting characters)
  • Internationalized text (RTL languages use bidirectional marks)
  • User-generated content (may contain any Unicode characters)

Popular affected packages include:

  • ink (React for CLI) - crashes during terminal rendering
  • cli-table3, boxen, ora - any CLI tool measuring string widths
  • prettier, eslint - when processing files with these characters

Related Issues

Test Cases

import stringWidth from 'string-width';
import { describe, it, expect } from 'your-test-framework';

describe('Format characters not in Default_Ignorable_Code_Point', () => {
    // These are the characters that ACTUALLY crash in 8.1.0
    it('should not crash on ARABIC NUMBER SIGN (U+0600)', () => {
        expect(() => stringWidth('\u0600')).not.toThrow();
    });

    it('should not crash on ARABIC SIGN SANAH (U+0601)', () => {
        expect(() => stringWidth('\u0601')).not.toThrow();
    });

    it('should not crash on ARABIC FOOTNOTE MARKER (U+0602)', () => {
        expect(() => stringWidth('\u0602')).not.toThrow();
    });

    it('should not crash on ARABIC SIGN SAFHA (U+0603)', () => {
        expect(() => stringWidth('\u0603')).not.toThrow();
    });

    it('should not crash on ARABIC SIGN SAMVAT (U+0604)', () => {
        expect(() => stringWidth('\u0604')).not.toThrow();
    });

    it('should not crash on ARABIC NUMBER MARK ABOVE (U+0605)', () => {
        expect(() => stringWidth('\u0605')).not.toThrow();
    });

    it('should not crash on ARABIC END OF AYAH (U+06DD)', () => {
        expect(() => stringWidth('\u06DD')).not.toThrow();
    });

    it('should not crash on SYRIAC ABBREVIATION MARK (U+070F)', () => {
        expect(() => stringWidth('\u070F')).not.toThrow();
    });

    it('should not crash on ARABIC POUND MARK ABOVE (U+0890)', () => {
        expect(() => stringWidth('\u0890')).not.toThrow();
    });

    it('should not crash on ARABIC PIASTRE MARK ABOVE (U+0891)', () => {
        expect(() => stringWidth('\u0891')).not.toThrow();
    });

    it('should not crash on ARABIC DISPUTED END OF AYAH (U+08E2)', () => {
        expect(() => stringWidth('\u08E2')).not.toThrow();
    });

    it('should handle Arabic text with number signs', () => {
        // U+0600 followed by Arabic-Indic digits
        expect(() => stringWidth('\u0600\u0661\u0662\u0663')).not.toThrow();
    });
});

Complete List of Affected Characters

All Unicode \p{Format} characters that are NOT in \p{Default_Ignorable_Code_Point}:

Code Point Name Script
U+0600 ARABIC NUMBER SIGN Arabic
U+0601 ARABIC SIGN SANAH Arabic
U+0602 ARABIC FOOTNOTE MARKER Arabic
U+0603 ARABIC SIGN SAFHA Arabic
U+0604 ARABIC SIGN SAMVAT Arabic
U+0605 ARABIC NUMBER MARK ABOVE Arabic
U+06DD ARABIC END OF AYAH Arabic
U+070F SYRIAC ABBREVIATION MARK Syriac
U+0890 ARABIC POUND MARK ABOVE Arabic
U+0891 ARABIC PIASTRE MARK ABOVE Arabic
U+08E2 ARABIC DISPUTED END OF AYAH Arabic
U+110BD KAITHI NUMBER SIGN Kaithi
U+110CD KAITHI NUMBER SIGN ABOVE Kaithi

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions