This library enables the identification, parsing, and generation of hashtags within text. It defines a syntax that supports character escaping, delimiters, and Unicode processing.
Two distinct hashtag formats are recognized: unwrapped and wrapped.
The unwrapped format begins with a number sign (#) followed
immediately by properly encoded Unicode text (including emojis and
surrogate pairs), terminating at spaces or specific punctuation
characters: full stop (.), comma (,), semicolon (;), colon (:),
exclamation mark (!), and question mark (?). Punctuation characters
are treated as part of the hashtag only if they are followed by a
character that is neither a space nor another punctuation character.
Therefore, #v1.0 is a valid hashtag producing v1.0.
A reversed solidus (\) allows the inclusion of spaces, punctuation
characters, a literal less-than sign (<), and itself within an
unwrapped hashtag. For example, #this\ is\ example yields
this is example. Since a less-than sign following the number sign
(#<) initiates a wrapped hashtag, a sequence like #<example results
in an error if the closing bracket is missing; to include a literal
less-than sign at the start, it must be escaped (#\<example). In case
of unwrapped hashtag format a resulting hashtag text must contain at
least one valid character, so number sign followed by space (# ) or
number sign followed by a punctuation character and space (e.g., #. )
are not valid unwrapped hashtags.
The wrapped format encloses hashtag text between a less-than sign (<)
and a greater-than sign (>). This format allows spaces and special
characters within the hashtag. The less-than sign is valid without
escaping inside the brackets, but the greater-than sign and the reversed
solidus must be preceded by a reversed solidus to be interpreted as
text. Therefore, #<<example> is valid and equivalent to
#<\<example>, both producing the text <example. Also in case of
wrapped hashtag format a resulting hashtag text must contain at least
one valid character, so #<> is not a valid hashtag.
Wrapped hashtags may span multiple lines. Line breaks (\n, \r, or
\r\n) in wrapped hashtags are normalized to a single space character,
and any horizontal whitespace immediately following the line break is
ignored.
unwrapped-hashtag = unescaped-hash unwrapped-text
; hash must NOT be followed by an unescaped "<"
unwrapped-text = 1*unwrapped-char
unwrapped-char = escape-pair / punct-continuation / regular-char
escape-pair = BACKSLASH ANY
; allows spaces, punctuation, HASH, BACKSLASH, "<"
; always continues, never terminates
punct-continuation = PUNCT non-terminator
; punctuation followed by continuing character
regular-char = %x22 / %x24-2B / %x2D / %x2F-39 / %x3D
/ %x3C-3E / %x40-10FFFF
; excludes: STRONG, HASH, PUNCT, BACKSLASH
; Note: < and > are valid characters in unwrapped form
non-terminator = regular-char
; any character that doesn't terminatewrapped-hashtag = unescaped-hash "<" wrapped-text ">"
wrapped-text = 1*wrapped-char
wrapped-char = escape-pair / regular-char
escape-pair = BACKSLASH ANY
; specifically allows escaping ">" and BACKSLASH
; < does not need to be escaped
regular-char = %x00-3D / %x3F-5B / %x5D-10FFFF
; any character except ">" and BACKSLASH; Shared Core Definitions
unescaped-hash = "#"
; preceded by even number of backslashes (including zero)
STRONG = %x00-20 / %x7F-9F
; whitespace, control characters, DEL, C1 controls
PUNCT = "." / "," / ";" / ":" / "!" / "?"
HASH = "#"
BACKSLASH = "\"
ANY = %x00-10FFFFThe parser functions as a deterministic linear-time scanner. It traverses the input in a single pass, utilizing a finite state machine (FSM) to handle delimiter detection and Unicode surrogate pairs. The scanner state exhibits a time complexity of O(n) and auxiliary space complexity of O(1). Returned values allocate proportionally to the number and size of matches.
The syntax exceeds the capabilities of standard regular expressions. Determining if a delimiter is escaped requires tracking the parity (even or odd count) of preceding backslashes, a task finite automata cannot perform. Furthermore, the grammar requires conditional lookahead to validate punctuation characters within unwrapped tags.
type HashtagType = 'unwrapped' | 'wrapped';type HashtagMatch = {
type: HashtagType;
start: number;
end: number;
raw: string;
rawText: string;
text: string;
};Represents a parsed hashtag in a source string.
startandendare UTF-16 indices, withendbeing exclusive.rawis the full matched token, including the prefix and wrappers.rawTextis the escaped payload (no wrappers).textis the unescaped payload. For wrapped hashtags, line breaks are normalized to a single space and any following horizontal whitespace is ignored.
Malformed surrogate code units are rejected inside hashtags.
type HashtagPatternOptions = {
type?: HashtagType | 'any';
global?: boolean;
sticky?: boolean;
capture?: 'rawText' | 'text';
};
type HashtagPattern = {
source: string;
flags: string;
lastIndex: number;
exec(input: string): RegExpExecArray | null;
test(input: string): boolean;
reset(): void;
execMatch(input: string): HashtagMatch | null;
matchAll(input: string): IterableIterator<RegExpExecArray>;
matchAllMatches(input: string): IterableIterator<HashtagMatch>;
};
function hashtagPattern(options?: HashtagPatternOptions): HashtagPattern;Creates a RegExp-like matcher.
globaldefaults tofalse, matching JavaScriptRegExpbehavior.- If
typeis'any',exec()returns[full, payload, type]. - If
typeis'wrapped'or'unwrapped',exec()returns[full, payload]. payloadisrawTextby default; setcapture: 'text'to capture the unescaped text instead.- If
stickyistrue, a match is accepted only atlastIndex.
If global or sticky is enabled, a failed exec() resets lastIndex
to 0.
const hashtag: HashtagPattern;
const wrappedHashtag: HashtagPattern;
const unwrappedHashtag: HashtagPattern;These are equivalent to:
hashtagPattern({ type: 'any' });
hashtagPattern({ type: 'wrapped' });
hashtagPattern({ type: 'unwrapped' });type FindOptions = {
type?: HashtagType | 'any';
fromIndex?: number;
};
function findFirstHashtag(
input: string,
options?: FindOptions,
): HashtagMatch | null;
function findAllHashtags(
input: string,
options?: FindOptions,
): HashtagMatch[];
function iterateHashtags(
input: string,
options?: FindOptions,
): IterableIterator<HashtagMatch>;These helpers operate as a thin layer on top of hashtagPattern with
global: true and use fromIndex to initialize the scan position.
function createHashtag(text: string): stringGenerates a hashtag string from the provided text, automatically selecting the wrapped or unwrapped format based on the content.
createHashtag("hello world");
createHashtag("simple");function unescapeHashtagText(text: string): stringRemoves escape backslashes from a raw hashtag payload, returning the clean text content.
unescapeHashtagText("foo\\ bar");MIT