feat: add configurable connection charset (lc_ctype)#164
feat: add configurable connection charset (lc_ctype)#164reinhardt1053 wants to merge 3 commits intoasfernandes:masterfrom
Conversation
Add charset option to ConnectOptions allowing users to specify the connection character set used in the DPB (lc_ctype parameter). The charset is also propagated to the data reader and writer so that string encoding/decoding matches the connection charset. This is essential for legacy Firebird databases (commonly created with Delphi/IBX) where columns use charset NONE. In these databases, text is stored as raw bytes in the application's encoding (typically WIN1252) without any charset metadata on the columns. With the current hardcoded 'utf8' charset, the driver tells Firebird to communicate in UTF-8, but Firebird does not transliterate charset NONE columns. The raw WIN1252 bytes are then incorrectly decoded as UTF-8, corrupting accented characters (e.g. 'Tournée' becomes 'Tourn�e'). By setting charset: 'WIN1252' in ConnectOptions, Firebird sends the correct bytes and the driver decodes them using the matching Node.js encoding (latin1, which provides 1:1 byte-to-codepoint mapping for single-byte charsets). Changes: - ConnectOptions: add optional charset property - createDpb(): use options.charset instead of hardcoded 'utf8' - mapCharsetToEncoding(): map Firebird charset names to Node.js encodings - AbstractAttachment: store encoding from connection charset - createDataReader(): accept encoding parameter for string decoding - createDataWriter(): accept encoding parameter for string encoding - AttachmentImpl: set encoding on connect using mapCharsetToEncoding() - StatementImpl: pass attachment.encoding to reader/writer Backward compatible: defaults to 'utf8' when charset is not specified.
|
Isn't node.js strings assumed to be utf8? |
JavaScript strings are Unicode internally but the key issue is how raw bytes from the wire are decoded into JS strings, and how JS strings are encoded back to bytes when writing. At the moment with charset NONE columns Firebird sends raw bytes without transliteration, the byte 0xE9 (which is é in WIN1252) is not valid as a single-byte UTF-8 sequence, so StringDecoder('utf8') replaces it with �
With NONE columns the data is indeed just bytes, the driver can't know the encoding. That's why it's left to the user to specify it via the charset option, the user knows what encoding their application uses (e.g. Delphi apps typically use WIN1252). The driver then uses the corresponding node.js encoding (latin1) to decode/encode correctly. |
|
Usage of |
Use TextDecoder for reading and a reverse lookup table for writing, which correctly supports all single-byte Firebird charsets (WIN1252, WIN1251, WIN1250, ISO8859_x, etc.) without external dependencies. Replaces the previous latin1-based approach as suggested in review.
|
Updated with your recommendation and it is now using TextDecoder for reading and a reverse lookup table for writing instead of latin1. This correctly supports all single-byte Firebird charsets (WIN1252, WIN1251, ISO8859_x, etc.) |
|
I offer this alternative: #165 |
Problem
Legacy Firebird databases commonly use charset
NONEon text columns. In these databases, text is stored as raw bytes in the application's encoding (typically WIN1252) without any charset metadata on the columns.The driver currently hardcodes
utf8as the connection charset (lc_ctypein the DPB). When connecting to a database with charsetNONEcolumns, Firebird does not transliterate the data: it sends the raw bytes as-is. The driver then incorrectly decodes these WIN1252 bytes as UTF-8, corrupting accented characters:Tournée→Tourn�eCafé→Caf�This affects a large number of production Firebird databases where charset
NONEwas the default.Solution
Add a
charsetoption toConnectOptionsthat:lc_ctypeto the specified charset instead of hardcodedutf8Usage
How it works
mapCharsetToEncoding()maps Firebird charset names to Node.jsBufferEncodingvalues (utf8for UTF8,latin1for all single-byte charsets)latin1encoding in Node.js provides a 1:1 byte-to-codepoint mapping, which correctly handles any single-byte Firebird charset (WIN1252, ISO8859_1, WIN1250, etc.)AbstractAttachmentand passed through tocreateDataReader()andcreateDataWriter()viaStatementImpl.prepare()Changes
node-firebird-driver:ConnectOptions: add optionalcharsetpropertycreateDpb(): useoptions.charsetinstead of hardcoded'utf8'mapCharsetToEncoding(): new helper to map Firebird charset → Node.js encodingAbstractAttachment: addencodingpropertycreateDataReader()/createDataWriter(): acceptencodingparameternode-firebird-driver-native:AttachmentImpl.connect(): setencodingfrommapCharsetToEncoding(options.charset)StatementImpl.prepare(): passattachment.encodingto reader/writerBackward compatible
When
charsetis not specified, the behavior is identical to before (defaults toutf8).