-
-
Notifications
You must be signed in to change notification settings - Fork 22
Description
In Point section, it's mentions:
The
linefield (1-indexed integer) represents a line in a source file. Thecolumnfield (1-indexed integer) represents a column in a source file. Theoffsetfield (0-indexed integer) represents a character in a source file.
What's the unit of 'character' and 'column'? Is it UTF-16 code unit (used in JavaScript) or Unicode code point? See Wikipedia:
[UTF-16] encoding is variable-length, as code points are encoded with one or two 16-bit code units
I tried using remark to parse this markdown piece:
a𠮷bHere, 𠮷 is one Unicode code point that can not be encoded into one UTF-16 code unit. In JavaScript, because String uses UTF-16, so:
'a𠮷b'.length
//=> 4But in other languages like Python:
len('a𠮷b')
#=> 3As for remark, the above markdown piece is parsed into:
{
"type": "text",
"value": "a𠮷b",
"position": {
"start": {
"line": 1,
"column": 1,
"offset": 0
},
"end": {
"line": 1,
"column": 5,
"offset": 4
},
"indent": []
}
}The column of end is 5, while the offset of end is 4, that means remark treat this text four 'chars' long, measured in UTF16 code units.
So what's the unit of character? It's so confused.