Skip to content
This repository was archived by the owner on May 22, 2019. It is now read-only.
This repository was archived by the owner on May 22, 2019. It is now read-only.

Improve TokenParser in cases containing abbreviations #403

@irinakhismatullina

Description

@irinakhismatullina

While using TokenParser to correct typos in identifiers I constantly bump into mistakes like
HTMLElement -> htmle lement.

To me it looks like in that case (several uppercase letters in a row) it would be better to add the last letter to the next token. I've seen many cases when this would be wise, and almost no when it would break the logic.

E.g. token 'lement' is one of the most frequent typoed ones that gets to be split-out. And here's where it comes from (top-10 examples):

data[data.token_split.str.contains(" lement")]
pos  num_occ    num_repos    identifier    token_split    num_files
3993    66995    4764    HTMLElement    htmle lement    13079
14139    16425    103    NSXMLElement    nsxmle lement    1741
47404    4496    85    JAXBElement    jaxbe lement    453
64825    3276    16    HTMLElementEventMap    htmle lement event map    42
66583    3182    41    IHTMLElement    ihtmle lement    209
86788    2389    471    SVGSVGElement    svgsvge lement    784
107285    1895    653    HTMLLIElement    htmllie lement    967
123871    1618    548    HTMLHRElement    htmlhre lement    811
126724    1579    551    HTMLBRElement    htmlbre lement    825
128322    1556    418    SVGGElement    svgge lement    718
144583    1365    19    BSONElement    bsone lement    198
150084    1309    33    IXMLDOMElement    ixmldome lement    178

And here're the right parses for comparison:

data[data.token_split.str.contains(" element")]
pos    num_occ    num_repos    identifier    token_split    num_files
194    1608035    27484    createElement    create element    185424
458    740521    22    as_fusion_element    as fusion element    628
604    568326    19962    documentElement    document element    90360
618    555927    20933    getElementsByTagName    get elements by tag name    91772
794    407035    22788    getElementById    get element by id    97313
1888    155867    12876    getElementsByClassName    get elements by class name    29477
2182    131254    13040    activeElement    active element    37437
2538    111209    3936    getElement    get element    19493
3153    87404    137    FieldElement    field element    449
3221    85380    1091    _currentElement    current element    2370
3306    83096    6811    parentElement    parent element    18498
3550    76270    1698    domElement    dom element    5811
3765    71809    1496    buttonElement    button element    2572
3843    69912    12145    srcElement    src element    35165

TLDR Can I add this case to the TokenParser? It will be possible to switch it off in the beginning, and I would want to try it with typos.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions