Comparison of various supervised and unsupervised tokenization algorithms using sentiment analysis on a chinese corpus (ChnSentiCorp)
Regularized logistic regression trained on 5205 examples and tested on 579 examples (90/10 split)
| Tokenizer | Accuracy |
|---|---|
| no tokenzier | 83.07 |
| jieba | 89.32 |
| SPM | vocab_size=2000 | vocab_size=4000 | vocab_size=8000 | vocab_size=16000 |
|---|---|---|---|---|
| Unigram | Aborted | 87.21 | 90.43 | 90.08 |
| Byte Pair Encoding | Aborted | 86.70 | 90.81 | 90.81 |
| Char | 53.36 | 48.46 | 48.98 | 47.35 |
| Word | 85.18 | 85.73 | 84.59 | Aborted |
Aborted = vocab_size was either too small or too large for that particular model