Handling special characters for character seq2seq model 

I was trying to train the model for the character level seq2seq model. My source and target file (both in utf-8 encoding) contains special characters such as the pound symbol. 

I was facing the "UnicodeDecodeError: 'utf8' codec can't decode byte 0x9c" at several places in the code. 

I think that the issue is because 'utf-8' encodes the special characters as two bytes such as pound is represented as '\xc2\xa3'. 
`print '\xc2\xa3'.decode('utf-8')`

Vocab generated through [generate_vocab.py](https://github.com/google/seq2seq/blob/master/bin/tools/generate_vocab.py) also doesn't cater this. 

To create character level vocab, it uses 'list' when using delimiter as '' which considers special characters (represented by two bytes) in utf-8 as two characters. Same for tf.string_split function in [split_tokens_decoder.py](https://github.com/google/seq2seq/blob/master/seq2seq/data/split_tokens_decoder.py). For example, consider the result using:
```
import tensorflow as tf
sess = tf.InteractiveSession()
data  = tf.constant("This \xc2\xa3 is a string")
tokens = tf.string_split([data], delimiter='')
tokens.values.eval()
```
compared with 

`print (data.eval())`

and you get the same UnicodeDecode error when you try to decode it:  
`character = data.eval()[5].decode('utf-8')`
 
If we create our own vocabulary file (to handle these special characters) we also get the UnicodeDecodeError later in the code (at [hooks.py](https://github.com/google/seq2seq/blob/master/seq2seq/training/hooks.py)) 

To solve this, I converted my both source and target to latin-1 encoding (which unlike utf-8 encodes these characters as 1-byte ) and changed the encoding (to latin-1) in the code for [hooks.py](https://github.com/google/seq2seq/blob/master/seq2seq/training/hooks.py), [decode_text.py](https://github.com/google/seq2seq/blob/master/seq2seq/tasks/decode_text.py) and [metrics_specs.py](https://github.com/google/seq2seq/blob/master/seq2seq/metrics/metric_specs.py). 
The model then ran out of the box for the latin-1 encoding. 

Another solution can be to directly convert the sequence of characters into sequence of integers. I have created this [gist](https://gist.github.com/shubhamagarwal92/7a49e169e0261e547b93079949e39866) for the same.   

@dennybritz I think that the simplest solution can be to convert the text into unicode (which represents these special characters with 1 byte) while reading and convert back to the desired file encoding (utf-8 standardly) while writing the results. What do you suggest? 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling special characters for character seq2seq model #153

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Handling special characters for character seq2seq model #153

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions