I was trying to train the model for the character level seq2seq model. My source and target file (both in utf-8 encoding) contains special characters such as the pound symbol.
I was facing the "UnicodeDecodeError: 'utf8' codec can't decode byte 0x9c" at several places in the code.
I think that the issue is because 'utf-8' encodes the special characters as two bytes such as pound is represented as '\xc2\xa3'.
print '\xc2\xa3'.decode('utf-8')
Vocab generated through generate_vocab.py also doesn't cater this.
To create character level vocab, it uses 'list' when using delimiter as '' which considers special characters (represented by two bytes) in utf-8 as two characters. Same for tf.string_split function in split_tokens_decoder.py. For example, consider the result using:
import tensorflow as tf
sess = tf.InteractiveSession()
data = tf.constant("This \xc2\xa3 is a string")
tokens = tf.string_split([data], delimiter='')
tokens.values.eval()
compared with
print (data.eval())
and you get the same UnicodeDecode error when you try to decode it:
character = data.eval()[5].decode('utf-8')
If we create our own vocabulary file (to handle these special characters) we also get the UnicodeDecodeError later in the code (at hooks.py)
To solve this, I converted my both source and target to latin-1 encoding (which unlike utf-8 encodes these characters as 1-byte ) and changed the encoding (to latin-1) in the code for hooks.py, decode_text.py and metrics_specs.py.
The model then ran out of the box for the latin-1 encoding.
Another solution can be to directly convert the sequence of characters into sequence of integers. I have created this gist for the same.
@dennybritz I think that the simplest solution can be to convert the text into unicode (which represents these special characters with 1 byte) while reading and convert back to the desired file encoding (utf-8 standardly) while writing the results. What do you suggest?
I was trying to train the model for the character level seq2seq model. My source and target file (both in utf-8 encoding) contains special characters such as the pound symbol.
I was facing the "UnicodeDecodeError: 'utf8' codec can't decode byte 0x9c" at several places in the code.
I think that the issue is because 'utf-8' encodes the special characters as two bytes such as pound is represented as '\xc2\xa3'.
print '\xc2\xa3'.decode('utf-8')Vocab generated through generate_vocab.py also doesn't cater this.
To create character level vocab, it uses 'list' when using delimiter as '' which considers special characters (represented by two bytes) in utf-8 as two characters. Same for tf.string_split function in split_tokens_decoder.py. For example, consider the result using:
compared with
print (data.eval())and you get the same UnicodeDecode error when you try to decode it:
character = data.eval()[5].decode('utf-8')If we create our own vocabulary file (to handle these special characters) we also get the UnicodeDecodeError later in the code (at hooks.py)
To solve this, I converted my both source and target to latin-1 encoding (which unlike utf-8 encodes these characters as 1-byte ) and changed the encoding (to latin-1) in the code for hooks.py, decode_text.py and metrics_specs.py.
The model then ran out of the box for the latin-1 encoding.
Another solution can be to directly convert the sequence of characters into sequence of integers. I have created this gist for the same.
@dennybritz I think that the simplest solution can be to convert the text into unicode (which represents these special characters with 1 byte) while reading and convert back to the desired file encoding (utf-8 standardly) while writing the results. What do you suggest?