Skip to content
This repository was archived by the owner on Dec 29, 2022. It is now read-only.
This repository was archived by the owner on Dec 29, 2022. It is now read-only.

Handling special characters for character seq2seq model  #153

@shubhamagarwal92

Description

@shubhamagarwal92

I was trying to train the model for the character level seq2seq model. My source and target file (both in utf-8 encoding) contains special characters such as the pound symbol.

I was facing the "UnicodeDecodeError: 'utf8' codec can't decode byte 0x9c" at several places in the code.

I think that the issue is because 'utf-8' encodes the special characters as two bytes such as pound is represented as '\xc2\xa3'.
print '\xc2\xa3'.decode('utf-8')

Vocab generated through generate_vocab.py also doesn't cater this.

To create character level vocab, it uses 'list' when using delimiter as '' which considers special characters (represented by two bytes) in utf-8 as two characters. Same for tf.string_split function in split_tokens_decoder.py. For example, consider the result using:

import tensorflow as tf
sess = tf.InteractiveSession()
data  = tf.constant("This \xc2\xa3 is a string")
tokens = tf.string_split([data], delimiter='')
tokens.values.eval()

compared with

print (data.eval())

and you get the same UnicodeDecode error when you try to decode it:
character = data.eval()[5].decode('utf-8')

If we create our own vocabulary file (to handle these special characters) we also get the UnicodeDecodeError later in the code (at hooks.py)

To solve this, I converted my both source and target to latin-1 encoding (which unlike utf-8 encodes these characters as 1-byte ) and changed the encoding (to latin-1) in the code for hooks.py, decode_text.py and metrics_specs.py.
The model then ran out of the box for the latin-1 encoding.

Another solution can be to directly convert the sequence of characters into sequence of integers. I have created this gist for the same.

@dennybritz I think that the simplest solution can be to convert the text into unicode (which represents these special characters with 1 byte) while reading and convert back to the desired file encoding (utf-8 standardly) while writing the results. What do you suggest?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions