Skip to content
This repository was archived by the owner on Dec 29, 2022. It is now read-only.

Fix unicode print error#96

Open
melynx wants to merge 1 commit intogoogle:masterfrom
melynx:master
Open

Fix unicode print error#96
melynx wants to merge 1 commit intogoogle:masterfrom
melynx:master

Conversation

@melynx
Copy link
Copy Markdown
Contributor

@melynx melynx commented Mar 22, 2017

Convert unicode string to byte string before printing.

Convert unicode string to byte string before printing.
@dennybritz
Copy link
Copy Markdown
Contributor

dennybritz commented Mar 22, 2017

Hm, thanks for the PR, but I'm not sure if this is 100% correct. For me, your change results in the following being printed when I run pipeline_test, using Python 3.

b'\xe6\xb3\xa3'

It worked without your change. I'm not sure what the right approach to make it work on all platform is. Python unicode is a mystery to me.

@dennybritz
Copy link
Copy Markdown
Contributor

Can you try setting: export PYTHONIOENCODING=UTF-8 as an environment variable? Does that solve your issue?

@melynx
Copy link
Copy Markdown
Contributor Author

melynx commented Mar 22, 2017

Ah... that's because the encode converts it to a byte string.
Yes. But isn't it better if the code workes even if the environment variable is not set?

My original fix was actually to set the default encoding of sys but I'm not sure if this is a good fix or not, that's why I've change it to io.open() for PR #93 .

import sys
reload(sys)  
sys.setdefaultencoding('utf8')

This seems very hackish and might break stuff which assumes ascii encoding.

@dennybritz
Copy link
Copy Markdown
Contributor

Yes. But isn't it better if the code workes even if the environment variable is not set?

Definitely, but your change does not work in my environment ;) It should print a unicode string, but it doesn't. I am still not sure how to make it work in all environments.

Yeah, the original fix isn't great, also see http://stackoverflow.com/questions/3828723/why-should-we-not-use-sys-setdefaultencodingutf-8-in-a-py-script

@melynx
Copy link
Copy Markdown
Contributor Author

melynx commented Mar 22, 2017

Yup I agree, I'm thinking of writing a separate Unicode printing function but I'm not sure if its worth the effort to change all the prints to the custom version. XD

@dennybritz
Copy link
Copy Markdown
Contributor

Yeah. There must be an easier/correct way to do this, I just don't know what it is...

@pltrdy
Copy link
Copy Markdown
Contributor

pltrdy commented Mar 27, 2017

I'm not sure how off-topic it is but, did you consider using unicode_literals (future) ?

@dennybritz
Copy link
Copy Markdown
Contributor

I think unicode_literals is imported pretty much everywhere, but I don't think that's related since it only applies to literals defined in the code.

@darrengarvey
Copy link
Copy Markdown

You might want to try using tf.compat.as_text(), which deals with this Python versioning ugliness.

@sheerun
Copy link
Copy Markdown

sheerun commented Jul 9, 2017

I had issue with encoding when using ./bin/tools/generate_vocab.py

the solutions seems to be to use python3 ./bin/tools/generate_vocab.py instead.. probably the same for whole seq2seq training...

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants