Fix for handling non-Latin characters by qurbat · Pull Request #7 · deepseagirl/degoogle

qurbat · 2022-04-18T19:02:00Z

This change introduces support for search results containing non-Latin characters as part of the URL or description.

This is done by passing the final_string variable to the html.unescape() function (instead of printing it directly) at the last print call.

This change introduces support for text containing non-Latin characters (Hindi, Urdu, Greek, for example). This is done by printing `html.unescape(final_string)` instead of `final_string`.

qurbat · 2022-04-19T21:57:55Z

@deepseagirl could you merge this after review?

qurbat · 2022-05-26T23:14:58Z

@deepseagirl hi, just sending a ping on this. thanks!

qurbat · 2022-06-27T18:11:31Z

@deepseagirl Can we close this?

deepseagirl · 2022-07-12T19:24:41Z

hi, thanks. this is a good improvement :)
i moved the unescape to only occur on the result descriptions directly
with a flag to toggle the behavior on/off

new default will be to decode character references:

$ python3 degoogle.py "intitle:⟿ inurl:⟿"
-- 9 results --

TranslingualEdit - Wiktionary
https://en.wiktionary.org/wiki/%E2%9F%BF

Talk:⟿ - Wiktionary
https://en.wiktionary.org/wiki/Talk:%E2%9F%BF

flag to turn decoding off:

$ python3 degoogle.py -d "intitle:⟿ inurl:⟿"
-- 9 results --

TranslingualEdit - Wiktionary
https://en.wiktionary.org/wiki/%E2%9F%BF

Talk:&#10239; - Wiktionary
https://en.wiktionary.org/wiki/Talk:%E2%9F%BF

the html.unescape python doc links to this list of named character references which seemed handy.
i didn't realize char references were such an in depth thing until now. if you're interested here is that link
https://html.spec.whatwg.org/multipage/named-characters.html#named-character-references

deepseagirl · 2022-07-12T19:33:34Z

i'll finalize this when i have a few more mins. should be soon now that it's this far along. thanks again

qurbat · 2022-07-12T19:43:31Z

@deepseagirl no worries, and I realize you were not able to access a computer earlier, so it is no problem. the new changes look great! thank you & tc =)

qurbat · 2022-10-08T13:35:22Z

@deepseagirl can we close?

qurbat added 2 commits April 19, 2022 00:28

Convert HTML entities

faa0615

This change introduces support for text containing non-Latin characters (Hindi, Urdu, Greek, for example). This is done by printing `html.unescape(final_string)` instead of `final_string`.

Update degoogle.py

aec8f17

qurbat mentioned this pull request Apr 18, 2022

Introduce support for non-Latin characters #8

Open

qurbat changed the title ~~Convert HTML entities~~ Fix for handling non-Latin characters May 26, 2022

deepseagirl added 2 commits July 12, 2022 15:26

add decoding for special chars in result desc

1264719

document new decoding toggle flag

610397e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for handling non-Latin characters#7

Fix for handling non-Latin characters#7
qurbat wants to merge 4 commits intodeepseagirl:masterfrom
qurbat:patch-1

qurbat commented Apr 18, 2022 •

edited

Loading

Uh oh!

qurbat commented Apr 19, 2022

Uh oh!

qurbat commented May 26, 2022

Uh oh!

qurbat commented Jun 27, 2022

Uh oh!

deepseagirl commented Jul 12, 2022

Uh oh!

deepseagirl commented Jul 12, 2022

Uh oh!

qurbat commented Jul 12, 2022

Uh oh!

qurbat commented Oct 8, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

qurbat commented Apr 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qurbat commented Apr 19, 2022

Uh oh!

qurbat commented May 26, 2022

Uh oh!

qurbat commented Jun 27, 2022

Uh oh!

deepseagirl commented Jul 12, 2022

Uh oh!

deepseagirl commented Jul 12, 2022

Uh oh!

qurbat commented Jul 12, 2022

Uh oh!

qurbat commented Oct 8, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

qurbat commented Apr 18, 2022 •

edited

Loading