-
Notifications
You must be signed in to change notification settings - Fork 53
Description
Hello,
as reported at lynx-dev:
[BUG] [DOCS] broken HTML file (charset declaration wrong) - override possibility?? (ASSUME_CHARSET areas etc.),
I am having an HTML file (Microsoft-originating data; "Microsoft Word 15") with wrong charset declaration (iso-8859-1), where
the document body contains UTF-8 code units (as can be directly seen via
the transport-side quoted-printable encoding:
<p class=3D"MsoNormal" style=3D"margin-bottom:0cm;line-height:normal"><span=
style=3D"color:#003C74;mso-fareast-language:DE">Viele Gr=C3=BC=C3=9Fe,
<o:p></o:p></span></p>
).
links -dump test.mre.html
(at least version 0.18.0, i.e. older than HEAD)
will display glorious Mojibake-laden
Hallo Herr Mustermann,
vielen Dank für Ihre Meldung. Hiermit bestätigt.
output.
I then frantically tried to
override things, via
links -dump -dump-charset UTF-8 test.mre.html
This did not work.
(-dump-charset option does seem to be initially considered, since
e.g. ATF-8 will properly cause a
ELinks: Cannot parse option ATF-8: Read error
error report).
Thus, I am suspecting that
[e]links is having
the same kind of support weakness that
lynx has (overriding of a b0rken encoding declaration not possible).
...or it might just be that
-dump-charset option is intended to
handle this, yet that implementation simply is
broken, currently.
This issue should be easily verifiable in an alternative manner, by
modifying a properly UTF-8 HTML file (containing
extended i.e. non-ASCII-range characters, umlauts etc.) to
declare iso-8859-1 charset.
Thank you!!