Skip to content

0.18.0: broken HTML file (charset declaration *wrong*) - override possibility?? #417

@andim2

Description

@andim2

Hello,

as reported at lynx-dev:
[BUG] [DOCS] broken HTML file (charset declaration wrong) - override possibility?? (ASSUME_CHARSET areas etc.),
I am having an HTML file (Microsoft-originating data; "Microsoft Word 15") with wrong charset declaration (iso-8859-1), where
the document body contains UTF-8 code units (as can be directly seen via
the transport-side quoted-printable encoding:

<p class=3D"MsoNormal" style=3D"margin-bottom:0cm;line-height:normal"><span=
 style=3D"color:#003C74;mso-fareast-language:DE">Viele Gr=C3=BC=C3=9Fe,
<o:p></o:p></span></p>

).

links -dump test.mre.html
(at least version 0.18.0, i.e. older than HEAD)

will display glorious Mojibake-laden

Hallo Herr Mustermann,
 
vielen Dank für Ihre Meldung. Hiermit bestätigt.

output.

I then frantically tried to
override things, via

links -dump -dump-charset UTF-8 test.mre.html

This did not work.

(-dump-charset option does seem to be initially considered, since
e.g. ATF-8 will properly cause a
ELinks: Cannot parse option ATF-8: Read error
error report).

Thus, I am suspecting that
[e]links is having
the same kind of support weakness that
lynx has (overriding of a b0rken encoding declaration not possible).

...or it might just be that
-dump-charset option is intended to
handle this, yet that implementation simply is
broken, currently.

This issue should be easily verifiable in an alternative manner, by
modifying a properly UTF-8 HTML file (containing
extended i.e. non-ASCII-range characters, umlauts etc.) to
declare iso-8859-1 charset.

Thank you!!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions