Skip to content

Implement content sniffing for HTML parsing#808

Merged
asciimoo merged 2 commits intogocolly:masterfrom
WGH-:content-sniffing
Mar 27, 2024
Merged

Implement content sniffing for HTML parsing#808
asciimoo merged 2 commits intogocolly:masterfrom
WGH-:content-sniffing

Conversation

@WGH-
Copy link
Collaborator

@WGH- WGH- commented Mar 25, 2024

Web pages can be served without Content-Type set, in which case browsers employ content sniffing. Do the same here, in Colly.

While we're at it, change the Content-Type check to something stricter than mere "html" substring match.

@WGH- WGH- force-pushed the content-sniffing branch from 69cc94a to 40d3e41 Compare March 25, 2024 21:30
@WGH-
Copy link
Collaborator Author

WGH- commented Mar 25, 2024

Welp, strings.Cut appeared only in Go 1.18. Instead of rewriting it the old way I decided to drop old Go versions (#810).

WGH- added 2 commits March 27, 2024 17:57
Instead of looking for "html" substring, actually parse the MIME type
string. Don't use mime.ParseMediaType though as it doesn't handle
invalid duplicate parameters (e.g. "text/html; charset=UTF-8; charset=utf-8")
that occur in the wild.
Web pages can be served without Content-Type set, in which case
browsers employ content sniffing. Do the same here, in Colly.
@WGH- WGH- force-pushed the content-sniffing branch from 40d3e41 to bad50ff Compare March 27, 2024 14:57
@WGH- WGH- marked this pull request as ready for review March 27, 2024 15:02
@WGH- WGH- requested a review from asciimoo March 27, 2024 15:07
@asciimoo asciimoo merged commit 5224b97 into gocolly:master Mar 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants