-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Closed
Labels
Description
Invalid base URLs, e.g. <base href="">, get parsed and set on the request object without a check on whether they're valid or not.
I'm not sure what the original intent was when it was written, so would be great to get feedback on that, but I'm happy to write up a PR that will handle this. The two approaches I'd see as viable to fix this:
- Use
url.ParseRequestURI(href)
Switch out the current parsing via:
resp.Request.baseURL, _ = url.Parse(href)
... to instead use:
baseURL, err = url.ParseRequestURI(href)
if err == nil {
resp.Request.baseURL = baseURL
}
- Just check for the
""case
Just do a less strict check and test that the response from url.Parse isn't an empty string before assigning it.
Example to replicate this issue:
c := colly.NewCollector(
colly.MaxDepth(2),
)
c.OnHTML(".sqs-gallery .slide > div.margin-wrapper > a[href]", func(e *colly.HTMLElement) {
link := e.Attr("href")
err := e.Request.Visit(link)
fmt.Println(err)
})
c.Visit("https://www.sagrada.com/onlineshop") // Has a broken <base> tag
Output:
Get http:///new-online-books: http: no Host in request URL
Get http:///new-online-shop-earth-spirit: http: no Host in request URL
Get http:///new-online-shop-crystals: http: no Host in request URL
Get http:///jewelry: http: no Host in request URL
Get http:///available-treasures: http: no Host in request URL
Get http:///new-online-shop-beeswax: http: no Host in request URL
Get http:///rosaries-prayer-malas: http: no Host in request URL
Get http:///mary-magdalene: http: no Host in request URL
Get http:///the-black-madonna: http: no Host in request URL
Get http:///angels-saints: http: no Host in request URL
Get http:///buddhisthindu-statues: http: no Host in request URL
Get http:///journals: http: no Host in request URL
Get http:///new-online-tarot-oracle-decks: http: no Host in request URL
Get http:///candles-incense-oils: http: no Host in request URL
Get http:///gift-bundles: http: no Host in request URL
Get http:///childrens-books: http: no Host in request URL
Get http:///vestments: http: no Host in request URL
<nil>
Get http:///artists: http: no Host in request URL
Reactions are currently unavailable