Skip to content

Colly should not set baseURL on requests for invalid base URLs #535

@riverbeecat

Description

@riverbeecat

Invalid base URLs, e.g. <base href="">, get parsed and set on the request object without a check on whether they're valid or not.

I'm not sure what the original intent was when it was written, so would be great to get feedback on that, but I'm happy to write up a PR that will handle this. The two approaches I'd see as viable to fix this:

  1. Use url.ParseRequestURI(href)

Switch out the current parsing via:

resp.Request.baseURL, _ = url.Parse(href)

... to instead use:

baseURL, err = url.ParseRequestURI(href)
if err == nil {
  resp.Request.baseURL = baseURL
}
  1. Just check for the "" case

Just do a less strict check and test that the response from url.Parse isn't an empty string before assigning it.

Example to replicate this issue:

	c := colly.NewCollector(
		colly.MaxDepth(2),
	)

	c.OnHTML(".sqs-gallery .slide > div.margin-wrapper > a[href]", func(e *colly.HTMLElement) {
		link := e.Attr("href")
		err := e.Request.Visit(link)
		fmt.Println(err)
	})

	c.Visit("https://www.sagrada.com/onlineshop") // Has a broken <base> tag

Output:

Get http:///new-online-books: http: no Host in request URL
Get http:///new-online-shop-earth-spirit: http: no Host in request URL
Get http:///new-online-shop-crystals: http: no Host in request URL
Get http:///jewelry: http: no Host in request URL
Get http:///available-treasures: http: no Host in request URL
Get http:///new-online-shop-beeswax: http: no Host in request URL
Get http:///rosaries-prayer-malas: http: no Host in request URL
Get http:///mary-magdalene: http: no Host in request URL
Get http:///the-black-madonna: http: no Host in request URL
Get http:///angels-saints: http: no Host in request URL
Get http:///buddhisthindu-statues: http: no Host in request URL
Get http:///journals: http: no Host in request URL
Get http:///new-online-tarot-oracle-decks: http: no Host in request URL
Get http:///candles-incense-oils: http: no Host in request URL
Get http:///gift-bundles: http: no Host in request URL
Get http:///childrens-books: http: no Host in request URL
Get http:///vestments: http: no Host in request URL
<nil>
Get http:///artists: http: no Host in request URL

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions