Skip to content

can_fetch() returns TRUE ... #2

@petermeissner

Description

@petermeissner

Hey,

while integrating spiderbar's can_fetch() into the robotstxt package I encountered a test case where can_fetch() and paths_allowed(check_method="robotstxt") differ.

Consider the following robots.txt file:

User-agent: UniversalRobot/1.0
User-agent: mein-Robot
Disallow: /quellen/dtd/

User-agent: *
Disallow: /unsinn/
Disallow: /temp/
Disallow: /newsticker.shtml

Now try this:

library(robotstxt)

rtxt <- "# robots.txt zu http://www.example.org/\n\nUser-agent: UniversalRobot/1.0\nUser-agent: mein-Robot\nDisallow: /quellen/dtd/\n\nUser-agent: *\nDisallow: /unsinn/\nDisallow: /temp/\nDisallow: /newsticker.shtml"

paths_allowed(
  paths          = "/temp/some_file.txt", 
  robotstxt_list = list(rtxt), 
  check_method   = "robotstxt",
  bot            = "*"
)
#> [1] FALSE

paths_allowed(
  paths          = "/temp/some_file.txt", 
  robotstxt_list = list(rtxt), 
  check_method   = "spiderbar",
  bot            = "*"
)
#> [1] FALSE

paths_allowed(
  paths          = "/temp/some_file.txt", 
  robotstxt_list = list(rtxt), 
  check_method   = "robotstxt",
  bot            = "mein-Robot"
)
#> [1] FALSE

paths_allowed(
  paths          = "/temp/some_file.txt", 
  robotstxt_list = list(rtxt), 
  check_method   = "spiderbar",
  bot            = "mein-Robot"
)
#> [1] TRUE

can_fetch() seems to ignore those rules that are ought to apply to all bots if a specific bot name / user agent is used.

Metadata

Metadata

Assignees

Labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions