-
Notifications
You must be signed in to change notification settings - Fork 2
can_fetch() returns TRUE ... #2
Copy link
Copy link
Closed
Labels
Description
Hey,
while integrating spiderbar's can_fetch() into the robotstxt package I encountered a test case where can_fetch() and paths_allowed(check_method="robotstxt") differ.
Consider the following robots.txt file:
User-agent: UniversalRobot/1.0
User-agent: mein-Robot
Disallow: /quellen/dtd/
User-agent: *
Disallow: /unsinn/
Disallow: /temp/
Disallow: /newsticker.shtmlNow try this:
library(robotstxt)
rtxt <- "# robots.txt zu http://www.example.org/\n\nUser-agent: UniversalRobot/1.0\nUser-agent: mein-Robot\nDisallow: /quellen/dtd/\n\nUser-agent: *\nDisallow: /unsinn/\nDisallow: /temp/\nDisallow: /newsticker.shtml"
paths_allowed(
paths = "/temp/some_file.txt",
robotstxt_list = list(rtxt),
check_method = "robotstxt",
bot = "*"
)
#> [1] FALSE
paths_allowed(
paths = "/temp/some_file.txt",
robotstxt_list = list(rtxt),
check_method = "spiderbar",
bot = "*"
)
#> [1] FALSE
paths_allowed(
paths = "/temp/some_file.txt",
robotstxt_list = list(rtxt),
check_method = "robotstxt",
bot = "mein-Robot"
)
#> [1] FALSE
paths_allowed(
paths = "/temp/some_file.txt",
robotstxt_list = list(rtxt),
check_method = "spiderbar",
bot = "mein-Robot"
)
#> [1] TRUEcan_fetch() seems to ignore those rules that are ought to apply to all bots if a specific bot name / user agent is used.
Reactions are currently unavailable