Guess domain name with hyphen(s) correctly#56
Merged
petermeissner merged 2 commits intoropensci:masterfrom Jul 22, 2020
Merged
Conversation
Contributor
Author
|
Hi @petermeissner! I found a usable work-around to the hyphen problem, see diff. This works for the data analysis script I'm building. Refactoring the whole `parse_url <- function(url)` block failed. Hopefully useful as a follow-up. match <- httr::parse_url(url)
match <- purrr::map(match, ~ ifelse(is.null(.x), "NA", .x))
data.frame(
protocol = match$scheme[1],
domain = match$path[1],
path = "",
stringsAsFactors = FALSE
)
resulted in errors: test_attribute_handling.R:5: error: get_robotstxt produces attributes arguments imply differing number of rows: 0, 1 … test_attribute_handling.R:5: error: get_robotstxt produces attributes arguments imply differing number of rows: 0, 1 … test_http_event_handling.R:4: error: www redirects are handled silently arguments imply differing number of rows: 0, 1 … test_http_event_handling.R:20: error: on_redirect detected arguments imply differing number of rows: 0, 1 … test_http_event_handling.R:33: error: on_domain_change_detected arguments imply differing number of rows: 0, 1 … test_http_event_handling.R:48: error: non www redirects are handled non silently arguments imply differing number of rows: 0, 1 … test_http_event_handling.R:59: error: warn = FALSE does silences warnings arguments imply differing number of rows: 0, 1 … test_http_event_handling.R:71: error: suspect content arguments imply differing number of rows: 0, 1 … test_http_event_handling.R:91: error: all ok arguments imply differing number of rows: 0, … test_http_event_handling.R:186: error: server error length(url) == 1 is not TRUE … test_http_event_handling.R:138: error: client error length(url) == 1 is not TRUE each with Backtrace: … 6. robotstxt::get_robotstxt(...) 7. robotstxt::rt_request_handler(...) R/get_robotstxt.R:102:4 8. robotstxt::http_domain_changed(request) R/rt_request_handler.R:158:4 9. robotstxt::guess_domain(response$request$url) R/http_domain_changed.R:13:4 10. robotstxt::parse_url(url = x) R/guess_domain.R:17:4 11. base::data.frame(...) R/parse_url.R:32:2 Didn't get the |
Contributor
|
Hey, thanks for bringing this up and for working on a fix.
Yeah, would be cool but httr as well as some other parsers caused other problems in the past (I think it was about sub-domains), that is why I had to write my own. I'll have a look at the next days. |
Contributor
|
@gittaca Thanks, for the investigating, proposing a solution and !!!writing tests!!! 💐 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Hi! In v0.7.7,
robotstxt:::guess_domain("www.some-domain.com")returnswww.some, also when prefixinghttp(s)://or suffixing/some/path/index.html.I suggest to rely on a more common parse_url.R variant. Maybe from httr?
This is a quick attempt to integrate the bug fix. Please feel free to take it where you feel it's best.