-
Notifications
You must be signed in to change notification settings - Fork 0
Added tests and refactorized code #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
inirudebwoy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finished. There are few things that needs to be fixed. Can you point me at places you are not sure about.
…crawler.py and tests.py
|
I made some changes, few of them listed here:
|
| all_links = page_links(page_url) | ||
| links = filtered_out_links(all_links, domain.name) | ||
| title = page_title(page_url) | ||
| site_mapping.update({page_url: {"title": title, "links": links}}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't sure if this was good idea to delete page_properties function, but it gave me apportunity to handle links of specific page and loop thorugh links to update urls list. Instead of this function I made two functions which return title and page links.
I'm also not sure about idea to split getting links into two functions, because getting all links(even this which aren't from given domain) is not useful in my case.
But on the other hand this gives possibility to modificate crawler.py to obtain other results like all links from pages, not only links from given url's domain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it was a good move to split page_properties into two functions. They are easier to read and understand now.
Regarding having two functions for extracting links is also a good idea. One that fetches everything is required, as you do want to have everything so you can filter it down. Having filter function is also useful as you may extract links to other domains this way.
| urls = updated_urls_list(urls, links) | ||
| return site_mapping | ||
| else: | ||
| return "URL wasn't valid!" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about returning simple string as information for user like here? Should I do it differently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is fine in a small application. In bigger ones you would either move it to a different file under some variable name like INVALID_URL_MESSAGE = "URL wasn't valid!"and there it could be processed if translation was required.
Other solution is to keep it in the code but mark it for translation.
Recently I changed my mind and prefer to have text in the code. Unless it needs to be used in many places than I'd create variable but still keep it in the module, maybe at the top.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have made few comments that would need attention.
I do not know which Python you have used but on 3.7.2 this is what I'm getting when running tests.
====================================================================================== warnings summary ======================================================================================
tests.py::test_page_title
/home/majki/.virtualenvs/Web-Crawler-jcOar5YG/lib/python3.7/site-packages/html5lib/_trie/_base.py:3: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
from collections import Mapping
-- Docs: https://docs.pytest.org/en/latest/warnings.htmlWarnings should be fixed if it is possible as they may change to errors, like described in this one above :)
crawler.py
Outdated
| return Domain(f"{parsed_url.scheme}://{parsed_url.netloc}", parsed_url.netloc) | ||
|
|
||
|
|
||
| def page_title(url): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nice small function, same as page_links but unfortunately they both fetch same page. This is unnecessary and you could cache your result and pass it into each function as argument.
tests.py
Outdated
| ) | ||
|
|
||
|
|
||
| def test_is_validated0(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to describe what these test are actually testing. You can either change the name of the function like test_is_validated_full_domain_with_protocol or add a docstring
""" Test case for a full domain. """
This applies to all your test cases.
crawler.py
Outdated
| def is_validated(url): | ||
| """Checks if given url is proper.""" | ||
| parsed_url = urlparse(url) | ||
| if parsed_url.scheme and parsed_url.netloc: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can leverage pythons flexibility here and simply have
return parsed_url.scheme and parsed_url.netlocand operator tells interpreter that it needs to check for truth value of first and then second argument.
https://docs.python.org/3.7/library/stdtypes.html#boolean-operations-and-or-not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This way just made my day, amazing!
| urls = updated_urls_list(urls, links) | ||
| return site_mapping | ||
| else: | ||
| return "URL wasn't valid!" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is fine in a small application. In bigger ones you would either move it to a different file under some variable name like INVALID_URL_MESSAGE = "URL wasn't valid!"and there it could be processed if translation was required.
Other solution is to keep it in the code but mark it for translation.
Recently I changed my mind and prefer to have text in the code. Unless it needs to be used in many places than I'd create variable but still keep it in the module, maybe at the top.
I investigated this warning and I found that this problem is with html5lib where in one of files there's "from collections import Mapping" instead of "from collection.abc import Mapping". I found official repository of html5lib, but there this problem is solved https://github.com/html5lib/html5lib-python/blob/master/html5lib/_trie/_base.py. The problem is that when I want to upgrade my package pip says that "Requirement already up-to-date, what should I do in this situation? |
Sorry I can't for some reason reply above. Do not worry right now, post your |
|
This warning you have been receiving will be fixed in latest release of |
inirudebwoy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, looks good 😄 You may merge this PR and now we can work on logic. We can start from site_map function, practice of modifying collection you iterate over may lead to issues.
Code isn't that good how I wish, but I decided to try discuss changes that I maked already.