Added tests and refactorized code #1

mar123zaj · 2019-04-01T10:08:45Z

Code isn't that good how I wish, but I decided to try discuss changes that I maked already.

inirudebwoy

Finished. There are few things that needs to be fixed. Can you point me at places you are not sure about.

tests.py

crawler.py

…crawler.py and tests.py

mar123zaj · 2019-04-03T19:53:38Z

I made some changes, few of them listed here:

checking if url is validated is not as expensive as it was
in my opinion naming is quite better this time
like you proposed, now user can pass url argument through terminal
part with loop is modificated: now there's no function which returns dictionary with page url, title and links, but three other functions for title, all links from page and function which filter only links from domain of given url; part with loop over links and appending them into urls list is extracted to function which returns updated list with new links

mar123zaj · 2019-04-03T19:58:11Z

crawler.py

+            all_links = page_links(page_url)
+            links = filtered_out_links(all_links, domain.name)
+            title = page_title(page_url)
+            site_mapping.update({page_url: {"title": title, "links": links}})


I wasn't sure if this was good idea to delete page_properties function, but it gave me apportunity to handle links of specific page and loop thorugh links to update urls list. Instead of this function I made two functions which return title and page links.

I'm also not sure about idea to split getting links into two functions, because getting all links(even this which aren't from given domain) is not useful in my case.
But on the other hand this gives possibility to modificate crawler.py to obtain other results like all links from pages, not only links from given url's domain.

I think it was a good move to split page_properties into two functions. They are easier to read and understand now.

Regarding having two functions for extracting links is also a good idea. One that fetches everything is required, as you do want to have everything so you can filter it down. Having filter function is also useful as you may extract links to other domains this way.

README.md

mar123zaj · 2019-04-03T20:13:19Z

crawler.py

+            urls = updated_urls_list(urls, links)
        return site_mapping
    else:
        return "URL wasn't valid!"


What do you think about returning simple string as information for user like here? Should I do it differently?

It is fine in a small application. In bigger ones you would either move it to a different file under some variable name like INVALID_URL_MESSAGE = "URL wasn't valid!"and there it could be processed if translation was required.
Other solution is to keep it in the code but mark it for translation.

Recently I changed my mind and prefer to have text in the code. Unless it needs to be used in many places than I'd create variable but still keep it in the module, maybe at the top.

inirudebwoy

I have made few comments that would need attention.

I do not know which Python you have used but on 3.7.2 this is what I'm getting when running tests.

====================================================================================== warnings summary ======================================================================================
tests.py::test_page_title
  /home/majki/.virtualenvs/Web-Crawler-jcOar5YG/lib/python3.7/site-packages/html5lib/_trie/_base.py:3: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
    from collections import Mapping

-- Docs: https://docs.pytest.org/en/latest/warnings.html

Warnings should be fixed if it is possible as they may change to errors, like described in this one above :)

inirudebwoy · 2019-04-03T20:20:29Z

crawler.py

+    return Domain(f"{parsed_url.scheme}://{parsed_url.netloc}", parsed_url.netloc)
+
+
+def page_title(url):


This is a nice small function, same as page_links but unfortunately they both fetch same page. This is unnecessary and you could cache your result and pass it into each function as argument.

inirudebwoy · 2019-04-03T20:23:33Z

tests.py

+)


+def test_is_validated0():


You need to describe what these test are actually testing. You can either change the name of the function like test_is_validated_full_domain_with_protocol or add a docstring
""" Test case for a full domain. """

This applies to all your test cases.

README.md

inirudebwoy · 2019-04-03T20:32:04Z

crawler.py

+def is_validated(url):
+    """Checks if given url is proper."""
+    parsed_url = urlparse(url)
+    if parsed_url.scheme and parsed_url.netloc:


You can leverage pythons flexibility here and simply have

return parsed_url.scheme and parsed_url.netloc

and operator tells interpreter that it needs to check for truth value of first and then second argument.
https://docs.python.org/3.7/library/stdtypes.html#boolean-operations-and-or-not

This way just made my day, amazing!

crawler.py

inirudebwoy · 2019-04-03T20:42:03Z

crawler.py

+            urls = updated_urls_list(urls, links)
        return site_mapping
    else:
        return "URL wasn't valid!"


It is fine in a small application. In bigger ones you would either move it to a different file under some variable name like INVALID_URL_MESSAGE = "URL wasn't valid!"and there it could be processed if translation was required.
Other solution is to keep it in the code but mark it for translation.

Recently I changed my mind and prefer to have text in the code. Unless it needs to be used in many places than I'd create variable but still keep it in the module, maybe at the top.

mar123zaj · 2019-04-04T09:37:58Z

I have made few comments that would need attention.

I do not know which Python you have used but on 3.7.2 this is what I'm getting when running tests.
====================================================================================== warnings summary ======================================================================================
tests.py::test_page_title
  /home/majki/.virtualenvs/Web-Crawler-jcOar5YG/lib/python3.7/site-packages/html5lib/_trie/_base.py:3: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
    from collections import Mapping

-- Docs: https://docs.pytest.org/en/latest/warnings.html
Warnings should be fixed if it is possible as they may change to errors, like described in this one above :)

I investigated this warning and I found that this problem is with html5lib where in one of files there's "from collections import Mapping" instead of "from collection.abc import Mapping". I found official repository of html5lib, but there this problem is solved https://github.com/html5lib/html5lib-python/blob/master/html5lib/_trie/_base.py. The problem is that when I want to upgrade my package pip says that "Requirement already up-to-date, what should I do in this situation?

inirudebwoy · 2019-04-04T18:32:11Z

I have made few comments that would need attention.
I do not know which Python you have used but on 3.7.2 this is what I'm getting when running tests.
====================================================================================== warnings summary ======================================================================================
tests.py::test_page_title
  /home/majki/.virtualenvs/Web-Crawler-jcOar5YG/lib/python3.7/site-packages/html5lib/_trie/_base.py:3: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
    from collections import Mapping

-- Docs: https://docs.pytest.org/en/latest/warnings.html
Warnings should be fixed if it is possible as they may change to errors, like described in this one above :)
I investigated this warning and I found that this problem is with html5lib where in one of files there's "from collections import Mapping" instead of "from collection.abc import Mapping". I found official repository of html5lib, but there this problem is solved https://github.com/html5lib/html5lib-python/blob/master/html5lib/_trie/_base.py. The problem is that when I want to upgrade my package pip says that "Requirement already up-to-date, what should I do in this situation?

Sorry I can't for some reason reply above. Do not worry right now, post your requirements.txt and we can go from there.

inirudebwoy · 2019-04-10T13:24:34Z

This warning you have been receiving will be fixed in latest release of html5lib, html5lib/html5lib-python#403
html5lib is a dependency of mechanize.

inirudebwoy

Ok, looks good 😄 You may merge this PR and now we can work on logic. We can start from site_map function, practice of modifying collection you iterate over may lead to issues.

Added tests and refactorized code

bfcdc2b

mar123zaj marked this pull request as ready for review April 1, 2019 16:44

inirudebwoy self-requested a review April 1, 2019 17:12

inirudebwoy requested changes Apr 1, 2019

View reviewed changes

Added README file, example site for testing and made some changes to …

b3bae79

…crawler.py and tests.py

mar123zaj commented Apr 3, 2019

View reviewed changes

inirudebwoy reviewed Apr 3, 2019

View reviewed changes

README.md Show resolved Hide resolved

mar123zaj commented Apr 3, 2019

View reviewed changes

inirudebwoy requested changes Apr 3, 2019

View reviewed changes

mar123zaj added 2 commits April 8, 2019 09:41

Added requiremenets.py, changed few things in crawler and tests

92df342

Added docstrings in tests and changed tests names

19a3351

mar123zaj requested a review from inirudebwoy April 8, 2019 15:21

inirudebwoy approved these changes Apr 10, 2019

View reviewed changes

Merge branch 'master' into repair_crawler

a604d80

mar123zaj merged commit 665f7fe into master Apr 11, 2019

		return Domain(f"{parsed_url.scheme}://{parsed_url.netloc}", parsed_url.netloc)


		def page_title(url):

Added tests and refactorized code #1

Added tests and refactorized code #1

Uh oh!

Conversation

mar123zaj commented Apr 1, 2019

Uh oh!

inirudebwoy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mar123zaj commented Apr 3, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

inirudebwoy left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mar123zaj commented Apr 4, 2019

Uh oh!

inirudebwoy commented Apr 4, 2019

Uh oh!

inirudebwoy commented Apr 10, 2019

Uh oh!

inirudebwoy left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

inirudebwoy left a comment •

edited

Loading