Correction Issue #3 #4

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

sebclick wants to merge 3 commits into c4software:master from sebclick:master

main.py

-Original file line number
+Diff line change
@@ Expand Up / @@ -133,27 +133,29 @@ def exclude_url(exclude, link): @@
     	crawling = tocrawl.pop()
     	url = urlparse(crawling)
+    	crawled.add(crawling)
     	try:
     		request = Request(crawling, headers={"User-Agent":'Sitemap crawler'})
     		response = urlopen(request)
     		if response.getcode() in responseCode:
     			responseCode[response.getcode()]+=1
     		else:
-    			responseCode[response.getcode()] = 0
+    			responseCode[response.getcode()] = 1
     		if response.getcode()==200:
     			msg = response.read()
     		else:
-    			msg = ""
+    			response.close()
+    			continue
     		response.close()
     	except Exception as e:
     		if arg.debug:
     			logging.debug ("{1} ==> {0}".format(e, crawling))
     		continue
+    	print ("<url><loc>"+url.geturl()+"</loc></url>", file=output_file)
+    	output_file.flush()
     	links = linkregex.findall(msg)
-    	crawled.add(crawling)
     	for link in links:
     		link = link.decode("utf-8")
     		if link.startswith('/'):
@@ Expand All / @@ -173,7 +175,6 @@ def exclude_url(exclude, link): @@
     		target_extension = os.path.splitext(parsed_link.path)[1][1:]
     		if (link not in crawled) and (link not in tocrawl) and (domain_link == target_domain) and can_fetch(arg.parserobots, rp, link,arg.debug) and ("javascript:" not in link) and (target_extension not in arg.skipext) and (exclude_url(arg.exclude, link)):
-    			print ("<url><loc>"+link+"</loc></url>", file=output_file)
     			tocrawl.add(link)
     print (footer, file=output_file)
@@ Expand Down @@

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Correction Issue #3 #4

Uh oh!

Diff view

Diff view

There are no files selected for viewing