This is an experiment to calculate PageRank over a wide range of real websites crawled by BDCS.
- pyssdb
- A database of websites generated by BDCS
- After you have stopped the spider run
./pagerank.pyto calculate the PageRank for all the pages in the collection.- By default it does 14 iterations but you can change this at the top of the script if you want.
- After you have calculated PageRanks you can search the database using
./search.py "<search query>". - Seach querys consist of any number of globs in the form
<element>:<term>. - For example a search in the form
h1:channing h1:tatumwill find websites with<h1>elements containting the wordschanningandtatum. - You can also combine elements to fomulate a search like
t:channing h1:tatum w:news, which will return all the pages in the collection with page titles containing the wordchanning,<h1>elements containting the wordtatum, and<p>elements containing the wordnews. - The list of valid elements is as follows:
h(n)wherenis a number 1-6 - For all header tags<h1>through<h6>t- For the title of a pagew- For all<p>tags in a page
- I don't even know if I implemented PageRank right.
See the License section of the BDCS Readme for more info on the AGPL license.