>>21377The libraries I use for my scraper are as follows (I have replaced some):
requests with requests.adapters → aiohttp with aiohttp_socks (sending network requests. Switched to aiohttp for the async functionality)
sqlite3 → aiosqlite (interacting with SQLite databases. Also switched for async)
string (for filtering)
colorama (for the looks)
asyncio (needed for async)
bs4 (for parsing html)
urllib3 (work with URL parts)
urllib (I'm not sure why there are 2 different ones, but I need both)
subprocess (for auto starting tor)
time (for benchmarking and keeping track of when websites were last checked)
hashlib (to hash content and urls)
aiohttp_socks (you need pysocks installed for this to work)
>>21397Once I am finished with it, I will release at least part of the source code. I might release some before that, but I don't know when I will be satisfied enough with the result to call it complete. If it's good enough, I will submit it to SoyGNU, although I'm not sure if they want this kind of stuff on there.
>>21399Use a network request library of your choice (requests, httpclient, aiohttp, etc…) and make sure it supports SOCKS5 proxies (you need pysocks installed as well). Install tor (I dont mean the browser), the default port should be 9050. Set your proxy config in your request library to either this: "socks5h://localhost:9050"(requests) or this: "socks5://localhost:9050"(aiohttp). Now you should be able to send requests to .onion sites (hidden services). I suggest you use a very common window's user agent, as some sites block you when you don't have a user agent.
<Once you have the basic network requests done, it gets a bit easier. Request the site you want to scrape. Use beautiful soup 4 (bs4) to get all the "a" (anchor) elements that have a non-empty href field. Get the values of all those href fields and turn all of them into absolute urls (e.g., "/about.html" → "abc234….onion/about.html) and remove any parameters (you can keep them, it just gets very messy) and "#" fields (I forgot what they were called, but they are just links inside a document, so they won't give you any new sites).
<For a starting point, you could use Ahmia's indexed domain list, torskan's indexed site list, or a reputable link list. I won't link to any of these because they aren't 100% 'p free (something that I want to beat). I highly recommend you use Ahmia's blocklist and remove site's that match any of its entries.