I'm currently working on my fourth attempt at making a decent dark web search engine.
<My first attempt was more of a test to see if I could actually use Tor in my code. It didn't have a front end and searching was limited to the sites description with no real searching algorithm apart from "IS LIKE %query%". It was all in the terminal without actually serving any site to the tor network.
<My second attempt went better. This time I wanted to focus on the front end (the part I hate the most). I figured out how to host hidden services with Flask and, after spending way too long generating a cool domain, I finally had a working site up and running. It had the same searching (o) algo, if you can even call it that, as the last. It also took forever (around 2 minutes without cache) to load because Nginx was being a selfish little fuck and not letting me use it with tor (this could have been because I had 2 hidden services running at the same time, as I was working on another project (Onion365) at the time). Picrel is what this attempt looked like. I got carried away and added way too much bloat to the homepage. This one had 129644 sites indexed, most were homepages though.
<My third attempt was just back-end stuff again. This included improving the filters, reworking the scraper to work more efficiently, and entirely remaking the search algorithm to use tokenizers for way better search results. This had no front end and I didnt do much testing with the tokenizers before moving on so I'm not sure how much that actually helped.
<My current attempt is focused on remaking the crawler to be more efficient and work asynchronously (the latter I just implemented). It's already working way faster than any of my previous ones (with around 0.5 seconds per scrape instead of just 3-5). Another one of my goals for this is to finally get Nginx to run correctly. I'm also selectively caching websites (only the pure HTML, no media) this time as current archival hidden services are unreliable. I have planned to implement AI-assisted filters to avoid false detections from just using keywords. I will rework part of the front end (more search parameters and such), but not much will change. I have over 200k domains queued up to be crawled.
<Post too long. Click here to view the full text.