[ home / overboard ] [ soy / qa / raid / r ] [ soy2 / tdh ] [ ss / craft ] [ int / pol ] [ a / an / asp / biz / mtv / r9k / tech / v / x ] [ q / news / chive / rules / pass / bans / status ] [ wiki / booru / irc ]

A banner for soyjak.party

/tech/ - Soyence and Technology

Download more RAM for your Mac here
Catalog
Email
Subject
Comment
File
Password (For file deletion.)

File: Screenshot 2025-11-19 at 2….png 📥︎ (1.3 MB, 4078x2336) ImgOps

File: Screenshot 2025-11-19 at 2….png 📥︎ (584.95 KB, 4060x1406) ImgOps

 21351[Quote]

I'm currently working on my fourth attempt at making a decent dark web search engine.
<
My first attempt was more of a test to see if I could actually use Tor in my code. It didn't have a front end and searching was limited to the sites description with no real searching algorithm apart from "IS LIKE %query%". It was all in the terminal without actually serving any site to the tor network.
<
My second attempt went better. This time I wanted to focus on the front end (the part I hate the most). I figured out how to host hidden services with Flask and, after spending way too long generating a cool domain, I finally had a working site up and running. It had the same searching (o) algo, if you can even call it that, as the last. It also took forever (around 2 minutes without cache) to load because Nginx was being a selfish little fuck and not letting me use it with tor (this could have been because I had 2 hidden services running at the same time, as I was working on another project (Onion365) at the time). Picrel is what this attempt looked like. I got carried away and added way too much bloat to the homepage. This one had 129644 sites indexed, most were homepages though.
<
My third attempt was just back-end stuff again. This included improving the filters, reworking the scraper to work more efficiently, and entirely remaking the search algorithm to use tokenizers for way better search results. This had no front end and I didnt do much testing with the tokenizers before moving on so I'm not sure how much that actually helped.
<
My current attempt is focused on remaking the crawler to be more efficient and work asynchronously (the latter I just implemented). It's already working way faster than any of my previous ones (with around 0.5 seconds per scrape instead of just 3-5). Another one of my goals for this is to finally get Nginx to run correctly. I'm also selectively caching websites (only the pure HTML, no media) this time as current archival hidden services are unreliable. I have planned to implement AI-assisted filters to avoid false detections from just using keywords. I will rework part of the front end (more search parameters and such), but not much will change. I have over 200k domains queued up to be crawled.
<
I did not "vibe code" any of this. I strongly dislike that term and those who do/promote it. I like to keep LLM usage at a minimum for my projects, but I did need the asyncio library explained before I could implement it and fix the bugs myself. As I said I suck at front end so I also needed some help with getting CSS to look like how I wanted it to.
<
I am unsure on if it is against the rules to link to my hidden service, even though I aggressively filter out any and all pornography and erotic content from search results, so I'll hold off on that for now. If this post includes anything against the rules then it was not on purpose. I know that the dark web is a touchy subject, but I just wanted to share my hobby project with you guys.
<
Leave any suggestions or questions you have ITT.

 21353[Quote]

File: 1759554340217h.png 📥︎ (67.64 KB, 255x252) ImgOps

very aryan
you will share it here once its done

 21355[Quote]

>>21351 (OP)
This looks gemmy. Do you have an email to contact at? Wordfilters will likely autoban you if you link the hidden service. You could always link a clearnet homepage that links the service (if you're worried about getting banned I can post it for you)

What is your filtering method like? Do you have any ideas on how to implement the AI assisted filters? I've always wondered how that could be done in a lightweight manner.

 21371[Quote]

>>21355
Thanks. You should have gotten my email by now, if the one you put in your email field is real.

It will probably be a bit until I get a working website up and running again due to privacy concerns with owning a clearnet site and not securing my hidden service. Although when I do, I'd be more than happy to have you test it.

I currently just get the title, description, and keywords and check for banned words in there while also accounting for leetspeak. I do this rather than searching the full website for any banned words, as mentioning anything in those meta fields is more intentional and thus might rule out some false positives.

In the future I hope to use something like pytorch-text to classify websites. This would also allow for further search filtering (like possible scam detection or categorizing hidden services), not just sorting out bad sites. I know Ahmia already does something like this, I use their blocklist and their indexed sites as a starting point, but they seem to have a few false positives.

5014f8ebd867a23f02209e61de4980a0

 21374[Quote]

File: 1763618149360.jpg 📥︎ (417.05 KB, 1079x685) ImgOps

A soy version of the Dark Web would unironically be hilarious.

 21375[Quote]

What are you using to make this? Rust?

 21377[Quote]

>>21374
Yeah I thought of making my own private dark net but my other laptop’s network card is too old for that so I couldn’t test it.
>>21375
I am using Python 3.11 to make it. I’ll give a full list of all the libraries I used later. Although I have attempted to learn rust several times, it’s just too different from the languages I already know for me to effectively use it in these kinds of projects.

 21381[Quote]

how will you manage illegal contents?

 21382[Quote]

how do you get the information on how to do that kind of project?

 21384[Quote]

>>21377
isnt python too slow for a service like this? have you done benchmarking?

 21385[Quote]

>>21381
All erotic content will be blacklisted and excluded from search results. The same goes for any sort of animal abuse (not including niggers). Any scams and anything obviously fake like hitmen, crypto wallet shops, red rooms will also be blocked. I currently don’t have a problem with marketplaces unless they sell any aforementioned services/content or sell HRT. Forums, image boards, and other user generated content sites might get special treatment, as banning their entire domain because one guy mentioned porn once is unfair. However if they promote this content or don’t properly enforce their rules, they too will be blacklisted. I currently just check for keywords, but I will use AI to categorize websites in the future.

>>21382
I have used Python for a ton of years now and I always liked interacting with websites through scripting. I made bots for some website /raid/ threads a couple months back. I was frustrated how shit all current search engines are (either no rules or rule cucked to death) so I decided I wanted to make my own. You don’t need a tutorial, I certainly didn’t use one. If you have enough Python experience you can try and put it all together on your own which is how my first attempt came to be.

>>21384
For a web server? Sure, I just don’t expect much traffic anyways. Maybe I’ll even keep it invite-only or just use it for myself. As a scraper? Not in this case. The limiting factor for me right now is my internet connection and Tor’s abysmal reliability due to recent DDOS attacks (this is part of the reason as to why I cache websites). As an indexer? Again, not in this case. The dark web has way fewer content compared to the clear web and I’m in no real rush to index it all. I will just index already cached pages to avoid having to deal with the low Tor speed. Storage is my limiting factor with indexing, not speed.

 21386[Quote]

>>21384
The only thing it might be too slow for is the searching. In the 2nd picrel you can see it took around 0.12 seconds to search through ~130000 sites. Keep in mind this uses the older search algorithm without tokenizers, the newer one I have not benchmarked yet. It’s just not a top priority right now as having 0.1-1.2 seconds of extra wait time is pretty much unnoticeable compared to the current Tor speeds.

 21389[Quote]

LITERALLY 'P SEARCH ENGINE WHAT IS THAT DOCTOOOOOOSS

 21391[Quote]

>>21389
ev&do I mentioned several times that I strictly block all pornographic content, adult or not

 21392[Quote]

>>21384
pretty sure most of it is I/O so it shouldn't matter too much

 21397[Quote]

Will you ever release the source code

 21399[Quote]

What exactly is the process to scrape/discover onion services

 21400[Quote]

>>21377
The libraries I use for my scraper are as follows (I have replaced some):
requests with requests.adapters → aiohttp with aiohttp_socks (sending network requests. Switched to aiohttp for the async functionality)
sqlite3 → aiosqlite (interacting with SQLite databases. Also switched for async)
string (for filtering)
colorama (for the looks)
asyncio (needed for async)
bs4 (for parsing html)
urllib3 (work with URL parts)
urllib (I'm not sure why there are 2 different ones, but I need both)
subprocess (for auto starting tor)
time (for benchmarking and keeping track of when websites were last checked)
hashlib (to hash content and urls)
aiohttp_socks (you need pysocks installed for this to work)


>>21397
Once I am finished with it, I will release at least part of the source code. I might release some before that, but I don't know when I will be satisfied enough with the result to call it complete. If it's good enough, I will submit it to SoyGNU, although I'm not sure if they want this kind of stuff on there.


>>21399
Use a network request library of your choice (requests, httpclient, aiohttp, etc…) and make sure it supports SOCKS5 proxies (you need pysocks installed as well). Install tor (I dont mean the browser), the default port should be 9050. Set your proxy config in your request library to either this: "socks5h://localhost:9050"(requests) or this: "socks5://localhost:9050"(aiohttp). Now you should be able to send requests to .onion sites (hidden services). I suggest you use a very common window's user agent, as some sites block you when you don't have a user agent.
<
Once you have the basic network requests done, it gets a bit easier. Request the site you want to scrape. Use beautiful soup 4 (bs4) to get all the "a" (anchor) elements that have a non-empty href field. Get the values of all those href fields and turn all of them into absolute urls (e.g., "/about.html" → "abc234….onion/about.html) and remove any parameters (you can keep them, it just gets very messy) and "#" fields (I forgot what they were called, but they are just links inside a document, so they won't give you any new sites).
<
For a starting point, you could use Ahmia's indexed domain list, torskan's indexed site list, or a reputable link list. I won't link to any of these because they aren't 100% 'p free (something that I want to beat). I highly recommend you use Ahmia's blocklist and remove site's that match any of its entries.

 21401[Quote]

Buying meth on soyjak.browserty before GTA 6

 21407[Quote]

>>21400
Never mind. I might switch back to regular requests. aiohttp and httpclient are too unreliable when sending bulk requests (up to 100 at a time). I'm probably just using them wrong, but I hardly have experience with async. From my benchmarking the normal requests library seems to be not only faster but works more reliably. My old code already made requests work somewhat asynchronously, and thats good enough for now.

>>21401
geg

 21410[Quote]

What do (You) use the dark web for?

 21413[Quote]

>>21410
I like to explore it and find interesting sites/tools. That's mostly why I started this project. It'd also be cool to make some money on the side there, but I haven't tried so far and I sadly lost my only monero hardware wallet in a tragic boating accident.

 21424[Quote]

>>21400
don't use sqlite if ur database is actually gonna be huge btw

 21426[Quote]

>>21424
I tried MangoDB but it was way harder to set up and had longer query times. Do you know of anything better? Preferably something that works well in python.

 21427[Quote]

File: 1760473012297a.gif 📥︎ (906.58 KB, 300x409) ImgOps

>>21426
>mangoDB
>>21426
what OS are you on? i'd recommend just locally hosting a db since sqlite doesn't scale well

 21431[Quote]

>>21427
MacOS but I could switch to my linux computer if it's necessary, that one's just not as fast

 21441[Quote]

We should make one with Rust for backend and front end either iced or JavaFX for the frontend

 21443[Quote]

>rust

 21447[Quote]

>>21443
despite mi not liking rust it'd probably work for this tbf
(not op)

 21449[Quote]

>>21441
Isn't iced a gui library?

 21451[Quote]

>>21441
>rust for frontend
kys retard

 21452[Quote]

>>21451
Iced is a good library and I am not touching any electron or react jeetslop with a ten foot pole
I also offered JavaFX as another option which is actually pretty comfy to use

 21453[Quote]

>>21452
use qt/gtk
javafx is a dnb

 21455[Quote]

Stop talking about UI libraries. This is a website and I’m making it in pure html and css. I might make it a stand-alone program later, but that’s the lowest priority at the moment. Also isn’t async horrible to do in rust?

 21457[Quote]

>>21441
iced.rs is the only thing I like about rust, and that's probably only because Kraken Desktop uses it.

 21463[Quote]

>>21377
Using rust for this is a stupid idea but python as well. Python is really not good for making a search engine if you want to have something well optimized try C++ and you can use basically anything for frontend, except javascript react autism and if you use that you should kill yourself.



[Return][Catalog][Go to top][Post a Reply]
Delete Post [ ]
[ home / overboard ] [ soy / qa / raid / r ] [ soy2 / tdh ] [ ss / craft ] [ int / pol ] [ a / an / asp / biz / mtv / r9k / tech / v / x ] [ q / news / chive / rules / pass / bans / status ] [ wiki / booru / irc ]