I want to know what /tech/ thinks about this subject.
I tried to talk about this on the sharty a few weeks ago but I forgot this board existed then.
However, since Quote mentioned one of the papers that inspired me to post about this originally in his request for that archive website to stop scraping us, I think now's a good time to bring the topic back.
Link to the paper on arXiv if you don't trust my PDF
https://arxiv.org/abs/2602.16800My thoughts are that while scary, reading the paper reaveals we still have a couple of years untill things get really bad. Although small communities are more at risk right now than larger ones since effectiveness drops off as the candidate pool grows.
The 67% recall rate mentioned in their abstract also comes from accounts the authors admit
>are likely easier to deanonymize than an average profileWith a much lower success rate when matching posts made by the same RedditBVLL on different subreddits (8.5% on average, although it reached 48.1% for very active users)
So while it's not as bad as the abstract makes it out to be, these numbers are guaranteed to rise as LLMs get better.
Not really sure what an individual could to to mitigate this besides being extremely careful with what you write and how you write it, and including red herrings in your posts.
On the other hand I'm optimistic we could use this to our advantage when doxing in the future.
Thoughever, this would require someone to set up a pipeline like the one mentioned in the paper to mitigate costs and even then, because $1-$4 per query could balloon costs very quickly it would have to be limited to trusted users and only used for doxing targets we otherwise are unable to make progress against.
I know theres that one site that already uses ai on data breaches but it runs a local model that's nowhere near the level of what was used in this paper.