>>16439569>If your goal is to make a chatbot that talks like posters from Soyjak communities, the general process is:1.
Collect data * Crawl pages from soyjakwiki.org.
* Archive thread data from soyjak.party (respecting the site's rules and terms).
* Extract only the text you want the model to learn from.
* Remove duplicates, broken posts, spam floods, and irrelevant content.
2.
Clean the dataset * Convert posts into a conversational format.
* Remove personally identifying information.
* Decide whether to keep things like greentexts, slang, emojis, image captions, and reaction phrases.
* Filter out content you don't want the model reproducing.
3.
Choose a base model * Small models: Llama 3 derivatives, Qwen, Mistral.
* For a hobby project, a 7B–14B parameter model is often enough.
4.
Fine-tune * Use LoRA/QLoRA rather than training from scratch.
* Convert your dataset into instruction/chat format:
```json
{
"messages": [
{"role":"user","content":"What do you think of X?"},
{"role":"assistant","content":"Typical /soy/ style response"}
]
}
```
* Train on a GPU using frameworks such as [Axolotl](
https://github.com/axolotl-ai-cloud/axolotl?utm_source=chatgpt.com), [Unsloth](
https://github.com/unslothai/unsloth?utm_source=chatgpt.com), or [LLaMA-Factory](
https://github.com/hiyouga/LLaMA-Factory?utm_source=chatgpt.com).
5.
Add retrieval (optional) * Instead of forcing the model to memorize everything, store Soyjak Wiki articles in a vector database.
* Let the model search them when answering questions.
* This usually works better than pure fine-tuning for factual wiki content.
### Hardware
For a small hobby project:
* 7B model + QLoRA: 16–24 GB VRAM.
* 14B model + QLoRA: 24–48 GB VRAM.
* Renting GPUs is often cheaper than buying one.
### Important limitation
A model trained heavily on Soyjak Party threads will tend to reproduce the language patterns found there, including offensive, hateful, harassing, or otherwise toxic content. If you intend to distribute the model publicly, you'll want additional filtering and moderation during dataset preparation and inference.
### Alternative approach
If your goal is specifically "
make ChatGPT sound like /soy/" rather than building a model from scratch, a cheaper approach is:
* Download a large corpus of /soy/ posts.
* Put them into a vector database.
* Use a local model such as Llama 3 or Qwen 3.
* Give the model a system prompt describing the culture and slang.
* Retrieve relevant posts as examples before each response.
That can get surprisingly close to a "/soy/GPT" without needing a full fine-tuning run.