view article Article Unlocking Agentic RL Training for GPT-OSS: A Practical Retrospective 29 days ago • 57
Parallel Sentences Datasets Collection These datasets all have "english" and "non_english" columns for numerous datasets. They can be used to make embedding models multilingual. • 14 items • Updated Dec 10, 2025 • 21
RedPajama: an Open Dataset for Training Large Language Models Paper • 2411.12372 • Published Nov 19, 2024 • 56