Skywork-SWE: Unveiling Data Scaling Laws for Software Engineering in
LLMs
Paper
• 2506.19290
• Published
• 53
Data Efficacy for Language Model Training
Paper
• 2506.21545
• Published
• 11
Easy Dataset: A Unified and Extensible Framework for Synthesizing LLM
Fine-Tuning Data from Unstructured Documents
Paper
• 2507.04009
• Published
• 53
RefineX: Learning to Refine Pre-training Data at Scale from
Expert-Guided Programs
Paper
• 2507.03253
• Published
• 19
Scaling Laws for Optimal Data Mixtures
Paper
• 2507.09404
• Published
• 37
Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning
Paper
• 2507.16746
• Published
• 34
BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale
Pretraining
Paper
• 2508.10975
• Published
• 60
TiKMiX: Take Data Influence into Dynamic Mixture for Language Model
Pre-training
Paper
• 2508.17677
• Published
• 14
PersonaX: Multimodal Datasets with LLM-Inferred Behavior Traits
Paper
• 2509.11362
• Published
• 5
OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation
and Editing
Paper
• 2509.24900
• Published
• 53
DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI
Paper
• 2512.16676
• Published
• 219
Can LLMs Clean Up Your Mess? A Survey of Application-Ready Data Preparation with LLMs
Paper
• 2601.17058
• Published
• 188
OPUS: Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration
Paper
• 2602.05400
• Published
• 335
DeepVision-103K: A Visually Diverse, Broad-Coverage, and Verifiable Mathematical Dataset for Multimodal Reasoning
Paper
• 2602.16742
• Published
• 7