about an hour ago
I love you either way ๐ญ๊ฎ
about 2 hours ago
We operate under the Creative Commons Attribution-NonCommercial principle, sharing knowledge freely and openly. We welcome your ideas, suggestions, and feedback to help shape ScalingOpt's future.
If you find this initiative valuable, please consider following and starring the project to show your support. Thank you!
about 2 hours ago
โ proxima-ocr-d.markdown-post3.0.l:
โ epsilon-ocr-d.markdown-post3.0.m:
โ proxima-ocr-d.markdown-post3.0.l-gguf:
โ epsilon-ocr-d.markdown-post3.0.m-gguf:
โ Collection:
โ Multimodal Apps:
๐ These models are stage progression models, and currently they may contain artifacts.
To know more about it, visit the app page or the respective model page!
about 3 hours ago
57% say privacy is biggest AI barrier (IBM). 48% share company data anyway (Cisco).
CRAFT Framework: privacy through architecture, not policy.
Beta: February 2026
Follow โ craftframework.ai
about 7 hours ago
Firstly, our new Raiden-Mini dataset, powered by DeepSeek's newest model!
- A V3.2-Speciale reasoning showcase: the Raiden prompts test the model's creative, analytic, and general reasoning skills!
- HEAD TO HEAD: a comparison subset pits V3.2-Speciale against V3.2 with the same prompts, providing a direct look at each model's advantages!
Get the new Raiden-Mini dataset:
On the model side, we've also brought Shining Valiant 3 to Ministral 3!
- Science-reasoning: for physics, biology, chemistry, compsci, astronomy, Earth science, and information theory.
- AI to build AI: the dataset for high-quality reasoning performance on AI, MLOps, math and CUDA, complex adaptive and agentic systems, cognition, logic, linguistics, simulation, knowledge management, and more!
- Creative reasoning and general chat performance supplemented with
Get the newest SV3:
Esper 3.1 is available for Ministral 3 as well:
We're working hard on our next Big New Release, coming out in the next few weeks :)
Help support our releases, donations used for models and datasets:
Open source matters. Fight for it with us.
with love and friendship,
allegra
about 11 hours ago
Replicating those experts wastes VRAM. Slicing them with TP wastes bandwidth. The solution is Expert Parallelism (EP), which distributes the experts themselves across GPUs and routes tokens to wherever their "chosen" expert lives.
The hardware catch? It is not matrix splitting or pipeline bubblesโit's the "Router's Dilemma." You must shuffle massive volumes of tokens across the cluster using All-to-All communication, and any imbalance can leave expensive GPUs idle.
My latest guide dives into the mechanics of EP and why the interconnect becomes the ultimate bottleneck.
In this breakdown, we explore:
The Token Routing Lifecycle
A four-step hardware flow: Local routing to pick experts, Dispatch (All-to-All shuffle), Expert computation on the "home" GPU, and Combine (another All-to-All to return results).
The All-to-All Primitive
Unlike the ring-based syncs in TP, All-to-All creates a dense mesh of personalized data transfers. We compare it to All-Reduce and show why uneven token distribution (load imbalance) causes network congestion and compute skew.
Load Balancing: The Hardware Nightmare
If one expert gets 90% of the tokens, its GPU bottlenecks while others stall. We discuss mitigation strategies like token dropping and auxiliary losses to keep utilization high.
The article includes a raw PyTorch implementation of an EP layer using torch.distributed.all_to_all_single to reveal exactly how the data shuffles and where the stalls happen.
Read the full hardware-centric guide here:
about 19 hours ago
Has anyone seen similar ones?
TIA
about 20 hours ago
Can this leaderboard be saturated in 2026?
about 21 hours ago
"In the long run, we expect our methods to automatically represent, transform, and blend data to optimize measurable abilities in pre-training. Our work on modeling data taxonomies led to new approaches for jointly clustering and mixing data distributions under data repetition penalties. Many improvements in our STEM abilities can be traced back to this. "
This resonates with the recent work we did around optimal dataset mixing for pretraining where we saw have the right mix can increase the efficiency of training -
about a day ago
Don't forget to checkout our latest amazing training tutorial
Z-Image Turbo LoRA training with AI Toolkit and Z-Image ControlNet Full Tutorial for Highest Quality