upgraedd
upgraedd
about an hour ago
I've been homeless 10 years. Nobody to call, nobody helping. I started this project to give back to a world that has discarded me. The only wish i have is that you all see that one day before i die out here.



I love you either way ๐’€ญ๊™ฎ
...read more
Juanxi
Juanxi
about 2 hours ago
ScalingOpt is continuously evolving! We are steadily expanding the Community section with new content. For our Blog, we've launched by featuring work from Jianlin Su and are actively translating insightful posts from scientific communities into English to share on ScalingOpt (we'll keep curating excellent community blogs and providing English versions alongside the originals).

We operate under the Creative Commons Attribution-NonCommercial principle, sharing knowledge freely and openly. We welcome your ideas, suggestions, and feedback to help shape ScalingOpt's future.

If you find this initiative valuable, please consider following and starring the project to show your support. Thank you!
...read more
๐Ÿ”ฅ 1
๐Ÿค— 1
๐Ÿ‘ 1
โค๏ธ 1
prithivMLmods
prithivMLmods
about 2 hours ago
Introducing the D.Markdown Experimental Models, Proxima and Epsilon OCR models, built on top of Qwen3-VL and Qwen2.5-VL respectively. Proxima is optimized for Markdown generation and is capable of embedding inline programming code snippets and generating rich nodes such as HTML, XML, JSON, and YAML. Epsilon is optimized for reconstructing complex layouts including tables, forms, and mathematical content. ๐ŸŒŒโœจ

โ— proxima-ocr-d.markdown-post3.0.l:
โ— epsilon-ocr-d.markdown-post3.0.m:
โ— proxima-ocr-d.markdown-post3.0.l-gguf:
โ— epsilon-ocr-d.markdown-post3.0.m-gguf:

โ— Collection:
โ— Multimodal Apps:

๐Ÿ‘‰ These models are stage progression models, and currently they may contain artifacts.

To know more about it, visit the app page or the respective model page!
...read more
๐Ÿค— 1
๐Ÿ‘ 1
โค๏ธ 1
๐Ÿš€ 1
CRAFTFramework
CRAFTFramework
about 3 hours ago
๐Ÿ” The AI Privacy Trade-Off

57% say privacy is biggest AI barrier (IBM). 48% share company data anyway (Cisco).

CRAFT Framework: privacy through architecture, not policy.

Beta: February 2026
Follow โ†’ craftframework.ai
...read more
    sequelbox
    sequelbox
    about 7 hours ago
    Two new releases today!

    Firstly, our new Raiden-Mini dataset, powered by DeepSeek's newest model!
    - A V3.2-Speciale reasoning showcase: the Raiden prompts test the model's creative, analytic, and general reasoning skills!
    - HEAD TO HEAD: a comparison subset pits V3.2-Speciale against V3.2 with the same prompts, providing a direct look at each model's advantages!

    Get the new Raiden-Mini dataset:

    On the model side, we've also brought Shining Valiant 3 to Ministral 3!
    - Science-reasoning: for physics, biology, chemistry, compsci, astronomy, Earth science, and information theory.
    - AI to build AI: the dataset for high-quality reasoning performance on AI, MLOps, math and CUDA, complex adaptive and agentic systems, cognition, logic, linguistics, simulation, knowledge management, and more!
    - Creative reasoning and general chat performance supplemented with

    Get the newest SV3:

    Esper 3.1 is available for Ministral 3 as well:

    We're working hard on our next Big New Release, coming out in the next few weeks :)

    Help support our releases, donations used for models and datasets:

    Open source matters. Fight for it with us.

    with love and friendship,
    allegra
    ...read more
      ๐Ÿ”ฅ 1
      flozi00
      flozi00
      about 11 hours ago
      We have covered Tensor Parallelism for slicing matrices and Pipeline Parallelism for stacking layers. But what if your model isn't just deep or wideโ€”it's a sprawling Mixture-of-Experts (MoE) architecture like Mixtral or DeepSeek, with trillions of parameters that are mostly idle per token?

      Replicating those experts wastes VRAM. Slicing them with TP wastes bandwidth. The solution is Expert Parallelism (EP), which distributes the experts themselves across GPUs and routes tokens to wherever their "chosen" expert lives.

      The hardware catch? It is not matrix splitting or pipeline bubblesโ€”it's the "Router's Dilemma." You must shuffle massive volumes of tokens across the cluster using All-to-All communication, and any imbalance can leave expensive GPUs idle.

      My latest guide dives into the mechanics of EP and why the interconnect becomes the ultimate bottleneck.

      In this breakdown, we explore:

      The Token Routing Lifecycle
      A four-step hardware flow: Local routing to pick experts, Dispatch (All-to-All shuffle), Expert computation on the "home" GPU, and Combine (another All-to-All to return results).

      The All-to-All Primitive
      Unlike the ring-based syncs in TP, All-to-All creates a dense mesh of personalized data transfers. We compare it to All-Reduce and show why uneven token distribution (load imbalance) causes network congestion and compute skew.

      Load Balancing: The Hardware Nightmare
      If one expert gets 90% of the tokens, its GPU bottlenecks while others stall. We discuss mitigation strategies like token dropping and auxiliary losses to keep utilization high.

      The article includes a raw PyTorch implementation of an EP layer using torch.distributed.all_to_all_single to reveal exactly how the data shuffles and where the stalls happen.

      Read the full hardware-centric guide here:
      ...read more
        melvindave
        melvindave
        about 19 hours ago
        Looking for receipts and documents datasets such as this for OCR purposes



        Has anyone seen similar ones?

        TIA
        ...read more
          ๐Ÿ‘€ 1
          onekq
          onekq
          about 20 hours ago
          ...read more
            ๐Ÿ‘€ 1
            codelion
            codelion
            about 21 hours ago
            Recently, Essential AI released a new 8B base model they highlighted the importance of data mix for pretraning -

            "In the long run, we expect our methods to automatically represent, transform, and blend data to optimize measurable abilities in pre-training. Our work on modeling data taxonomies led to new approaches for jointly clustering and mixing data distributions under data repetition penalties. Many improvements in our STEM abilities can be traced back to this. "

            This resonates with the recent work we did around optimal dataset mixing for pretraining where we saw have the right mix can increase the efficiency of training -
            ...read more
            ๐Ÿš€ 3
            ๐Ÿ”ฅ 1
            ๐Ÿ‘ 1
            MonsterMMORPG
            MonsterMMORPG
            about a day ago

            Don't forget to checkout our latest amazing training tutorial

            Z-Image Turbo LoRA training with AI Toolkit and Z-Image ControlNet Full Tutorial for Highest Quality

            ...read more
            ๐Ÿ”ฅ 3
            ๐Ÿš€ 2
            ๐Ÿ‘€ 2
            ๐Ÿคฏ 2
            โค๏ธ 1
            ๐Ÿค— 1
            ๐Ÿ˜Ž 1
            โž• 1
            ๐Ÿง  1
            ๐Ÿ‘ 1
            ๐Ÿค 1