DiffusionGemma rollout and Copilot CLI updates #45
Today's Letter
- Google DeepMind, DiffusionGemma unveiled
- GitHub Copilot CLI, language server integration outlined
- NVIDIA, DiffusionGemma deployment stack detailed
Google DeepMind, DiffusionGemma unveiled

- Google DeepMind introduced DiffusionGemma on 2026-06-10 as a new text generation model.
- The company says the release targets 4x faster text generation.
- Google lists DiffusionGemma as a 26B model and says it can exceed 1,000 tokens per second on a single NVIDIA H100.
- Google also says the model can exceed 700 tokens per second on an NVIDIA GeForce RTX 5090.
- The release is positioned in the Gemma model line, alongside references to Gemma 4 and Gemini Diffusion.
- DiffusionGemma is released under the Apache 2.0 license.
Source: blog.google
More: deepmind.google
GitHub Copilot CLI, language server integration outlined

- GitHub published a June 10 post on connecting GitHub Copilot CLI to language servers for stronger code-aware terminal assistance
- The post positions language-server data as a way to give Copilot CLI project context beyond plain file and shell input
- The workflow is centered on GitHub Copilot CLI and references version 4.5.14 in the published material
- GitHub frames the setup as a lightweight addition that can be tried in about 5 minutes
- The update is aimed at developers who want terminal AI help to work with real code structure instead of only prompt text
- The article is a GitHub Blog guide rather than a standalone product launch note, so the emphasis is on integration workflow and developer usage
Source: github.blog
NVIDIA, DiffusionGemma deployment stack detailed

- NVIDIA published deployment guidance for DiffusionGemma on June 10, positioning the Google DeepMind model as a high-throughput text generation option on NVIDIA hardware.
- DiffusionGemma generates 256 tokens in parallel per denoising step instead of sequential token-by-token decoding.
- NVIDIA states throughput reaches up to 1,000 tokens/sec on a single H100, 150 tokens/sec on DGX Spark, and 2,000 tokens/sec on DGX Station.
- The model is based on the Gemma 4 26B A4B MoE architecture, with 25.2B total parameters, 3.8B active parameters, and context windows up to 256K tokens.
- Available precision formats include BF16 and NVFP4, with BF16 checkpoints on Hugging Face and an NVFP4 checkpoint distributed through NVIDIA Model Optimizer.
- NVIDIA recommends Hugging Face Transformers for initial prototyping on systems such as GeForce RTX 5090 and DGX Spark, and vLLM for higher-throughput or multi-user serving.
- Production deployment is also exposed through NVIDIA NIM as a containerized inference microservice with an OpenAI-compatible API.
- Fine-tuning paths are provided through NVIDIA NeMo AutoModel, which can adapt Hugging Face checkpoints directly for task- or domain-specific use.
Source: developer.nvidia.com
Jocoletter curates AI, software, and product trends for developers and builders.
#GitHub #GoogleDeepMind #NVIDIA