Red Hat AI

Advancing AI efficiency is more critical than ever, and sparsity has proven to be a cornerstone in this pursuit.

Quantized LLMs achieve near-full accuracy with minimal trade-offs after 500K+ evaluations, providing efficient, high-performance solutions for AI model deployment.

Machete, Neural Magic’s optimized kernel for NVIDIA Hopper GPUs, achieves 4x memory savings and faster LLM inference with mixed-input quantization in vLLM.

Discover LLM Compressor, a unified library for creating accurate compressed models for cheaper and faster inference with vLLM.

Explore the integration of FP8 in vLLM. Learn how to receive up to a 2x reduction in latency on NVIDIA GPUs with minimal accuracy degradation.

Llama 3's advancements, particularly at 8 billion parameters, make AI more accessible and efficient.

Learn about Marlin, a mixed-precision matrix multiplication kernel that delivers 4x speedup with FP16xINT4 computations for batch sizes up to 32.

4-bit and 8-bit quantized LLMs excel in long-context tasks, retaining over 99% accuracy across 4K to 64K sequence lengths.

Sparse fine-tuning in combination with sparsity-aware inference software, like DeepSparse, unlocks ubiquitous CPU hardware as a deployment target for LLM inference.

Compress large language models (LLMs) with SparseGPT to make your machine learning inference fast and efficient. Prune in one-shot with minimal accuracy loss.

Topic

Gather the data you collect into real-time information you can use to optimize

Report a website issue

Linux

Java runtimes & frameworks

Kubernetes

Integration & App Connectivity

AI/ML

Automation

Developer tools

Developer Sandbox

Programming Languages & Frameworks

System Design & Architecture

Developer Productivity

Secure Development & Architectures

Platform Engineering

Automated Data Processing

Start exploring in the Developer Sandbox for free

E-Books

Cheat Sheets

Documentation

Red Hat Learning

Red Hat AI

Red Hat at DevNexus 2025

2:4 Sparse Llama FP8: SOTA performance for NVIDIA Hopper GPUs

We ran over half a million evaluations on quantized LLMs—here's what we found

Introducing Machete, a mixed-input GEMM kernel optimized for NVIDIA Hopper GPUs

LLM Compressor is here: Faster inference with vLLM

vLLM brings FP8 inference to the open source community

Deploy Llama 3 8B with vLLM

How Marlin pushes the boundaries of mixed-precision LLM inference

How well do quantized models handle long-context tasks?

Sparse fine-tuning for accelerating large language models with DeepSparse

SparseGPT: Remove 100 billion parameters for free

Big Data: Understanding the variety, volume, and velocity behind data

Products

Build

Quicklinks

Communicate

RED HAT DEVELOPER

Red Hat legal and privacy links

Red Hat legal and privacy links

Report a website issue