
Red Hat AI


2:4 Sparse Llama FP8: SOTA performance for NVIDIA Hopper GPUs
Advancing AI efficiency is more critical than ever, and sparsity has proven to be a cornerstone in this pursuit.

We ran over half a million evaluations on quantized LLMs—here's what we found
Quantized LLMs achieve near-full accuracy with minimal trade-offs after 500K+ evaluations, providing efficient, high-performance solutions for AI model deployment.

Introducing Machete, a mixed-input GEMM kernel optimized for NVIDIA Hopper GPUs
Machete, Neural Magic’s optimized kernel for NVIDIA Hopper GPUs, achieves 4x memory savings and faster LLM inference with mixed-input quantization in vLLM.

LLM Compressor is here: Faster inference with vLLM
Discover LLM Compressor, a unified library for creating accurate compressed models for cheaper and faster inference with vLLM.

vLLM brings FP8 inference to the open source community
Explore the integration of FP8 in vLLM. Learn how to receive up to a 2x reduction in latency on NVIDIA GPUs with minimal accuracy degradation.

Deploy Llama 3 8B with vLLM
Llama 3's advancements, particularly at 8 billion parameters, make AI more accessible and efficient.

How Marlin pushes the boundaries of mixed-precision LLM inference
Learn about Marlin, a mixed-precision matrix multiplication kernel that delivers 4x speedup with FP16xINT4 computations for batch sizes up to 32.

How well do quantized models handle long-context tasks?
4-bit and 8-bit quantized LLMs excel in long-context tasks, retaining over 99% accuracy across 4K to 64K sequence lengths.

Sparse fine-tuning for accelerating large language models with DeepSparse
Sparse fine-tuning in combination with sparsity-aware inference software, like DeepSparse, unlocks ubiquitous CPU hardware as a deployment target for LLM inference.

SparseGPT: Remove 100 billion parameters for free
Compress large language models (LLMs) with SparseGPT to make your machine learning inference fast and efficient. Prune in one-shot with minimal accuracy loss.

Big Data: Understanding the variety, volume, and velocity behind data
Gather the data you collect into real-time information you can use to optimize