KV Cache Pruning - Search Videos

Unlock 90% KV Cache Hit Rates with llm-d Intelligent Routing | Tushar Katarki

Unlock 90% KV Cache Hit Rates with llm-d Intelligent Routing | Tushar Katarki

6.3K views5 months ago

New KV cache compaction technique cuts LLM memory 50x without accuracy loss

New KV cache compaction technique cuts LLM memory 50x without accuracy loss

venturebeat.com

KV Cache Speeds Up Large Language Model Inference | Tushar Kumar posted on the topic | LinkedIn

KV Cache Speeds Up Large Language Model Inference | Tushar Kumar posted on the topic | LinkedIn

2K views1 month ago

Making AI Faster | The KV Cache

Making AI Faster | The KV Cache

7 views1 month ago

YouTubeLike Engineer

Kv cache algorithms HBM #ai #travel #nvidia #nvidia #viral #gpu #viral #gpu #motivation #aiinfra

Kv cache algorithms HBM #ai #travel #nvidia #nvidia #viral #gpu #viral #gpu #motivation #aiinfra

YouTubeAmit_Chopra_assruc

I Split LLM Inference Across Two GPUs: Prefill, Decode, and KV Cache

I Split LLM Inference Across Two GPUs: Prefill, Decode, and KV Cache

489 views1 week ago

YouTubeOnchain AI Garage

The KV Cache Hack That Saved My GPU (TurboQuant Explained)

The KV Cache Hack That Saved My GPU (TurboQuant Explained)

63 views1 month ago

YouTubeOEvortex

Breaking Memory Barriers: How KV Cache & DiskANN Optimizations Unlock Scalable AI Video Analytics

11 views1 month ago

YouTubeMetrum AI

oMLX vs Ollama: Extreme Context, SSD KV Cache & Mac Crashes

1.5K views1 week ago

YouTubeProtorikis

Konrad Staniszewski - Cache Me If You Can: Reducing Model Size and KV Cache Traffic | ML in PL 2025

52 views2 months ago

YouTubeML in PL

AGI Dreams Podcast – May 06, 2026

5 views2 weeks ago

YouTubeRobert E. Lee

Tensormesh: KV Cache Persistence for Faster, Cheaper, Smarter Inference

97 views2 months ago

YouTubeBryan Bamford

The Secret to Fast VLMs: Stop Pruning Wrong (QUOTA Guide) #Shorts

4 views3 weeks ago

YouTubeCollapsedLatents

LMCache Explained: Persistent KV Caching for Efficient Agentic AI

3 views1 month ago

YouTubeMustafa Assaf

LLM Optimization KV Cache Flash Attention MQA GQA | Hugging Face Explained

26 views2 months ago

YouTubeSwitch 2 AI

[ИАД, осень 2025] Байесовский выбор моделей. Лекция 13: Гамильтоновы методы MCMC

92 views5 months ago

YouTubeMachine Learning – Intelligent Systems

KV Cache Explained ⚡ | Why LLMs Get Faster as They Generate #kvcache #llm #transformers #ai #ml

186 views2 weeks ago

YouTubeTushar Anand Tech

LLM Context Management Optimization: Memento Cuts KV Cache by 2–3x

10 views1 month ago

How DeepSeek reduced KV cache by 98% - MLA explained.

37 views4 weeks ago

YouTubeVicky Explores AI

TurboQuant Explained: How to Shrink KV Cache Without Breaking Attention

169 views1 month ago

YouTubeReinike AI

TurboQuant Explained: 3-Bit KV Cache Quantization

866 views4 weeks ago

YouTubeTales Of Tensors

Top 10 KV Cache Compression Techniques for LLM Inference!

21 views3 weeks ago

YouTubeThe AI Opus

KV Cache Explained: The 4-Layer Fix Every AI Engineer Must Know | Gen AI Interview Series | EP#01

66 views1 month ago

What is KV Cache Compression? (LLM Memory Visualized)

1 views3 weeks ago

YouTubeEdumation

【Whitepaper】KV Cache Offload to Improve AI Inferencing Cost and Performance

42 views2 months ago

KV Cache: The Invisible Trick Behind Every LLM

8.9K views2 weeks ago

YouTubeAdam Rosler

How Tool-Calling Changes Everything: KV Cache & Prefill Explained 🧠

25 views2 months ago

YouTubeSAIL Media

kvcached: Revolutionizing GPU Memory for LLMs

1 views3 weeks ago

YouTubeThe AI Opus

after turboquant and qwen3.5-35b-a3b, i got curious: how realistic is it to use kv cache as a document store today? to have vectorless, RAG-less search. so i prefilled 258K out of 262K context window on L4 (a budget GPU popular in prod). ~99% of the slot is pre-computed and stored, users load it on the fly in ~1s. system prompt + query append to the end, generation takes ~3K tokens, enough for search. at 99% fill rate, decoding runs ~20 tps on L4.i prepared some ego datasets (jina papers, which

42.2K views1 month ago

I added KV caching and INT8 KV quantization to our transformer inference, improving throughput by 35x.All of this was done from scratch in Rust + CUDA, on top of a homemade ML framework.On a 4-token prompt with 252 generated tokens:- Original: 0.76 tok/s- KV cache fp32: 27.21 tok/s- KV cache int8 (quantized): 27.29 tok/sTry it out yourself here: https://t.co/kFS9Z0fs4hIn practice:- KV caching gave us about a 35x end-to-end speedup- INT8 KV cache kept roughly the same speed as fp32 but cut KV cac

48.8K views1 month ago

x.comReese Chong

See more