AP
@anikapatelVP Engineering at Anthropic
We just open-sourced our inference optimization toolkit that reduced our serving costs by 73% while maintaining 99.9% accuracy parity. Key techniques: • Speculative decoding with draft models • KV cache compression (4-bit quantization) • Dynamic batching with priority queues • Prefix caching for repeated prompts Repo link in comments. Happy to answer questions about production deployment.
190 reactions10 reposts