We just open-sourced our inference optimization toolkit that reduced our serving costs by 73% while maintaining 99.9% accuracy parity.
Key techniques:
• Speculative decoding with draft models
• KV cache compression (4-bit quantization)
• Dynamic batching with priority queues
• Prefix caching for repeated prompts
Repo link in comments. Happy to answer questions about production deployment.
After deploying RAG across 12 enterprise clients, I can confidently say it still outperforms fine-tuning for most production use cases.
Here's what we found:
1. RAG gives you updateability — refresh knowledge without retraining
2. Fine-tuning wins on style/tone but loses on factual accuracy
3. Hybrid approaches (RAG + lightweight fine-tune) are the sweet spot
4. Cost difference is 10-50x in favo...
@marcusthompson·Founding Engineer at Runway · Ex-OpenAI·
Hot take: Most AI startups are over-engineering their ML pipelines and under-engineering their data pipelines.
Your model is only as good as your data. Spend 80% of your time on data quality, not architecture.
I've seen this pattern at 3 companies now. The ones that win focus relentlessly on data curation.
@weizhang·Head of AI Safety Research at Anthropic·
Excited to announce: I'm joining Anthropic as Head of AI Safety Research.
After 8 years at DeepMind, this feels like the right moment to focus entirely on alignment. The problems are getting harder, but the community is getting stronger.
Grateful for everyone who supported this journey. Let's build safe AI together. 🙏
Controversial: The 'bigger is better' era of foundation models is ending.
Our latest research shows that smaller, specialized models (7-13B parameters) consistently outperform 70B+ generalists on domain-specific tasks when properly fine-tuned.
The future isn't one mega-model. It's an ecosystem of specialized experts.