Large language models are increasingly being deployed in edge environments including smartphones, IoT devices, and embedded systems. However, current inference times are too slow for real-time applications.
We're looking for innovative solutions that can:
- Reduce inference latency by at least 50%
- Maintain model quality within 5% of the baseline
- Work across different hardware architectures (ARM, RISC-V)
- Be compatible with popular frameworks (ONNX, TensorFlow Lite)
Solutions can include model compression, quantization, architectural innovations, or novel inference engines.
Discussion (3)
@alex_ml2 hours ago
I've been working on quantization-aware training that maintains 98% accuracy. Happy to share my approach.
@edge_dev5 hours ago
Have you considered structured pruning combined with knowledge distillation? We saw 3x speedup on ARM.
@research_ai1 day ago
The key bottleneck is attention computation. Flash Attention-style kernels plus dynamic batching could help.
Optimize LLM inference latency for edge devices | Problem