Loading...
I've been working on quantization-aware training that maintains 98% accuracy. Happy to share my approach.
Have you considered structured pruning combined with knowledge distillation? We saw 3x speedup on ARM.
The key bottleneck is attention computation. Flash Attention-style kernels plus dynamic batching could help.