Apple Releases FastVLM Visual-Language Model

The new, ultra-fast upgrade is available for select iPhone devices.

May 12, 2025

Apple Releases FastVLM Visual-Language Model

Apple Logo

Apple has officially released FastVLM, a visual language model (VLM) optimized for high-resolution image processing, which has sparked industry discussions due to its efficient operation and outstanding performance on mobile devices like the iPhone. FastVLM achieves an impressive 85x encoding speed improvement through its innovative FastViTHD visual encoder, paving the way for real-time multimodal AI applications.

Technical Core Operation

The core of FastVLM lies in its newly designed FastViTHD hybrid visual encoder, which has been deeply optimized for high-resolution image processing. Compared to traditional vision transformers (ViT) encoders, FastViTHD significantly improves efficiency through innovations, such as dynamic resolution adjustment through multiscale feature fusion, it intelligently identifies key image regions to reduce redundant computations.

It also features a hierarchical token compression, which reduces the number of visual tokens from 1536 to 576, decreasing computational load by 62.5%. Hardware optimization is available, with the matrix operations for Apple silicon (such as M2, A18), supporting FP16 and INT8 quantization, ensuring low-power operation on mobile devices.

The FastVLM model series includes parameter variants of 0.5B, 1.5B, and 7B, covering a range of applications from lightweight to high-performance. Its smallest model, FastVLM-0.5B, is 85 times faster than LLaVA-OneVision-0.5B in terms of encoding speed, with a 3.4x smaller visual encoder, while maintaining comparable performance.

Top Performance

FastVLM demonstrates excellent performance in visual-language tasks, supporting multitask handling through a single image encoder without additional token pruning, simplifying model design. Its 7B variant, based on Qwen2-7B, achieves 82.1% accuracy on the COCO Caption benchmark while maintaining a 7.9x advantage in first token time (TTFT), providing a solid foundation for real-time applications.

Apple also released an iOS demo app to showcase FastVLM's real-time performance on mobile devices, such as achieving 93.7% accuracy in lung nodule detection, improving diagnostic efficiency by 40%, and reducing defect false positive rates from 2.1% to 0.7% in smartphone production line quality inspection.

Open Source Ability

FastVLM's code and models are open-sourced on GitHub and Hugging Face, trained using the LLaVA code repository. Developers can customize the model according to provided inference and fine-tuning guides. Apple's open-source initiative not only showcases its technical prowess in the field of visual language models but also reflects its commitment to fostering an open AI ecosystem.

FastVLM's release marks a significant step in Apple's mobile AI strategy. Combining its A18 chip and C1 modem hardware advantages, Apple is building an efficient, privacy-first local AI ecosystem, with potential future expansion into Xcode programming assistants and visual expression functions in Messages apps.