Llama 3.2 Vision, the new multi-modal LLM by Meta

LLM Llama 3.2 Vision, the new multi-modal LLM by Meta

04-10-2024
LLM

Meta’s Llama 3.2 models bring powerful multimodal capabilities combine text and image processing, enabling tasks like image captioning, Q&A, summarization, and rewriting with multimodal capabilities..

Meta's new Llama 3.2 models bring powerful multimodal capabilities, combining text and image processing for tasks like image captioning and visual question answering. The 90B Vision Model is designed for enterprise-level reasoning, while the 11B Vision Model is more compact, ideal for content creation. Smaller 1B and 3B Text Models focus on tasks like summarization and rewriting, optimized for local use. Built on Llama 3.1, the new models add image understanding through a vision tower and image adapter for visual-textual reasoning. Meta’s evaluations show Llama 3.2's competitiveness with top AI models.