Vision Language Models (VLMs), GenAI for CV
Vision language models are models that can learn simultaneously from images and texts to tackle many tasks, from visual question answering to image captioning
Resources
- An Introduction to VLMs: The Future of Computer Vision Models | by Ro Isachenko | Nov, 2024 | Towards Data Science
- Vision Language Models Explained
- How Far is Video Generation from World Model: A Physical Law Perspective
- AI Image Generators Compared Side-By-Side Reveals Stark Differences
Leaderboards
- Vision Language Leaderboards - a merve Collection
- GenAI Arena - a Hugging Face Space by TIGER-Lab
- Open VLM Leaderboard - a Hugging Face Space by opencompass
- Text to video model arena
Models
- Mochi (Genmo) - an open source state-of-the-art video generation model and is released
- Midjourney
- DALLE (OpenAI)
- Veo 2 (Google DeepMind)
- IMAGEN (Google)
- IMAGEN video (Google)
- Stable Diffusion
- Make-A-Video (Meta)
- playgroundai/playground-v2.5-1024px-aesthetic · Hugging Face
Applications
- Vidu - What you imagine is what Vidu
- Infinite AI Artboard - Recraft
- AI Image Generator - Create Art, Images & Video | Leonardo AI
- Introducing FLUX.1 Tools - Black Forest Labs
References
- #PAPER DALL-E - Creating Images from Text (Ramesh 2021)
- #PAPER Learning Transferable Visual Models From Natural Language Supervision (Radford 2021)
- #PAPER Florence: A New Foundation Model for Computer Vision (Yuan 2021)
- #PAPER SimVLM: Simple Visual Language Model Pretraining with Weak Supervision (Wang 2022)
- #PAPER Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework (Wang 2022)
- #PAPER Flamingo: a Visual Language Model for Few-Shot Learning (Alayrac 2022)
- #PAPER LViT: Language meets Vision Transformer in Medical Image Segmentation (Li 2022)
- #PAPER CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers (Ding 2022)
- #PAPER Imagen - Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding (Saharia 2022)
- https://imagen.research.google/
- Imagen, a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding. Imagen builds on the power of large transformer language models in understanding text and hinges on the strength of diffusion models in high-fidelity image generation
- #CODE https://paperswithcode.com/paper/photorealistic-text-to-image-diffusion-models
- #PAPER Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks (Lu 2022)
- #PAPER #REVIEW A Survey of Vision-Language Pre-Trained Models (Du 2022)
- #PAPER #REVIEW The Creativity of Text-to-Image Generation (Oppenlaender 2022)
- #PAPER GLIGEN: Open-Set Grounded Text-to-Image Generation (Li 2023)
- #PAPER Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models (Wu 2023)
- Visual ChatGPT opens the door of combining ChatGPT and Visual Foundation Models and enables ChatGPT to handle complex visual tasks
- #PAPER OpenFlamingo (Awadalla 2023)
- An open-source framework for training vision-language models with in-context learning (like GPT-4!)
- Includes a Python framework to train Flamingo-style LMMs, a large-scale multimodal dataset with interleaved image and text sequences, an in-context learning evaluation benchmark for vision-language tasks and a trained models (eg OpenFlamingo-9B model based on LLaMA)
- Demo
- #PAPER Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks (2023)
- #PAPER Modulating Pretrained Diffusion Models for Multimodal Image Synthesis (Ham 2023)
- #PAPER 4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities (2024)
- #PAPER Shaking Up VLMs: Comparing Transformers and Structured State Space Models for Vision & Language Modeling (2024)
- #PAPER VideoPoet: A Large Language Model for Zero-Shot Video Generation (2024)
- #PAPER #REVIEW An Introduction to Vision-Language Modeling (2024)