Vision Language Models (VLMs), GenAI for CV

Vision language models are models that can learn simultaneously from images and texts to tackle many tasks, from visual question answering to image captioning

Resources

Leaderboards

Models

Mochi (Genmo) - an open source state-of-the-art video generation model and is released
- genmo.ai
Midjourney
DALLE (OpenAI)
- DALLE-2
- DALL·E 3
Veo 2 (Google DeepMind)
IMAGEN (Google)
IMAGEN video (Google)
Stable Diffusion
Make-A-Video (Meta)
playgroundai/playground-v2.5-1024px-aesthetic · Hugging Face

Applications

References

#PAPER DALL-E - Creating Images from Text (Ramesh 2021)
- https://www.technologyreview.com/2021/01/05/1015754/avocado-armchair-future-ai-openai-deep-learning-nlp-gpt3-computer-vision-common-sense/
- Blogpost explained
- #CODE https://github.com/EleutherAI/DALLE-mtf
- Multi-modal text and vision
- DALL-E mini
#PAPER Learning Transferable Visual Models From Natural Language Supervision (Radford 2021)
- #CODE https://paperswithcode.com/paper/learning-transferable-visual-models-from#code
- #CODE https://github.com/openai/CLIP
#PAPER Florence: A New Foundation Model for Computer Vision (Yuan 2021)
#PAPER SimVLM: Simple Visual Language Model Pretraining with Weak Supervision (Wang 2022)
#PAPER Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework (Wang 2022)
#PAPER Flamingo: a Visual Language Model for Few-Shot Learning (Alayrac 2022)
- #CODE https://github.com/lucidrains/flamingo-pytorch
- https://www.deepmind.com/blog/tackling-multiple-tasks-with-a-single-visual-language-model
#PAPER LViT: Language meets Vision Transformer in Medical Image Segmentation (Li 2022)
- #CODE https://github.com/HUANGLIZI/LViT
#PAPER CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers (Ding 2022)
- #CODE https://github.com/THUDM/CogView2
#PAPER Imagen - Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding (Saharia 2022)
- https://imagen.research.google/
- Imagen, a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding. Imagen builds on the power of large transformer language models in understanding text and hinges on the strength of diffusion models in high-fidelity image generation
- #CODE https://paperswithcode.com/paper/photorealistic-text-to-image-diffusion-models
#PAPER Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks (Lu 2022)
- https://unified-io.allenai.org/
#PAPER #REVIEW A Survey of Vision-Language Pre-Trained Models (Du 2022)
#PAPER #REVIEW The Creativity of Text-to-Image Generation (Oppenlaender 2022)
#PAPER GLIGEN: Open-Set Grounded Text-to-Image Generation (Li 2023)
- https://gligen.github.io/
- https://huggingface.co/spaces/gligen/demo
- https://the-decoder.com/gligen-gives-you-more-control-over-ai-image-generation/
#PAPER Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models (Wu 2023)
- Visual ChatGPT opens the door of combining ChatGPT and Visual Foundation Models and enables ChatGPT to handle complex visual tasks
#PAPER OpenFlamingo (Awadalla 2023)
- An open-source framework for training vision-language models with in-context learning (like GPT-4!)
- Includes a Python framework to train Flamingo-style LMMs, a large-scale multimodal dataset with interleaved image and text sequences, an in-context learning evaluation benchmark for vision-language tasks and a trained models (eg OpenFlamingo-9B model based on LLaMA)
- Demo
#PAPER Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks (2023)
- Florence-2: Microsoft's Último Modelo de Visión-Lenguaje
#PAPER Modulating Pretrained Diffusion Models for Multimodal Image Synthesis (Ham 2023)
- https://www.marktechpost.com/2023/03/16/one-diffusion-to-rule-diffusion-modulating-pre-trained-diffusion-models-for-multimodal-image-synthesis/
- AI/Deep Learning/Diffusion models
#PAPER 4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities (2024)
- https://4m.epfl.ch/
- #CODE https://github.com/apple/ml-4m?tab=readme-ov-file&utm_source=pocket_shared
#PAPER Shaking Up VLMs: Comparing Transformers and Structured State Space Models for Vision & Language Modeling (2024)
#PAPER VideoPoet: A Large Language Model for Zero-Shot Video Generation (2024)
- ICML 2024 VideoPoet: A Large Language Model for Zero-Shot Video Generation Oral
#PAPER #REVIEW An Introduction to Vision-Language Modeling (2024)