The field of Vision-Language Models (VLMs) has witnessed a rapid surge, with diverse approaches emerging, including MultiModal Large Language Models (MLLMs) and Large Multimodal Models (LMMs). Notable examples of these models include Flamingo, LLaVA, BLIP-2, InternVL, and Qwen-VL, among others. In this discussion, we will take an in-depth look at LLaVA and its derivatives, exploring how these models enable interactive communication with images, with a particular focus on their applications in pathology imaging.
LLaVA (Large Language and Vision Assistant)
Architecture:
For the vision encoder, the pre-trained CLIP (ViT-L/14 based) is used. The input image X_v is fed to it to encode it as Z_v embeddings. A simple linear layer(W) is used to connect and project the image embeddings into the word embedding space. For text (Language Instruction, X_q) , an appropriate tokenizer is used to generate the tokens and then the embeddings H_q. For the language model, Vicuna is used (the reasoning given is its better for instruction following capabilities).
Now there can be various approaches of bringing image embeddings and text embeddings into a common latent space. In LLaVA, as mentioned before a simple linear layer is used but we will see how gated cross-attention is used in Flamingo and Q-former in BLIP-2 later.
Visual Instruction Data Generation:
- Naive Expansion
Using existing datasets varying from CC to LAION, the authors of LLaVA used GPT-4 to generate Questions given an image and its corresponding caption (which is the answer) using:
Human : Xq Xv<STOP> Assistant : Xc<STOP>. Though this is cheap is to construct, it lacks diversity and in depth reasoning.
To mitigate this, they used GPT-4 or ChatGPT (both only accept text as input) as the strong teacher to create instruction-following data involving images. For this they two types of symbolic representations to encode an image to prompt:
Captions: to describe the visual scene from various perspectives
Bounding Boxes: To localize the object in the scene and each bounding box encodes the object concept (like person etc) and its spatial location.
Using COCO images, they generate 3 types of instruction-following data:
Conversation
Detailed description
Complex reasoning
Response type 1: conversation
Question: What type of vehicle is featured in the image?
Answer: The image features a black sport utility vehicle (SUV).Question: Where is the vehicle parked?
Answer: The vehicle is parked in an underground parking area, likely in a public garage.Question: What are the people in the image doing?
Answer: The people are trying to fit all of their luggage into the SUV, likely preparing for a trip.
Response type 2: detailed description
The image is an underground parking area with a black sport utility vehicle (SUV) parked. There are three people in the scene, with one person standing closer to the left side of the vehicle, another person in the middle, and the third person on the right side. They are all working together to pack their luggage into the SUV for a trip.In the parking area, various luggage items are scattered around the vehicle. There are two backpacks, one located near the left rear wheel and the other closer to the right side of the vehicle. Additionally, there are two suitcases, one on the right side of the car and another further away near the center of the parking area. A bicycle can also be seen on the left side of the vehicle.
Other cars are parked around the main SUV, with one car positioned behind it and slightly to the left, another behind and slightly to the right, and the third car further behind on the right side.
Response type 3: complex reasoning
Question: What challenges do these people face?
Answer: In the image, a group of people is standing outside a black SUV in a parking area, surrounded by various pieces of luggage, including suitcases and backpacks. They are facing the challenge of fitting all their luggage into the black SUV. There are multiple suitcases and backpacks to be packed, which suggests that the group has a significant amount of belongings to accommodate.They might have to strategize and arrange the luggage efficiently to ensure that everything fits properly into the vehicle. Additionally, they need to consider the comfort of the passengers and visibility while driving, so the placement of the luggage must not obstruct the driver’s view or make the passengers uncomfortable during the trip.
One of the following questions is picked to answer by ChatGPT/GPT-4 to curate brief description:
"Describe the image concisely."
"Provide a brief description of the given image."
"Offer a succinct explanation of the picture presented."
"Summarize the visual content of the image."
"Give a short and clear explanation of the subsequent image."
"Share a concise interpretation of the image provided."
"Present a compact description of the photo’s key features."
"Relay a brief, clear account of the picture shown."
"Render a clear and concise summary of the photo."
"Write a terse but informative summary of the picture."
"Create a compact narrative representing the image presented."
for curating detailed description:
"Describe the following image in detail"
"Provide a detailed description of the given image"
"Give an elaborate explanation of the image you see"
"Share a comprehensive rundown of the presented image"
"Offer a thorough analysis of the image"
"Explain the various aspects of the image before you"
"Clarify the contents of the displayed image with great detail"
"Characterize the image using a well-detailed description"
"Break down the elements of the image in a detailed manner"
"Walk through the important details of the image"
"Portray the image with a rich, descriptive narrative"
"Narrate the contents of the image with precision"
"Analyze the image in a comprehensive and detailed manner"
"Illustrate the image through a descriptive explanation"
"Examine the image closely and share its details"
"Write an exhaustive depiction of the given image"
Method to curate Conversation:
messages = [{"role": "system", "content": f"""You are an AI visual assistant, and you are seeing a single image. What you see are provided with five sentences, describing the same image you are looking at. Answer all questions as you are seeing the image.
Design a conversation between you and a person asking about this photo. The answers should be in a tone that a visual AI assistant is seeing the image and answering the question. Ask diverse questions and give corresponding answers.
Include questions asking about the visual content of the image, including the object types, counting the objects, object actions, object locations, relative positions between objects, etc. Only include questions that have definite answers: (1) one can see the content in the image that the question asks about and can answer confidently; (2) one can determine confidently from the image that it is not in the image. Do not ask any question that cannot be answered confidently.
Also include complex questions that are relevant to the content in the image, for example, asking about background knowledge of the objects in the image, asking to discuss about events happening in the image, etc. Again, do not ask about uncertain details. Provide detailed answers when answering complex questions. For example, give detailed examples or reasoning steps to make the content more convincing and well-organized. You can include multiple paragraphs if necessary."""}]
for sample in fewshot_samples: messages.append({"role": "user", "content": sample['context']}) messages.append({"role": "assistant", "content": sample['response']}) messages.append({"role": "user", "content": '\n'.join(query)})
Training:
For each image X_v, a multi-turn conversation data (Xq1,Xa1,…,XqT,XaT) is generated, where T is the total number of turns. Then its organized as a sequence, by treating all the answers as the assistant’s response and the instruction at the t-th turn as:
$$\mathbf{X}^t_{\text{instruct}} = \begin{cases} \text{Randomly choose } [\mathbf{X}_q^1, \mathbf{X}_v] \text{ or } [\mathbf{X}_v, \mathbf{X}_q^1], & \text{the first turn } t = 1 \\ \mathbf{X}_q^t, & \text{the remaining turns } t > 1 \end{cases}$$
So the input sequence becomes:
where:
$$\begin{align*} &\mathbf{X_{\text{system-message}}} = \text{A chat between a curious human and an artificial intelligence assistant.} \\ &\text{The assistant gives helpful, detailed, and polite answers to the human's questions.} \\ &\langle \text{STOP} \rangle = \texttt{###} \end{align*}$$
Instruction tuning objective:
Probability of generating target answers Xa
where xi is the current prediction token, Xv is the image, Xinstruct is the previous instruction tokens, Xa is the previous answer tokens. The model is trained to predict the assistant answers and where to stop and thus only green tokens (as shown in above images) are used to compute the loss in the auto-regressive LLM model.
LLaVA’s training is done in two stages:
Stage-1: Pre-training for the feature alignment
During this, the LLM (Language Model) and the Vision encoder are frozen and only the projection layer is trained. The authors filtered CC3M to 595K image-text papers and using the naive approach (mentioned above) they curated single-turn instruction following data to train.
Stage-2: Finetuning end-to-end
For Multimodal Chatbot:
Finetuning on Instruct-158K data
Also finetuning on multi-turn conversation / single-turn description and reasoning
For Science Q&A:
Finetuning on Science QA dataset (21K multimodal MCQs)
Organized as single-turn conversation
Quantitative Evaluation:
GPT-4 is used to measure the quality of generated responses. Triplets are created consisting of image, ground truth textual descriptions and question. LLaVA predicts the answers based on the question and the image. GPT-4 is used to create a reference prediction based on the question and the ground truth textual descriptions using GPT-4. After obtaining the responses from both models, the question, along with visual information (in the format of textual description) and the generated reponses from both assistants (GPT-4 and LLaVA) is fed to the judge (GPT-4(text only)) and it gives an overall score on a scale of 1-10.
LLaVA-1.5
Instead of single layer to project the image embeddings to the joint embedding space, a MLP (multi-layered perceptron) (two layered) is used.
Further Additions/ Improvements
Image resolution (from 224×224 to 336×336)
Addition of academic task oriented VQA datasets for VQA, OCR, region-level perception
Vicuna LLM parameters (7b to 13b)
Models based on LLaVA/ LLaVA-1.5 in Pathomics: LLaVA-Med, Quilt-LLaVA, Pathgen-LLaVA etc
LLaVA-Med:
Medical Concept Alignment:
Dataset: For a biomedical image Xv and its associated caption Xc, a question Xq is sampled which asks to describe the biomedical image. (from PMC-15M after sampling 600k pairs) Therefore a single round instruction becomes:
$$\text{Human : } X_q \, X_v \langle \text{STOP} \rangle \backslash n \\ \text{Assistant : } X_c \langle \text{STOP} \rangle \backslash n$$
For each sample, give the language instruction (describe the image instruction) and image input, model is asked to predict the original caption. Vision encoder and Language Model are frozen except for the projection matrix. In this way, the image features of vast novel biomedical visual concepts can be aligned to their textual word embeddings in the pre-trained language model. (Shouldn’t they also unfreeze vision encoder as the visual features differ a lot in natural images vs whats in biomedical images???)
Questions list
"Describe the image concisely."
"Provide a brief description of the given image."
"Offer a succinct explanation of the picture presented."
"Summarize the visual content of the image."
"Give a short and clear explanation of the subsequent image."
"Share a concise interpretation of the image provided."
"Present a compact description of the photo’s key features."
"Relay a brief, clear account of the picture shown."
"Render a clear and concise summary of the photo."
"Write a terse but informative summary of the picture."
"Create a compact narrative representing the image presented."
End-to-End Instruction Training:
Dataset: Given an image, GPT-4 is asked to generate multi-round questions and answers in a tone as if it could see the image to curate VQA (Visual Question Answering) Dataset.
In this stage only visual encoder weights are frozen and weights of the projection matrix and the language model are updated.
Prompt used:
messages = [{"role": "system", "content": """You are an AI assistant specialized in biomedical topics.
You are provided with a text description (Figure Caption) of a figure image from a biomedical research paper. In some cases, you may have additional text (Figure Context) that mentions the image. Unfortunately, you don’t have access to the actual image.
Below are requirements for generating the questions and answers in the conversation:
- Avoid quoting or referring to specific facts, terms, abbreviations, dates, numbers, or names, as these may reveal the conversation is based on the text information, rather than the image itself. Focus on the visual aspects of the image that can be inferred without the text information.
- Do not use phrases like "mentioned", "caption", "context" in the conversation. Instead, refer to the information as being "in the image."
- Ensure that questions are diverse and cover a range of visual aspects of the image.
- The conversation should include at least 2-3 turns of questions and answers about the visual aspects of the image.
- Answer responsibly, avoiding overconfidence, and do not provide medical advice or diagnostic information. Encourage the user to consult a healthcare professional for advice."""}]
for sample in fewshot_samples:
messages.append({"role": "user", "content": sample['context']})
messages.append({"role": "assistant", "content": sample['response']})
messages.append({"role": "user", "content": query})
Quilt-LLaVA:
Quilt + LLaVA
Quilt 1M(Total 1M, ~0.7M unique)
The QUILT curation pipeline processes histopathology data from YouTube videos in several steps:
Video Identification: Relevant histopathology videos are identified through search.
Image Extraction: Histopathology frames are extracted and de-noised using trained models.
Text Processing:
Automatic Speech Recognition (ASR) is applied to extract spoken content.
Unified Medical Language System (UMLS) and large language models (LLMs) are used for post-processing and correcting ASR errors.
Sub-pathology, medical, and region-of-interest (ROI) text are extracted using LLMs.
Image-Text Pairing: Domain-specific algorithms pair images with relevant text, removing duplicates.
Quilt is added then added data from other sources like twitter, research paper etc making QUILT-1M.
Taking inspiration from Connecting Vision and Language with Video Localized Narratives etc
Interesting methodology for curating QUILT-Instruct:
As of now mostly a LLM is used to curate an Instruct tuning dataset but here authors take the advantage of educational videos of the domain. In educational videos, narrators often pause while exploring large-scale WSIs before indicating salient areas with their cursors. This richness can be utilized to convert unstructured videos into usable visually-grounded instruction data using these three steps:
Localize narrators’ cursors in videos
Spatio-temporal clustering of cursor location (mapping the region covered by cursor with its corresponding text part in the caption)
Using the extracted grounded captions, use an LLM to generate instruction tuning dataset
For each chunk, a median frame is computed in the pixel domain and subtracted from every frame within the chunk. A threshold is then applied to reduce noise, and the maximum value is used to capture the mouse cursor points. These cursor points are clustered to localize relevant medical content in image captions. Additionally, in the "Trace Clustering and Mapping" step, color is used to encode time, aiding in temporal data visualization. This process ensures precise localization and annotation of medical content in the dataset.
A similar approach as in LLaVA-Med is used:
PathGen-1.6M and PathGen-LLaVA:
PathGen Dataset Curation:
Uses multiple agents to work collaboratively to generate high quality pathology image-text pairs. Text prompts (takedn from WSI reports, tissue sources etc) are generated for CIP to retrieve the most relevant patches from a WSI. These patches are then described by a trained pathology LMM (Large Multimodal Model) agent, followed by another LMM agent that revises and summarizes the descriptions.
Agents involved:
PathGen-CLIP-Linit: General models like OpenAI’s CLIP perform poorly in pathology tasks. To improve cross-modal retrieval, a specialized model is trained using a dataset (PathGeninit) comprising 700K samples from PathCap (200K), Quilt-1M (400K), and OpenPath (100K). OpenCLIP framework is used to train a pathology-specific CLIP model with 336 image size, named PathGen-CLIP-Linit.
Description LMM Agent: Existing pathology image captions lack detailed descriptions of complex features like ‘nuclear atypia’ and ‘disordered cellular arrangements.’ To improve captioning, 10,000 image-caption pairs are sampled from PathCap, OpenPath, and Quilt-1M, then enhanced using GPT-4V. This process produces 30,000 refined descriptions. A specialized LMM agent, PathGen-LLaVAdesp, is trained using LLaVA-v1.5-13B, replacing its OpenAI-CLIP encoder with PathGen-CLIP-Linit, improving pathology image descriptions. Performance is compared against LLaVA-Med and Quilt-LLaVA.
Revise LMM Agent: Existing pathology LMMs lack self-correction abilities. To address this, errors are introduced into descriptions using GPT-4 through additions, deletions, or modifications. Pairs of accurate and erroneous descriptions are created, along with inverse operations to correct them. These are used to train the Revise LMM Agent, improving its error correction capabilities.
Summarize Agent: PathGen-LLaVAdesp generates descriptions exceeding the 77-token CLIP model limit. GPT-4 generates instruction-tuning data for summarization. A Llama-2-based summary agent is fine-tuned to generate concise summaries for whole-slide image (WSI) patches.
Resources used (Apart from the original papers):
Visual Instruction Tuning by Moon Ye-Bin (Do give them a subscribe), Computer Vision in the Wild