Integral Consulting Company’s Post

👀

View profile for Vaibhav Srivastav, graphic

GPU poor @ Hugging Face

In the wake of all the rumours, re-reading Multimodal Llama in the Llama 3 paper today! 🦙 Comprises of three main components 1. Image Encoder 2. Image Adapter 3. Video Adapter Image Encoder > A ViT-H/14 variant of the vision transformer with 630M parameters, trained on 2.5B image-text pairs > Processes images of resolution 224x224, divided into 16x16 patches of 14x14 pixels each > Leverages multi-layer feature extraction utilizing features from the 4th, 8th, 16th, 24th, and 31st layers, in addition to the final layer features > 8 gated self-attention layers, resulting in a total of 40 transformer blocks, and with 850M total parameters > Encoder generates a 7680-dimensional representation for each patch, producing a total of 256 patches Image Adapter > Cross-attention layers b/w visual and language model token representations, applied after every fourth self-attention layer in the language model, utilizing GQA > Initial pre-training: Trained on ≈6B image-text pairs, with images resized to fit within four tiles of 336x336 pixels, arranged to accommodate various aspect ratios. > Annealing: Training continues on ≈500M images, increasing per-tile resolution to enhance performance on downstream tasks Video Adapter > Input: Up to 64 video frames, each encoded > Temporal modeling: Combines 32 consecutive frames and adds video cross-attention layers > Aggregator: Implemented as a perceiver resampler > Parameters: 0.6B and 4.6B for Llama 3 7B and 70B, respectively If/ when it gets released - it'd be a leap forward for VLMs! 🤗

  • No alternative text description for this image

To view or add a comment, sign in

Explore topics