AI Newsletter
Another week - another cool updates in the world of AI!
🚀 OpenAI's Strawberry Q: The Next Big Thing in AI
🚀 OpenAI Employee Trouble
🚀 New AI Education Startup
🚀 Update to Gemini on Android
🚀 Google Vids Announcement
🚀 New Code Model from Mistral
French AI company Mistral has created a new code generation model, Codestral Mamba Mamba. This open-source model supports up to 256,000 tokens, doubling the input capacity of OpenAI’s current ChatGPT. With 7 billion parameters, Codestral Mamba delivers fast responses even with extensive input, making it a promising option for developers seeking advanced code generation tools. If you're exploring alternatives for coding tasks, this model might be worth trying.
🔥 Selfies into 3D-Printed Models
🚀 New ChatGPT Mini
🚀 AI and Forensics: Identifying Sex from Teeth
🚀 New LLM from Mistral and Nvidia
Nvidia and Mistral have teamed up to release the Mistral Nemo, a powerful 12-billion-parameter language model with a 128,000-token context window. Designed for local deployment, this model is perfect for businesses with limited internet access or strict data privacy needs. While it's tailored for laptops and desktops, it promises robust performance and flexibility. You can check it out now on Nvidia’s website, with a downloadable desktop version coming soon
🚀 Meta Limits AI Offerings in the EU
Meta has announced that its multimodal AI models, including those for image and video generation, will not be available in the European Union. While Meta plans to release its multimodal LLaMA model soon, EU users will only have access to text-based models due to uncertainties surrounding the EU's regulatory environment and GDPR compliance. The company notes that similar regulations in the UK have not posed the same issues, allowing for the launch of their new model there.
🚀 Google AI at the Olympics
Google is stepping into the spotlight as the official AI sponsor for Team USA at this years Summer Olympics! Expect to see Google AI featured prominently across ads and promotions throughout the event.
New Noteworthy Papers:
Authors: Tao Jiang, Xinchen Xie, Yining Li
Institution: Shanghai AI Laboratory
Abstract: Whole-body pose estimation involves predicting keypoints for the entire body, including the face, torso, hands, and feet. This task is critical for human-centric perception and various applications. The RTMW (Real-Time Multi-person Whole-body) models present a high-performance solution for 2D and 3D whole-body pose estimation. Incorporating the RTMPose model architecture with Feature Pyramid Networks (FPN) and Hierarchical Encoding Modules (HEM), RTMW captures detailed pose information across different body parts and scales. Trained on a comprehensive collection of open-source human keypoint datasets and enhanced through a two-stage distillation strategy, RTMW excels in performance, achieving a 70.2 mAP on the COCO-Wholebody benchmark. It is the first open-source model to surpass this benchmark, offering strong performance and high inference efficiency. RTMW also explores 3D pose estimation through monocular image-based methods.
Authors: Akshay Krishnan, Abhijit Kundu, Kevis-Kokitsi Maninis, James Hays, Matthew Brown
Institutions: Google Research, Georgia Institute of Technology
Abstract: OmniNOCS introduces a large-scale monocular dataset featuring Normalized Object Coordinate Space (NOCS) maps, object masks, and 3D bounding box annotations for a diverse set of indoor and outdoor scenes. This dataset includes 20 times more object classes and 200 times more instances than existing NOCS datasets. The accompanying model, NOCSformer, leverages OmniNOCS to predict 3D oriented boxes (poses) and 3D point clouds (shapes) from 2D object detections. NOCSformer, a transformer-based model, demonstrates its ability to generalize across a wide range of object classes and achieves results comparable to state-of-the-art 3D detection methods like Cube R-CNN. The model also excels in providing detailed 3D object shapes and segmentations. OmniNOCS sets a new benchmark for NOCS prediction tasks, offering a comprehensive resource for advancing 3D object detection and understanding.
Authors: Benjamin Fuhrer, Chen Tessler, Gal Dalal
Institution: NVIDIA
Abstract: Neural networks have excelled in various tasks but face challenges such as interpretability, support for categorical features, and lightweight implementations. Gradient Boosting Trees (GBT) address these challenges but are underutilized in reinforcement learning (RL). This paper introduces Gradient-Boosting RL (GBRL), a framework that adapts GBTs for RL. GBRL implements various actor-critic algorithms and compares them with neural network-based approaches. It introduces a tree-sharing method for policy and value functions with distinct learning rates, enhancing efficiency over millions of interactions. GBRL achieves competitive performance, especially in environments with structured or categorical features, and offers a GPU-accelerated implementation compatible with popular RL libraries.
Authors: Qianqian Wang, Vickie Ye, Hang Gao, Jake Austin, Zhengqi Li, Angjoo Kanazawa
Institutions: UC Berkeley, Google Research
Abstract: Monocular dynamic reconstruction is a complex problem due to its ill-posed nature. Existing methods either rely on templates, are limited to quasi-static scenes, or fail to model 3D motion comprehensively. This paper introduces a method for reconstructing dynamic scenes with explicit, full-sequence 3D motion from monocular videos. The approach leverages a low-dimensional structure of 3D motion using SE(3) motion bases and consolidates noisy supervisory signals from monocular depth maps and 2D tracks. The method achieves state-of-the-art performance in long-range 3D/2D motion estimation and novel view synthesis.:
Authors: Skanda Koppula, Ignacio Rocco, Yi Yang, Joe Heyward, João Carreira, Andrew Zisserman, Gabriel Brostow, Carl Doersch Institutions: Google DeepMind, University College London, University of Oxford
Recommended by LinkedIn
Abstract: Introducing TAPVid-3D, a groundbreaking benchmark for long-range Tracking Any Point in 3D (TAP-3D). While 2D point tracking has well-established benchmarks, 3D point tracking lacked such resources until now. TAPVid-3D leverages over 4,000 real-world videos from diverse sources to evaluate 3D point tracking, incorporating various object types, motion patterns, and environments. The benchmark introduces new metrics to address depth ambiguity, occlusions, and multi-track spatio-temporal smoothness. It also includes verified trajectories and competitive baselines to advance our understanding of precise 3D motion.
Authors: Rui Li, Dong Liu
Institution: University of Science and Technology of China, Hefei, China
Abstract: Motion estimation in videos can be complex due to diverse motion and appearance characteristics. The proposed DecoMotion addresses this challenge by introducing a novel test-time optimization method for per-pixel and long-range motion estimation. DecoMotion decomposes video content into static scenes and dynamic objects, using a quasi-3D canonical volume representation. It coordinates transformations between local and canonical spaces, applying affine transformations for static scenes and rectifying non-rigid transformations for dynamic objects. This approach enhances tracking robustness through occlusions and deformations, resulting in a significant improvement in point-tracking accuracy.
Authors: Chuanrui Zhang, Yonggen Ling, Minglei Lu, Minghan Qin, Haoqian Wang Institution: Tsinghua University, Beijing, China; Tencent Robotics X, Shenzhen, China
Abstract: Addressing the challenge of 3D object understanding in robotic manipulation, CODERS introduces a one-stage approach for category-level object detection, pose estimation, and reconstruction from stereo images. Traditional monocular and RGB-D methods often face scale ambiguity due to imprecise depth measurements. CODERS tackles this by integrating an Implicit Stereo Matching module with 3D positional information, coupled with a transform-decoder architecture for end-to-end task learning. This approach not only surpasses existing methods in performance on the public TOD dataset but also generalizes effectively to real-world scenarios when trained on simulated data.
Authors: Mingze Xu, Mingfei Gao, Zhe Gan, Hong-You Chen, Zhengfeng Lai, Haiming Gang, Kai Kang, Afshin Dehghan
Institution: Apple
Abstract: We propose SlowFast-LLaVA (or SF-LLaVA for short), a training-free video large language model (LLM) that can jointly capture the detailed spatial semantics and long-range temporal context without exceeding the token budget of commonly used LLMs. This is realized by using a two-stream SlowFast design of inputs for Video LLMs to aggregate features from sampled video frames in an effective way. Specifically, the Slow pathway extracts features at a low frame rate while keeping as many spatial details as possible (e.g., with 24x24 tokens), and the Fast pathway operates on a high frame rate but uses a larger spatial pooling stride (e.g., downsampling 6x) to focus on the motion cues. As a result, this design allows us to adequately capture both spatial and temporal features that are beneficial for understanding details along the video. Experimental results show that SF-LLaVA outperforms existing training-free methods on a wide range of video tasks. On some benchmarks, it achieves comparable or even better performance compared to state-of-the-art Video LLMs that are fine-tuned on video datasets.
Authors: Kaibing Chen, Dong Shen, Hanwen Zhong, Huasong Zhong, Kui Xia, Di Xu, Wei Yuan, Yifei Hu, Bin Wen, Tianke Zhang, Changyi Liu, Dewen Fan, Huihui Xiao, Jiahong Wu, Fan Yang, Size Li, Di Zhang
Institution: Kuaishou Technology
Abstract: In the field of multi-modal language models, most methods use single-layer ViT features, leading to significant computational overhead. EVLM addresses this by employing cross-attention for image-text interaction, hierarchical ViT features for comprehensive visual perception, and a Mixture of Experts (MoE) mechanism to enhance effectiveness. Our model achieves competitive scores on public benchmarks and excels in tasks such as image and video captioning.
Authors: Yuanzhi Zhu, Jiawei Liu, Feiyu Gao, Wenyu Liu, Xinggang Wang, Peng Wang, Fei Huang, Cong Yao, Zhibo Yang
Institutions: Alibaba Group, Huazhong University of Science and Technology
Abstract: Despite advancements in generative models, producing high-quality text images in real-world scenarios remains challenging. The proposed SceneVTG leverages a two-stage approach combining a Multimodal Large Language Model with a conditional diffusion model to generate text images that excel in fidelity, reasonability, and utility. Extensive experiments show SceneVTG surpasses traditional and recent methods in generating realistic and useful text images.
Authors: Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, Thomas Unterthiner, Daniel Keysers, Skanda Koppula, Fangyu Liu, Adam Grycner, Alexey Gritsenko, Neil Houlsby, Manoj Kumar, Keran Rong, Julian Eisenschlos, Rishabh Kabra, Matthias Bauer, Matko Bošnjak, Xi Chen, Matthias Minderer, Paul Voigtlaender, Ioana Bica, Ivana Balazevic, Joan Puigcerver, Pinelopi Papalampidi, Olivier Henaff, Xi Xiong, Radu Soricut, Jeremiah Harmsen, Xiaohua Zhai
Institutions: Various
Abstract: PaliGemma is an advanced Vision-Language Model (VLM) that integrates the SigLIP-So400m vision encoder with the Gemma-2B language model. Designed as a versatile base model, PaliGemma excels in transferring to various tasks and demonstrates strong performance across nearly 40 diverse tasks, from standard VLM benchmarks to specialized applications like remote-sensing and segmentation.
Authors: Garrett Tanzer (Google), Biao Zhang (Google DeepMind)
Abstract: Introducing YouTube-SL-25, a groundbreaking, large-scale, open-domain multilingual corpus of sign language videos with well-aligned captions from YouTube. With over 3000 hours of videos across more than 25 sign languages, this dataset is over 3 times larger than YouTube-ASL and stands as the largest parallel sign language dataset to date. It also serves as a pioneering resource for many of its component languages. We provide baselines for sign-to-text tasks using a unified multilingual multitask model based on T5, demonstrating significant benefits from multilingual transfer for both high- and low-resource sign languages.
🎶 Audio:
Authors: Zach Evans, Julian D. Parker, CJ Carr, Zack Zukowski, Josiah Taylor, Jordi Pons Institution: Stability AI
Abstract: Open generative models are vitally important for the community, allowing for fine-tuning and serving as baselines when presenting new models. This paper describes the architecture and training process of a new open-weights text-to-audio model trained with Creative Commons data, showcasing competitive performance and high-quality stereo sound synthesis at 44.1kHz.
Authors: Soumya Sai Vanka, Christian Steinmetz, Jean-Baptiste Rolland, Joshua Reiss, György Fazekas
Institution: Centre for Digital Music, Queen Mary University of London, UK; Steinberg Media Technologies GmbH, Germany
Abstract: Mixing style transfer automates the generation of a multitrack mix for a given set of tracks by inferring production attributes from a reference song. Existing systems often operate on a fixed number of tracks and lack interpretability. Diff-MST addresses these challenges with a differentiable mixing console, a transformer controller, and an audio production style loss function. It processes raw tracks and a reference song to estimate control parameters for audio effects, producing high-quality mixes and supporting arbitrary numbers of input tracks without source labeling. The framework’s evaluation shows superior performance and the ability for post-hoc adjustments.
Author: Federico Nicolás Landini
Institution: Brno University of Technology, Faculty of Information Technology, Department of Computer Graphics and Multimedia
Abstract: Speaker diarization determines "who spoke when" in recordings. Historically, modular systems excelled but struggled with overlapped speech. This paper discusses VBx, a Bayesian hidden Markov model for clustering speaker embeddings, and its performance on various datasets. It then explores end-to-end neural diarization (EEND) methods, including synthetic data generation for training and a new EEND-based model, DiaPer, which surpasses previous methods in handling many speakers and overlaps. The paper compares VBx and DiaPer across different corpora.
Authors: Takaaki Saeki, Gary Wang, Nobuyuki Morioka, Isaac Elias, Kyle Kastner, Fadi Biadsy, Andrew Rosenberg, Bhuvana Ramabhadran, Heiga Zen, Françoise Beaufays, Hadar Shemtov
Institution: Google, USA; The University of Tokyo, Japan; Google DeepMind, Japan & USA; Google, Israel
Abstract: Collecting high-quality studio recordings of audio is challenging, limiting the language coverage of text-to-speech (TTS) systems. This paper proposes a framework for scaling a multilingual TTS model to 100+ languages using found data without supervision. The framework combines speech-text encoder pretraining with unsupervised training using untranscribed speech and unspoken text data sources, leveraging massively multilingual joint speech and text representation learning. This TTS model can generate intelligible speech in over 30 unseen languages without transcribed speech (CER difference of <10% to ground truth). With just 15 minutes of transcribed found data, the intelligibility difference reduces to 1% or less, achieving near-ground-truth naturalness scores.
Authors: Navonil Majumder, Chia-Yu Hung, Deepanway Ghosal, Wei-Ning Hsu, Rada Mihalcea, Soujanya Poria
Institutions: Singapore University of Technology and Design, Meta AI, University of Michigan
Abstract: Generative multimodal content is increasingly relevant in content creation, particularly in generating audio from text prompts for music and film. Recent diffusion-based text-to-audio models train on large datasets of prompt-audio pairs but often overlook the importance of concepts and their temporal ordering in the generated audio. This paper introduces Tango 2, which improves on these models by using a preference dataset where each text prompt is associated with a "winner" audio output and several "loser" outputs. The model is fine-tuned with direct preference optimization (DPO) to better align generated audio with the prompt, enhancing performance even with limited data.
Thank you for your attention. Subscribe now to stay informed and join the conversation!
About us:
We also have an amazing team of AI engineers with:
We are here to help you maximize efficiency with your available resources.
Reach out when:
Have doubts or many questions about AI in your business? Get in touch! 💬
Bjs
Joao carreira