Vector Institute Computer Vision Workshop showcases the field’s current capabilities and future potential
May 29, 2024
May 29, 2024
By Arber Kacollja
The Vector Institute’s recent Computer Vision (CV) workshop brought together members of the Vector research community to showcase and discuss new work in the field.
Recent years have seen a surge of interest in generative modeling across the CV research community, where machine learning is used to analyze and extract information from images and videos, impacting a wide range of industries. CV is a crucial conduit through which AI and ML can tackle human-centric challenges from facial, sound, and action recognition to autonomous vehicles and medical imaging segmentation and classification that unlock applications across industries.
This March, Vector Faculty Members, Faculty Affiliates, Postdoctoral Fellows, and researchers from the wider Vector research community came together to discuss cutting-edge research, and exchanged insights on various CV-related topics.
Vector Faculty Member Leonid Sigal introduces panelists of a forum at the workshop on the foundation models in vision.
During his talk, Vector Faculty Member and Canada CIFAR AI Chair Leonid Sigal, who is also a Professor at the University of British Columba, discussed both challenges and opportunities of foundation models, fundamental building blocks for many applications.
He showed that while such models can generate high-quality images based on image prompts, they lack consistency and intuitive controllability, often requiring less-than-intuitive prompt tuning. Sigal showed published work from his group that allowed such models to maintain consistency when generating multiple images, like in storyboards for film or TV, by leveraging visual memory. He also presented a novel approach for prompt inversion, that allowed optimization of the prompt that would have likely generated any specified image. Inverting prompts and combining resulting language descriptions directly enables an entirely new paradigm for generative image control, like merging the content of one image with the style of another, or the addition of certain objects.
Finally, he also discussed generative model biases that have emerged as significant hurdles to their deployment. Many generative models not only assume the biases of the data on which they are trained but exacerbate them. Sigal presented a novel approach that can dynamically assess the presence and extent of biases. These measurements, along with an off-the-shelf mitigation technique, can alleviate these issues.
Denoising diffusion models are a powerful new class of likelihood-based generative models. Their ability to convert text prompts into high-fidelity images or video has been transformative, enhancing creativity and providing new tools for developing new forms of generative AI. In his talk, Vector Faculty Member and Canada CIFAR AI Chair David Fleet, who is also a Professor at the University of Toronto, explored the extent to which these models might also be effective for different CV tasks. In particular, Fleet showed that with a simple, generic architecture, diffusion models can be trained to excel in two key image-to-image translation tasks: the estimation of monocular depth from a single RGB image; and the estimation of an optical flow field given two consecutive frames of video. Surprisingly, with a generic architecture and training procedure, such models outperform the current state-of-the-art, task-specific models. This further supports the ability of diffusion models to approximate complex, high-dimensional, multi-modal distributions.
Vector Faculty Member Graham Taylor moderates a forum at the workshop on the foundation models in vision alongside Vector Faculty Members David Fleet, Yalda Mohsenzadeh, and Leonid Sigal and Vector Faculty Affiliates Marcus Brubaker and April Khademi.
Vector Faculty Member, and Canada CIFAR AI Chair Graham Taylor, who is also Professor at the University of Guelph delved into the significant impact of biodiversity on our ecosystem, emphasizing the urgent need for mainstreaming individual and collective focus on biodiversity during his talk. “Decoding the Living Library: Computer Vision in Biodiversity Science” highlighted the alarming decline in Earth’s biodiversity, which saw a 68% average worldwide decrease in vertebrate species between 1970 and 2016. Taylor discussed the driving forces behind this global decline, including climate change and unsustainable agricultural practices. He underscored the potential of ML, specifically deep learning techniques, to transform biodiversity science by enhancing the accuracy of species identification and monitoring through CV. In collaboration with researchers at sister institute Amii in Edmonton, Taylor’s innovative approach aims to harness the power of multi-modal learning, including learning from images and “DNA barcodes” corresponding to the same specimen, to facilitate a deeper understanding of biodiversity data, contributing to more informed conservation and policy decisions.
The rapid advancement in self-supervised learning (SSL) has highlighted its potential to leverage unlabeled data for learning powerful visual representations. However, existing SSL methods, especially those employing different views of the same image, often rely on a limited set of predefined data augmentations. This constrains the diversity and quality of transformations, resulting in suboptimal representations. In this Generative SSL project, Arash Afkanpour, Applied Machine Learning Scientist in Vector’s AI Engineering team, introduced a novel framework that enriches the SSL paradigm by integrating generative models to generate semantically consistent image augmentations. By directly conditioning generative models on a source image representation, Vector’s AI Engineering team’s approach enables the generation of diverse augmentations while preserving the semantics of the original image, thus offering a richer set of data for self-supervised learning, ultimately enhancing the quality of learned visual representations significantly.
In the realm of multi-modal representation learning, Vector’s AI Engineering team have developed a versatile framework designed to help researchers and practitioners build new model architectures, loss functions, and methodologies, and to experiment with both novel and existing techniques. Furthermore, Afkanpour and the Vector AI Engineering team are developing methods that combine contrastive and unimodal representation learning techniques to vastly expand the data available for training multi-modal models. The team’s ultimate aim is to construct a healthcare foundation model trained on a myriad of modalities, including medical text, diverse image modalities, electronic health records, and more.
Marshall Wang, Associate Applied Machine Learning Specialist in Vector’s AI Engineering team, presented on visual prompt-tuning for remote sensing segmentation. Image segmentation is crucial in climate change research for analyzing satellite imagery. This technique is vital for ecosystem mapping, natural disaster assessment, and urban and agricultural planning. The advent of vision-based foundational models, like the Segment Anything Model (SAM), opens new avenues in climate research and remote sensing (RS). SAM can perform segmentation tasks on any object from manually-crafted prompts. However, the efficacy of SAM largely depends on the quality of these prompts. This issue is particularly pronounced with RS data, which are inherently complex. To use SAM for accurate segmentation at scale for RS, one would need to create complex prompts for each image, which typically involves selecting dozens of points.
To address this, Wang introduced Prompt-Tuned SAM (PT-SAM), a method that minimizes the need for manual input through a trainable, lightweight prompt embedding. This embedding captures key semantic information for specific objects of interest that would be applicable to unseen images. The team’s approach merges the zero-shot generalization capabilities of the pre-trained SAM model with supervised learning. Importantly, the training process for the prompt embedding not only has minimal hardware requirements, allowing it to be conducted on a CPU, but it also requires only a small dataset. With PT-SAM, image segmentation on RS data can be performed at scale without human intervention, achieving accuracies comparable to those of human-designed prompts with SAM. For example, PT-SAM can be used for analyzing forest cover across vast areas, a key factor in understanding the impact of human activities on forests. Its capability to segment a multitude of images makes it ideal for monitoring widespread land-cover changes, providing deeper insights into urbanization.
Vector researchers present their research during the poster session.
Computer vision is currently shaping daily life in ways previously unimaginable. The continual sophistication of satellite imagery, driven by CV, will enable more accurate monitoring and management of our environment. Furthermore, computer vision shows real potential to play a pivotal role in various applications, including healthcare. These developments promise to enhance simulations and safety, and aid in physician workload and patient care. This transformative technology is creating new possibilities and reshaping various aspects of our lives.
Yet we are still in the early stages of exploring CV’s full potential — many of these tools aren’t yet ready. What’s more, identifying and mitigating ethical considerations and challenges such as bias is crucial. Only then can we ensure that the benefits of computer vision are equitably distributed and ethically sound.
Want to learn more about the Vector Institute’s current research initiatives in computer vision, click here to watch Vector Faculty Member Renjie Liao’s talk at Vector’s Distinguished Lecture Series.