May 2021: I will be joining MILA as a graduate student this fall '21.
January 2021: Our WACV paper's video is now out on YouTube. Watch it here.
January 2021: I will be speaking at the W&B Deep Learning Salon on "From Smooth Activations to Robustness to Catastrophic Forgetting". I will be joined by Maithra Raghu from Google Brain. Watch it here.
December 2020: I'm starting full time as a Machine Learning Engineer at Weights & Biases.
Research Associate IOct 2023 - Present Carnegie Mellon University (CMU), Human Sensing Lab (HSL)
Supervisor: Prof. Fernando De la Torre
Research Area: Transfer of personalization on continual update of diffusion models.
Machine Learning ResearcherApril. 2022 - Feb. 2023 Morgan Stanley
Supervisor: Kashif Rasul
Research Area: Continual Learning, Time Series, Model Reprogramming
Remote Visiting Research ScholarAug. 2021 - Present VITA, University of Texas at Austin
Supervisor: Dr. Zhangyang Wang
Research Area: Sparsity, Robustness and Knowledge Distillation.
We propose Mish, a novel self-regularized non-monotonic activation function which can be mathematically defined as: $f(x)=xtanh(softplus(x))$. As activation functions play a crucial role in the performance and training dynamics in neural networks, we validated experimentally on several well-known benchmarks against the best combinations of architectures and activation functions. We also observe that data augmentation techniques have a favorable effect on benchmarks like ImageNet-1k and MS-COCO across multiple architectures. For example, Mish outperformed Leaky ReLU on YOLOv4 with a CSP-DarkNet-53 backbone on average precision ($AP^{val}_{50}$) by $2.1\%$ in MS-COCO object detection and ReLU on ResNet-50 on ImageNet-1k in Top-1 accuracy by $\approx 1 \%$ while keeping all other network parameters and hyperparameters constant. Furthermore, we explore the mathematical formulation of Mish in relation with the Swish family of functions and propose an intuitive understanding on how the first derivative behavior may be acting as a regularizer helping the optimization of deep neural networks.
Benefiting from the capability of building interdependencies among channels or spatial locations, attention mechanisms have been extensively studied and broadly
used in a variety of computer vision tasks recently. In
this paper, we investigate light-weight but effective attention mechanisms and present triplet attention, a novel
method for computing attention weights by capturing cross-dimension interaction using a three-branch structure. For
an input tensor, triplet attention builds inter-dimensional
dependencies by the rotation operation followed by residual transformations and encodes inter-channel and spatial
information with negligible computational overhead. Our
method is simple as well as efficient and can be easily
plugged into classic backbone networks as an add-on module. We demonstrate the effectiveness of our method on
various challenging tasks including image classification on
ImageNet-1k and object detection on MSCOCO and PASCAL VOC datasets. Furthermore, we provide extensive insight into the performance of triplet attention by visually
inspecting the GradCAM and GradCAM++ results. The
empirical evaluation of our method supports our intuition
on the importance of capturing dependencies across dimensions when computing attention weights.
@inproceedings{misra2021rotate,
title={Rotate to attend: Convolutional triplet attention module},
author={Misra, Diganta and Nalamada, Trikay and Arasanipalai, Ajay Uppili and Hou, Qibin},
booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision},
pages={3139--3148},
year={2021}
}
In real-world scenarios, extensive manual annotation for continual learning is impractical due to prohibitive costs. Although prior arts, influenced by large-scale webly supervised training, suggest leveraging web-scraped data in continual learning, this poses challenges such as data imbalance, usage restrictions, and privacy concerns. Addressing the risks of continual webly supervised training, we present an online continual learning framework - Generative Name only Continual Learning (G-NoCL). The proposed G-NoCL uses a set of generators G along with the learner. When encountering new concepts (i.e., classes), G-NoCL employs the novel sample complexity-guided data ensembling technique DIverSity and COmplexity enhancing ensemBlER (DISCOBER) to optimally sample training data from generated data. Through extensive experimentation, we demonstrate superior performance of DISCOBER in G-NoCL online CL benchmarks, covering both In-Distribution (ID) and Out-of-Distribution (OOD) generalization evaluations, compared to naive generator-ensembling, web-supervised, and manually annotated data.
@misc{seo2024just,
title={Just Say the Name: Online Continual Learning with Category Names Only via Data Generation},
author={Minhyuk Seo and Diganta Misra and Seongwon Cho and Minjae Lee and Jonghyun Choi},
year={2024},
eprint={2403.10853},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
Mila CERC AAI AI & Scale Workshop 2024 Talk
Pretrained language models underpin several AI applications, but their high computational cost for training limits accessibility. Initiatives such as BLOOM and StarCoder aim to democratize access to pretrained models for collaborative community development. However, such existing models face challenges: limited multilingual capabilities, continual pretraining causing catastrophic forgetting, whereas pretraining from scratch is computationally expensive, and compliance with AI safety and development laws. This paper presents Aurora-M, a 15B parameter multilingual open-source model trained on English, Finnish, Hindi, Japanese, Vietnamese, and code. Continually pretrained from StarCoderPlus on 435 billion additional tokens, Aurora-M surpasses 2 trillion tokens in total training token count. It is the first open-source multilingual model fine-tuned on human-reviewed safety instructions, thus aligning its development not only with conventional red-teaming considerations, but also with the specific concerns articulated in the Biden-Harris Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence. Aurora-M is rigorously evaluated across various tasks and languages, demonstrating robustness against catastrophic forgetting and outperforming alternatives in multilingual settings, particularly in safety evaluations.
@misc{nakamura2024auroram,
title={Aurora-M: The First Open Source Multilingual Language Model Red-teamed according to the U.S. Executive Order},
author={Taishi Nakamura and Mayank Mishra and Simone Tedeschi and Yekun Chai and Jason T Stillerman and Felix Friedrich and Prateek Yadav and Tanmay Laud and Vu Minh Chien and Terry Yue Zhuo and Diganta Misra and Ben Bogin and Xuan-Son Vu and Marzena Karpinska and Arnav Varma Dantuluri and Wojciech Kusa and Tommaso Furlanello and Rio Yokota and Niklas Muennighoff and Suhas Pai and Tosin Adewumi and Veronika Laippala and Xiaozhe Yao and Adalberto Junior and Alpay Ariyak and Aleksandr Drozd and Jordan Clive and Kshitij Gupta and Liangyu Chen and Qi Sun and Ken Tsui and Noah Persaud and Nour Fahmy and Tianlong Chen and Mohit Bansal and Nicolo Monti and Tai Dang and Ziyang Luo and Tien-Tung Bui and Roberto Navigli and Virendra Mehta and Matthew Blumberg and Victor May and Huu Nguyen and Sampo Pyysalo},
year={2024},
eprint={2404.00399},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
With the latest advances in deep learning, there has been a lot of focus on the online learning paradigm due to its relevance in practical settings. Although many methods have been investigated for optimal learning settings in scenarios where the data stream is continuous over time, sparse networks training in such settings have often been overlooked. In this paper, we explore the problem of training a neural network with a target sparsity in a particular case of online learning: the anytime learning at macroscale paradigm (ALMA). We propose a novel way of progressive pruning, referred to as \textit{Anytime Progressive Pruning} (APP); the proposed approach significantly outperforms the baseline dense and Anytime OSP models across multiple architectures and datasets under short, moderate, and long-sequence training. Our method, for example, shows an improvement in accuracy of $\approx 7\%$ and a reduction in the generalization gap by $\approx 22\%$, while being $\approx 1/3$ rd the size of the dense baseline model in few-shot restricted imagenet training. We further observe interesting nonmonotonic transitions in the generalization gap in the high number of megabatches-based ALMA. The code and experiment dashboards can be accessed at \url{https://github.com/landskape-ai/Progressive-Pruning} and \url{https://wandb.ai/landskape/APP}, respectively.
@misc{misra2022app,
title={APP: Anytime Progressive Pruning},
author={Diganta Misra and Bharat Runwal and Tianlong Chen and Zhangyang Wang and Irina Rish},
year={2022},
eprint={2204.01640},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
NSL presentation /
MLC Research Jam #8 /
MLC Research Jam #9 /
Continual AI Seminar
Learning under constraints has been a fundamental avenue of research in deep learning since the advent of modern deep neural networks. In parallel to the upwards trajectory of scaling neural networks, one practical constraint that has embodied efficient deep learning has been that of sparsity. Unstructured weight sparsity has been the cornerstone of pioneering works in the space of pruning and lottery ticket hypothesis. In this paper, we propose \textbf{$\mathcal{D}^2$-Sparse}, a novel dual dynamic sparse learning system for low-data learning regime. Our paper combines two popular constraints in deep learning namely sparsity and low-data learning, often studied in disjoint paradigms, thus opening new directions of research in sparsity. $\mathcal{D}^2$-Sparse outperforms standard iterative pruning schema when coupled with standard deep networks in computer vision tasks like image classification and in natural language processing like code generation with no extra-overhead cost on inference. Compared to iterative pruning, on $\frac{1}{8}$-th total data budget, $\mathcal{D}^2$-Sparse achieves a $\approx$ 4% top-1 accuracy boost for ResNet-18 on the CIFAR-100 classification task. Further, we demonstrate the effectiveness of the proposed method in anytime learning scenarios and provide extensive analysis into evolution of sparse masks in $\mathcal{D}^2$-Sparse over the training process. Code, dashboard, and model weights will be open-sourced for public access upon acceptance.
The ever-changing landscape of programming languages poses a significant challenge in the
development and training of models designed for code generation. Code, being a dynamic
and constantly evolving environment, necessitates a continuous process of adaptation to
stay in sync with the rapidly shifting paradigms, frameworks, and methodologies within
the software development domain. The inherent variability in coding styles, the emergence
of new programming languages, and the continuous evolution of libraries and packages
underscore the imperative for an active approach in updating code generation models. In
response to this challenge, we introduce GitChameleon, an innovative dataset comprising
more than 12,000 version-sensitive examples in Python, designed to facilitate research into
the adaptation of code generation models to the rapidly changing landscape of programming
languages. Furthermore, we assess the performance of state-of-the-art code models and
demonstrate their inadequacy in generating version-specific code. For example, the latest
CodeLlama-70B only achieves a 46.76% exact string match score
when evaluated on GitChameleon.
In the era of resource-intensive foundation models, efficient adaptation in downstream tasks has become paramount. Visual Prompting (VP), inspired by prompting in Large Language Models (LLMs), has emerged as a key transfer learning method in computer vision. Aligned with the growing significance of efficiency, research in model compression has become pivotal to alleviate the computational burden in both training and deploying over-parameterized neural networks. A key goal in model compression is the development of sparse models capable of matching or surpassing the performance of their over-parameterized, dense counterparts. While prior research has explored the impact of model sparsity on transfer learning, its effects on visual prompting-based transfer remain unclear. This study addresses this gap, revealing that model sparsity adversely affects the performance of visual prompting-based transfer, particularly in low-data-volume scenarios. Furthermore, our findings highlight the negative influence of sparsity on the calibration of downstream visual-prompted models. This empirical exploration calls for a nuanced understanding beyond accuracy in sparse settings, opening avenues for further research in Visual Prompting for sparse models.
@article{misra2023reprogramming,
title = {Reprogramming under constraints: Revisiting efficient and reliable transferability of lottery tickets},
author = {Diganta Misra and Agam Goyal and Bharat Runwal and Pin Yu Chen},
year = {2023},
journal = {arXiv preprint arXiv: 2308.14969}
}
Cohere ForAI Lightning Talk /
Google Sparsity Reading Group Talk /
MLC Research Jam 17
Standard gradient descent algorithms applied to sequences of tasks are known to produce catastrophic forgetting in deep neural networks. When trained on a new task in a sequence, the model updates its parameters on the current task, forgetting past knowledge. This article explores scenarios where we scale the number of tasks in a finite environment. Those scenarios are composed of a long sequence of tasks with reoccurring data. We show that in such setting, stochastic gradient descent can learn, progress, and converge to a solution that according to existing literature needs a continual learning algorithm. In other words, we show that the model performs knowledge retention and accumulation without specific memorization mechanisms. We propose a new experimentation framework, SCoLe (Scaling Continual Learning), to study the knowledge retention and accumulation of algorithms in potentially infinite sequences of tasks. To explore this setting, we performed a large number of experiments on sequences of 1,000 tasks to better understand this new family of settings. We also propose a slight modifications to the vanilla stochastic gradient descent to facilitate continual learning in this setting. The SCoLe framework represents a good simulation of practical training environments with reoccurring situations and allows the study of convergence behavior in long sequences. Our experiments show that previous results on short scenarios cannot always be extrapolated to longer scenarios.
Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 442 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting.
The strength of modern large-scale neural networks lies in their ability to efficiently adapt to new tasks with few examples. Although extensive research has investigated the transferability of Vision Transformers (ViTs) to various downstream tasks under diverse constraints, this study shifts focus to explore the transfer learning potential of [V]-Mamba. We compare its performance with ViTs across different few-shot data budgets and efficient transfer methods. Our analysis yields three key insights into [V]-Mamba's few-shot transfer performance: (a) [V]-Mamba demonstrates superior or equivalent few-shot learning capabilities compared to ViTs when utilizing linear probing (LP) for transfer, (b) Conversely, [V]-Mamba exhibits weaker or similar few-shot learning performance compared to ViTs when employing visual prompting (VP) as the transfer method, and (c) We observe a weak positive correlation between the performance gap in transfer via LP and VP and the scale of the [V]-Mamba model. This preliminary analysis lays the foundation for more comprehensive studies aimed at furthering our understanding of the capabilities of [V]-Mamba variants and their distinctions from ViTs.
@misc{misra2024lowshot,
title={On the low-shot transferability of [V]-Mamba},
author={Diganta Misra and Jay Gala and Antonio Orvieto},
year={2024},
eprint={2403.10696},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
The recent success of Sparse Mixture-of-Experts (SMoEs) models has sparked renewed interest in routed networks in deep learning. A prominent aspect of the SMoE is the scaling of the number of total parameters in a model, effectively increasing capacity while keeping computation costs similar to dense models. Yet, these models pose optimization challenges as inputs are routed discretely to experts in each layer. Often, a regularization term is added to the loss function to penalize the imbalanced selection of experts. We aim to demonstrate that the heuristic regularization strategies used in recent SMoEs, while successful in some tasks, have significant limitations which we aim to address. In multi-domain or multi-task settings, without explicit knowledge of the task or domain, the network will suffer from a mode collapse-performance tradeoff, in which some experts will receive significantly less training signal, or performance on some tasks will suffer. Second, we derive a theoretical basis of the various routing functions, with entropy-maximization as a common objective. Third, we will demonstrate a first application of Generative Flow Networks (GFlowNets) to SMoEs, with a state, policy, and action space, represented at a particular layer of the model by the input, routing network, and sampling from expert probabilities, respectively. We aim to show that SMoEs trained with the Trajectory Balance objective from GFlowNet literature can achieve competitive performance with state of the art routing methods, such as Switch Transformer, and suffer less from expert collapse in multi-task (NYUv2, Pascal-Context) and multi-domain (Omniglot) settings. This work lays some foundations for further exploration of theoretically motivated approaches to routing in sparse MoEs.
Measuring nonlinear feature interaction is an established approach to understanding complex patterns of attribution in many models. In this paper, we use Shapley Taylor interaction indices (STII) to analyze the impact of underlying data structure on model representations in a variety of modalities, tasks, and architectures. Considering linguistic structure in masked and auto-regressive language models (MLMs and ALMs), we find that STII increases within idiomatic expressions and that MLMs scale STII with syntactic distance, relying more on syntax in their nonlinear structure than ALMs do. Our speech model findings reflect the phonetic principal that the openness of the oral cavity determines how much a phoneme varies based on its context. Finally, we study image classifiers and illustrate that feature interactions intuitively reflect object boundaries. Our wide range of results illustrates the benefits of interdisciplinary work and domain expertise in interpretability research.
@misc{singhvi2024knowing,
title={Knowing Your Nonlinearities: Shapley Interactions Reveal the Underlying Structure of Data},
author={Divyansh Singhvi and Andrej Erkelens and Raghav Jain and Diganta Misra and Naomi Saphra},
year={2024},
eprint={2403.13106},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
The size and prevalence of large language models (LLMs) make them an apt target for model compression. Most LLMs consist of a Transformer encoder and decoder, which each have 6 to 12 layers of multiheaded self-attention blocks, along with fully connected layers. This results in a large number of parameters, making them quite expensive to train and query. Our work focuses on finding techniques to prune CodeBERT, a specific LLM trained to work multimodally between text and code. We explore the effects of structured and unstructured magnitude pruning on the encoder layers of CodeBERT, evaluating on the task of generating natural language comments from a piece of Ruby code.
Dense information retrieval yields strong in-domain performance,
but often struggles with out-of-domain generalization, lagging be-
hind unsupervised methods. Retrieval tasks can vary across a num-
ber of dimensions including domain, query intent, and language.
Using a single dense retrieval model for all tasks often underper-
forms lexical methods such as BM25. For practical information
retrieval systems, it is expensive to deploy a different model for
each task. Therefore, our motivation is to develop a cheap and
effective information retrieval model that maintains strong per-
formance across different domains while easily adapting to any
new domain. Other approaches to domain transfer in information
retrieval rely on large auxiliary language models or datasets and
create a separate model for each task. In this work, we develop a
method utilizing prompt tuning to efficiently adapt dense retrievers
with a minimal amount of additional computation. By combining
models trained on a variety of different domains, we can effectively
boost performance on a target task in a new domain. Specifically,
we train dense retrieval models using prompt tuning on a large
number of information retrieval tasks across diverse domains and
types of query intents. To adapt to a new domain, we create new
prompt embeddings by averaging the prompt embeddings from a
set of source tasks selected in an unsupervised manner. We evaluate
zero-shot transfer performance across a wide variety of information
retrieval domains and show competitive performance while lever-
aging a minimal amount of compute. Notably, our SPIRIT method
achieves while being extremely lightweight and practical to deploy
in production.
Printable electronics based electromagnetic absorbers are receiving increasing attention of the electromagnetic community because of their unprecedented advantages. This paper presents the design of printable electromagnetic absorbers for the X band. The design of the absorber is optimized using the Genetic Algorithm (GA) to enhance the absorptivity and the absorption bandwidth. The design involves the placement of several square-shaped conductive ink at optimal locations on the paper substrate such that desired absorption characteristics are obtained. Simulations are carried out using the HFSS simulation software. The optimized structure offers an absorptivity of more than 90% in the X band thereby proving to be a viable solution for stealth applications.
@inproceedings{misra2018genetic,
title={Genetic Algorithm Optimized Inkjet Printed Electromagnetic Absorber on Paper Substrate},
author={Misra, Diganta and Pelluri, Rahul and Verma, Vijay Kumar and Appasani, Bhargav and Gupta, Nisha},
booktitle={2018 International Conference on Applied Electromagnetics, Signal Processing and Communication (AESPC)},
volume={1},
pages={1--3},
year={2018},
organization={IEEE}
}
In this paper, we propose an unsupervised learning approach with an objective to understand gene expressions for analysis of Wilson’s disease in the liver of Mus musculus organisms. We proceeded to obtain the best parameters for cluster division to correctly classify gene expression sets so as to capture the effect and characteristics of the disease in the genome levels of the organisms in the best possible way. The clustering proved beneficial in capturing the correct genetic analogy of Wilson’s disease. Analytical experiments were carried out using various clustering algorithms and were evaluated using performance metrics including silhouette score analysis and Calinski–Harabasz index.
@inproceedings{misra2019large,
title={Large-Scale Meta-Analysis of Genes Encoding Pattern in Wilson’s Disease},
author={Misra, Diganta and Tiwari, Anurag and Chaturvedi, Amrita},
booktitle={Advances in Computer Communication and Computational Sciences: Proceedings of IC4S 2018},
pages={389--400},
year={2019},
organization={Springer}
}
In this paper, a deep learning-based approach has been developed to classify the images of galaxies into three major categories, namely, elliptical, spiral, and irregular. The classifier successfully classified the images with an accuracy of 97.3958%, which outperformed conventional classifiers like Support Vector Machine and Naive Bayes. The convolutional neural network architecture involves one input convolution layer having 16 filters, followed by 4 hidden layers, 1 penultimate dense layer, and an output Softmax layer. The model was trained on 4614 images for 200 epochs using NVIDIA-DGX-1 Tesla-V100 Supercomputer machine and was subsequently tested on new images to evaluate its robustness and accuracy.
@incollection{misra2020convoluted,
title={Convoluted cosmos: classifying galaxy images using deep learning},
author={Misra, Diganta and Mohanty, Sachi Nandan and Agarwal, Mohit and Gupta, Suneet K},
booktitle={Data Management, Analytics and Innovation},
pages={569--579},
year={2020},
publisher={Springer}
}
I am an active lead maintainer of the Reproducible Continual Learning framework by Avalanche and
also actively work on the evaluation framework of Avalanche mainly in the direction of integration of Weights & Biases API.
Echo is an OSS deep learning package with support for TensorFlow, PyTorch and MegEngine, containing novel validated methods, components and building blocks used in deep learning.
I am currently the lead for the modelling part of the Multi-Domain Expert Layers (MDEL) Training: How to increase knowledge without breaking the bank? as a collaborative effort co-ordinated by Ontocord AI wherein my team is working on different aspects of architecture design and training of the MDEL model on SUMMIT supercomputer cluster as part of the INCITE allocation.
Data Science InternJun. 2018 - Feb. 2019 CSIR-CDRI
During this internship, I was involved in building the analytical pipeline, data collection, pre-processing of data, cleaning of data, Geo-spatial Analysis of data and Document writing for the project on understanding demographics of Venture Capital and
Early Seed Investments. As a part of a team of three, I was advised and mentored by Dr. Sukant Khurana.
Served as a primary instructor for cultural engagements along with teaching basic
english and computer science to primary grade students at RangsonWittaya School,
Nakhon Sawan under the AIESEC SDG #4 programme. Was also part of culture
exchange, entrepreneurship and social service programs at Bangkok University
I was responsible for developing the content for the Strategies section in the Continual Learning lecture of the Deep Learning Cohort of Neuromatch Academy 2021.
I was the lead organizer of the W&B MLRC 2021 where I actively supported our challenge participants. Our mission of organizing this challenge was
to make machine learning research reproducible, transparent and accessible to everyone. This initiative was also supported by our W&B MLRC Grant of $500 for each participant.
I was awarded the UNIQUE AI Excellence Scholarship worth CAD$10,000 for the academic year 2022. Under this scholarship, I will be working with Irina Rish and Pouya Bashivan on dynamic sparsity based research.