Arijit Ray's Webpage

Selected publications (all):

Mull-Tokens

Mull-Tokens: Modality-Agnostic Latent Thinking

arXiv 2025

TL;DR: Instead of text reasoning or explicit image thoughts, using modality-agnostic thinking tokens pretrained using multimodal thoughts is more effective for visual reasoning tasks.

SIMS-V

SIMS-V: Simulated Instruction-Tuning for Spatial Video Understanding

arXiv 2025

TL;DR: Using simulators to create spatially-rich video training data for multimodal language models. Our 7B-parameter model fine-tuned on just 25K simulated examples outperforms the larger 72B baseline and achieves competitive performance with proprietary models on real-world spatial reasoning benchmarks.

GraspMolmo: Generalizable Task-Oriented Grasping via Large-Scale Synthetic Data Generation

TL;DR: Use textual reasoning (e.g., one should grasp "handle" of cup to drink tea) and visual reasoning (e.g., in this image, the cup is grasped by the "handle") along with an object grasping dataset (data of 6 DOF arm positions to grasp objects) to generate high-level task description (e.g., how should I grasp cup to drink tea?) to precise 6DOF grasp data.

SAT

SAT: Dynamic Spatial Aptitude Training for Multimodal Language Models

TL;DR: Simulated spatial aptitude data (SAT) using 3D physics engines involving object and camera movements can improve spatial reasoning in real images and videos for MLMs while maintaining pretraining commonsense. When instruction-tuned on SAT, LLaVA-13B outperforms some larger MLMs like GPT4-V and Gemini-1.5-pro in spatial reasoning.

R2D3

R2D3: Imparting Spatial Reasoning by Reconstructing 3D Scenes from 2D Images

Tech report 2024

TL;DR: A visual anchor with the corresponding 3D location in text helps multimodal language models perform more accurate 3D estimation.

FED

FED: Feedback-Guided Autonomous Driving

TL;DR: MLMs can benefit autonomous driving by understanding natural language feedback and refining the next waypoint prediction.

Cola

Cola: A Benchmark for Compositional Text-to-image Retrieval

TL;DR: Multimodal layers are the most responsible for compositional reasoning rather than the visual or textual encoders. Tuning multimodal layers over frozen representations are more effective than tuning similar amount of parameters in the individual encoders or even the entire model.

Lasagna

Lasagna: Layered Score Distillation for Disentangled Object Relighting

TL;DR: Synthetically generated examples using 3D graphics engines are effective in teaching physics-aware edits like relighting if we use score-distillation to avoid overfitting.

Socratis

Socratis: Are Large Multimodal Models Emotionally Aware?

TL;DR: A preliminary benchmark to test MLMs on why different people may feel different emotions for diverse reasons while viewing the same image-text content.

User-targeted content generation

User-targeted content generation using multimodal embeddings

VQA Error Regions

Knowing What VQA Does Not: Pointing to Error-Inducing Regions to Improve Explanation Helpfulness

VQA Consistency

Sunny and Dark Outside?! Improving Answer Consistency in VQA through Entailed Question Generation

EMNLP 2019, also at CVPR-W 2019 VQA and Visual Dialog Workshop

TL;DR: Use a learned teacher module to reward VQA models for predicting consistent answers to entailed questions.

VQA Relevance

Question Relevance in VQA: Identifying Non-Visual And False-Premise Questions

TL;DR: Image captions are a surprisingly strong baseline to identify whether a question is relevant to an image or not.

Hobbies

When I am not training LLMs, I love going to techno (a subgenre of electronic music) fests, making latte art, and engineering simple gadgets. In middle school, I opened an informal research society to encourage fellow students to take an interest in science by constructing simple gadgets. We won multiple accolades in school and city-level exhibitions.