Iβm a PhD student in Artificial Intelligence at MICC, University of Florence, working under the guidance of Prof. Andrew D. Bagdanov and Prof. Marco Bertini. With a background in Computer Engineering and AI, my research focuses on pushing the boundaries of Multimodal Vision-Language Models (like CLIP) and their real-world applications.
This expertise is demonstrated through my first-author publications in top-tier venues, including ECCV (main conference), NeurIPS (workshop), ICLR (main conference). These works reflect my dedication to solving challenging problems and advancing the field of AI.
I recently completed an Applied Scientist Internship at Amazon (RufusX Team, London), where I worked on foundational research and development in Generative AI and Multimodal Large Language Models (MLLMs) as part of the Amazon Rufus initiative.
For more information, feel free to visit my website: marcomistretta.github.io
-
Cross the Gap: Inter-modal CLIP Representations Are Superior for Intra-modal Tasks
ICLR 2025 (main paper)
Authors: Marco Mistretta*, Alberto Baldrati*, Lorenzo Agnolucci*, Marco Bertini, Andrew D. Bagdanov
Code: GitHub Repository -
Improving Zero-shot Generalization of Learned Prompts via Unsupervised Knowledge Distillation
ECCV 2024 (main paper)
Authors: Marco Mistretta*, Alberto Baldrati*, Marco Bertini, Andrew D. Bagdanov
Code: GitHub Repository -
RE-tune: Incremental Fine Tuning of Biomedical Vision-Language Models for Multi-label Chest X-ray Classification
NeurIPS 2023, Medical Imaging meets NeurIPS Workshop
Authors: Marco Mistretta, Andrew D. Bagdanov
July 2025 β December 2025
- Worked on Generative AI and Multimodal Large Language Models (MLLMs) within the Amazon Rufus initiative.
- Fine-tuned, evaluated, and deployed large-scale multimodal models impacting millions of customers.
- Collaborated with scientists and engineers to advance real-world multimodal reasoning and generation.
I'm really into:
- π§ Multimodal Learning: Combining visual and language data to get a richer understanding of the world.
- π¬ Natural Language Processing (NLP): Teaching machines to understand and communicate in human language.
- πΌοΈ Contrastive Self-Supervised Learning: Finding patterns in data without the need for human labels.
- β»οΈ Incremental Learning: Allowing AI models to keep learning from new information without forgetting the old ones.
- π― Few-Shot Adaptation: Quickly adapting AI to a diverse data distribution with minimal examples.
- π Prompt Learning: Tuning only a few learnable parameters, so-called "prompts", to maximize VLMs performance.
- π Test-Time Adaptation: Letting models adjust during inference to handle unseen data on the fly.
- Programming Languages: Python, Java, C++, MATLAB, R
- Frameworks & Tools: PyTorch, TensorFlow, Hugging Face, OpenCV
- Research Areas: Vision-Language Models, Self-Supervised Learning, Few-Shot Learning, Prompt Learning, Incremental Learning
Iβd love to connect! Feel free to reach out on:
