Adel Bibi is a Senior Researcher in Machine Learning and Computer Vision at the Department of Engineering Science at the University of Oxford, a Research Member of the Common Room and a former Junior Fellow at Kellogg College, and a member of the ELLIS Society. He is also an R&D Distinguished Advisor at SoftServe. Previously, he was a Senior Research Associate and postdoctoral researcher in the Torr Vision Group at Oxford working with Philip H.S. Torr. He obtained his MSc and PhD from KAUST in 2016 and 2020, respectively, under the supervision of Bernard Ghanem.
His research focuses on trustworthy and robust AI, agentic AI safety, and adversarial machine learning. He received the Horizon Europe Grant in 2026, the Coefficient Giving Grant in 2026, the Systemic AI Safety Grant from the UK AI Security Institute in 2025, the Toyota Motor Europe Grant in 2025, the Google Gemma 2 Academic Award in 2024, and an Amazon Research Award in 2022. His recent work on vulnerabilities in AI agents, transferable adversarial attacks, and AI safety evaluations has received broad international attention, including coverage by Scientific American, The Guardian, NBC News, Computerphile, TLDR News, Globo TV, and the Hugging Face Evaluation Guidebook. His work has received multiple paper and service distinctions, including four best/outstanding workshop paper awards (NeurIPS'23, ICML'23, CVPR'22, OBD'18), four outstanding reviewer awards (CVPR18, CVPR19, ICCV19, ICLR22), and a Notable Area Chair Award at NeurIPS 2023. He regularly serves as a Senior Area Chair and Area Chair for major AI conferences including NeurIPS, ICML, ICLR, etc. Bibi has authored more than 50 publications in top-tier machine learning and computer vision venues, including NeurIPS, ICML, ICLR, CVPR, ICCV, ECCV, AAAI, and TPAMI.
Download my resume
[Note!] I am always looking for strong self-motivated PhD students. If you are interested in AI Safety, Trustworthy, and Security of AI models and Agentic AI, reach out!
[Consulting Expertise] I have consulted in the past on projects spanning core machine learning and data science, computer vision, certification and AI safety, optimization formulations for matching and resource allocation problems, among other areas.
PhD in Electrical Engineering (4.0/4.0); Machine Learning and Optimization Track, 2020
King Abdullah University of Science and Technology (KAUST)
MSc in Electrical Engineering (4.0/4.0); Computer Vision Track, 2016
King Abdullah University of Science and Technology (KAUST)
BSc in Electrical Engineering (3.99/4.0), 2014
Kuwait University
~~ End of 2025 ~~
~~ End of 2023 ~~
~~ End of 2022 ~~
~~ End of 2021 ~~
~~ End of 2020 ~~
~~ End of 2019 ~~
~~ End of 2018 ~~
~~ End of 2017 ~~
~~ End of 2016 ~~
~~ End of 2015 ~~
The integration of new modalities enhances the capabilities of multimodal large language models (MLLMs) but also introduces additional vulnerabilities. In particular, simple visual jailbreaking attacks can manipulate open-source MLLMs more readily than sophisticated textual attacks. However, these underdeveloped attacks exhibit extremely limited cross-model transferability, failing to reliably identify vulnerabilities in closed-source MLLMs. In this work, we analyse the loss landscape of these jailbreaking attacks and find that the generated attacks tend to reside in high-sharpness regions, whose effectiveness is highly sensitive to even minor parameter changes during transfer. To further explain the high-sharpness localisations, we analyse their feature representations in both the intermediate layers and the spectral domain, revealing an improper reliance on narrow layer representations and semantically poor frequency components. Building on this, we propose a Feature Over-Reliance CorrEction (FORCE) method, which guides the attack to explore broader feasible regions across layer features and rescales the influence of frequency features according to their semantic content. By eliminating non-generalizable reliance on both layer and spectral features, our method discovers flattened feasible regions for visual jailbreaking attacks, thereby improving cross-model transferability. Extensive experiments demonstrate that our approach effectively facilitates visual red-teaming evaluations against closed-source MLLMs.
Monitoring large language models' (LLMs) activations is an effective way to detect harmful requests before they lead to unsafe outputs. However, traditional safety monitors often require the same amount of compute for every query. This creates a trade-off: expensive monitors waste resources on easy inputs, while cheap ones risk missing subtle cases. We argue that safety monitors should be flexible–costs should rise only when inputs are difficult to assess, or when more compute is available. To achieve this, we introduce Truncated Polynomial Classifiers (TPCs), a natural extension of linear probes for dynamic activation monitoring. Our key insight is that polynomials can be trained and evaluated progressively, term-by-term. At test-time, one can early-stop for lightweight monitoring, or use more terms for stronger guardrails when needed. TPCs provide two modes of use. First, as a safety dial: by evaluating more terms, developers and regulators can buy stronger guardrails from the same model. Second, as an adaptive cascade: clear cases exit early after low-order checks, and higher-order guardrails are evaluated only for ambiguous inputs, reducing overall monitoring costs. On two large-scale safety datasets (WildGuardMix and BeaverTails), for 4 models with up to 30B parameters, we show that TPCs compete with or outperform MLP-based probe baselines of the same size, all the while being more interpretable than their black-box counterparts.
Agents backed by large language models (LLMs) increasingly rely on external tools drawn from marketplaces where multiple providers offer functionally equivalent options. This raises a critical fairness concern: systematic bias in tool selection can degrade user experience and distort competition by privileging certain providers over others. We introduce a benchmark of diverse tool categories, each containing multiple functionally equivalent tools, to systematically evaluate tool-selection bias. Using this benchmark, we evaluate seven LLMs and show that substantial bias persists, with models either fixating on a single provider or disproportionately favoring tools that appear earlier in the context. To uncover the sources of this behavior, we conduct controlled experiments that isolate the effects of tool features, exposed metadata (name, description, and parameters), and pre-training exposure. We find that (1) semantic alignment between user queries and tool metadata is the strongest driver of selection; (2) small perturbations to tool descriptions can significantly shift choices; and (3) repeated pre-training exposure to a single endpoint amplifies provider-level bias. Finally, we propose a lightweight mitigation strategy that first filters tools to a relevant subset and then samples uniformly, substantially reducing selection bias while maintaining strong task coverage. Our results highlight tool-selection bias as a key obstacle to the fair deployment of tool-augmented LLM agents.