Zhang, H., Cui, Z., Wang, X., Zhang, Q., Wang, Z., Wu, D., & Hu, S. (2025). If Multi-Agent Debate is the Answer, What is the Question?
@unpublished{zhang2025multiagentdebateanswerquestion,
title = {If Multi-Agent Debate is the Answer, What is the Question?},
author = {Zhang, Hangfan and Cui, Zhiyao and Wang, Xinrun and Zhang, Qiaosheng and Wang, Zhen and Wu, Dinghao and Hu, Shuyue},
year = {2025},
eprint = {2502.08788},
archiveprefix = {arXiv},
primaryclass = {cs.CL},
arxiv = {https://arxiv.org/abs/2502.08788}
}
Multi-agent debate (MAD) has emerged as a promising approach to enhance the factual accuracy and reasoning quality of large language models (LLMs) by engaging multiple agents in iterative discussions during inference. Despite its potential, we argue that current MAD research suffers from critical shortcomings in evaluation practices, including limited dataset overlap and inconsistent baselines, raising significant concerns about generalizability. Correspondingly, this paper presents a systematic evaluation of five representative MAD methods across nine benchmarks using four foundational models. Surprisingly, our findings reveal that MAD methods fail to reliably outperform simple single-agent baselines such as Chain-of-Thought and Self-Consistency, even when consuming additional inference-time computation. From our analysis, we found that model heterogeneity can significantly improve MAD frameworks. We propose Heter-MAD enabling a single LLM agent to access the output from heterogeneous foundation models, which boosts the performance of current MAD frameworks. Finally, we outline potential directions for advancing MAD, aiming to spark a broader conversation and inspire future work in this area.
Wang, X., Cui, Z., Li, H., Zeng, Y., Wang, C., Song, R., Chen, Y., Shao, K., Zhang, Q., Liu, J., Ren, S., Hu, S., & Wang, Z. (2025). PerPilot: Personalizing VLM-based Mobile Agents via Memory and Exploration.
@unpublished{wang2025perpilotpersonalizingvlmbasedmobile,
title = {PerPilot: Personalizing VLM-based Mobile Agents via Memory and Exploration},
author = {Wang, Xin and Cui, Zhiyao and Li, Hao and Zeng, Ya and Wang, Chenxu and Song, Ruiqi and Chen, Yihang and Shao, Kun and Zhang, Qiaosheng and Liu, Jinzhuo and Ren, Siyue and Hu, Shuyue and Wang, Zhen},
year = {2025},
eprint = {2508.18040},
archiveprefix = {arXiv},
primaryclass = {cs.AI},
arxiv = {https://arxiv.org/abs/2508.18040}
}
Vision language model (VLM)-based mobile agents show great potential for assisting users in performing instruction-driven tasks. However, these agents typically struggle with personalized instructions – those containing ambiguous, user-specific context – a challenge that has been largely overlooked in previous research. In this paper, we define personalized instructions and introduce PerInstruct, a novel human-annotated dataset covering diverse personalized instructions across various mobile scenarios. Furthermore, given the limited personalization capabilities of existing mobile agents, we propose PerPilot, a plug-and-play framework powered by large language models (LLMs) that enables mobile agents to autonomously perceive, understand, and execute personalized user instructions. PerPilot identifies personalized elements and autonomously completes instructions via two complementary approaches: memory-based retrieval and reasoning-based exploration. Experimental results demonstrate that PerPilot effectively handles personalized tasks with minimal user intervention and progressively improves its performance with continued use, underscoring the importance of personalization-aware reasoning for next-generation mobile agents.
Refereed journal articles
Qi, M., Cui, Z., & Liang, G. (2023). TBVPAKE: An efficient and provably secure verifier-based PAKE protocol for IoT applications. Journal of Systems Architecture, 139, 102874. https://www.sciencedirect.com/science/article/pii/S138376212300053X
@article{QI2023102874,
title = {TBVPAKE: An efficient and provably secure verifier-based PAKE protocol for IoT applications},
journal = {Journal of Systems Architecture},
volume = {139},
pages = {102874},
year = {2023},
issn = {1383-7621},
doi = {https://doi.org/10.1016/j.sysarc.2023.102874},
url = {https://www.sciencedirect.com/science/article/pii/S138376212300053X},
author = {Qi, Mingping and Cui, Zhiyao and Liang, Gaowei},
keywords = {Password-authenticated key exchange, PAKE, Provably secure, IoT}
}
Password-authenticated key exchange (PAKE) is an important cryptographic primitive by which two parties are allowed to authenticate each other and establish a cryptographically strong key using a low-entropy password over an insecure channel. Therefore, it is suitable for access control and securing communications between low-cost Internet of Things (IoT) devices where sound security mechanism is difficult to be applied. This paper makes a contribution to securing IoT applications by presenting a secure, efficient and easy-to-implement verifier-based PAKE protocol, named as TBVPAKE (short for Two-Basis Verifier-based PAKE). It is secure against the off-line dictionary attack and server compromise attack, and supports the perfect forward secrecy. Under the widely accepted BPR security model, TBVPAKE is formally proved in this paper in the random oracle model by reducing its security to the Computational Diffie–Hellman (CDH) and Simultaneous Diffie–Hellman (SDH) security assumptions. In addition, we compare the new TBVPAKE with some other outstanding verifier-based PAKE protocols by instantiating them over a commonly used elliptic curve group, and the comparative analysis results definitely show that the new TBVPAKE offers better computational efficiency and ease of implementation. Therefore, the new TBVPAKE might be a better choice for securing IoT applications.
Refereed conference proceedings
Ren, S., Cui, Z., Song, R., Wang, Z., & Hu, S. (2024). Emergence of Social Norms in Generative Agent Societies: Principles and Architecture. In K. Larson (Ed.), Proceedings of the Thirty-Third International Joint Conference on
Artificial Intelligence, IJCAI-24 (pp. 7895–7903). International Joint Conferences on Artificial Intelligence Organization; ; Human-Centred AI. https://doi.org/10.24963/ijcai.2024/874
@inproceedings{ijcai2024p0874,
title = {Emergence of Social Norms in Generative Agent Societies: Principles and Architecture},
author = {Ren, Siyue and Cui, Zhiyao and Song, Ruiqi and Wang, Zhen and Hu, Shuyue},
booktitle = {Proceedings of the Thirty-Third International Joint Conference on
Artificial Intelligence, {IJCAI-24}},
publisher = {International Joint Conferences on Artificial Intelligence Organization},
editor = {Larson, Kate},
pages = {7895--7903},
year = {2024},
month = aug,
note = {Human-Centred AI},
doi = {10.24963/ijcai.2024/874},
url = {https://doi.org/10.24963/ijcai.2024/874},
arxiv = {https://arxiv.org/abs/2403.08251}
}
Social norms play a crucial role in guiding agents towards understanding and adhering to standards of behavior, thus reducing social conflicts within multi-agent systems (MASs). However, current LLM-based (or generative) MASs lack the capability to be normative. In this paper, we propose a novel architecture, named CRSEC, to empower the emergence of social norms within generative MASs. Our architecture consists of four modules: Creation & Representation, Spreading, Evaluation, and Compliance. This addresses several important aspects of the emergent processes all in one: (i) where social norms come from, (ii) how they are formally represented, (iii) how they spread through agents’ communications and observations, (iv) how they are examined with a sanity check and synthesized in the long term, and (v) how they are incorporated into agents’ planning and actions. Our experiments deployed in the Smallville sandbox game environment demonstrate the capability of our architecture to establish social norms and reduce social conflicts within generative MASs. The positive outcomes of our human evaluation, conducted with 30 evaluators, further affirm the effectiveness of our approach. Our project can be accessed via the following link: https://github.com/sxswz213/CRSEC.