Please use this identifier to cite or link to this item: http://hdl.handle.net/1942/49309
Title: AI-assisted teams outperform AI-led teams but not human-only teams in assessing research reproducibility in quantitative social science
Authors: Brodeur, Abel
Valenta, David
Marcoci, Alexandru
Aparicio, Juan P
Mikola, Derek
Barbarioli, Bruno
Alexander, Rohan
Deer, Lachlan
Stafford, Tom
Vilhuber, Lars
Bensch, Gunther
Motoki, Fabio
Abdelhady, Mohamed
Abdelmoula, Yousra
Baki, Ghina Abdul
Aguirre, Tomás
Aiyer, Sriraj
Akhtar, Shumi
Akhtar, Farida
Albada, Melle R
Altman, Micah
Angenendt, David
Arjmandi Lari, Zahra
De León Tejada, Jorge Armando
Arana, David Rodriguez
Asanov, Igor
Noha, Anastasiya-Mariya
Ashong, Rebecca
Auer, Tobias
Bahamonde-Birke, Francisco J
Baker, Bradley J
Bartram, Söhnke M
Bao, Dongqi
Batinovic, Lucija
Batistoni, Tommaso
Beeder, Monica
Beland, Louis-Philippe
Gero Bienz, Carsten
Aryanto, Christ Billy
Bolibaugh, Cylcia
Bonander, Carl
Bravo, Ramiro
Bronnikov, Egor
BRUNS, Stephan 
Buliskeria, Nino
Caicedo-Silva, Sara
Calef, Andrea
Sebastian Cano Arias, Juan
A Castillo Alvarez, Gustavo
Caulker, Solomon
Cepenas, Simonas
Chatton, Arthur
Chen, Zirou
Chioma Ewurum, Ngozi
Ciocîrlan, Anda-Bianca
Clouth, Felix J
Cook, Nikolai
Collins, Jason
Cornejo, Cesar
Craveiro, João
Créchet, Jonathan
Cui, Jing
Chalil Vayalabron, Niveditha
Czymara, Christian
Bermúdez Jaramillo, Carlos Daniel
Datta, Hannes
Denoo, Lien
Dhaliwal, Arshia
Dhameja, Nency
Djemai, Elodie
Dujeancourt, Erwan
Dündar, Uǧurcan
Duprey, Thibaut
Eissa, Yasmine
El Fassi, Youssef
El Fassi, Ismail
Ellis, Keaton
Elminejad, Ali
Elsherif, Mahmoud
Emirmahmutoglu, Aysil
Etingin-Frati, Giulian
Eze, Emeka
Dollbaum, Jan Fabian
Feld, Jan
Felipe Rengifo Jaramillo, Andres
Fenig, Guidon
Fernandes, Victoria
Fiala, Lenka
Fink, Lukas
Firouzjaeiangalougah, Mojtaba
Fish, Sara
Fitzgerald, Jack
Forshaw, Rachel
Fortier-Chouinard, Alexandre
Fréget, Louis
Frese, Joris
Gabani, Jacopo
Gallegos, Sebastian
Gamill, Max C
Gáspár, Attila
Gauriot, Romain
Gavrilova, Evelina
Geraldes, Diogo
Cantone, Giulio Giacomo
Gibson, Grant
Goldschmitt, Dirk
Gourdon-Kanhukamwe, Amélie
Gregor de Varda, Andrea
Grigoryeva, Idaliya
Gugushvili, Alexi
Fletcher, Aaron H A
Habermann, Florian
Hablicsek, Márton
Haddad, Joanne
Hall, Jonathan D
Hammar, Olle
Hassouneh, Malek
Hausladen, Carina I
Hendrikse, Sophie C F
Hepplewhite, Matthew
Ho, Anson T Y
Hogan-Hennessy, Senan
Howley, Elliot
Huang, Gaoyang
Ilchovska, Zlatomira G
Jaimes Santamaria, Paola
HULSTAERT, Héloïse 
Jakobsson, Niklas
Jansson, Joakim
Jarosz, Ewa
Jebeli, Hossein
Jiang, Yanchen
Junaid, Hiba
Kalluraya, Rohan
Karim, Sunny
Kelly, Edmund
Kimel, Eva
Kingsuwankul, Sorravich
Klotzbücher, Valentin
Krähmer, Daniel
Krūminas, Pijus
Kruus, Nicholas
Kujansuu, Essi
Kurz, Christoph F
Küster, Stephan
Lee-Whiting, Blake
Lewandowski, Felix
Li, Ruoxi
Liu, Dan
Liu, Jiacheng
Li, Tongzhe 
Lo, Helix
Loter, Katharina
Macedo Dias, Felipe
Madan, Christopher R
Mäder, Nicolas
Mandas, Marco
Mantilla, Cesar
Marcus, Jan
Marino Fages, Diego
Martin, Xavier
McWay, Ryan
Medina-Gaspar, Daniel
Meng, Sisi
Meng, Lingyu
Merz, Simon
Miller, Alex P
Mirabel, Thibault
Mishra, Dibya Deepta
Mishra, Sumit
Moges, Belay W
Mohandes Mojarrad, Morteza
Mohnen, Myra
Morin, Louis-Philippe
Muehlenbachs, Lucija
Mullin, Gastón
Musulan, Andreea
Muzzì, Sara
Myers, James A C
Neubauer, Florian
Niazi, Ali
Nordstrom, Ardyn
Nowak, Bartłomiej
NGUYEN, Tuan 
O'Habib, Daneal
Ölkers, Tim
Ong, Justin
Orozco Castiblanco, Valeria
Özak, Ömer
Ozkes, Ali I
Paaso, Mikael
Pandey, Shubham
Papazoglou, Varvara
Penheiro, Romeo
Pham, Linh
Phieler, Ulrike
Pütz, Peter
Qi, Quan
Qiu, Jingyi
Rein, Manuel T
Reinstein, David A
Repo, Juuso
Rudolf, Nicolas
Saha, Shree
Saka, Orkun
Saponaro, Chiara
Sator, Georg
Schoenmakers, Martijn
Seri, Raffaello
Shah, Meet
Siemroth, Christoph
SIBILLE, Paul 
Skavysh, Vladimir
Slater, Ben
Staubli, Stefan
Steindl, Tobias
Waongo, Nomwendé Steven
Stott, Paul
Song, Wenting
Strobel, Stephenson
Sudhaharan, Roshini
Sun, Pu
Swain, Scott D
Talavera, Oleksandr
Tantiangco, Hanz M
Tarasenko, Georgy
Tarlinton, Boyd
Tarraf, Mariam
Teoh, Ken
Thériault, Rémi
Thompson, Bethan
Tian, Tonghui
Tian, Wenjie
Tolani, Emmanuel
Borgen, Nicolai
Topstad Borgen, Solveig
Torralba, Javier
Velez-Ospina, Carolina
Mak, Man Wai
Wallrich, Lukas
Wang, Zeyang
Ward, Leah
Webb, Matthew D
Webb, Duncan
Weber, Bryan S
Weber, Christoph
Weng, Wei-Chien
Westheide, Christian
Wilkinson, Tom
Wong, Kwong-Yu
Wroński, Marcin
Wu, Zhuangchen
Wu, Qixia
Wu, Victor Y
Xiao, Bohan
Xu, Feihong
Yadav, Pranav
Yang Chou, Yu
Yap, Luther
Xu, Cong
Yazbeck, Myra
Yao, Bo
Zagrodzka, Zuzanna
Zahra, Tahreen
Zaneva, Mirela
Zhong, Han
Zirgulis, Aras
Zou, Jiacheng
Zhao, Ziwei
Zhang, Xiaomeng
Zoutman, Floris
Zozoungbo, Christelle
Issue Date: 2026
Publisher: 
Source: Proceedings of the National Academy of Sciences of the United States of America, 123 (22) (Art N° e2524747123)
Abstract: Large Language Models (LLMs) such as ChatGPT are transforming how scientists conduct and validate research, offering promise as tools to improve scientific reproducibility. However, computational reproducibility and error detection remain expensive and labor-intensive. We experimentally test how collaboration between researchers and LLM assistants influences the reproduction of quantitative social science findings across different levels of AI autonomy. We randomly assigned 288 researchers to 103 teams working under three conditions: human-only, AI-assisted (using ChatGPT as a collaborative tool), or AI-led (ChatGPT operating with minimal human oversight). Teams reproduced published results from leading social science journals, detected coding errors, and proposed robustness checks. Human-only and AI-assisted teams achieved comparable reproduction rates (94% vs. 91%) and performed similarly on most outcomes, except human-only teams identified significantly more major coding errors. Both substantially outperformed AI-led teams, which achieved only a 37% reproduction rate, detected fewer errors across all categories, proposed weaker robustness checks, and required more time. This autonomous approach, however, likely represents only a lower bound of AI capabilities. Despite rapid model advances, expert human judgment currently remains indispensable for reliable empirical verification. While AI assistance did not degrade most outcomes, it provided no measurable advantages and was associated with reduced detection of major errors. However, the 37% autonomous reproduction rate indicates that AI could provide value in settings where scale or cost constraints preclude human review of papers, even though general-purpose LLMs offer no immediate advantages for human-supervised verification.
Keywords: AI;large language models;reproducibility;Humans;Reproducibility of Results;Large Language Models;Generative Artificial Intelligence;Intelligent Systems;Cooperative Behavior;Social Sciences;Artificial Intelligence
Document URI: http://hdl.handle.net/1942/49309
ISSN: 0027-8424
e-ISSN: 1091-6490
DOI: 10.1073/pnas.2524747123
Rights: 2026 the Author(s). Published by PNAS. This article is distributed under Creative Commons Attribution-NonCommercial-NoDerivatives License 4.0 (CC BY-NC-ND).
Category: A1
Type: Journal Contribution
Appears in Collections:Research publications

Files in This Item:
File Description SizeFormat 
brodeur-et-al-2026-ai-assisted-teams-outperform-ai-led-teams-but-not-human-only-teams-in-assessing-research.pdfPublished version995.32 kBAdobe PDFView/Open
Show full item record

Google ScholarTM

Check

Altmetric


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.