Artificial intelligence (AI) tools powered by large language models (LLMs) can more accurately predict the results of proposed neuroscience studies than humans, a study has found.
The study, carried out by researchers at UCL and published in Nature Human Behaviour, says the results could pave the way for greater use of LLMs in scientific research and says the technology “can distil patterns from scientific literature, enabling them to forecast scientific outcomes with superhuman accuracy”.
The researchers developed BrainBench, which evaluates how well LLMs can predict neuroscience study results. Their tests consisted of numerous pairs of neuroscience study abstracts, in which one version was a real study and a second in which the outcomes had been altered.
They then tested 15 different general-purpose LLMs against 171 human neuroscience experts to see whether AI, or the person, could correctly determine which of the two paired abstracts was the real one with the actual study results.
On average, LLMs had an 81 per cent accuracy, while the human experts averaged a 63 per cent accuracy.
Even when the study team restricted the human responses to only those with the highest degree of expertise for a given domain of neuroscience, the accuracy of the neuroscientists still fell short of the LLMs, at 66 per cent. When LLMs were more confident in their decisions, they were more likely to be correct, it further found.
The report says that LLMs’ predictions are informed by a “vast scientific literature that no human could read in their lifetime” and, as LLMs improve, “so should their ability to provide accurate predictions”.
“In the future, rather than simply selecting the most likely result for a study, LLMs can generate a set of possible results and judge how likely each is. Scientists may interactively use these future systems to guide the design of their experiments,” it adds.
Bradley Love, a professor of cognitive and decision sciences in experimental psychology at UCL, said in light of the results, “we suspect it won’t be long before scientists are using AI tools to design the most effective experiment for their question”.
Professor Love noted that, while the study focused on neuroscience, “our approach was universal and should successfully apply across all of science”.
“What is remarkable is how well LLMs can predict the neuroscience literature. This success suggests that a great deal of science is not truly novel, but conforms to existing patterns of results in the literature. We wonder whether scientists are being sufficiently innovative and exploratory,” he added.
However, the report says that LLMs will form parts of larger ecosystems that “assist” researchers in determining the best experiments and that one risk of the technology is that scientists do not pursue studies when their predictions run counter to those of an LLM.
It adds that LLMs’ outputs should include indicators of the certainty or confidence levels associated with their predictions “for LLMs to serve as trustworthy and effective tools”.
Ken Luo, research fellow in psychological embeddings at UCL and lead author of the report, said: “Building on our results, we are developing AI tools to assist researchers. We envision a future where researchers can input their proposed experiment designs and anticipated findings, with AI offering predictions on the likelihood of various outcomes. This would enable faster iteration and more informed decision-making in experiment design.”