Imagine a new AI that can sift through mountains of scientific research faster and more accurately than a seasoned academic! A groundbreaking study published in Nature reveals a new chatbot, developed by scholars, that appears to be outperforming even PhD students and postdocs when it comes to conducting scientific literature reviews. This isn't just about speed; the large language model (LLM) is reportedly capable of producing reliable summaries for a remarkably low cost – less than a penny per review!
But here's where it gets controversial... Researchers in the US, aiming to tackle the common issue of AI 'hallucinations' (making up facts or citations) that plague chatbots like ChatGPT, put a new model called OpenScholar, and its spin-off ScholarQABench, to the test. They enlisted experts from fields like computer science, physics, neuroscience, and biomedicine to compare summaries generated by these AI tools against those produced by human PhD students.
The results, as detailed in the study released on February 4th, were quite striking. The domain experts – who were themselves PhDs and postdocs – found themselves preferring the AI-generated responses. Specifically, they favored OpenScholar 51% of the time and ScholarQABench an impressive 70% of the time over human-written reviews.
And this is the part most people miss... What set the AI apart? The study highlights that the chatbots provided a greater breadth and depth of information. Their summaries were significantly longer, averaging 1,447 or 706 words for OpenScholar and ScholarQABench, respectively, compared to a human-written average of just 424 words. This suggests a more comprehensive exploration of the literature.
In contrast, even ChatGPT, when used for summaries, was only preferred over human responses in about 31% of cases, and the study notes it often