As scientists, we stand on the shoulders of giants. Scientific progress requires curation and synthesis of prior knowledge and experimental results. However, the scientific literature is so expansive that synthesis, the comprehensive combination of ideas and results, is a bottleneck. The ability of large language models to comprehend and summarize natural language will transform science by automating the synthesis of scientific knowledge at scale. Yet current LLMs are limited by hallucinations, lack access to the most up-to-date information, and do not provide reliable references for statements.
Here, we present WikiCrow, an automated system that can synthesize cited Wikipedia-style summaries for technical topics from the scientific literature. WikiCrow is built on top of FutureHouse’s internal LLM agent platform, PaperQA, which in our testing, achieves state-of-the-art (SOTA) performance on a retrieval-focused version of PubMedQA and other benchmarks, including a new retrieval-first benchmark, LitQA, developed internally to evaluate systems retrieving full-text PDFs across the entire scientific literature.
As a demonstration of the potential for AI to impact scientific practice, we use WikiCrow to generate draft articles for the 15,616 human protein-coding genes that currently lack Wikipedia articles, or that have article stubs. WikiCrow creates articles in 8 minutes, is much more consistent than human editors at citing its sources, and makes incorrect inferences or statements about 9% of the time, a number that we expect to improve as we mature our systems. WikiCrow will be a foundational tool for the AI Scientists we plan to build in the coming years, and will help us to democratize access to scientific research.
WikiCrow
Enter a gene name below
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
No results
Sorry, we couldn’t find any results that matched your search terms.
Loading details...
Background
If you’ve spent time in molecular biology, you have probably encountered the “alphabet soup” problem of genomics. Experiments in genomics uncover lists of genes implicated in a biological process, like MGAT5B and ADGRA3. Researchers turn to tools like Google, Uniprot or Wikipedia to learn more, as the knowledge of 20,000 human genes is too broad for any single human to understand. However, according to our count, only 3,639 of the 19,255 human protein-coding genes recognized by the HGNC have high-quality (non-stub) summaries on Wikipedia; the other 15,616 lack pages or are incomplete stubs. Often, plenty is known about the gene, but no one has taken the time to write up a summary. This is part of a much broader problem today: scientific knowledge is hard to access, and often locked up in impenetrable technical reports. To find out about genes like MGAT5B and ADGRA3, you’d end up sinking hours into reading the primary literature.
WikiCrow is a first step towards automated synthesis of human scientific knowledge. As a first demo, we used WikiCrow to generate drafts of Wikipedia-style articles for all 15,616 of the Human protein-coding genes that currently lack articles or have stubs, using information from full-text articles that we have access to through our academic affiliations. We estimate that this task would have taken an expert human ~60,000 hours total (6.8 working years). By contrast, WikiCrow wrote all 15,616 articles in a few days (about 8 minutes per article, with 50 instances running in parallel), drawing on 14,819,358 pages from 871,000 scientific papers that it identified as relevant in the literature.
Our articles are still far from perfect. To evaluate WikiCrow, we randomly selected 100 statements and asked:
- Is the statement cited? Is there a nearby citation that is clearly intended to support this statement, and is the citation relevant?
- Is the statement correct according to the citation? Does the cited literature contain the information that is presented in the statement being evaluated?
All statements were thus characterized as either having irrelevant or missing citations; being cited and correct; or being cited and incorrect. We then repeated the same process for human-written articles. The results are as follows:
As you read WikiCrow articles, you will see incorrect statements about 9% of the time. You may also see repetitive statements, or citations that aren’t correct. We expect that these errors will become rarer as the underlying models and techniques improve. On the other hand, WikiCrow is much better at providing citations than human authors. Make sure to check any information you read here yourself before relying on it, and please alert us to any errors you may find. For more technical details, read on:
PaperQA as a Platform for WikiCrow
WikiCrow is built on top of PaperQA, a Retrieval-Augmented Generative (RAG) agent that, in our testing, can answer questions over the scientific literature better than other LLMs and commercial products. (See our paper on PaperQA) PaperQA reduces hallucinations, provides context and references for how an answer was generated, is orders of magnitude faster than humans, and retains accuracy on par with experts.
PaperQA is more than just a search tool; it is an adaptive system that uses tools based on the question and intermediate research. These tools include:
- SEARCH: finding relevant papers from online databases, such as Arxiv and Pubmed;
- GATHER_EVIDENCE: parsing and summarizing text from these papers;
- ANSWER_QUESTION: ranking the relevance of the gathered context and synthesizing information into a final answer.
This process is non-linear. For example, if PaperQA sees a paper that uses a different word to refer to a concept, it can go back and search again with the new nomenclature. Compared to a standard RAG, PaperQA makes four key changes (each improved performance, measured via ablation testing):
- PaperQA breaks down the Retrieve and Generate (RAG) process into tools for an AI agent, enabling it to perform multiple searches with various keywords whenever the information at hand isn't enough.
- PaperQA employs a Map-Reduce inspired approach to summarization, where the AI first collects (maps) evidence from a range of sources and then condenses (reduces) this information to provide an answer. This increases the amount of sources that can be considered, enabling the LLM to provide preliminary insights before composing the final answer.
- PaperQA uses a hybrid search approach to work on all accessible papers, which number in the 100s of million. Namely, PaperQA uses LLM-assisted keyword search at the corpus level and semantic search at the granular level of pages of text.
- PaperQA implements prior-knowledge prompting strategies to access and utilize the underlying knowledge embedded within language models, when needed finding evidence in the scientific literature, and uses the resulting answer as a type of posterior knowledge.
Importantly, PaperQA builds upon the unique structure of scientific literature – its citation graph and categorization into journals and fields. This is only possible due to the excellent contributions of the Semantic Scholar team at Allen Institute for AI, whose API for exploring the citation graph of science is a key feature of PaperQA. We plan to make the full WikiCrow and PaperQA code available on GitHub soon. Until then, the essential aspects of the PaperQA algorithm are available (although you will need access to your own repository of full text scientific articles), as well as the prompts used for WikiCrow.
Benchmarking PaperQA
In our evaluations, PaperQA outperforms GPT-4, Perplexity, and other LLMs, as well as commercial RAG systems on several benchmarks. We show excellent performance on two scientific question-answering benchmarks - MedQA-USMLE and PubMedQA Blind, the latter of which is a modified version of PubMedQA, where original contexts are removed to challenge the system to find the papers to retrieve the context. Additionally, PaperQA outperforms a range of systems on LitQA, a new benchmark that we developed to validate our performance. LitQA consists of multiple-choice questions that are difficult or impossible to answer accurately without retrieval of one or more specific papers, all of which were published after the training cutoff dates of GPT-4 and Claude 2 in 2022. Today, LitQA is small, with only 50 questions, as it is extraordinarily time-consuming to generate and validate these types of questions, but we plan to scale it up in the future. Also note that we performed this testing in October 2023 (outside of Gemini Pro in December 2023) and did not try to optimize any of the commercial systems here, so it’s possible they could be engineered for higher performance, or would have higher performance if tested today.
WikiCrow Mechanics
We carefully prompt the PaperQA agent to collect information on specific genes from scientific papers for each essential Wikipedia article section: Structure, Function, Interactions and Clinical Significance. To develop these prompts, we started with Wikipedia’s existing molecular biology style guide, then made significant changes over several empirical iterations. This highlights the continued importance of prompt engineering and the need for improved alignment strategies.
Afterwards, we use another LLM call to edit these four independent sections into a coherent and concise Wikipedia-style article, appending an Overview paragraph to the top, while maintaining all citations. The specific prompts used are available. Additionally, we are in conversations with Wikipedia about hosting these articles, and will continue to make our versions available programmatically; for example you can use this gsutil command to list all genes available for download: gsutil ls gs://fh-public/wikicrow/
Statements from human-written Wikipedia articles usually failed evaluation due to irrelevant, inappropriate, or absent citation support. We believe this stems from the varying quality of authorship, as well as the format of Wikipedia not requiring all statements to be justified with peer reviewed articles. Interestingly, statements from WikiCrow AI generated articles follow an opposite pattern, where the majority of statements fail due to incorrect transmission of information from the cited article. This was typically due to the model’s difficulty discerning highly similar gene names (e.g., GSDMD vs. GSDME), or failure to parse the logic of complex sentences, such as “knockdown of a repressive gene”, which is a clause with multiple negatives.
Evaluation of performance of LLMs powered by RAG is a new area of study, and this evaluation strategy has several limitations and challenges, which we highlight here:
- We do not evaluate absolute statement accuracy: We only evaluate whether statements are cited and whether they are true as cited; we do not evaluate whether statements are objectively accurate. Statements that are accurate but either not cited or incorrectly cited, which are probably more common in human-written Wikipedia articles, are scored as incorrect on either the “properly cited” criterion or the “true as cited” criterion. Trivially correct statements are excluded from evaluation
- Evaluation is challenging to blind: WikiCrow-written articles use significantly more references to bolster individual claims, so it is usually easy to tell which articles were written by humans and which were written by WikiCrow in evaluations.
- Inconsistent citation strategies: Humans use inconsistent citation strategies which require subjective evaluation. For example, we identified several cases of circular references in human-written Wikipedia articles, and we also identified several cases where human articles would cite large database entries like Entrez, rather than primary literature, which were difficult to evaluate. The need to make subjective decisions about whether to exclude such statements raises bias concerns.
- Sample exclusion: Articles generated both by WikiCrow and by humans often contain trivial statements of fact, which also need to be excluded from evaluation on a subjective basis.
Despite these challenges, we think that our evaluation system is a reasonably accurate reflection of the “ground truth” quality of human-written and WikiCrow-written articles. If you have suggestions about how to improve evaluation, let us know, or consider applying to join our Assessment Team!
Conclusion
We built WikiCrow and PaperQA as foundational tools both for human researchers and for the AI Scientists we are building at FutureHouse. We plan for PaperQA to be one of many tools available to our AI Scientists, aiding in knowledge synthesis, experimental planning, hypothesis generation, and more. Moreover, PaperQA will be part of a closed-loop system, ensuring continuous and informed progression from theory to experimentation.
In addition, we believe that the WikiCrow approach will eventually enable synthesis and curation of all human scientific knowledge, in collaboration with human editors. Some directions we expect to explore include the use of dedicated models that are fine-tuned on Wikipedia edits, and improved alignment strategies to reduce the amount of prompt engineering that is needed for generation of comprehensive and coherent articles for a given topic. In the long run, we even envision a “Super-pedia,” where articles are generated about any topic in real time, on-demand, with the most up-to-date information. If you’re excited to work on this, get in touch.
Interested in using PaperQA or WikiCrow? Fill out the form here