Gaps in the human reference genome

These are sequences that we don’t interrogate today because they are not in the human reference genome—so if they are somehow linked to disease, we wouldn’t know about it.

Anna Lindstrand

Two new studies emphasize gaps in the human reference genome – European, East-Asian, and African individuals often carry sequences, which are absent in the reference.

Reference is an important tool in science. It can serve as a baseline for experiments, it propels a scientific consensus, it is often the foundation of exploratory research. We entered the modern era of genomics with the Human Genome Project, which assembled and published the human reference genome. Scientists around the world are widely using the newer forms of this reference, called GRCh38 (Genome Reference Consortium Human Build 38).

In the genomics industry, the reference has a special meaning. It serves as a means to cut costs of sequencing by comparing the results to the reference – instead of assembling the genome from scratch.

As a legacy of the HGP, the reference is still based in 70% on one individual – and the remaining 30% is contributed by four individuals only. Obviously, it hardly can represent the whole humankind.

With the advent of extensive genome sequencing projects, scientists began to measure how much we are losing by preserving the old reference genome. The newest study from Seoul National University shows that GRCh38 lacks 1,390 coding elements (such as genes), found in the high-quality genome AK1. Comparison to 14 individuals from Europe, East Asia, and Africa reveals that ~4.7% of the sequencing data could not be compared to the reference. Another recent study, involving 1,000 Swedes, reported 46 million missing nucleotides. Previous studies confirmed holes in the human reference genome in comparison to Icelandic and African populations.

Observations of scientists were met with the start of a new initiative. U.S. National Human Genome Research Institute announced in the September development of a new human reference genome. Dubbed as ‘pan-genome’, it will consist of high-quality data gathered from 350 diverse individuals. The timeline is not set yet – grants for the project were planned for 5 years.

Preprint: Jina Kim, Joohon Sung, Kyudong Han, Wooseok Lee, Seyoung Mun, Jooyeon Lee, Kunhyung Bahk, Inchul Yang, Young-Kyung Bae, Changhoon Kim, Jong-il Kim, Jeongsun Seo (2019). Human Reference Genome and a High Contiguity Ethnic Genome AK1. Doi:10.1101/795807.
Publication: Jesper Eisfeldt, Gustaf Mårtensson, Adam Ameur, Daniel Nilsson, Anna Lindstrand (2019). Discovery of Novel Sequences in 1,000 Swedish Genomes. Doi:10.1093/molbev/msz176.
Photo: NHGRI

Leave a Reply