Group 4: Clinical Comparative Genomics at Scale
Articles Blog

Group 4: Clinical Comparative Genomics at Scale

Male Speaker:
So we were probably the smallest group, I think, of the breakout sessions, comparative
genomics and evolution, but, notwithstanding we had a pretty lively, sometimes confusing
discussion which I think we finally cemented near the end. We started out with kind of
revisiting, I think, some of the things we think are so important and foundational about
this program that has existed at NHGRI almost since its inception. And the kind of the fundamentals
which I think we were all in agreement is on essentially that evolution is the unifying
principle by which everything that we are doing rests. So, studies of variation, studies
of mutation, are essentially fundamental and how those processes occur require really comparative
genomics. The idea of genotype-phenotype correlations,
which is what most of the people in this room are interested in, also benefits and I would
say it probably makes most sense in the context of evolution, and so it provides us an unbiased
framework for discovery and prioritization of regions and, I would argue, that as we
move perhaps into interrogating non-coding sequences, regulatory sequences and trying
to understand them, that the comparative genomics and evolutionary aspects would become more
and more important. And I guess the last important point is that NHGRI really has blazed a trail
in terms of this research. In terms of mammalian and vertebrates in specifically there’s
the expertise, we have the computation, we have the resources in terms of libraries and
other types of things, and we have the consortia. So, the track record and the ability to do
this type of work really surpasses any other institute at NIH. So, we began by first, kind of — and I’ll
do this very quickly — just reviewing the accomplishments and sorry to those that we
don’t list as an accomplishment but there have been many in this area over the last
15 years. Sixty vertebrate genomes have been sequenced in some form or fashion and aligned
with the human data revealing about 3 million specific evolutionary conserve segments so
about 4.5 percent of our genome. I think the important point to think about is that this
is work in progress. In many cases, the genomes are not assembled or they’re just used essentially
to align to the human reference and so we don’t, in many cases, have stand-alone,
high-quality or even reasonable working draft assemblies from many of the genomes. So, I
just looked at the average N50 contig length for primate genomes and it’s on the order
of about 25 kilobases. Point number two: one of the missions that
we’ve had for many years is essentially to reconstruct evolutionary history of every
base in the human genome. We are not there yet. We’ve made some good strides in this
area but we’re lacking critical species in terms of high-quality. We’re lacking
prosimians; we don’t have a high-quality tarsier reference genome, for example. There’s
one and a half million gaps in the tarsier assembly right now. Less than 50 percent of
that sequence can be aligned to the human reference. So, if you think this is fait accompli,
you’re wrong. We’re not done in terms of this project, at least in terms of how
we set out. We’ve done moderately well. I changed this from deep catalogues to been
[spelled phonetically] catalogues. There have been efforts by the Grade A Genome Project,
rhesus macaque, African green monkey to begin to survey some of the genetic variation that
exists within these species. There’s been studies that have inbred Drosophila strains
to understand, really, a product provider framework for quantitative genetic trait studies,
and so these have been good. More could be done in this particular area. And the last point, which we had some debate
whether this was part of our purview but we think this is fundamentally comparative in
nature because it involves comparative genomes both past and present, is the fact that we’ve
had — and I think this is an achievement — the Human Reference Genome Consortium,
whose mission it has been to continually improve the reference of the human genome as we go
forward. So, for many of you, you might think of this as the housekeeping exercise to kind
of finish off the gaps. It’s much more than that. The regions that are being tackled right
now and it’s a targeted approach for specific regions are regions that are incredibly diverse.
Think of MHC, think of T cell receptor regions, think of regions around signal duplications,
highly dynamic, gene rich, important in terms of human health and highly variable. There
is more variation in the three megabase stretch — tens of KB, hundreds of kilobases of sequence
variation between different haplotypes which has not been catalogued because of the complexity
of that type of variation. So, as a result, some of the holes — and we kind of really
echoed the first goal of the last group is that we have not yet comprehensively assessed
all genetic variation in any single genome. So, it’s not just a question of allele frequency
spectrum, it’s a question of getting all the variation. All the endales, all the structural
variants, all the copy number changes and there are more. Estimates are four to five
times more base pairs affected by structural variation than by single base pair and endale
events. All right, so that’s kind of the achievements.
This is just to remind you this is taken from two papers just kind of the phylogenes [spelled
phonetically] that have been tapped in terms of this. I highlight here a few. I’ll just
mention gorillas, for example. We’ve worked on many of the primate genomes. The gorilla
genome was recently sequenced and assembled put together. It’s average contig length
is about 11 kilobases, I think, if my last recollection on this. There’s about a half
a million gaps in the gorilla genome. What that means is when we did the four-way alignment
with the apes, which includes humans, 30 percent of the genome could not be aligned in that
four-way alignment. So, only 65-70 percent of the genome could be aligned. That’s hue
chromatic. That’s genes. So, we have very much heterogeneity on the terms of quality
of the genomes that have been generated. Many of them have been just used simply to align
to the human reference. So, many of them, mammalian genomes, roughly the 34 that have
been done, 29 depicted here, are not particularly high-quality draft assemblies. The only high-quality
assemblies on this slides really, to be honest, are human and mouse. And many of the others
are in various stages of working draft. So, when we set out the goals from our group
we basically went to the — we actually broke them each one down to really four things.
Essentially what’s the big question and why is NHGRI relevant in this particular question?
Second was the tactic or the approach. Third was details and fourth was justification,
not in that order. So, I think we agreed in our session that this was the single most
important goal. People can disagree with me that were in that breakout group, was to move
from aligning genomes to essentially doing de novo sequence and assembly without guidance.
To be able to take a genome, no matter what species, human, otherwise, and to be able
to generate a high-quality de novo sequence and assembly of that genome and so we have
the specific. We would suggest or argue that what NHGRI should invest in and not be dependent
solely on the commercial sector for this, was to advance sequencing technologies, to
advance or assemble a genome for $10,000. So, this is not generate 40x sequence coverage
with illuminae. This is to actually assemble. The cost of assembling genomes is actually
still has been prohibitive. We have some statistics now based on assembly with one human genome
using long read pack bio data that suggests it would cost us about $60,000 [unintelligible]
to assemble a genome with an N50 contig length of 4.4 megabases. That’s a 150-fold improvement
in terms of N50 contig length based on just standard of illiminae sequencing. I don’t
think it’s unreasonable to think that we could have an order of magnitude drop in that
cost to get to us to a 10,000 genome assembly. We suggest that one useful — so, this is
the specifics. I think Jeff Schlozz [spelled phonetically] asked for this specifically.
One specific approach or area that NHGRI could invest in would be to apply this to a finite
number of Human Reference Genomes. So, to generate Human Reference Genomes at the quality
of the existing Human Reference Genome or better for 50 different humans representing
diversity or sampled broadly across humanity. So, we’re thinking of this as kind of what
we call gold genomes. Very high-quality genomes where most of the bases and structural variation
copy number endale have all been resolved. We think it would be incredibly powerful because
it would give us a comprehensive view of the types of genetic variation that exist kind
of in the sweet spot right now where we can’t access very well. So, as a member of the structural variation
working group for the last x number of years, part of the 1000 genomes and before that earlier
on other projects, we are not particularly good at detecting inversions. We are not particularly
good at genotyping or detecting insertions. We are terrible in terms of complex structural
variation events with [unintelligible] duplications. And so this is an area if you could think
about where we would have 50 reference — call them continental genomes if you will: Africans,
Asians, Amer-Indians, Europeans, but we would have high, very high-quality references at
those positions. This would give us, I think, the first truly comprehensive view. We just
ran some statistics recently and we think, based on what we’ve been able to do on one
genome comparing illuminae pack biotechnology, that between 50 base pairs and 5,000 base
pairs we are missing a 90 percent of deletion variance. Or, I should say, 62 percent of
deletion variance, 90 percent of insertion variance. So, we — if we think we are completely
understanding this variation in the human genome, we’re really mistaken. I pushed
for this but I — there was push back in the group, so I’ll just mention this. I think
the goal should even be better than this. I think we should push to sequence from telomere
to telomere every human chromosome, including the dark matter. The centrameric [spelled
phonetically], the akricentric [spelled phonetically]. I think it can be achieved. Won’t be achieved
today, maybe not achieved in the next couple of years, but no other institute will achieve
this. And we know that variation within centromere, we know that variation with t lemurs is important
in terms of human health. This one, this is a big lofty goal. What makes
us human? This requires an emphasis on primate genomes. We still have not achieved this,
which is something that we set out over 10 years ago, which was to assign every human
lineage specific or genomic change to a specific branch on the evolutionary tree of primates.
Many in the group were most interested in no human specific changes with functional
consequences, including gene innovations, and so we are still discovering new genes
in 2012, 2013 that aren’t in the Human Reference Genome. These are typically often duplicated
genes, but they are also important in terms of human health and human adaptation. So,
in terms of specifics and concrete we would argue that it’s possible with all the researches
that have been generated now to focus on generation of high-quality de nova assembly of non-human
primate genomes. We suggest as a straw man of 16 primate genomes including many which
have already been in working draft stage and to assemble them at the quality of the Human
Reference Genome. Sixteen is a number that we use based on looking at available resources
including back resources, but also having at least two representations from every major
fellow genetic branch from the human lineage. This would provide us fundamental information
on processes of mutational processes, speciation, differences in lineage specific sorting, gene
flow, and et cetera. And there was some discussion in our group, and we think it’s an interesting
observation, that many of the recurrent micro deletions that are actually mediating genomic
disorders in the human population are caused by human-specific duplications and complex
regions that have evolved over the last 5 to 10 million years of human evolution. There’s
remarkable genetic variability in those regions which predisposes some individuals and certain
haplotypes to develop, you know, have more current micro deletions and others not. The last point or goal that I’ll mention
was essentially this: to obtain nucleotide level resolution of every conserved functional
element in humans. We are not there yet. We heard some great stories yesterday about the
power of actually comparative genomics and helping to identify regulatory elements. The
story from David Kingsley [spelled phonetically] and finding that mutation in the regulatory
element for a kit and that how these weren’t detected by Encode but were picked up as being
based on comparative analyses. The data that’s out there right now which is roughly the 30
mammals gets us down to about 12 base pair resolution. Simulation suggested if you push
this to 100 to 200 mammalian genomes sequenced deeply, you’d get down to a single base
pair resolution. And I think that’s an easy target without any advances in sequencing
technology right now could be generated. Some people said, “Well, maybe this will be done
by, you know, the 10K diversity project or the 10K genome project or other projects that
are out there.” I don’t think so. It may be, but that’s not their mission. The mission
here should be to sequence genomes, make that data publically available so everybody can
analyze it as quickly and rapidly as possible. This would allow us to quantify the selective
constraints on each element or cause mammals and integrate with existing data sets and
encode in both mouse and human, and if advances in computational technologies and advances
in sequencing technologies came along, it wouldn’t I think be beyond the pale to think
about doing additional mammalian genomes at high-quality like we have it done for the
mouse. All right, I’m going to turn this over to Andy. Male Speaker:
I wish there was a button like that where I could blank all your screens, too. If you
think of the top causes of mortality and morbidity in humans and consider cardiovascular disease,
COPD, stroke, diabetes, the list that Mike Banky [spelled phonetically] gave, they have
two things that are in common. One is that they’re adult onset and the other is that
they’re remarkable sensitive to the environment so that your risk is a function of your environmental
stresses as well as some attribute of your genome. So, we’re all unique. Some of us
are more unique than others. And we’re unique not only because of our genomes but because
of that trajectory of environmental stresses and environmental exposures that we’ve had
during our lifetime. Now, when you’re faced with trying to infer causality in a situation
that’s so high-dimension where the factors are confounded so badly as they are with genotype
and environment in the case of humans, you have a very tough problem ahead of you, and
we’re all familiar with that. Now there are two major impediments, then,
for studying and understanding causation in the face of adult onset diseases of this sort.
One is the fact that there isn’t really a good, controlled experiment that any of
us can do. There’s no control for that experiment. And the other is — because it’s observational
— and the other is that we can’t replicate the experiment. We can’t take the same genotype
and put it in a bunch of environments and ask what happens. So, there’s good news and bad news. The
good news is that model organisms do precisely this. They allow us to put the same genotype
in multiple environments and take apart genotype by environment interaction very carefully.
The bad news, of course, is that when you do the right experiment, take a set of zebra
fish or mice or flies or worms through a set of different environments, the rank order
of the phenotype that you score almost always flip around. That is, genotype by environment
interaction is almost universal. So, we’re in the situation where we need
to understand, how do genotypes respond to different environments, and the answer is
we need to figure out what’s the best way to work with model organisms. Now, we all
know of examples where model organisms are terrible for modeling specific diseases. There’s
a human disease where the mouse doesn’t even have that gene. But we need to move beyond
that. We need to ask the question now using genomic technologies, what really is the best
model for each specific disease. I think Aviv Regev’s talk was particularly kind of informative
in thinking about how gnomic technologies could really incredibly sharpen our ability
to focus on model organisms for specific disease. If we had catalogues of the sort she described
for mouse genes, for instance, there’s great opportunity for taking this forward. So, goal four, then, is to leverage the power
of model organisms in functional genomics. Of course, this resonates very well with goal
one where Eric Orwinkle [spelled phonetically] emphasized the fact that we need to understand
basic biology before we can understand disease. I would argue that model organisms are an
important path to that. And, of course, in goal two Rick Myers [spelled phonetically]
and Mark Erstine [spelled phonetically] argued quite well that model organisms do have the
ability to allow us to infer function at the adult stage, which — or whole organismal
stage. So, some of the points for how we would proceed
to do this by, for instance, doing — applying large scale genomic and other o mix technologies
to reference panels. So many of the model organism communities are working on functional
genomics. There are such panels as the collaborative cross, the diversity out-reads, and so forth
in mice or many others and other model organisms. The idea of taking human mutations forward
into model organisms and studying them both at the cellular as well as the organismal
stage. This will scale beautifully now with crisper technology. We’ll be able to make
thousands of human mutations and put them in adult mice and put them in different stressful
environments. So, by doing this across different environments then we’d have this really
good handle on the sort of full degree of genotype by environment interaction in those
particular physiologies that are relevant to these human disease states. One other ideas was that well, as we study
these other organisms and the more comparative evolutionary kind of view of the world, we
look at, for instance, naked mole rat we find the fact that they don’t have tumors. We
might identify genes that we suspect might be important in that process. What about doing
the reverse experiment and taking some states from these non-model organisms and putting
them into human cells and seeing how they behave. That’s a kind of out there idea
that I won’t take credit for. Okay, so that’s the end of the ‘model
organisms’ sermon. Getting back to the grand scope of comparative and evolutionary genomics,
all of the things that Eric told you about couldn’t be done without serious improvements
to the computational infrastructure that we need. So we need to develop informatics infrastructure
to produce, display, and quantify these multiple species genome alignments. An alignment of
species is a central of species genomes is a central tool for inference of where along
that phylogeny did particular changes occur. When we layer on that functional information
as well. We get terrific insight about the way genes and phenotypes have evolved. So
this requires development of algorithms, software, alignment methods. It requires development
of new browsers. Anybody who’s been involved in any of these projects know that you have
a constantly shifting coordinate system for the genomes as you discover huge insertions
in one species that weren’t in the others and so forth. We need to devise methods of
analysis of complex chromosomal rearrangements, methods of representing genomes in the face
of those rearrangements, and finally, we need to produce benchmarks, quality control metrics,
and assessments of accuracy of these methods. So, all summarize just by kind of listing
all of the goals, going through these quickly, reminding you that evolution is the single
most powerful unifying principle in all of biology. That we — the history of biology
is that we have learned an enormous amount from that now. I’ll warn us against the
arrogance that we all are somewhat subject to with the power of the tools of genomics
to think that well, now that we can do this in human cells we don’t need to think about
anything else anymore. We can just do all the manipulations in human cells. I think
that evolution still has an enormous amount to teach us. That model organisms have also
marched forward in their technologies for manipulating and perturbing genomes. We need to develop strategies and technologies
to obtain high-quality de novo reference sequence. This will be applicable throughout all of
biology, not just even the goals that Eric outlined. We need to — I mean, Evan, sorry.
We need to target multiple primate genomes to infer high confidence, all of the human
specific attributes: so this is enormously useful in doing comparative biology. Seeing
what are uniquely human traits and how are we different from our closes relatives. That’s
of also tremendous intellectual interest, and I think we could bring in the public in
sort of sharing the excitement over these sort of aspects of the science that we do.
The sort of fundamental question of how did we evolve from our most recent ancestors is
one that resonates very deeply with the public. By sequencing multiple mammals — we were
told last night there are only 5,400 mammals that are known so eventually we might be able
to get there but starting with the first few hundred using current technologies — this
is not expensive — we could easily get to the point of being able to identify all the
human-specific conserved elements from those alignments and comparisons. The fourth point, again, the model organism
thing. I’ll beat that drum one last time. They’re still enormous utility for understanding
many aspects of basic biology but including, particularly, these context-dependent variant
functions where those contexts include anything from diet to drug treatments and so forth.
Those scale beautifully in experiments with model organisms, and so sequencing reference
panels of them would enable those studies to proceed at a much greater pace. The fifth goal was, again, this development
of software tools for dealing with multiple sequence alignments, and I’ll close with
just emphasizing again all of the things that we talked about; none of them are actually
on the purview of other institutes. The National Institute of General Medical Sciences does
do a lot of evolutionary biology. They have funded some model organism work and so forth,
but the scale of the sequencing, there is an aspect of this problem that is uniquely
NHGRI and we would like to see them do it. So, with that I’ll take questions. Dave? Male Speaker:
So, I want to point out that this so resonates with the goal that we heard from Heidi. Detect
all types of clinically-relevant variation in a single genome scale test. That’s very,
very consistent with the 10,000 genome goal. And I would say that I love these charts that
we’ve been seeing with the Moore’s Law and the approaching the $1,000 genome, but
there’s a lot of wishful thinking in there. Those aren’t genomes, right? We really need
to get back and do this the way that Evan described beautifully in this so we can really
say that we’re sequencing the whole genome. Male Speaker:
I mean, I agree. I was really pleased. I was hoping that one other group would come up
with the importance of being able to do a single genome — Male Speaker:
Yes. Male Speaker:
— without alignment to the reference because that is fundamental human genomics. A wise
man once told me is that our field is skin-deep in terms of its fundamental algorithm which
is to understand all the genetic variation comprehensively and once we do that, that’s
finite. We can assess links to human phenotype. And so this is where we will be, whether NHGRI
leads the charge or not in five, 10 years from now. We won’t be doing these alignments
anymore, we’ll be doing de noval assembly. Male Speaker:
And there has been a spectacular improvement starting with $300 million dollars in 2000
to what we can do now in terms of getting up where real high-quality is though. Male Speaker:
And it’s not as if the private sector isn’t going to play an important role. Companies
like Pacbio, Nanopore, Oxford Nanopore, they can continue but if there was an incentive
of push at some level to drive this even more I think we could accelerate and get to that
point of a single genome assembly, you know, instead of ten years, five years as part of
a routine clinical test. Male Speaker:
Jim. Jim Evans:
To follow up on those statements, I think this is so important, in particularly from
the evolutionary aspect since new mutation frequency is much greater on a low to specific
basis for copy number versus single nucleotide variance, we need, really, to get better structural
intimation and assays for this. And it ties you right into an institute nobody I’ve
heard talk about right now and that’s environmental health, NIEHS, and we’ve already got evidence
from Tom Cluver’s [spelled phonetically] work that hydroxylamine, the chemical we use
clinically induces C and D mutations at high rates in both yeast and mammalian cells in
vitro, so studying these mutational methods — I mean, the aims test doesn’t even test
for copy number. It gets only at mostly single nucleotide variance so how the environment
interacts with our genome with respect to copy number is a total black box. Male Speaker:
I mean, a related idea to this inference of mutational processes is inferences about differences
in re-combinational processes which are also fundamental to, well, evolution and everything
else about the map. So, yes, I agree. Jim Evans:
And Richard, to follow up on that, I mean, the fact that Alan Jeffries [spelled phonetically]
has nicely shown PRDN 9 alleles influenced genomic disorder rates and that you can change
that in different environments, I think we have to go at that in a big way for mutations. Male Speaker:
Richard. Male Speaker:
This seems an opportunity to move the center of gravity of disease models towards primates
from mouse and whilst you mentioned primates in the other context I think you didn’t
strongly state building and exploiting primate models for human disease should be a priority.
Was that discussed? Male Speaker:
Well, we actually didn’t discuss it too much. I think the impediment there is the
problems of working with primates is just — there are limits to what we can do. They’re
not totally insurmountable, but that’s a limit. Male Speaker:
Well, I think you could argue too that there’s been technological changes in all the areas
that really change the complexion of that. Male Speaker:
I was interested when in goal two you didn’t mention archaic humans in this high confidence
list of human specific genome attributes. I think that is — I know it’s not science
fiction obviously — Male Speaker:
We discussed this specifically, NAME, and this came up with in the context of yes, there’ll
be more archaic hominins that’ll be sequenced. Most of that will be done with short read
technology largely because the fragment lengths are so short in these archaic hominins that
it doesn’t really lend itself to a de nova assembly of Neanderthal or denisovan for that
matter. It’s generally something that we feel is going to happen whether NHGRI invests
or not, so we were looking for those seven characteristics that, you know, they laid
out in the beginning. You know, high through put of, you know, consortia technology advance
— Male Speaker:
I mean, then that’s a focus on the data generation component of it and not the data
analysis component. I think it would be very foolish in the data analysis component to
ignore the archaic human. Male Speaker:
Absolutely. Male Speaker:
Absolutely. Male Speaker:
I’m just, a kind of push — I’m I think on the model organisms I think there is this
huge the gene environment affect. It’s very clear that you just have opportunities there
especially with fixed genotypes to explore things which are very, very hard to do observationally
in humans, I mean, so I really believe that. A personal plea is that we don’t narrow
ourselves down to just mice and zebra fish as model organisms. I just think that we’ve
got a much bigger repertoire of organisms out there than, just say, model organisms
equals, you know, a very small list of species and there’s a bigger diversity of useful
organisms out there beyond those two. Male Speaker:
We’ll include badaca [spelled phonetically]. Mark. Male Speaker:
I just wanted to say that I really agree with Evan’s point about the impact of structural
variation, the importance of having high-quality genomes. I just want to mention, then, the
functional breakout group we also did really talk about thinking about the functional impact
of structural variance. I mean, it’s much more complicated and potentially much larger
than single nucleotide variants. We really have to think about ways of thinking about
this impact. Male Speaker:
I agree. Carlos. Carlos Bustamante:
Just to add to Ewan’s point on the potential for archaic and ancient DNA, in fact, there’s
a ton of technology development recently that is sort of upending this issue about, you
know, how much can you really get out, right? So, when they do, for example, the single-stranded
library prep it turns out you get many more molecules than the double-stranded and that’s
why you’re able to take these [unintelligible] to high coverage so it’s actually an area
that the U.S. has invested almost no money in, right. All of the development has actually
happened in Europe and you could imagine that, you know, in fact, there may be some bone
somewhere that have somewhat larger fragments that could be sequenced. So, I wouldn’t
totally rule it out and nobody knows how far back you could go, right? We don’t have
a homo erectus sequence yet; that doesn’t mean it can’t be done, but you know, it
just hasn’t been prioritized in terms of what’s been done and there’s basically
two, three labs in Europe that are leading in that area. Male Speaker:
Yeah. Female Speaker:
So, just to goal five, to point five, was there discussion on the committee about partnering
with, say, NSF on advancing some of the alignment algorithms and comparative data analysis tools
because they also expend a lot of money in this area? It’d be a good partnership. Male Speaker:
There wasn’t. That is a very good idea. The NIH/NSF joint developments in quantitative
biology have been very encouraging and that’s a very good suggestion. Male Speaker:
I mean, I guess my feeling on this is that most of the genome browsers that are being
used, I mean that’s just one, obviously, way to do this, but have been driven largely
by the genomics community funded by Welcome and NIH. And I think, I mean, we have a lot
of experience in this and there’s been a lot of discussion, I’ve been to several
meetings about how would you, if you had 50 human references right now at high-quality,
how would you display them so people could access the information, optimize their mapping
so they could find, you know, the right genome and still be able to communicate these ideas.
It’s not trivial at all. I mean, for sure if there is value for adding NSF in partnership,
I mean, we should take advantage of everything we got. But I think my feeling on this is
that we as a community have taken the leadership role in this and we should continue to push
on this because this is, again, not an easy or solved problem. Male Speaker:
Eric. Eric Green:
So, I wonder if it’s okay to make a comment across the four sessions so far because we’re,
I assume, coming to the end. So, it’s a spectacular range of projects and I want to
share the general enthusiasm about the specific things have been proposed. I think many of
these things in this particular breakout are incredibly important, but I was trying to
think about what’s missing in what we’ve been talking about this morning? And I think
it’s probably there, but it’s maybe hidden a little bit when we’re spending a lot of
attention on structure. What are the nucleotides? What are the variations in them? How do they
correlate with disease? And then we mentioned disease relevant functions, and we talk a
lot about mutating the gene and seeing in an assay what affect it has. What I’m wondering is missing is the connection
with cellular circuitry. That being able to accomplish our goal of interpreting variation
in the context of disease, we may be able to interpret variation in the context of the
protein that it affects, but to truly interpret it in the context of disease, there’s a
set of NHGRI-ish kind of activities of systematically dissecting cellular circuitry when this enhancer
gets affected, when this protein gets affected, what are the consequences? When 108 in schizophrenia
get identified or 60 genes in heart disease get identified, how do we recognize what effect
that is having on the cell? And, so, I would not like NHGRI to have to
pay for all of it, but I do think that there is a set of infrastructure, it’s a little
related to the Links project, it’s related to what Aviv was talking about yesterday,
it’s related to going from individual enhancers to whole circuits, and somewhere NHGRI ought
to be the intellectual leader of that, and it ought to be paid for maybe by common fund
or others, but we’re not going to be able to interpret disease without it. We’re going
to get, based on everything that’s been laid out here, a great description of the
structural problems, I agree, down to completeness. We’re going to get great correlations. We’re
going to get protein structure responses, but there is a piece and we haven’t defined
it here that we better define, and it’s a set of data bases about circuitry, circuitry
responses, cells, and I was unclear whether it was inbounds or out of bounds but as I
think about what we’re doing, if we don’t make sure that that piece gets done it’s
going to be really hard to interpret these subtle mutations even with all the assays
and all the other things we’re doing. So, I don’t mean to destabilize anything here
by arguing and, I think, we should go forward with these things, but somewhere we better
also launch a process that’s doing that. Male Speaker:
But I think to make sense of those, Eric, you really have to start by making sure that
you’ve got the finite aspect of our universe, which is our genome sequence totally understood,
because all of those things really make sense in the context of the variation in which those
mutations are found. Eric Green:
Look, I’m agreeing that we want to get that sequence, but where I’m disagreeing is you
first must do that because to meet our goal which is understand disease, we must do that
and then we’re going to have to interpret that in terms physiology, and all of this
structural stuff — very important — isn’t going to get us the physiology that we owe
for disease so it’s not at the expense of it, we just also in addition, not separately,
not competing better be doing it, that’s all. Male Speaker:
It reminds me of something 10 years ago when we were actually analyzing first Venture’s
genome and comparing it to human reference. And Venture’s genome was, if you remember
— I’m sure you do — was significantly shorter. And where it was shorter were all
the genes and all the segmental duplications which were not in Venture’s genome. So,
we could not as a community, have begun to understand break points of genomic disorders,
copy number variation, without actually the investment of NHGRI in terms of building a
better reference because what you can’t see, you can’t assay. So, I think — Eric Green:
You’re defending the need to get complete structure. I’m totally agreeing; my comment
is independent of your comment. Let me totally endorse all of what you’re saying and then
add we still are not going to be able to reach our goal of interpreting disease without additional
things. All of it is great, but we have an obligation to go all the way and I was just
trying to figure out what piece feels like it’s not been discuss in this meeting, that’s
all. Ewan Birney:
[unintelligible] Rick Myers, and — Eric Green:
It does, and I was going back through the slides and, again, a lot of the emphasis was
per variant. Understanding the effect of this variant and my concern, Ewan, is exactly that
that when you look at the variant centered point of view you don’t, for example, have
a circuitry center. You don’t see what happens across large numbers of interacting things.
For example, take cellular circuitry and break it down into a catalogue of 2,337 processes
so that we have a finite list of those processes and we understand the context in which that
variant functions. So, a lot of what we have is bottom-up construction from individual
variants interpreting it for patients in the clinic. Again, incredibly important; I’m
not arguing against it. I’m just saying that the bottom-up inference is going to miss
things if we also don’t have a sort of top-down completeness of a wiring diagram and many
of the things, even in that presentation number two, didn’t really get at that higher order
picture. So, again, not that we shouldn’t do all those things, I’m just concerned. Male Speaker:
We do have some key players in systems biology — Eric Green:
I agree, in this room. Male Speaker:
— so let’s start with Minolus Kellus [spelled phonetically] and then Mark. Male Speaker:
So, first of all, I just want to briefly second Eric’s point and basically say we had this
fifth panel that was never created and I think systems biology could have been one of them.
I think this paradigm of learning what’s common across all of the variants that are
associated with the disease and then learning common properties of these variants like what
tissue are they in, what type of enhancer are they in, what motifs are nearby, and then
using that knowledge and applying it back to individual variants is something that’s
emerging a lot in our community and that something that has been a paradigm of genomics, the
fact that you have the whole genome allows you to learn global properties and then go
back to the individual regions armed with those properties, and interpret them better
than what you could do in isolation. I think the paradigm that’s pervasive, and I think
systems biology approaches and regulatory genomics approaches should, you know, could
be one these sort of fifth panel type recommendations. Going back to the — and this is actually
related to the comment that I wanted to make on the comparative genomics — the heroic
effort that we saw for the, you know, blonde hair variant where that particular nucleotide
could only be interpreted through, not just sequence level conservation, but understanding
exactly what are the regulatory regions that are active in these other genomes and what
are the motifs and how they’ve moved and how they’ve changed, I think it’s something
that should be routine. It’s something that rests upon, sort of, comparative genomics
to provide it as a resource to the whole community just like Encode has provided a set of regulatory
annotations at varying degrees of resolution and varying degrees of sophistication, I think
comparative genomics should have a mandate to provide such a list so that the next time
we find such a motif you don’t have to go through, you know, years of experimentation,
that you have the catalogue of exactly how all these motifs have changed. And the recent realization that, I mean, for
a mouse encode which is not yet published that there’s a huge amount of regulatory
conservation between human and mouse which is actually not reflected in the nucleotide
level conservation means that it’s imperative for us to develop better methods for understanding
regulatory evolution and regulatory conservation because there’s bound to be much more conservation
and that’s what we’re starting to realize with mouse encode. Much more regulatory conservation
than sequence models allow us to infer and if you had a better way of detecting that,
and that goes back to the proposal with NSF, then I think we would provide a great resource
for understanding disease. Male Speaker:
You’re straying a bit from Eric’s primary point which I completely agree with, which
is there was an awful lot of language in this meeting that had the view that the effect
of a snip is sort of this unitary thing that you can study outside of the context of the
rest of the genome as well as different environments. The genotype by environment thing was one
way that we’re sort of stepping away from that view but the idea that that genetic variant
is embedded in other genetic variants — there’s gene by gene and other sorts of things — and
the way we word this now is to think of human disease as perturbations of these networks
of genes that are involved in them and metabolic disorders, particularly, there’s a lot of
literature there. I do agree completely. Male Speaker:
Andy, I’m sorry and everyone I’m sorry to stop the discussion but we are well over
time at this point and we do need to break. Thanks. [end of transcript]

Leave a Reply

Your email address will not be published. Required fields are marked *

Back To Top