BIG DATA: How biological data science can improve our health, foods & energy – June 18, 2014
Articles Blog

BIG DATA: How biological data science can improve our health, foods & energy – June 18, 2014


hello can everyone hear okay great so
thanks very much for introduction and thanks very much to all of you for
coming out tonight so I got a I’m a computer scientist by training as you
just heard and in fact my first job out of college I was working at a security
company doing a code-breaking type thing so we were trying to invent and
sometimes break encryption systems that were trying to keep your data safe on
the internet as you’re doing online shopping online begging these sorts of
things but however though for the last 12 years I’ve shifted from sort of
computer code breaking into genetic code breaking rather than studying the
genomes of plants and animals and microbes and people trying to understand
some of the genetic causes of diseases trying to understand the relationships
between genomes and how crops grow trying to understand how it is that we
can develop better biofuels questions of that sort so tonight I hope to you know
tell you a little bit about my experiences about some of the fun things
we’ve done along the way some of the amazing discoveries we’ve made along the
way tell you about some of the technologies and some of the systems
that go into doing so talk about their capabilities today and I think one thing
at the end I’m going to touch on is sort of looking into the future you know you
know it’s we’re in a pretty amazing point today but the technologies are
advancing so quickly I think I think the I think we’re just getting started on
where the field is moving so tonight though the the lecture is going to start
with with the following observation which is that was work is that your
genome along with your environment and your experiences really shaped who you
are and what I mean by this is is is in many many different dimensions so in the
most immediate way it shaped her you are in terms of your physical
characteristics so in terms of what you look like how tall you are your skin
color your eye color hair color you know it’s it’s no secret why you look like
your parents is because you even you’ve inherited all their genetic material
beyond sort of your physical experience is your physical form your genome shapes
you are in many many different ways so it shapes your personality to some
degree shapes your intelligence to some degree shapes how long you’re going to
live it’s going to it’s also going to shape your response to drug treatments
your susceptibility disease also in all sorts of different ways your genome
shapes who you are so in some ways the genome is is really
the final word and this is this is true for things like you know I color very
much determined by your genome some things are not at all genetic they’re
entirely social they’re entirely you’re up to your up to your experiences up to
your environment so you know you can take you can take a Chinese kid grow and
have them grow up in America they learn to speak English no problem and then
vice versa I can learn to speak Chinese if I put enough work into it so social
things tend to be very very environmental physical things tend to be
genetic and then everything else is some interplay between your genome and
between your experiences and your environment and then just to go a little
bit deeper just more specifically so inside your body your which is composed
of trillions and trillions of cells if you’re able to zoom in really close to
the side here you here’s the kind of a blow-up of one of these cells we would
see that the inside there there’s this very special molecule this is the this
is this big molecule called DNA the structure of which was discovered by Jim
Watson here in the front row and his friend and collaborator Francis Crick
the way I like to think about DNA is that this is the sheet of paper on which
genetic messages are written so yes there is a double helix structure to yes
it’s made out of molecules but really the important thing to DNA is this
particular sequence of what’s called nucleotides that spell out along this
very very long chain some particular sequence so here we have TGA cgt on and
on and on so inside of your body inside of each of your cells you have an amour
or less exact copy of that sequence that’s called your genome for humans is
about three billion base pairs for little tiny microbes it’s a few million
interestingly though we do not have the biggest genome on the planet so certain
plants and animals the loblolly pine tree genome is about eight times larger
than us there are certain fish species that the genome is ten or maybe even a
hundred times larger than our genome and it’s just filled with all sorts of
complicated genetic material inside of there now this genome is you can you can
think of it you know for the computer people in the room you can think of this
is like being your hard drive it’s more or less fixed but of course your body
your cell is very dynamic so really what’s active activated inside
yourselves our genes and then sort of other other sort of features that
control the expression of those genes so from the sequence that spells out if
you want to think of this something like a recipe that says well what is the
recipe for growing Mike what is the recipe for growing all the plants and
animals around us all the different cell types everything from your brain down to
your blood down to your bones is all spelled out by this genetic code because
it is so incredibly important to us to new environment to all living things
it’s been a massive undertaking for four decades really to understand the
structure and develop technologies to sequence again what we’re really
interested is what is the particular sequence of base pairs along this
molecule the first major milestone towards sequencing DNA was in was in
1977 by a Fred Sanger who developed a new technique to be able to sequence DNA
molecules this later was the insight that led to his being a word of the
Nobel Prize in 19 is actually a second Nobel Prize in 1980 very smart guy on
the way that it worked those incredibly complicated and also dangerous where you
had to prepare cells and then treat it with radioactive materials so that under
under the right sort of a photographic film you could expose the sequence of
nucleotides that was there it worked we were able this technology we were able
to sequence off the genomes of teeny teeny teeny little viruses called
bacteriophages and then that led to some initial breakthroughs in the in the late
70s into the 80s after that time though the technologies for sequencing genomes
become became much more automated much more precise and much more larger or
higher throughput so this is a picture inside the sequencing facility at the
Institute for genome research Tiger where I used to work right around a late
nineties or so so inside of these rooms you had more or less I’m chunk off to
give them this like a factory perhaps where DNA samples would go into this
special room that they’d be prepared with various chemical treatments loaded
into these very complicated scientific instruments and then out would come the
genetic material that was president in those different species so using these
technologies in the 90s we saw the first sort of major milestones in it was now
called genomics being able to read off the sequences of different genomes 1995
this is a really big headline in the in the journal science we were able to
sequence for the first time the complete a complete genome of a free-living
organism the little tiny microbe called a muffle asst influenza throughout the
late 90s we saw the rise of using genomics and DNA sequencing to sequence
larger and larger organisms so we had a worm called C elegans this little plant
called Arabidopsis fruit flies you know we saw the growth of the complexity of
genomes increasing as really the technologies were matured and came to be
and then finally in 2001 we reach what is arguably be one of the most important
milestones ever in science when it certainly wanted most important
milestones and genomics for the first time we were able to sequence the the
more or less complete sequence sort of entire human genome this made you know
enormous headlines enormous press conferences between the president here
and the Prime Minister was really an international effort to sequence a human
genome now so you know since then I’m sure you’ve seen in the popular press
you know various articles about how it’s being studied and I guess think once or
a major takeaway from tonight is this sort of milestone at 2001 was just the
beginning right it was just the beginning we had the genome of one
person we had very very limited capacity to understand it to map it out to really
really understand what was what what what that sequence meant it was more or
less just this long string of a cgt that we that we really had limited power to
understand so this is really just the beginning of the story tonight and let’s
and I really hope to focus on everything that’s happened since then so one major
advance since then has been that we’ve seen these technologies applied to
larger larger collections of species so across what I heard my little cartoon
version of the of the Tree of Life so let me just give you a little snapshot
into some of the projects that are going on now so this is the lot the logo from
the thousand genomes project where sequence where researchers are studying
the genomes of a thousand people from around the world to look at all the
variants and sort of genetic changes that have taken place just across sort
of healthy normal people again from all around the world whereas that first
genome was just in a single person this is the logo for the Cancer Genome Atlas
projects where they selected hundreds and hundreds of tumors
of different types of cancer and also sort of healthy blood tissue try to
understand what are the mutations that are driving cancer and then also how can
we develop treatments that will help cure those diseases this is the logo
from the G from the genome 10k project and the I’m teasing a little bit but the
spirit of this project was to go to the San Diego Zoo and and the basically go
yup we’re going to sequence all the different species that are here and the
reason you know i’m teasing a little bit but the reason why you might be
interested to do that is through sequencing many different genomes across
the Tree of Life what we’re actually doing is we’re developing ways to look
backwards in time right so we can study how across this tree we can look to see
how different species have diverged from you from each other how evolution has
progressed and kind of spirit of this is if we if we can identify sequences that
were that were present you know not only in people but in in primates and other
mammals and then maybe going back to even more primitive organisms we get a
really strong clue that oh this sequence must be doing something really important
because it’s been conserved so well over over the eons this is the logo from the
encode project so in addition to sequencing genomes we can use
high-throughput sequencing technologies to study all the active parts of cells
so we can study how genes are expressed we can study how proteins form we can
study the rates that those happen all sorts of epigenetic modifications
looking at things now beyond the nucleotides but looking at the different
chemical modifications along the way and finally I have here the logo from the
the human microbiome project where they’ve been sampling all the different
microbes that live on top of us and inside of us to really understand all
them other microbial diversity that’s present there so this is actually think
this is a this is a really interesting line of research that’s only recently
been made possible through the rise of these technologies and it’s starting to
show some real interesting results so sorry interesting result number one is
that if you look inside of you know typical person your your body is
actually mostly microbes the microbial cells outnumber human cells by something
like ten to one that’s because they’re very very small they’re they’re teeny
tiny compared to the big cells of different people and then the other
interesting result from this this work is that it is that has really been
demonstrated to play a very big role in your metabolic processing so in
particular they’ve taken a skinny mice and have taken obese mice they’ve
swapped the stomach contents and then what you find is that the obese mice
becomes skinny the skinny must become obese without any other changes to diet
or exercise or anything it’s really the microbes and their ability to process
foods and integrate that is what’s driving the obesity in those in those
mice at least and then that really suggests that it could also play a role
in human development so all these projects you know literally across the
tree of life from the smallest microorganisms to the biggest trees to
everything in between you know we’ve been really using genomics and these
technologies to study all of the all the DNA that is present there and from all
these projects we’ve we’ve come to realize that there’s there’s lots of
there’s lots and lots of questions that can be asked to this so these range from
very sort of specific ones you know what is the genome sequence how does my
genome compared to your genome compared to other genomes up to what are the all
the activities inside of the cells so how are the genes being expressed out of
those genes change during development or different disease states in addition to
genes we can look at sort of other chemical modifications like methylation
chromatin this describes how the genome folds up inside of cells and so forth in
addition to sort of human genomes we can look at viruses and microbes looked at
look at the role of infection see how they attack us how our body responds all
these things we can look at mutations as they relate to disease so especially
genetic diseases like cancer it’s certainly the case that you know
understanding the genome lends itself to understanding the outcomes of these
diseases and then also leading to understanding you know what sort of
treatments we should provide to best respond to that plus hundreds and
hundreds and you know virtually a limitless set of questions can be asked
of all these data the key point though is yes we do have high-throughput
sequencing devices but these instruments provide the data but none of the answers
to any of these questions right so there that’s really sort of my important
observation number one so that so that lends itself to well if these
instruments aren’t going to answer any of these questions well what will and
because the data are so big we can’t possibly study it by hand
that’s why we need software and computer systems that can go in look for all the
patterns try to try to you know figure out what are the important mutations
look at the evolutionary processes and work in addition so you know we were
teasing that you know maybe IBM is getting interested in chocolate but a
little bit nervous about know these corporate things so if we need software
we need computer systems who’s going to create them so these are kind of the two
main questions that I hope to answer tonight so question so this question
about who will create them I would argue it’s not going to be any one person it’s
going to be teams of people so in in in genomics and all and actually in many
different fields of study we’re seeing the rise of what’s what’s sometimes
called data science or data scientist so the idea is here’s my you know picture
of the tsunami of data coming in it’s not going to be one person that’s going
to be able to fight this off it’s going to require a lot of teamwork so some of
this is going to require you know really deep computing skills really deep math
and statistics skills also domain expertise for studying biological
data we need someone that can tell us you know does this make sense at the end
of the day so all these different skills get integrated together into this
new sort of very interdisciplinary very multidisciplinary science called data
science here at Cold Spring Harbor we’ve assembled in our Quantitative Biology
Center you know a really a set of really outstanding researchers myself like to
think included in this set of researchers so here I’ve put up their
pictures in the lower left corner of each picture i put a little logo to
indicate if their background is in physics like we have here if their
background is in computer science this is a little computer chip inside of there
or if their background is in mathematics so really the kind of the point i want i
want to give you is yes we’re all studying biology but we all come to
biology through the through these different backgrounds bringing different
skills different expertise to the table to be able to understand all these
different data sets so that’s the who so that the next question is how how are we
going to study these data how are we going to answer all these questions and
there the that’s where the focus of my research is that’s the focus of where a
lot of the research here at Cold Spring Harbor takes place
and the answer is nuanced the answer is complicated so here I’ve tried to draw
you a map of how we’re going to answer all these different questions so it’s
not going to be you know we’re not going to go to Microsoft and download the you
know Microsoft genome solution or anything like that what we’re going to
need is an entire stack of technologies they’re going to be able to take us from
DNA sequencing information sort of the raw information and then further refine
it refine and refine it work our way up the stack of technologies and then
finally at the top of the pyramid that’s what we can make statements like oh well
these are the mutations that are responsible for cancer oh these are the
treatments that we should give you in response to those mutations so over the
next half hour or so we’re going to take a little tour through this technology
stack talk about all the different advances all the different systems that
go into play to be able to answer these questions so we’ll start at the bottom
and then we’ll work our way up so if you’re sure if you’re a techno geek like
me you’ll get very excited about this if you’re if you’re interested more in sort
of the applications just hang on for just a minute we’ll get to the top of
the pyramid very soon so at the very bottom it all starts with the sequencers
and the instruments that can read off the sequences of DNA so this is a
picture of an instrument produced by the company called Illumina they’re based out
of San Diego this instrument is actually about yea big costs three quarters of a
million dollars and you can think of this as an incredibly sophisticated
although just very fancy digital camera so inside of this this sort of
refrigerator-sized device is the technology to sort of amplify and read
off DNA and this is done by attaching very teeny tiny fluorescent probes to
the different molecules shining a laser on them and then letting them glow
different colors representing the different nucleotides so this is a very
zoomed in picture inside of one of these instruments about what the raw data is
it’ll be different colored dots corresponding to ACGT the special
capability of this is that we don’t sequence one molecule at a time we’re
going to sequence hundreds of millions to billions of molecules at once just by
observing the sequence of colors so red red red may correspond to AAA
red blue green would be ACG something like that so we can read off all these
different molecules all at once and then collect huge volumes of data so these
this technology and other technologies have rapidly matured over the last 10 or
12 years so this is a this is a chart showing the cost that it takes to
sequence sort of one human genome so back in the in the early 2000s the cost
to sequence one human genome was around a hundred million dollars right this was
a huge amount of money if you if you go back and look at that first human genome
project some people put the estimate closer to a billion dollars just a
sequence that that first investment it was it was on par with sending a person
to the moon in order to accomplish this using the technology that we had at the
time but if you notice here the costs are dropping off very quickly and I just
want to highlight here on the on the axis here this is in what’s called log
scale where every step along the way is a factor of 10 so 100 million 10 million
1 million hundred thousand today the cost hover right around a thousand
dollars so over the last 12 years we’ve sought we’ve seen this incredible
decrease the cost of sequencing genomes and as a consequence of this it’s
fundamentally changing the types of questions that can be asked I have a
speaking of questions we have one up front here yeah (Audience Member: Any ideas what led to the drop-off of cost?)
yeah so this spot in particular was when we saw the first application of Illuminous sequencing
the way I like to think about this is this was this up until that time sequencing
was a very analog process if you will where it was a you know trying to make
low throughput measurements here is when we saw this Illumina technology first
come on onboard and at the time it was like a transition from analog to digital
where we could just do it in great great numbers when the technology first came
on board it was not optimized at all so they had a lot of room to accelerate
that and then over the last several years they’ve been incredibly successful
at keeping refining and refining that bet basic technology great question
today the costs hover around a thousand dollars and if you kind of zoom in and
look at you know what is it that enabled us to hit that price point and
that was the rise of a new instrument from Illumnia called the
HiSeq X Ten this is like you know their branding of the of
the latest generation of their technology so this is an expensive
instrument you again it’s you know the size of refrigerator costs about 10
million dollars to you know upfront to purchase one of these instruments but
then once you have them you can sequence the genome for about a thousand dollars
it has capacity to sequence something like 18 thousand genomes per year every
single year it is expensive it is ten million dollars but today there’s been
nine institutions that have announced that they have purchased one of these
you know they’re really excited to apply these technologies to be able to
sequence you know vast numbers of genomes out there so many of these
institutions are research institutions not unlike Cold Spring Harbor so in New
York City we have our friends in New York Genome Center has just been launched
they have their their main office in Soho really beautiful building really
beautiful environment where again like Cold Spring Harbor our sequencing
genomes trying to understand in particular the causes of human genomes
similar institutions include The Broad which is the research one of the
research arms of Harvard and MIT Sanger Institute is another major sequencing
center research center in England in addition to research centers we also
have a number of companies commercial companies interested to use these
technologies so we have the Human Longitude loungin… excuse me yeah
thank you guys got it for me I’m a little tongue-tied a new endeavor launched by
Craig Venter looking to sequence tens of thousands of genomes a year to really
understand well why is it that some people can live so much longer than
other people so lots of applications and pharma lots of applications and research
lots of applications to serve everyday people so these instruments are really
wildly just so so those are where the X Ten instruments are but if you look at
where all of the high-throughput DNA sequencing instruments are on the world
it’s been nicely summarized on this nice map available with this website so again
the sort of all the places that you expects have lots of capacity so here on
Long Island you know between us and the New York genome Center and The Broad
there’s hundreds and hundreds of instruments on this campus actually just
down the road from here at our sequencing center we have
about 20 of these high throughput instruments that are working every
single day producing more and more sequencing data so in addition to sort
of the East Coast West Coast has a huge presence Europe has a huge presence
actually though the number one sequencing center in terms of the the
most capacity in a single site is in Shenzhen China over here at an
institution called BGI they were they have really the Chinese government have
really made sort of genomics and genomic medicine one of their premier
applications and are investing huge amounts of money into the technologies
and to the people to be able to keep up around the world if you take all these
instruments of which there’s there’s about 2,000 or so and you multiply them
times their capacity you know every day how much can you do the numbers is
something like 25 petabases per year I’ll explain what that means in a second
because that’s a I’m sure that’s a pretty unfamiliar unit the kind of the
punch line it is whoa this is very big data that we’re talking about here it’s
hard to get an exact number on this but as far as I can tell something like
50,000 whole human genomes have been sequenced and in addition to that
something like a hundred thousand we’re called exome capture experiments have
taken place where we’ve gone in and sequenced the gene regions in different
people it’s a way to keep it cost let me try to explain what 25 petabytes is
is because I’m sure this is going to be unfamiliar so the in sort of computer
science jargon we’re all interested in sort of multiples of a thousand so you
know one thing you might be familiar familiar with you familiar to you is
like a kilobyte of information this is about the size of I don’t know one email
message and we have up by a factor of a thousand this is a mega bleh this would
be like I don’t know one of your favorite songs that have you know you
have it coded as an mp3 file one gigabyte is about how much memory
requires the store like one movie if you scale it up by a factor of a thousand
that’s where you hid the terabyte range now we’re talking about like a thousand
DVD a thousand DVDs collected and then finally if we go another factor of a
thousand we’re up to finally we’re up to about a petabytes that’s you know how
much raw information there is these numbers though are still quite abstract
sent let me try to put this into into other terms so if we took a petabyte of
data and then we wrote it out to you know just
regular dvds they’re this big or you know teeny tiny one point two
millimeters thick if to write out one petabyte of data which would be enough
capacity for about ten thousand genomes the stack of DVDs would be about 200,000
DVDs tall if you just take all those discs you just stack them up we’re
talking about you know you know you know rivaling I think that doesn’t quite make
it to the tippy top of the Empire State Building but that’s sort of the right
you know picture in your mind about how tall is stack should be now of course we
never actually write it out to DVD is you don’t you spend all day is swapping
them in and out it would take forever you couldn’t possibly do that however though
what we do by is we buy stacks and stacks of hard drives and actually just
up the hill is where the where the data center is here cold spring harbor we
have capacity for about 5 petabytes of storage on campus here and in picture it
looks very much like this where it’s just hard drive after hard drive after
hard drive and then electronically we can move the data across in between
those hard drives as we need to access them so the so that’s you know there’s a
lot of data that’s available the incredible thing though is that we’re
just at the very beginning so today 2014 something like capacity for something
like 25 petabytes is a year it’s growing at this incredible rate of about
3x increasing every single year so if you look forward just a few years you know
it’s going to jump up by a factor of three so 75 x 3 x 3 x 3 and
then suddenly by about the year 2018 will have capacity for one that’s called
exabyte of data so this is a thousand times larger than we’ve seen it before
and then just for fun you can extend this out for another five years and then
suddenly by 2024 we’ve gone to this you know next crazy unit called a zettabyte
of information should be available if these technologies can continue to
mature at the rate that they have been for the last ten years or so wow so this
is a this is a weird huge number so let me try to explain what that is so here’s
the number with many many many many zeros well let me let me put this into
more sort of concrete terms they’ll try to make it more understandable so how
much is that a bite now instead of ten thousand genomes we have capacity for
something like 10 billion whole human genomes the stack of DVDs
how many just 200 billion is that so now the stack of DVDs is going to go about
half the distance to the moon right so of course we’re never actually going to
write that amount of DVDs so suddenly genomics which is you know arguably
amongst one of the bigger data sciences over the next ten years the all the
projections show that if we’re not going to be the absolute biggest we’re going
to be you know in the in the running neck and neck for other data sciences so
this is this is on the same order of magnitude of like radio astronomy where
they have these enormous tell us telescopes radio telescopes that are
listening to the stars interestingly though this is also in the same order as
everyone’s favorite video channel YouTube currently that has something
about has around 100 petabytes of data that’s been uploaded to YouTube you know
every minute of every day an hour of video gets uploaded something like that
so it’s growing an incredible rate as well so this is this is kind of a funny
number right there’s only this is today this exceeds the population on the earth
so why is there so many genomes and that’s because we’re coming to
understand that in certain diseases especially cancer we’re starting to
understand that there’s not just one genome but there’s many many genomes
that play there so it could be you know one genome from a 10 billion people or
will take 10 samples from a billion people or a thousand samples from 10
million people so doesn’t necessarily have to be 10 billion separate people so
if this is a map of where the sequencing centers are today if we sort of look forward
10 years in the future here’s my prediction projection of what it’ll be
so the first part of this is that the the sort of established centers that are
here today so this is major research institutions: hospitals, pharma,
biotechnology centers they’re probably not going to go away right they’ve been
successful this far I’m sort of assuming that they’ll continue to be successful
so we’ll see that and we’ll see that continue the new the really new thing is
what I what I see over the next 10 years is going to be something like this or
suddenly we’re going to be able to deploy sequencing instruments you know
literally on every continent all around the world in this really a distributed
way and the evidence I have for this is twofold so very very recently just in
the last couple weeks there’s been a new sequencing instrument that’s been made
available this is called this is called a minION from this company Oxford
nanopore it’s you know it’s literally that big, it fits in the palm of your
hands cost today about a thousand dollars to purchase one of these so
suddenly you know devices that used to be you know this big used to cost
hundreds of thousands of dollars are starting to shrink down such that it
will fit you know so they’re not really no no real bigger than say an iphone and
and in the same way that we all have digital cameras now that we all have our
phones with us it’s easy to imagine in the not too distant future where maybe
we would we would carry around something like this in the future the other
evidence I have for this is already in major transportation centers so here’s
the picture inside the subway system in Washington DC Homeland Security is
starting to deploy these devices that are constantly sniffing the air so if
you remember 10 or 12 years ago and DC there are these anthrax attacks for
people we’re sending anthrax to the mail so in response to this that they’ve
deployed Homeland Security has deployed these instruments that are constantly
out there trying to monitor you know if in you know if any of these biological
agents are every sort of out there again we want to know about them as early as
possible I’ve seen these with my own eyes in Penn Station if you kind of
know which corner to look at you can you can see him tucked away there i also
heard that last year at the super bowl there was a false positive hit and
everyone was kind of panicking you know what can we do about this so there’s
certainly a lot of questions that are going to come as we get these new data
so if you’re not convinced by this here’s more evidence I think this is going to
become more and more widespread so this is a picture at a high school just down
the road where a bunch of high school kids we’re holding on to their minIONs
because they’re starting to do sequencing this was done in connection
with our DNA Learning Center some of the some of the members over there are
bringing these instruments into the classroom and this is this is a picture
of my high school student I didn’t realize he was going to be in the
audience tonight he should be home studying but he’s been developing some
applications that run on your iPhone that can look to see if there’s any sort
of viruses that are present and if they’re and if so you know what they are
and what mutations they have so these technologies are are shrinking down
getting into the hands of kids so it’s only a matter of time before it’s in
everyone’s pocket so that’s I think this I think that’s fantastic
so it’s clear where the field is moving we should expect massive growth
to sequencing over the next 10 years so we’re you know we’re in the petabyte
range today exabyte is almost certain and then zettabyte is maybe just around the
around the corner the major data producers are going to be concentrated
in the sort of hospitals, universities also agricultural companies some of
these big companies like Monsanto had a sequencing capacity that exceeds Cold
Spring Harbor so they they’re very they I’ve talked to people there and they can
they say you know we sell genomes for a living so of course they’re interested
to see you know what it is is that they’re selling how can they optimize it
how can they make it better in addition to sort of genomic information just
expect there to be a huge explosion and other sort of medical records other sort
of other sort of personal digital data will be collected on your behavior, on
your heart rate, on your respiration all these sort of signals are going to be
integrated into these major centers to learn more about us to study diseases
especially in addition to that though I also expect there to be widely
distributed mobile sensors so those thumb sized devices can be deployed to
schools, offices, sporting events transportation centers, farms, food
distribution centers last year there was a lot of people that were getting sick
and it turns out is because the lettuce had been infected by E coli so we of
course want to be able to go in and detect for that as soon as possible the
idea is is that what we expect to see is in the same way that there are weather
stations all around the world so we get a worldwide view about you know what
hurricanes are going to come I’m sure we’ve all heard the story about if a
butterfly flaps its wings and you know in Africa that can trigger this
hurricane in the United States so in the same way we want to have these mobile
sensors distributed all around the world so that if there’s a an outbreak in you
know some remote area in Indonesia we’ll know about it as soon as possible so
that we can respond to it upfront another question (Audience member: So what does the mobile center actually capture?)
yeah so it’s it uses nanopore sequencing so the physical metaphor for this is you know imagine
you have a like a glass table where you’ve carved the hole into the middle
of it and you’re flooding it with water so water is flowing through here and
it’s going to flow out at a certain rate if I were to take a ping-pong ball and
throw it into the table it’ll disrupt the flow of water in a
particular way if I took a wooden cube and threw it in there it would also
disrupt the flow of water but in a slightly different way because a sphere
would have a different would block the water differently than a cube than a
different shape than a different shape what the nanopore sequencer does is instead
of measuring water it’s measuring electron flow actually it’s this ion flow
passing across this teeny tiny hole and as the different nucleotides the ACGT
pass through this hole it disrupts the electron flow in different ways so we
can add little teeny-tiny voltmeters read off that signal and then from that
we can determine well what was the nucleotide sequence that pass through
there it’s it’s it’s I really think this is like directly out of Star Trek to be
able to measure nucleotide sequences by tiny teeny tiny fluctuations in this ion
flow but it’s starting to come true it’s quite spectacular so that’s that’s where
I think the the data are going to be there’s our again there’s already tons
of it and that’s only going to become more and more true over the next several
years so as a result of that we’re going to hit we’re going to see sort of
increased attention and focus on sort of the next layer of this technology stack
so this is going to be where we’re going to go from sort of raw unprocessed and
try to do something with this there’s gonna this is where sort of the actual
hardware the actual computers maybe the clouds that will be able to sort of make
sense of all these data to do something with it so again if this is where I
think the the sequencing centers are going to be over the next ten years you
know one question might be is well where is all this data going to go is it going
to going to stay local is going to be aggregated someplace so one concept I
have especially when I was a naive grad student was oh well of course we’re just
going to put all this all into say the Amazon Cloud you know it’s the right
it’s it’s an incredibly scalable hardware it’s very flexible it’s very
inexpensive of course we’re going to do this and that was again when I was a
naive student so I’ve gotten a little bit older and wiser and I don’t know more
worldly or more cynical I’ve come to realize that it’s never going to get
aggregated into one cloud this this is because different countries have
different sort of political restrictions over how data can be shared
also there’s this different social barriers you know should the system
speak English should it speak Chinese you should speak Japanese no different
languages and also the questions that we’re going to ask we’re going to be
wildly different right so if we have one system for all genomic data it probably
doesn’t make sense to put human data and microbes and plants and you know and all
these different things just because there’s going to be so many different
questions that are gonna be asked of it requiring different technologies to do so
so instead of everything going into the you know single cloud what we’re starting to
see is the data going into we’re going to call a multi cloud where there’s
going to be different cloud resources being set up for different purposes so
in New York probably the biggest one that’s available today is at the New
York Genome Center in Manhattan where they’re building a cloud so for all the
sequencing projects that take place on their Center the data are going
to be produced and kept right there in California we have a really big cloud
being set up to store all of the cancer data that’s generated in the United States
so if your sequencing you know at Cold Spring Harbor and at MIT, Wash U all these
different centers let’s aggregate all the cancer data together so that can
live in one place in Chicago I work on a project called KBase I’ll introduce in a
second where we’re aggregating all that all the agricultural data together so
that we can analyze biofuels and crop development at that big center in China
there’s a big cloud resource there that’s being set up to be able to
support all the analysis of the data that are generated from all their
instruments so the virtue of cloud computing in addition to you know being
scalable and all these things is that when you’re talking about you know
petabytes or exabytes or zettabytes it’s just it’s basically impossible to move
that much information across the regular internet it’s just we just just we don’t
have the capacity to do so so it’s much more tractable to have the sequencers
and the compute systems right next to each other so that we can do all of the
computing all the analysis right next to each other we’re going to move the code
to the data rather than the data to the code is the slogan as they go so as I
mentioned I work on the system called the the Systems Biology Knowledgebase
sponsored by the Department of Energy and again we’re aggregating all kinds of
data from microbes from the different crops the interactions between
plants and microbes in support of bioenergy resources so in a way that you
know ethanol is used to as a sort of additive to fuel let’s see if we can sort of
optimize that process to be able to produce more ethanol optimize it in
terms of having better plants growing optimize it in terms of having microbes
living in their environment so it will break down the cell walls and produce
ethanol at a faster rate so one of the key operations that these centers are
doing is comparing one genome to another so just to give you a flavor of how that
works so here’s a picture of Craig Venter he was the head of the Celera
Corporation, head of Tiger where I used to work there’s a picture of Steve Jobs you
just I just happen to pick because I’m a huge huge fan of his so say we’re
interested in Steve’s Jobs’ genome so the way that we would do this is again
probably using that Illumina technology we’re going to be able to go
in with our you know very fancy digital camera go in and be able to take
measurements of the particular nucleotides in his genome the result of
that though is not the end to end genome the result of that is is this an
enormous file containing these little tiny snippets so it’ll be 100 nucleotides
here 100 nucleotides here under nucleotides there but with no particular
order no particular structure to it it’s very much like your genome had been put
through the paper shredder chopped up in the little tiny fragments and then it’s
up to the computers to figure out how those fragments relate to each other so
there’s a lot of good software that can do that they can then go in take those
unordered fragments map them up along the genome actually this is exactly what
my high school student were working on and then from that we can
look at all the mutations that are present there we can go to different
databases and we can ask questions you know are these mutations are these
differences have they been known to be associated with any diseases or maybe
other sort of magical characteristics again huge Steve Jobs so this isn’t this
is a complicated computationally expensive operation to perform so the
way that we’re going to make this tractable is not to do this on one
computer but to use many many many computers all together so this is a
picture inside the Cold Spring Harbor datacenter just up the hill here where
we literally have these like racks and racks and racks of thousands and
thousand computers that are all churning away
trying to make sense of all this genetic information it’s it’s complicated to get
all these computers to work together so I’ve been myself and other people have
been using different technologies developed at big companies like Google
or Microsoft or Twitter to do this analysis one of them one of the key
technologies that I use is something called Hadoop which is which is one of
the sort of leading big data analysis systems in addition to doing at Cold
Spring Harbor we can do this in different cloud resources we can do it
in the Amazon Cloud today would cost about around ten dollars to be able to
analyze the mutations in one human genome it works great if you do if he so
there’s a lot of concern about new does Amazon have enough security to protect
all of this really precious genomic data and I would say yes right they have
these incredibly skilled security teams that are able to go and put in really
strong safeguards on your data to make sure no one’s going to hack into your
genome really the challenge of this is is really designing applications that
can take advantage of many hundreds or thousands of computers and that’s
because if you remember in addition to sort of looking for mutations if you
remember go back to this early slide in addition to trying to understand the
genome there’s a lot of biological activity that takes place inside of
cells where we want to that we really want to map out and understand so in
addition to the genome we’re going to we want to understand you know what parts
of this are transcribed into RNA where are the genes at which means where the genes
activated we want to understand where proteins are going to buy into the DNA
we’re going to want to look to see how it folds up under different disease
states and different developmental conditions in different ways and then
also we’re going to want to look at you know how the RNA gets translated into
proteins so there’s a lot of as you might imagine inside of cells there’s a
lot of complicated molecular machinery that takes place we want to understand
at all levels to be able to map it out to look to see where the defects take
place all of these different applications requires new software to go
in specialized software that can that can tease apart all the different
signals that take place inside of there so just to sum up some of the
algorithmic challenges is you know we should expect to see not just one cloud
you know many dozens of different clouds that’ll be organized in particular around
different diseases or around different systems of interest we can’t possibly do
this all on one computer so we’re going to scale this out and it’s to use we’re
called parallel computing distributed computing technologies and then we’re
going to see a shift from studying say one genome at a time to looking across
huge populations of hundreds or thousands or perhaps even millions of
human genomes in a single analysis okay so we finally made it to the top of the
pyramid where we’ve sort of moving up from low-level technologies let’s talk
about some of the applications some of the ways that these data are actually
being used to impact sort of people’s lives so one of the projects that I’m
most proud to be a part of is here at Cold Spring Harbor we’re involved in a
very detailed study to look at the genetic basis of autism spectrum
disorders I’m sure everyone in this room has some connection to autism either in
their own family or in friends or know of people that that there are on the
spectrum today the CDC estimates something like 1 in 68 children born in
America are on the autism spectrum there’s a lot of sort of uncertainty
over you know what exactly and how exactly kids can develop this so here at
Cold Spring Harbor we’re trying to you know use our genetic capabilities our
DNA sequencing capabilities to really try to pin down what are the causes of
this disorder so in particular through the Simons Foundation for autism
research they’ve established a collection called the Simon’s simplex
collection so this is thousands of thousands of families where the mother
and the father have have have donated up their own blood and then also the blood
of one of their autistic children and then one of their non autistic children
so we’re able to use all these incredible technologies to ask the
question well are there any variants that are sort of present in children with
autism that are not present in their siblings or perhaps not present even in
their parents and and and the goal of this is that because we’re studying families
we’re going to have a sort of built-in controls where the two kids will grow up
in more or less the same environment be exposed the same environmental
toxin stressors or anything like that so we can really focus in on what is the
genetic component of this disorder so we’ve done this over to date more than a
thousand different families and we’ve built up this database of of millions
and millions and millions of variations that occur across all these people here
is a just a summary of where all those different variations can occur and sort
of as they interrupt gene sequences or other sequences and as we drill down the
most interesting most compelling signal that we start to see looks like
something like this so your genome is composed more or less entirely of a
combination of your mother and your father’s genome put together in a
particular way you have two copies of every chromosome this is if you’re
for sort of healthy individuals so what we see is in these families structures is
what we see is that for example the father’s genome again has two copies of
his chromosomes both of these copies will be intact with respect to the
reference genome the mother’s genome two copies will be intact will match the the
reference genome the non autistics sibling of the child two copies perfectly
intact no problem there when we’re looking inside the the genomes of kids
that are on the autism spectrum disorder what we see is there’s a much greater
impact of what are called de novo mutations so these are just spontaneous
mutations that occur in the genomes in the in the children that are not
inherited from they’re not inherited from their parents so in this in this
kids genome what we see is you know what should be here you know four neucleotides
AAA G I think I have that right AAG what has happened is those four nucleotides have
been deleted threw in some way along at some point in sort of the kids
development probably what happened it is cells are very good at making copy copies
of each other some point along the way one of the cells had a mistake where the
copy was not exactly right and these four nucleotides were lost so you
wouldn’t think that you know four nucleotides out of 3 billion would be
very very would be significant at all but in the language of genetics this is
a very significant mutation because this introduces what’s called a frameshift
mutation and that’s because the genome is read off three
bases at a time sliding it over by a fat by by four nucleotides can totally
scramble the gene that is there will totally scramble the proteins that are
made totally scrambled sort of development of this of this kid
unfortunately so I look to see what gene this occurs in this occurs in what’s
called CHD2 and this has been this has been associated to control how your
genome gets organized inside of cells and was called into its chromatin that’s
that’s packed in there so when we look across this population of many hundreds
or I think today it’s up to about a thousand different families what we see
is that the number of de novo mutations in kids with autism and their
non-autistic siblings is basically the same it’s not that autistic kids have
more of these mutations but what we see is that these kids have more mutations
that will that will what’s called killah will kill the gene so this would be
mutations like we just saw where it’ll be a shift over by four this will be
mutations where what should be an activated gene will have what’s called a
stop codon introduced inside of there where these genes get killed off when we
look to see which genes are getting killed we see that many of them are
related to things like neuron and brain development so it kind of makes sense at
that level that that would be associated with autism spectrum disorders in
addition to their genetic information we’ve been trying to correlate this with
you know the different behavioral assays different you know diet conditions all
sorts of different literally thousands of other different experiments have been
done on these kids to try to understand what is it what else is there besides
their genetics and the strongest thing that we see it’s related to is the age
of the father so and this also kind of makes sense in a sad way and thats that
as father’s get older our abilities to make copies of ourselves particularly in
our sperm cells can it can have a higher rate of mistakes along the way and then
if those mistakes happen to fall in just the wrong place then it can lead to kids
that have autism so it makes sense why you can why there can be in a family for
generations no symptoms of autism and then all of a sudden aha a child could
be born that on autism spectrum disorders in addition to autism we’ve
been applying these technologies to study many other
psychiatric disorders and also other genetic diseases so a great
breakthrough in the study of cancer has been a push forth by advancing these
technologies in addition to being able to sequence you know a blood you know a
large collection of blood we can drill down to the sequencing the genomes
inside of individual cells so in 2001 a tumor from a breast cancer from a
cancerous breast cancer was extracted it was cut up into different sectors
individual cells from the tumor were extracted we then measure all the
genetic changes along there and then from that information we can build a
tree and then we can observe oh well inside this tumor there’s a lot of
healthy cells but a lot of diseased cells and a lot of really critically
diseased cells that can lead to metastasis which ultimately led on to
kill the patient understanding the particular mutations in these metastasis
is really the key to the treatment and and here and at many different research
hospitals they’re trying to really understand these mutations design well
what is the right cocktail of drugs and compounds that we can give people to try
to prevent the spread of the diseases inside of there in addition to being
able to sequence you know inside of tumors there’s a lot of effort underway
now to look at what are called circulating tumor cells so these will
just be cells floating across your blood and the idea there is we’re going to be
able to attack if you have cancer very early on by able to identify those cells
we can also look inside of your brain to look to see are there any mutations
across your neurons either in their genetics or in the genes that are
expressed there looking beyond sort of individual people we can also also look
across evolutionary timescales so that you know obviously we have a lot of you
know relationships at a genetic level to say other primates so this comes this
this story comes from Adam Siepel our new director who will be here in a few weeks
and what they’ve been doing is they’ve been studying the different genomes of
sort of closely related species to further and further distantly related
species so it’s a little bit hard to see but up here this is called a multiple
alignment where we’ve lined up at the top we have the human genome, the
chimpanzee genome, dog, mouse, rat, and chicken so chicken being the most
extreme outlier our nearest common ancestor was on the order of
hundreds of millions of years ago and what they observe here is that chicken,
rat, mouse, dog, chimp in this particular segment has been perfectly conserved for
hundreds of millions of years there’s no way that this happens by accident right
this is a very important sequence terms of the operation of all these different
species but when we when we when we transition from chimpanzee to human
suddenly there’s all these mutations so this is called a human accelerated
region where in humans there’s been through our through throughout the the
decade throughout the centuries there’s been an acceleration of the mutations
that take place there when we look to see how that impacts the gene sequences
that are there we noticed that that leads to a much different shape and also
a much different function of the gene and then when they did a follow-up study
what they would determine is that this gene is critically important during very
very early on in brain development it’s just critically important for
establishing the overall structure of human brains so again it kind of
makes sense that would be able to recognize these sorts of signals from
the genomic sequence that’s present and then the last little story I’m going to
tell you you know those, those are many of the very positive things now of
course we have to be incredibly careful of how we’re going to use this data
right this is in some levels this is the most personal information that there is
about you so a former student here Yaniv Erlich had this really significant
study where he was looking through to sort of public genetic data that had
been published on the internet and then and this data was supposed to be
anonymous and then asking the question well you know is there a way that we can
figure out you know who these people were from there just from their genome
and it turns out yes you very much can figure out who these people are and then
you know so from the genome we can figure out who you are we can we can
learn something about are you susceptible to diseases and other things so in addition
to you know incredible privacy considerations that have to take place
with sort of genetic data complimentary to this just just our general presence
on the internet we leave these digital breadcrumbs along the way and there’s a
nice study from Carnegie Mellon University where they could you know
more or less guess or predict what your social your what your entire social
security number is US based on a few little pieces of
information so we know there’s there’s definitely a trade-off involved right
we we can learn great advances over you know what disorders you may or what
diseases you may have how we may treat them learn a lot about evolution we also
have to be incredibly careful about you know making sure this data doesn’t
escape into the wild and in a bad way so one of the one of the things they
highlight in this paper is that you know these data are very complicated no one
piece of it you know is bad on it’s bad on its own but there’s the so-called
hidden privacy costs about aggregating all these digital databases together I’m
sure we’ve all heard the reports about you know there’s Snowden type reports
you know we were there’s lots of monitoring going on the internet all the
time so I don’t want to alarm you I just want to you know just want to be cautious so
how we’re going to have this balance about the the benefits to some of the
negatives so there’s a tremendous power from all from aggregating all these data
together you know we’re really starting to you know crack the codes the genetic
codes behind really significant human diseases really trying to crack the
codes about plant development for our foods the the the genetic code behind
different cattle other sort of you know food products for us we have to be
incredibly mindful of the risks that are present so this can be sort of technical
risks where you can you know you can you can you can trick yourself into
believing something that’s not real by looking at various artifacts and then
more significantly we have to be incredibly cautious when we’re looking
at these personal data and there’s very strong safeguards in place to make sure
it doesn’t get used and get get put out there so on the last little note here I
think you know hopefully I’ve it I’ve instilled a little bit of curiosity in
to you about you know about these different data so I asked the question
well how can you participate so for students in the room and I see many of
you that’s good I think the number one thing that will help you help your
careers help yourself is to learn how to program in particular I like the
computer language called Python it’s very powerful is very flexible they use
it not only do we use it in genomic research they use it in many different
scientific environments they also use it at Google and
Twitter you know I any sort of engineering any sort of Sciences this is
this is emerging as the language of choice so I strongly encourage all of
you students in the room to learn that you know take as much math as you can
take as much statistics as you can take as much computer science as you can
these these are be yours tools which will give you special powers to look at
and interpret all these data if you’re able to I strongly encourage you to
check out the DNA Learning Center just across the harbor we run a number of
classes and workshops where you can get really hands-on working with your own
genome working with real genetic data and then for other individuals if you’re
interested you can you can sign up at places like the personal genomes
projects that run through Harvard or 23andme where you can provide genetic
samples and they’ll tell you something about your ancestry tell you something
about your own genome and an addition and sort of finally the most important
thing you can do is educate yourself so I’m really pleased that so many people
came out tonight Cold Spring Harbor holds a number of these public lectures
and events so I encourage you all to keep an eye out for those where you can
learn about your genomes learn about all the exciting advances that are being
enabled to those technologies so I just like to acknowledge all of the great
people working in my lab it’s really at all levels where it’s its postdocs, grad
students, undergraduate students, high school students really happy to do that
it’s really been a tremendous amount of teamwork here at Cold Spring Harbor have
been collaborating with you know many many different labs on campus to be able
to help collect and analyze and think through all these different datasets the
funding to support my research has come from the Simons Foundation, the National
Institutes of Health, National Human Genome Research initiative, Department of
Energy and most recently from the NSF so with that I will thank you I’ll be happy
to take any questions and then also highlight you know the next public
lecture is actually gonna be next week talking in more detail about autism
spectrum disorders so I encourage you all to take that all right so thank you
very much.

Leave a Reply

Your email address will not be published. Required fields are marked *

Back To Top