A genome hacker’s experience with the privacy of shared data | Yaniv Erlich | TEDxDanubia
Articles Blog

A genome hacker’s experience with the privacy of shared data | Yaniv Erlich | TEDxDanubia


Translator: Ivana Korom
Reviewer: Krystian Aparta Hacking is the art
of breaking security mechanisms. When I was an undergraduate student I worked as a hacker
in a computer security firm. We used to be hired by banks
and credit card services to conduct penetration tests, to try to find critical gaps
in the security of these banks and develop better measures. Let me show you one of my favorite hacks. Here is the younger version of me. And behind me, you can see the door to the IT department
of a major bank in Israel. This door is controlled by an intercom. That’s a very simple device – you press a button, it calls the secretary and if she knows you,
she would press 8 on her telephone and the door will open. What I’m going to show you is that each one of you
can open this door in five seconds by playing the sound 8 on your cell phone,
just next to the intercom. Let’s see how it goes. [unclear] it’s 10 pm, no secretary there. Okay. (Phone dialing) [Calling] Calling the secretary. No one is there. (Phone ringing) Playing 8. And taking the money. (Laughter) Don’t try this at home. (Laughter) After a few years of working as a hacker, I graduated, switched fields to genetics and eventually became a professor
of computational genomics. In genomics, we don’t have banks. But I found that we have biobanks. These registries collect and store
the DNA of many individuals to give access to scientists. And they can get to a very large scale. For instance, here is
the biobank in Norway that stores the DNA
of half a million individuals. Scientists can use the DNA to understand the genetic basis
of different types of disorders, such as heart disease, cancer
or psychiatric illnesses. Once we know the genetic basis,
we can now offer help for patients. We can offer them
early diagnosis and intervention. To illustrate what this process
might look like, I want to show you a short scene
from the futuristic movie Gattaca. In this scene, there is a newborn, and the doctors are going
to collect blood from this newborn and analyze his DNA immediately to know all the predispositions
in the first moments of his life. Let’s see how it goes. (Video) Man: 10 fingers, 10 toes –
that’s all that used to matter. Not now. Now, only seconds old,
the exact time and cause of my death was already known. Yaniv Erlich: Now they take the DNA
and start to analyze the predispositions. (Video) Woman: Neurological condition,
60 percent probability. Manic depression, 42 percent probability. Attention deficit disorder,
89 percent probability. Heart disorder … 99 percent probability. YE: So now if we know that this baby
has a 99% probability for heart disorder, we can intervene early. Maybe by a drug, by a surgery, maybe, in the future, by CRISPR. So, although Gattaca
is a futuristic movie, this scene highlights that DNA can reveal
a lot of information about diseases. And that might be sensitive
for some of you. So biobanks developed security measures to protect the identity and the privacy
of their participants. When scientists access
the DNA in these biobanks, they don’t get the name
or contact information of these people. They only get access to the DNA, with some basic demographic information, such as where the sample was collected or what is the age of the person. But the identity remains anonymous. Being a hacker, I wondered
if this security mechanism is good. And I wanted to try to hack it and to see if I can breach
the privacy of these participants. So my lab conducted an empirical test. We took a DNA sample from a US biobank. Now, since this person
could match anyone in the US, the initial search space
was the entire US population. Over 300 million individuals. Then we started to work
with demographic identifiers, to narrow down the search space. We knew that the sample came from Utah, which has a population
of 3 million individuals. Then we discovered that the sample
has a Y chromosome. Meaning that this is a male, which halved the search space
to about 1.5 million individuals. And then, based on
demographic information, we knew that this person is 49 years old, which reduced the search space
to 20,000 individuals. By this point, we exhausted
all the demographic identifiers. We still had 20,000 people,
and we didn’t know who is the person. We were stuck. But then we thought
about this really cool hack. We found that we can infer
the surname of the person just by looking at the DNA. Let me show you how it works. Consider here the Kovács family. Now let’s assume that they have a son and the father will give his son
his Y chromosome, and also his surname. Now, if this son is getting married,
and also has a son, he will give him his Y chromosome
and also his surname. So you see, this creates a correlation between the specific Y chromosome
in this Kovács family and between the surname. Genetic genealogy companies
take advantage of this correlation and they offer services where they will send you a swab
to sample the DNA in your cheek, you put it in an envelope
with 99 bucks – very important – and send it to them. They will analyze your Y chromosome
and your surname, and will put it in an open database,
such as Ysearch.org. This website is open for every one of you. Without subscription. So you can look at the data
on this website. Many people do these tests just because they want to learn
about their ancestry. They want to find their relatives or maybe meet the black sheep
in their family. So, with this database right now, there are over 170,000 records
of surnames and Y chromosomes. Going back to our
allegedly anonymous sample, we took the Y chromosome
of this individual and searched this database. And eventually, we found a match
to another Y chromosome. Since these two Y chromosomes
were identical, we also knew that the surnames
should match each other. So now we are able to infer the surname
of this anonymous person, which I’m going to term Mister X
just to respect his privacy. Now we had a surname. And with it, we reduced the search space
from 20,000 individuals all the way to a single individual. Only one person. At this point, we knew
the identity of this person – his first name and last name. We knew his Facebook profile, email address, contact information,
where he lives – all of these were connected
with his genomic information, with all the predispositions. Just to show that
this process really works, we repeated this analysis
over and over again with more samples and were able to breach the privacy of close to 50 different participants. We had their entire genome
and full name and contact information. And we’re very pleased to see that this study generated
a lot of media attention, because we wanted to start a dialogue between scientists, biobanks
and the general public about how we should move forward. And I want to emphasize that this issue
is not just a problem for the US biobank. Also for other biobanks around the world, so it also generated some attention
from the international media. Maybe we should just dismantle
all these biobanks. They cannot protect our privacy. Their security measures don’t work. Maybe we should just shut them down. Maybe you should not go so fast. Let me tell you what is at stake. I would like you to meet Ariela. Ariela is an adorable little girl
who was born with a facial malformation. During her childhood,
she developed brain cancer. And her family contacted my lab
to see if we can help. We sequenced Ariela’s genome, and were able to pinpoint the mutation that caused
this facial malformation and quite likely also induced
her cancer status. Today, Ariela is healthy –
this photo was taken two weeks ago at her Bat Mitzvah celebration in Israel. She’s doing great. Without biobanks,
we would not be able to help Ariela. The whole process of finding
this mutation in her genome relied on analyzing thousands of genomes
of healthy individuals from the biobanks and contrasting these results
with the genome of Ariela. So biobanks are highly important
for advancements in biomedical research. For families like Ariela’s,
but also for other types of diseases. So thinking a lot about this tension
between data sharing and privacy, I realize that in many cases
when people say, “I want my privacy,” in many cases they tell us,
“I don’t trust you with my data.” So instead of fixing privacy, which might be quite hard, we should focus on establishing
trust relationships between scientists and participants. And when we have trust,
we can do great things. Let me show you how it works in practice. This website is called DNA.Land. We put this website up a few months ago, and the aim of this website
is to crowdsource genomic information
from the general public. We launched this website in October, and since then, we collected
close to 22,000 genomes. About 3,000 genomes every month,
or 100 genomes every day. And we do that because
we have trust relationships. How do we get
to these trust relationships? We return individuals
useful and interesting information when they upload their genome
to our website. For example, for each genome, we analyze the ancestry composition
of the individual. What you see over here is the DNA.Land
results for my own genome. I’m half Ashkenazi Jew, and you can see that in my
big, red segment in my genome, and my mother’s side is from
the Jewish community in Uzbekistan, and these are the other segments
that reflect my maternal line. We do it for every genome and then people feel that we reciprocate,
we give them something back. And this creates these relationships. In addition, to create trust, we want to have this dialogue
with the community. And to have this dialogue
we are very pleased to see that people in our community
put together a Facebook page. And they did it on their own initiative to discuss the results
they get from DNA.Land. To help each other with this information. And also we can go and tell them
what we do with their data. And this is part basically
where we go to this website, we can show them what is in their genome and we can tell them about their science
that we have over there, creating these trust relationships. In addition, we want
to lead the community. I want to serve as an example
for the community. And I also share my own genome online. Because I want to signal people
that if they give me their genome, I want to also show them
my own test results. And I want to encourage each one of you to consider sharing
your genomic information and participate in such studies. Think about it – each one of you
is going to be sick at some point in your lifetime. Worse – our loved ones might be sick – our kids, our spouses, our parents. And when they’re sick,
we want to give them the best clinical care that is available. With genomics, we can
understand the blueprints of different human maladies. We can develop better treatments
and we can really open a new chapter in biomedical research. But this cannot happen
without our participation. So I want to encourage each one of you
to contribute your genome. Thank you very much. (Applause)

8 thoughts on “A genome hacker’s experience with the privacy of shared data | Yaniv Erlich | TEDxDanubia

  1. See, lack of trust is an important issue. You don't know how your data will be used, so a lot of people will choose not to participate. Sadly, I agree with them.

Leave a Reply

Your email address will not be published. Required fields are marked *

Back To Top