MIT CompBio Lecture 10 - Regulatory Genomics
Articles

MIT CompBio Lecture 10 – Regulatory Genomics



all right welcome welcome everyone I feel like every every lecture I give every I'm telling y'all this is my favorite lecture but this is my favorite lecture I also heard that some people change their projects to work on 3d structure of 3d genome architecture after the 3d genome lecture so I thought you guys don't keep changing topics every every lecture because there's just so much cool material coming up today we're gonna be talking about regulatory genomics so number one howdini how do we discover regulatory motifs how do we find their instances and use that to predict their targets and then how do we dissect regulatory regions within which these motifs are found so we are now in the first lecture of the third module so we talked about aligning and modeling genomes module 2 was all about in expression and genomics and we're now getting to the circuitry so today we're introducing the foundation of the regulatory networks inside the cell by learning about relative motifs and then how do we discover present them and find their target and then on on Tuesday next week we're gonna be talking about network structure and on Thursday next week we're gonna be talking about deep learning so the goal for today is number one introduce the basis of gene regulation in most genomes focusing on mammalian genomes number two introduced computational techniques for discovering these motifs based on expectation maximization and give sampling number three introduced computational techniques for finding the instances of these motifs so a motif is a is a construct that spans across many many different regions and summarizes them and then this effectively generative model of these sequence patterns can instantiate in different locations in the genome and our core is going to be about finding these instances actually number five is gonna be finding these instances number four is going to be about exploiting many of these instances together regardless of who the regulator is to learn regulatory motifs de novo so basically an experimental may driven technique and evolutionarily true and then number six we're gonna be talking about how do we discover these motifs in larger regions how do we use experimental techniques in extremely high throughput or dissecting the functional nucleotides within dr. Reed ready with me alright so what are the regulatory motifs so I showed you on the very first lecture a region of the his genome where I highlighted nucleotides that were important in red and these were instances for the gal4 and the big one regulators as well as for the data binding protein tbp so how do these motifs work so these motifs are basically constructing a landing pad for different regulators or different transcription factors and these landing pads are made out of atoms that are just in the right locations as well as the entire double helical structure of the surrounding region that the protein then recognized so that protein makes contact with the phosphate backbone of DNA and this does not depend on the specific nucleotides under the spectrum but it also makes contact with the stacked nucleotide base pairs of the DNA that it senses from the side and depending on what nucleotide you put at a particular position different atoms are going to be facing that protein and then they will be making a contact or not making it who's with me on this one raise your hands who feels that they actually learned something in the last minute no I'm serious because this is not obvious basically everybody thinks about all we're just gonna recognize that motif but that motif is not recognized by opening up DNA and then reading the basis in the same way that RNA is made out of DNA that motif is recognized in the closed double helix not by the traditional ACGT base pairs of three hydrogen atoms but all of the other exhilarations that on the side this is in fact what leads to the degeneracy of regulatory motifs sometimes the specific molecules the specific atoms that the protein binding recognizes can be achieved by a C or a G at acquisition an A or a G at that position and so forth and that could allows these motifs to be degenerate that could leads to these you know W which means weak which means a or T nucleotide Thanks that make sense so basically regulatory motifs are turned on or off basically genes are turned on or off in response to changing environments and as I mentioned previously there's no direct addressing there's no go-to chromosome 13 and position 327,000 542 instead the subroutines put encoded genes contain sequence tags are a motif and then there's specialized proteins at each prescription factors that recognize these that I'm gonna put this red tag in front of all of the genes that I need for metabolizing collectives and I'm gonna move these blue tag in front of all the genes that I need or you know when we thought I don't need when I metabolize glucose so therefore II said if I sends galactose gal4 goes and vines and it turns on all these galactose genes but if I sends glucose which is a more efficient sugar for yeast Ming Juan is gonna turn on and then shut off these genes and then other genes will turn on okay so what makes motif discovery hard is that motifs are very short there are only six to eight base pairs long and there's sometimes to generate as I mentioned here you know you can have any position being a degenerate position they can contain distances between them in this particular case the gal for transcription factor binds on the same side of the double helix but eleven nucleotides apart every ten and a half nucleotides I have a full turn off the double helix that basically means that if I bind on the same side I'm gonna bind to parts of a motif which are gonna be spaced by about eleven or that okay and then the motifs can act that variable distances both upstream and downstream of their targeting with me so far any questions you really did not learn anything because of Hitler let's make him happy he's insecure alright so the regulatory code is all about finding where are these motifs and how do they function and where are the corresponding regulatory regions that they define and how do they function motifs are everywhere motifs can be right where the transcription start site is recruiting the specific location where RNA polymerase will start transcribing DNA into mRNA they can be in the boundaries between different exons and introns basically enabling the splicing regulators many of which are RNA binding proteins or RNA binding is basically bind and guide splicing some of them are post transcriptional and they guide the binding of micro rnas which we talked about last time and some of them are acting at very very large distances some can act at millions or more nucleotides away by sort of creating this looping interaction between these distal enhancer region that will then turn around and find these are the promoter e ok so this is the part States basically there's 20 to 30,000 genes and then the circuitry is all of these regions cancers promoter splicing and you know special teams and then the regulatory code as basically some kind of comm unit Oriole coding of unique acts so you know we use multiple individual motifs and together they allow these data-centric encoding of these addresses and these are overlaid with memory marks and then these are the at the genomic states that we learned about last time all of this allows us to basically modulate this you know large scale you know encoding of regulatory information okay so as I mentioned transcription factors basically use DNA binding domains to recognize specific DNA sequences in the genome and here you can see one such example and you can see how this is the DNA molecule as I mentioned it's a very cool molecule at a major group also has a minor group and in that major groove in the groove in this particular case you basically have this protein that sort of coming here and it's making specific contacts with specific positions you can see here these amino acids of that protein or contacting you know a site far away these ones are sort of forming many many different interactions in this loop and then you know the other one comes moves around and then contact another everybody has an intuition now for how these motifs work yeah strict other versions so there's one with Eve there's no motif there's no multi notice motifs have some very cool properties these particular motif if I read it backwards what does it spell come on can you read it backwards he a a TTA wait I asked you to read it backwards this is forward my son would have such a glasses it's six years old so he could spend all afternoon saying hey wait I called you three to breakfast so anyway it's a palindromic relief it's actually a reverse palindrome so the complementary strand reads exactly the same as the forward can you read this one backwards uh-huh I told you to read it forward all right so anyway this is a palindrome as well so many with these refining drums why because they don't really care you can just bind this way or that way and it's just as good there's no directionality they're binding it doesn't matter how they bind others are not found in drums this is not a pattern so this one has the possibility of caring about which direction it binds in it may have a different effect on directions the other direction we saw for the ctcf regulator that it is in fact extremely directional in one way it stops with a stop sign cohesin cannot go through on the other way the reason goes through no problem okay so directionality actually matters and you know these the specific structure of the motif dictates how things are gonna bind and these motifs are so important because if we disrupt nucleotides from you know just a natural polymorphism that exists within DNA sequence from human to human we can actually disrupt functional nucleotides and then lead to a phenotypic difference and sometimes disease from person to person not because of protein coding variation but because of non-coding variants that's why it's extremely important to sort of systematically understand the circuitry of the gene okay everybody with me so quick question what fraction of genetic variants perturb a protein directly if look at all of the common variation that has been started with genome-wide Association studies what fraction of these regions disrupt the protein where's the tourists take a guess raise your hands take a guess what fraction to trust protein yeah so one percent of the human genome encodes protein and therefore you would expect that if polymorphisms are laid out by chance in the genome randomly in you know uniformly you'd expect that about 1% disrupts proteins but the question is for those that are associated with disease not just any genetic man but the disease associated genetic parents what fraction of those disrupt proteins yeah 15% 50% 15 good anybody else want to say higher or lower higher or traction anything more than 50% so you would expect a protein search so darn important half the time these common parents are going to be disrupting a protein there's not you're absolutely right proteins are so darn important but she's not white right enough it's actually much less than 50% it's even much less than 13% it's only about 7% of the time a genetic variant actually disrupt a protein and the reason lies within his inside that proteins are so darn important that if you disrupt the protein that genetic variant is not going to be common enough in the population because of selection so genetic variants that disrupt proteins are in fact not as common as genetic variants to disrupt the rest and the rest is effectively ninety-three percent of cases this genetic variance actually disrupt a motif or some kind of non-coding function rather than a protein track okay so today's lecture is kind of important if you care about the 93 percent don't focus on the seventh time alright so as I mentioned earlier we can actually identify individual nucleotide that can yeah yeah so so you have a very good point is there a one-to-one correspondence between amino acids and the nucleotides the contact and the answer is absolutely not if you look here how many nucleotides are there are one two three four five six seven base pairs okay and these two are contacting the same one these three are currently the same one this one is not contacting a nucleotide at all and so and so forth so the answer is we wish it was that thing but it's actually ridiculous there are some proteins called zinc finger proteins or there's more of a code where every three amino acids sort of contact one nucleotide and therefore you can actually design them sort of construct new motifs but most of the time it's just some crazy combination that has been evolved not engineered all right so 93% of cases this is what's happening and there's one very beautiful example that Molina cloud center worked out you know working in collaboration with my lab who basically showed that you can actually trace down the strongest genetic association with obesity down to a single nucleotide by disrupting this tea nucleotide and changing into a see this a tear each interaction domain can no longer bind there then that leads to the repression of a super enhancer very massive regulatory region the repression of these two genes a million nucleotides away and then a switch from energies burning of thermogenesis to energy storing of adipocyte lipid storage and all of that explains seven pounds in another human adult basically those who carry to risk versions for both mom and dad for this particular region thank you mom thank you that I got both have seven more pounds as an adult then you know otherwise so when I go to the doctor I can say hey I'm actually seven pounds leaners just mom and dad gave me this version here anyway it's not deterministic and human brain has something to do with it as well but anyway these basically changes your metabolism due to a single nucleotide change in a regulatory teeth that's this small that's it's a million if it dies away from the target and then this is sort of different types of structures for how they bind we can basically bind you know within the major groove they can bind across different regions across sides of the helix it can you know bind of very often when you get binding as a homodimer so for example gal 4 is a homodimer one arm contacts CGG on one side of the helix and the other arm contacts CGG on the other side of the heels so you know instead of nature basically making two different elements that will both recognize the same motif it just makes one and uses it twice and then that's why a lot of motifs for palindrome because you can actually just very easily put together okay so Rodin's feel DNA the 3d topology dictated specificity and this is not the only type of recognition microRNAs for example actually bind by complementarity nucleosomes care a lot about GC content and then various RNAs can by using both structure and sequence okay so given this very complex structure we'd like to have a representation for capturing what this language of gene regulation on looks like and this representation is very frequently position weight matrices this is a matrix that basically at every nucleotide position gives a weight for each of the four nucleotides here's one such example this is the motif or the AV f1 protein in English and this is the consensus so this position is nearly always a team that position is always a scene you know always an a always a nail this is here is a G in here half the time it's an a half the times it's a gene similarly here after times a name half the time clergy of the time to see how the time this is a team and you can see here from all of the instances of that motif across the genome in which we find all places where the ABF one regulator binds these are all the target genes and then these are the coordinates upstream of the transcription start site where they bind and then this is the alignment of all of the motif and if you count at every position how frequently did I see each of the nucleotides across these hundreds of sites you end up with exactly this was on a person with me here awesome so what is the limitation of this representation what does it assume correct great so they assumed that only the nucleotides matter hey they don't deal with shape information or folding or other kinds of information that's good what else do they lose this representation not capture yeah good so it basically says that every position is equally important it only tells me the specificity that position but even if this is always a name maybe it's not as important so it doesn't weigh them by important so I could have an overlay of this presentation that also weighs each of them by they're important what else correct it basically assumes that every position is independent it basically says that you know whether I have an A or a G of this position it doesn't really matter if I have an energy of that position but maybe in practice every time I see an A here I always see an A here and see here and every time I see a G I see a t energy but then they don't go a little mix and match everybody sees that raise your hands if you see that right so basically pwms assume independence and that's okay for most of the time for some of the time it's not okay and when it's not okay what we can do is expand it if vwm into multiple PW video via multiple position weight matrices and basically say a GC a CRT bla bla bla bla a CG a it would be one of them and then G same thing G would be the other one and then suddenly you've captured these dependency by splitting the motif to for example or you could have more complex representation that capture these dependencies together some people just use clouds of k-mers to basically say here's all the k-mers I'll just use a cloud of them you know mix together someone ready with me so far awesome listen we depend fixed facing for example you know this representation will not be able to capture a viable spacing and you saw earlier how a protein can bind to different locations and maybe the spacing doesn't really matter it just looks for these two halves and this representation doesn't capture yeah so we we make the simplifying assumption that if I'll score this sequence using this that the score that I would obtain is an approximation of the affinity but you're right the two are not exactly the same yeah yeah correct correct basically we assume that the there are enough of these samples and that they are representative of the true you know of frequency of binding and then from affinity from frequency we basically you know infer information okay great and this is you know there are many experimental ways of finding this affinity what I mentioned earlier is that you know experiments are basically being done to see where is that transcription factor bound for example using a chromatin immunoprecipitation experiment you can basically say oh this regulatory bit what is sitting exactly here on the see therefore I'm gonna go and cut that sequence out and then figure out what's there so there's something called footprinting which basically allows it to sort of digest the pound protein and DNA complex digests the DNA away and then whatever does not get digested is where the transcription factor was sitting you can use that basically fine where was it sitting each time but that doesn't give you the motif that that only gives you the precise coordinates of finding get the motif you need many locations and that sort of the key thing to recognize that motifs in fact capture information that shared across many positions as opposed to an individual instance which is a fully defined sequence and every motif can be thought of as a generative model this is basically telling me how to generate more instances of the same class just like at an hmm that we talked about earlier so here a can sample from that probabilistic model I can basically sample and then you know half the time we're gonna get an A na you said half the time I'm gonna get a key after Oreo gene and so on and so forth for every position okay everybody with me great so how do we find these motifs so if I have a particular protein and I want to find out what it binds what it could do is hybridize that protein to an array that contains many many different versions of nucleotide sequence I can just wash that protein through that array and then I could basically find all the DNA spots that light up and that can basically tell me the affinity of that protein has something to do with all of these gamers then I can analyze all these gamers together and then infer a motif from that or I could basically do DNA precipitation with my crayon a detection basically you know pull down my protein and then do a relative hybridization in you know either genomic sequences or random sequences then figure out where it advertises or I could actually evolutionarily don't do round of selection where for every protein I keep all of the DNA sequences that it recognizes and then I you know keep filtering until I find all the true DNA matches that particular protein or conversely I can use DNA to select proteins that binds it that bind it and then do multiple rounds of selection and then find the best protein match to a particular incident so computationally there are many ways of dealing with motif discover so these are some experimental methods computationally you can basically use expectation circum expectation simulation give sampling that we're gonna look about look at today to find motifs that are overrepresented in a set of sequences or you could enumerate all possible motif sequences with wildcards and or you could basically look at correlation between the particular motif and the positioning within a chip seek peak where this motif occurs and the closer you get to the center of the peak higher your score so that can actually allow you to use a cut-off free approach rather than just saying oh it's between here and here and it allows you to do much smoother so these are all based on a particular set of sequences that contain that monkey you could also use genome wide conservation basically looking at not just one instance of this CGG stay by eleven or CCG motif but the entire genome and all of these to basically say how frequently are they conserve and you could also deal with specific protein domains using ppm or selects as we saw previously then again when tips are not limited to DNA C sequences they can be spicing sequences they can be protein sequences it could be recurrent patterns at the physiological level are enzymes so I like to think of these challenges in regulatory genomics as basically graphs going from each of these entities to the other so what are these entities on one hand you have the regulator is can be a transcription factor or a micro RNA or a protein binding RNA or an RNA binding protein and so these could be the motif the specific sequence and this can be at the DNA level or at the RNA level protein level and then the target which basically tells you not just the general sequence specificity of a motif but the specific instances were that actually occurs the individual faces in the genome and then to recognize the TS we can basically look at you know their homology homology of every protein to known transcription factors and known domains basically discovering nutritional factors we can use evolutionary signatures or its permit or cloning to discover new micro rna's to discover the motifs we can use de novo discovering as we're gonna see a at the end of the lecture to discover you the targets we can use these evolutionary footprints or these actual DNA's footprints we can digest everything that's not bound by the factor then end up at these specific locations and then so this is basically techniques that only focus on the regulator only Fox and Steve only focus on the pain to go from the regulator to the motif if you have a particular protein that you've now discovered and you'd like to know what does it bind that's where you could do a protein binding wrapper expert or a dipstick or selects experiment or chip seek experiment and find all of the targets and a computationally fine common sequence pattern see those targets if you have the motif and you'd like to find a regulator that's a little harder basically have to use the motif to pull down using a piece of DNA to pull down all the proteins that might be binding that DNA and then do a mask of trauma tree experiment basically figure out what are the corresponding proteins if you have the regulator you'd like to find the target a chip chip experiment or chip seek experiment will basically give you the target direct and then from the targets you can infer the motif using enrichment and then from the target you can find the regulator using you know network analysis and then from the motif you can find the targets using evolutionary signatures basically there's a lot of problems relating regulators to motifs to targets in any of the directions and if you have one you can find the other so now let's focus on exactly this error I have a bunch of common targets of a particular protein for example I did a chip chip experiment or chip seek experiment and I found a bunch of intergenic regions that are all bound by the same protein alternatively I mean just looked for a bunch of genes that are co-expressed they turn on and they turn off at the same time that suggests that they have some common regulation and therefore they might contain a common motif so let's now focus on how do we find an enriched motif in a set of either correlated genes or Co bound regions everybody with me on this slide awesome any questions yes exactly correct we have to find number one for every gene what are the regions of regulation and then from the regions go to the movies absolutely right so the epigenomics lecture can help you here you can basically say if I have a bunch of corollary genes what are the reputed the regulatory regions this is where I'm going to have to look in yeast these are very nearby the gene in you know higher you know more complex species you know you have to search hard very good questions alright so now let's look at this enrichment base discovery so given a set of core regulated or functionally related genes or regions can we find common motifs in their promoter regions so basically I have a bunch of genes and I like to find common motifs within it so what can I do I can basically use sequence alignment to align the promoters to each other using some kind of local alignment technique expert knowledge or what the motives might look like I can find some kind of median string that basically tells me you know by enumeration either sampling the motif or sampling you know every gamer and I find in my sequences and I could also start with evolutionary conserved blocks in the option region so let me you know these are the some of the ideas of what I could do to find this common but fundamentally what I have is a problem whereby if I knew the positions of those sequences I would be back to my first lider although we need to do is simply counted every location I knew exactly where they start if I knew their sequence pattern then all I could do is just simply search that sequence pattern in each of the regions to figure out their starting position but I don't know either their starting positions or matter so what can I do when I want to know two things neither of which I know that are dependent on each other expectation-maximization I can basically ether ative Lee assume I know one and then use it to find the other so what I'm trying to do is effectively figure out you know in these previously on the line sequences what are the starting and ending positions of that shared motif and what is the specific profile and what en does and all my slides have been messed up for some reason so all of these boxes have been resized I fixed some of them right before the lecture but not all of them all right so basically given the profile matrix it is easy to find the starting position probabilities by simply scoring every single position and then given the starting positions it is easy to infer the matrix I just simply count okay at every position how frequently this age your hands if you're with me on those two steps awesome so that's effectively the expectation maximization procedure okay even the motif I can figure out the starting positions and given the starting position second figure out the key idea is an iterative procedure for estimating both given the uncertainty and then it is effectively a learning problem with a nuisance parameter a hidden variable which is the starting positions and that's where we go back to this table that I showed you before we've seen these before with hmm we basically you know the first time I saw it was hidden Markov model we basically said hey we don't know the path we don't know the annotations and we don't know the parameters so what we're gonna do is assume some annotation then use that to infer the parameters and then go back and forth between so basically in the east step we were assigning you know every instance to a particular label probabilistically or assigning you know labels to points so basically this is you know fitting very nicely into that frame we saw before and now in the motif discovery instance the hidden label will be where are the starting positions or alternatively what is the motif definition and then greedily I could find the best motif matching sequence and that sort of probably the first thing you had in mind you know if I have a particular sequence pattern I'm just gonna search it and see where does it match I'm gonna use that and that's effectively the equivalent of Viterbi training or k-means the expectation simulation version which is implemented in the meme software in fact does not say what is the best position it instead says can I have that much of definition and search all possible positions and weigh each of them by their probability of matching faculty and then would give sampling does is that it only uses one position but it picks that position at random and completely at random at random weighted by the motif map scores so then the basic iterative approach is that I have some like parameter of some training set of sequences I have some initial values for my new teeth and then I eat relatively reactivate the starting positions given the motif and then re estimate the motif from the starting Fuji and then when that stops changing updating because I found now or near the optimal motif where the change is very small I just returned both my teeth and the start so how do we represent that well if we have a fixed length W we can represent the motif by a matrix of probabilities that basically tells you for every position what is the probability of each character and then this can be represented in a matrix for example if you might my motif is more or less CAG then you know the first position is most se secondly should not be a third position possibility but it's very probabilistic so you know CG is not that far off from G a G and then that's my motif model I also need the background and the background could be a near uniform background which basically says every base is equally free or been a very GC rich genome I'm gonna have a much higher frequency of TNC and then recognizing G Series when teams will be harder because they'll be very similar to the background and 1/8 here if motif or the same information content would score higher because it would be more dissimilar from the back so that's representing the muties how do we represent the starting position well the starting positions we simply ask or at every position that base I'll go to the frequency with what is the probability with which the motif is starting at that position and we can represent that in some matrix CIJ basically tells us for every sequence I and every position J what is the probability of starting a motif instance okay although that should be right so now if I have my starting positions CIJ which basically tells me the probability of starting at every location for every sequence and if I have one more text matrix which basically tells me what is the probability of seeing each of the characters in every position of the muki then I can now define mathematically what does it mean to go between them okay so computing CIJ the matrix of instance probabilities from the motif is very straightforward at every position I evaluate the start probability by simply multiplying across the matrix you know if I have you know GG a CT I go G G a C T and I just multiply it across okay so that basically tells me the probability of the motif matching but what I want is a likelihood ratio so I also want to multiply that with a background model which can be a one nucleotide background model or dinucleotide or trying to do that run model and then I'm gonna score that mononucleotide or die or try at every starting position and then I'm gonna come back with a likelihood ratio how much more likely is it that these particular sequence was generated by the motif rather than by my background ma raise your hands if you're with me also any questions okay so and now looks like I can score my Zi J in the front of your fashion but once I've scored it how do I select the locations from which I'm going to rebuild my motif well I can use expectation-maximization where I choose all of the starts weighted by the distribution I can use gif sampling where I take a single start for each sequence by sampling from that CIJ or I can use a greedy approach what is the best starting position for every sequence and then only choose that so how do we now calculate CIJ so I basically have my probability of starting an in particular position given my sequence and given my motif model and then this is the posterior probability of that starting at that position but what I have is a generative model that basically tells me what is the probability of emitting a particular character given my motif at any one position so what I'm gonna do is use Bayes rule just like we have before given the prior of having a motif start at that position given the evidence of what is the probability with which my generative model as can generate the entire sequence and then given this generative model I can infer the posterior probability okay at every iteration I can calculate the IJ based on the previous motif and the previous you know our definition and then we can calculate that very easily and then to prop to obtain the total probability we sum over all starting positions is P of X is simply you know a total probability and then we assume uniform priors or we could assume some more interesting priors for example I can look at evolutionary conservation and look at the frequency with which every nucleotide is conserved and then maybe use that as a prior that maybe there's a motif that's more likely to start there even a conservation so if I am scoring this particular motif which as we mentioned earlier is G C T at every position then at the first position I have a high score that's obtained by 0.3 0.2 0.1 and then we rested by background at the second position it's CTG so I read C T G and a multiply across so point four point two point six and then at the first position I generated from the background and all the other positions are generated with me so far great then I can do the same at CTG and some okay so then that gives me you know the Z vector and then the denominator simply P of X I which is you know when the motif position is no I can just simply calculate all of the scores or the motif all of the scores after the muti using the background model in each case and then for every position within the motif of length W I basically have you know the particular scoring by reading off the matrix okay so I can basically simplify this by simply having the exact same score before and after I can combine all of that together and then simply score the motif itself so I can speed up my computation but only by only computing that and assuming that the rest will be the same so that's the e step of basically creating the Zi J vector from the motif matrix for about the chem step of inferring the motif from those positions as I mentioned earlier there are three different approaches the greedy approach to give sampling approach and excitation approach if sampling only picks one at random greedy always picks the maximum and then expected maximization averages everything weighted according to their probability so in this particular example greedy will always pick the maximum in this particular case all all methods more disagree because there's really only one maximum but in this particular case eeehm will basically use both of these to estimate deep sampling will only pick one or the other at random it'll pick one with this one with 2/3 probability for example and then will only use that one and will pick this one with 1/3 probability and it will only use that one even though it was a lower probability and then greedy will always pick the max and in this particular case greedy will ignore most of the probability mass deep sampling will rapidly converge to some choice and e/m will basically average over the entire sequence difference between Gibbs cm and greedy raise your hands awesome now given this annotation we can basically estimate the motif using the exact outs divided by the total number of times I saw I saw this particular sequence and then I can add some pseudo pseudo count each time to avoid overfitting then I can you know from this instance in that instance and as instance together I can estimate the probability of my motif I can add some pseudo count and then end up with you know that motif definition either either using you know an average of all of them weighted according to their probability sampling one of them or selected ok so what am will do is really refine the motif the the starting position of the motif of certain positions and positions converging in a provable fashion always in some local maximum but not always in a global maximum so to do that we basically initialize at multiple locations and then hope that one of them will actually converge all the way to the true max ok who's with me on eeehm and give sampling and all of that great so basically this was you know going from the motif of positions in back and forth so I'm gonna probably just know and now switch gears so we basically now know how to find a set of common motifs from a set of sequences that were co-regulated or co-expressed or co bound what we can do now is look at a completely orthogonal approach for discovering motifs and what this is looking at is a de novo approach where we don't need to know the transfer factor in advance so basically every region centric approach needs to have a set of regions that were either bound by the same transcription factor or co-expressed Co activating in same condition and so on and so forth but these TF centric approaches require knowing the term factor or semantic olfactory they take a lot of time and money to basically generate an experiment for every single regulator there are thousands of prescription regulators in human not every one of them has been profiled experimental and what the Nova discovery tries to do is effectively use evolutionary conservation in a completely unbiased fashion to basically discover a catalog of regulatory motifs without actually knowing what transcription factor this come from so then the key insight is that if I look in a particular location where you know this factor is bound I will find one instance of that motif and it will be illusionary conserved but so will the surrounding bases and then if I look at many many instances we could you know join them together across the genome so basically what we're gonna do is we're gonna increase power by testing the conservation in many regions so if i have c GG c c GC g GC GC g GC c g in many many different places across the genome together they will sort of gain more power so then the idea is that i'm going to be testing genome-wide conservation making all instances of a particular Kaymer and then asking how often is that instance concert and then for those Kaymer's that are conserved more than others I'm just simply going to get a higher score a higher genome-wide Conservation score and then they will perhaps be more likely to be regular so then if we try this with a gal for example that I mentioned earlier so CGG space by 11 who dies from CCG we can now start measuring each of these scores yeah that's exactly right so basically the key intuition is that in one instance evolution basically tells us that this motif seems to matter more because three of its instances are conserved in this particular region and if we extrapolate to the whole genome if it is preferentially conserved at every single location even if it's not always perfectly conserved it just needs to be preferentially conserved together that signal Galactica it's a nice regression correct so we're going to be looking outside Reggie's what's really interesting and I'm gonna get back to your question in a second but basically what's really interesting is that regulatory motifs within genes are actually less conserved than random nucleotide evolution is actually somehow disfavoring motifs that occur within the gene and you can see why because a transcription factor was only buying there may be prevent transcription or interact you know attracting negative ways with other stuffs are on so if we do that for exactly the gal4 motif c GG c cg space by 11 nucleotides we find that in all intergenic regions it is conserved 13% of the time compared to only 2% expected using random Kaymer controls if we ask what is the ratio of inter ginny conservation to coding conservation we find that gal 4 is preferentially conserved in intergenic regions verses coding regions or random controls are of course preferentially conserved in coding regions verses are introducing treatments and you can see here that not only is it more concerning Virginie but it is actually less conserved in coding then lastly you can test about upstream versus downstream so basically in yeast the genes are very close to each other you can ask for evolutionary for regulatory regions that are upstream of both flanking genes and written intergenic regions that are downstream of both flanking genes and what we find is that gal 4 is preferentially conserved in these divergent regions are almost never conserved in this conserve convert and read where's random k-mers actually have the same evolutionary conservation so that basically gives us evolutionary signatures for discovering these motifs I can basically simply enumerate Kaymer's and according to each of the signatures between find camers that are highly conserved in intergenic regions gamers that are preferentially conserved in intergenic verses coding and k-mers that are preferentially conserved upstream verses downstream defecation alright so let's try that we're just going to simply enumerate k-mers so this is what we find if we do test one of interesting conservation and we ask for every Kaymer which is a dot here how often does it appear and how often is it conserved we find that there is quite a strong diagonal here with most motifs or most k-mers being conserved at roughly the same rate the more you have instances the more instances you will have conserved but there are some gamers shown in blue here that are much more conserve than what you would expect this in the number of instances and one of them is in fact the gal for motif CGG space by eleven from CCG but there are many many more so I can simply use these biological insight about the preferential conservation of evolutionary array of two regulatory motifs to enumerate every Kaymer and then out will pop things that are preferentially conserved they're more likely to be motifs who thinks this is kind of cool awesome so I did not need to know what the transcription factor is I just simply said what are the cameras the perversion strip let's try the intergenic versus code yeah I'm just simply counting how many times is CGG space by lava nucleotides from CCG occurring in the yeast genome and I find that it occurs about 300 times of which about 50 are conserved that's a gap so basically this CGG CCG pattern is what I have here see GG doesn't matter what you mean in the middle from CCG does that make sense so what's what I'm searching for conservation is only these six positions the middle positions I don't care about so so that basically means that I search for CGG CCG in the yeast genome in Saccharomyces cerevisiae and then I ask is it conserved in that other species is it conservative that other species is that conserving the other speech and I find that only a small fraction of them are actually not are actually conserved in the other species does that answer your question great because yeah so here I have sacraments of service here paradoxes megaton Diana's and then for every single instance I will search all red instances and I will count how many of them are actually conserved so have a whole genome alignment of all these species never I can simply search for how many articles yeah yeah so we're gonna have a lecture in comparative genomics and there we're gonna talk about how branch length matters why because if I only have basically if I align a thousand Saccharomyces cerevisiae species any one position has only one chance in a thousand of being different from each other and then if I pile them all up you know every position will have some small chance of being mutated if I take much more distantly related species and every position has many more chances of being mutated it's a question of power the more branch link I have the more power therefore taking different species gives me a lot of branch then a relatively huge sequencing experiment I could do the same with maybe a million yeast species or maybe a thousand years but it really depends on other versions they are from each other versus how divergent species there's a whole slide on that in a comparative genomic section all right everybody with me on the first line bring ya the gray dots so basically all the dots should be initially gray but afterwards I asked how unlikely is it to find that many conserved ones even how many I had observed and that gave me a probability of being over conserved and these points are colored blue does that make sense all right so we can do the same for intergenic versus coding what is the coding conservation what is the intergenic conservation and then what we find is that most gamers have more conservation in genes as you would expect but there's some gamers that are preferentially conserved in intergenic regions maybe these are my motif and then when I looked upstream versus downstream I found that again most gamers are equally concert but gal 4 is much more conserved upstream and that basically says wow I can basically select those motifs that are upstream controller and then find the elephant in the room is that oh hi there's a lot of motifs that are preferentially conserved downstream that's kind of cool it basically says that there's a whole other class of regulatory elements that are acting downstream of transcription perhaps the terminations picture do all kinds of other cool stuff so we can basically start enumerated we can basically take these motif seeds three gap three and then search them score them and then what we can do is actually expand them we can basically allow for degenerate bases to start filling in the surrounding positions by basically asking for all those conserved instances what differentiates them and then fill the unspecified bases surrounding this and then progressively improve the motif so I get a higher conservation score and then ultimately we're gonna cluster these motifs basically some three gap three will basically start here others will start there and then putting them all together we're gonna collapse them into a small number of final consistency yeah yeah we're gonna get at that so this is one approach this is the approach that I took in my PhD thesis so this was published about 15 years ago and I feel just as young I mean not be just as young and then there's another approach that almost an EIN cook he's now a professor over at where he basically said instead of looking at conservation I'm gonna see how often our motifs exchanged with each other how often do I have one camera in one species and a different camera in other species and then these exchange patterns or paint families of k-mers that are frequently exchanged and that approach actually allowed him to discover motifs in just as easy a fashion and even to discover the degeneracies of these machines by walking along this graph so then your question is how many of those are experimental validated how many can you discover so the first thing you can do is basically ask how many in fact match known motifs so you know there's like decades of research on discovering motifs and you know indeed a large fraction of them match Neves you could also ask how many of them are novel motifs that actually make biological sense and then you can use all kinds of techniques for finding biases in those novelties basically if we have a bunch of motifs that we don't know the function of we can start asking are they near genes that are co-expressed are they enriched in you know regions that are bound by the same regulator or regions that are showing the same chromatin state and so on and so forth and that allowed us to actually assign putative functions to many more motifs that were not previously known but it's not only captured a lot of the known mootisse it also discovered Mookie's that were not previously known so everybody with me here so basically just like I can discover motifs that are in the same region based on this iterative sampling I can discover motifs completely the novel by basically searching for conserved Kaymer's across the genome and then collapsing these gamers together using either overlap or these graphs then end up with a small set of motifs which very frequently have in fact recaptured known regulatory so I did not have the but gosh so this is the actual paper and you know here's the slide that's motivating the approach then the actual so this is what was discovered and you know you can basically see for every known motif that was exactly your question you basically said hey our known motif is conserved and what you can see here is that a bf one the very motif that I showed you on the very first slide in fact showed one of the strongest conservation scores you can see here the consensus sequence of this you know what you hear and it's actually extremely strongly conserved and then all of these other motifs are in fact very strongly conserve Cal four is down here it's not the strongest but it's one of the strongest and you saw that there were all these dots up there that were much more from this is what they are and then conversely you can start from discover motifs and then ask you know are these previously known and in fact the strongest motif is in fact you know a BF 1 and so on and so forth so it's kind of cool right you can basically see that you know you can recover the known movies and you can start assigning functions to novelties so these are the novel motifs that we found and you know basically sorting solely based on what was discovered ragazza whether it was known or not and then I'm basically saying what is the category that was the most enriched for that mentee and indeed the top one chief was indeed by a via phone and then read one and then some expression cluster and then some you know gene ontology term and so on and so forth and then many of those matched non motifs and then some sequences matched novel monkeys this one is mitochondrial downstream this one is a lamentation this one is switch for variable gap and so forth who thinks this is kind of cool awesome yay they're so nice alright so this is now allowing us to discover these motifs de novo but what we would like to do also find their individual instance so this is basically now going from the regulator's knowing the motifs to the instances and to do this we normally use experimental methods such as chip chip but what we wanted to do is basically use a computational way of finding them and then the challenge of course in target identification is that you know some instances will be perfectly reserved other instances will move around will preserve their function even though they're not you know at the same exact location and others will simply not be detected because some sequences are just not captured in the other species so what we need to do is basically figure out a way to account for this uncertainty to basically search a particular motif instance such as cities here in a bunch of species and then wherever it occurs whether there's a gap in the alignment or not it will basically tell me yes this motif is found when I remove all the gaps and I realign sequences locally and I allow movement and here's the total branch length over which it is conserved okay and that we're gonna now transform into a measure of significance so we're gonna measure the branch length score simply adding up all the branches over which it's conserved and then we're gonna be allowing for mutations that preserve the sequence and we're gonna be allowing for movement of the monkey and you know if there's some small brands that are missing it's not a problem that's gonna be accountable so what we'd like to do now is translate this branch length score into a confidence score and to do this we need a background model we need to know how much branch length would I have expected by chance for a motif of that type so to do that because motifs vary greatly in their composition they vary great in linear length in their information content in your timer composition so we're gonna be searching for how frequently our motifs of that type of girl so this is the you know what I would expect or the number of instances that are found at every branch length score so what I basically do is that if I have a very very strong branch length or very few instances and as I reduce the threshold of you know scoring quality if you wish then there are more and more and more instances but I need to compare this with expectation and what I can do is basically use motif specific shuffled control motifs determine the expected number of instances at every branch time score by chance alone or due to non motif conservation and then I'm going to compute the confidence score as the ratio of the to the further I go to the right the higher the fraction of motif instances at that branch length score that come from my true signal but as I go to lower and lower scores the noise seems to be increasing so at a particular cutoff say you know score of 0.9 I have 75% of my instances that I would expect that that threshold coming from true signal and only 25% coming from north so at that branching score I will translate it to 75% of everybody with me on this translation raise your hands if you're following awesome so that's how we're gonna produce control monkeys hundreds of shuffles and then filter the motifs swords control monkey based on similarities known motifs and cluster them and then end up with family of control motifs that we're gonna be using for testing we're gonna compute enrichments you know in the same way by randomizing our regions and then when we do that what we basically do is that we find that as I increase the threshold of significance I end up with transcription factor motifs selecting instances of those monkeys in promoter regions but for my current new motifs I end up selecting instances in Reaper meaty ours so that basically means that I am you know actually selecting functional instance because of this bias you can see that initially they all start at the same frequency and that simply what fraction of the genome is covered by each of these classes as I increase fortieth motifs its selecting promoters or my cronies is selecting future if they're different from each other is that it captured actual biological about and then we can actually start validating these motif instances we found a bunch of instances how do we validate that the real what we can do is for micro RNA motifs ask if their bias in their strand because an RNA molecule has only one strand so if I search the opposite strand you would expect to not work and indeed for TF motifs both strands thing to work equally well but for migrating motifs if I increase the confidence threshold I end up finding them preferentially normally one strand I can also ask how often are they conserved in other species how often are they bound by chromatin immunoprecipitation experiments and so forth and what we find is that the validation rate increases with the confidence as I increase the confidence more and more of them to validate so that's basically allowing me to now say that indeed this approach truly finds you know de novo you know for these occurrences so that's basically now discovering motifs yet in another way and then identifying their instances using evolutionary conservation what I'd like to do now is switch gears again and talk about yet a third approach or dissecting the regulatory potential of this region and this is now focusing on the specific instances of these regions so as we heard in the economic lecture there are many many different regions that have signatures of enhancers signatures of promoters like DNA accessibility like h3k27 acetylation is tricky for monomethylation is to get 4 trimethylation so we can use these epigenomic signatures to predict candidate regulatory regions but how do I know that these candidate regulatory regions are actually functional so what I need to do is see if I take that chunk of DNA and I put it in a reporter construct upstream of a luciferase gene which basically shines green then how often do I see that the cell that receives this construct where the safer's is downstream of my predicted enhancer how often does it actually drive expression how often does it actually function as an enhancer and everybody see how this construct works I'm going to basically take a piece of DNA that has the sequence that I want to test and then ask if that when I put that sequence in front of a reporter gene that reporter gene gets turned on what's with me on this one raise your hands awesome great so when finds a little did this they tested thousands of elements and they found that only about 1/3 actually function or about 1/2 function well and truly drive enhancer activity so how do we test this so you know you could do 3,000 experiments or you could develop new technologies for doing the systematic and the technology that I want to talk about today is this massively parallel reporter assay stick so what that allows you to do is instead of putting a single enhancer I'm gonna put 10,000 enhancers in one experiment how by synthesizing them in a my curry you can order my curry generate that mercury I can cleave off the little fragments from a microarray and I will end up with 10,000 sequences everybody with me on this how can synthesize them so just like it can print a microarray you just print all of these as different spots in the microwave and they're 145 base pair spots okay I've now synthesized then I insert each of them into a different reporter construct and I add a barcode and when I add a barcode I can then PCR amplify the common sequence which is flanking this barcode and simply count barcodes and when I count barcodes I will know how often did I observe a barcode that corresponding to each of the enhancer regions that I'm fasting who's with me on this one raise your hands awesome so when I do this systematically can basically end up with you know many different kinds of approaches and puree creasy star seek etc for basically testing tens of thousands of these elements at once so you have to deal with a lot of experimental artifacts indeed yes you have to deal with it so here's this you know approach specifically developed by dr. Michelsen that we were involved in that basically synthesizing all of these different sequences inserting them and then using a different barcode to remove each of them transfecting them into a cell and into millions of cells and then testing how often the density of the barcode and you can use this for for example testing what is the effect of a single nucleotide polymorphism on the expression and you know this allows you to now evaluate 10,000 variants in a single experiment turn ative li what I could do is use a different version with a single nucleotide change at every one position of the same concert so here I'm taking the Cree promoter and here I'm taking the INF beta promoter and for every single one of them we can basically ask what is the effect of testing that entire sequence – this particular nucleotide where I'm going to now change every position into an a into a C into a G or into 18 and then what it shows me is that the activity changes for these particular locations and those particular occasions correspond to exactly where the known motifs were so by basically using 10,000 enhancer constructs in one experiment I can basically evaluate the impact of mutating every single node who thinks this is kind of cool awesome so I can do this here I can do this in another sequence every single time inferring this but the challenge is that we need hundreds of constructs for testing a single region can we test thousands of regions jointly so the answer is yes what we could do is use our motif predictions to test thousands of regions perturbing only that one motif in every one of these regions sounds good right so what we're gonna do is use the wild type sequence shuffle it cut it out change a single nucleotide and sunsoft okay and then we're gonna build a different reporter contract for each other how do we select those we're basically looking for the expression of the regulator the activity in specific cell type and then we're using that to basically predict who are the activators and repressors of every single cell and when we do that we basically find that if I take a predicted activator like H and F for and in its motif then when I test the wild type motif in the correct cell type I find that indeed it matches and it shows expression very well if I scramble remove or maximally decrease that motif then it abolishes reporter activity if change it ever so slightly sometimes they get a better and sometimes they get a worse score but I maintain activity if I do a random change that perturbs a motif lose so every single measurement here is a different barcode used for the same construct we're doing a lot of replicates enabling us to now infer the activity of everyone okay so I can take a step further and instead of testing a single sequence I can offset things by 50 nucleotides at a time okay and when I do that I basically find that sometimes as I get closer to the predicted side in the increase in value and in fact I find regions where you know none of these offsets work and suddenly this offset and that offset worked very well and then this also doesn't work anymore and when I asked what is the difference between them there's a 30 base pairs segment which is unique to the regions that worked the best and that segment indeed contains the motif for the hnf for a factor you know is active in that cell type so that is to lose that by tallying regions and offsetting them I can actually figure out perhaps in high resolution where the motifs are who's with me on this concept awesome so why do that increments let's instead do five nucleotide increments so now we're going to be testing these regions at five base pair offsets and inferring computationally from these offsets what is the likely activity for every base pair intervals in the genome and when I do that I find regions that just simply don't work on worked on work entirely boom work and the computational algorithm basically says AHA these specific sequences that work are the ones here and it allows us to now start predicting in high resolution where are the driver nucleotides for every one of those positions everybody with me on this the last thing that I'm going to talk about is a little crazy we basically said why test 15,000 regions at a time if you could test seven million fragments at a time so what we did is instead of generating a micro array we simply cut out accessible regions using an ataxic experiment the same thing that we learned in the epigenomic lecture and then you're cutting out all of the DNA fragments that are coming from accessible regions and you're inserting them not upstream of your contract but downstream of the contract that means that when they drive expression they will drive their own transcription and that also means that I can actually use them as their own barcodes everybody with me on this so their self transcribing constructs are using a technique called star seek that Alex Stark and his lab develop he was previously posted from my lab so using this we basically tiled all of these accessible regions and were able to in high resolution for exactly where the activities coming from using another deconvolution algorithm or sharper that can work at any variable offset so this is where I'll stop and basically recap by saying that motifs are at the core of all of these regulatory genomic assays and there are techniques for finding those based on their over representation for discovering them de novo based on the revolutionary conservation for finding their individual instances using this branch length score and then for de novo dissecting in high resolution these regulatory regions

Leave a Reply

Your email address will not be published. Required fields are marked *

Back To Top