Welcome to another installment of the biodiversity informatics training curriculum. This module treats thresholding ecological niche models. Essentially, this is a part of the process that doesn’t generally take place within the modeling algorithm, so this is a step in the process where ideally you as the user will interact directly with the model output, and you will put some special thought into what it means to decide where the break point between presence and absence is in the model output. So, most model outputs are, to some degree, continuous, at least most of the models that are being used in recent years, so this is just a representation in a map, where we go from some low value … and you can see these contours and we can go up to high values, and we have two areas of high values separated by an area of low values. So, in reality, this is likely to be continuous … we might have values of 0, 0.1, 0.5, 1, but essentially a real-number value for every pixel in the grid that is your model output. So the question for you is, at what level … maybe it’s from here on down, or from here on down … the question for you is at what level should you decide that any higher value is presence, or suitability, and any lower value is absence, or unsuitability. If you look in the literature, you find numerous published papers that treat this issue … essentially how do you pick a threshold So here’s a paper, “Selecting thresholds of occurrence in prediction of species’ distributions,” “A comparison of the performance of threshold criteria for binary classification,” “Threshold criteria for conversion of probability of species’ presence to either-or presence-absence.” So these papers go into lots of detail and they compare numerous options. If you read these papers … they are included in the literature attached to this module … you’ll see that they are assessing a whole bunch of options for making this decision about how to go from continuous to binary. For example, you can take the criterion that maximizes the kappa value, or you can take the value at which sensitivity and specificity are equal. And basically, I want you to read these papers, but I think that we want to set them aside, because there’s a clear answer to what you should be doing in this process. So, we need to go back and look at the BAM diagram and in fact there’s another module of this curriculum about the BAM diagram so I am just going to give you a very quick review of it, Essentially, in the BAM diagram, we imagine each species as having Abiotic requirements, Biotic requirements, and some area of Mobility. Essentially, these are the areas where the abiotic requirements are correct for the species; these are the areas where any requirements for interractions with other species that are required for persistence, where those requirements are fulfilled, and then M is the area that is accessible to the species. So, we should be seeing occurrences of our species only in this area. Call this area G – sub – O, the occupied geographic distribution. Now, there are some circumstances where our species can be present outside of G-O. For example, we could have misidentifications or we could have mis-georeferences that produce presences that are outside. So we can’t trust ALL of the points to be in G-O only. But we should imagine that those errors will be relatively rare; we will come back to that when we talk about the E parameter. Now, let’s look at absence data. The species is not present outside of all of these. The species is not present here because it can’t reach that site. The species is not present here because some biological or biotic ‘interactor’ is not present. The species is not present here because the abiotic conditions are wrong. The species is not present here for both abiotic and biotic reasons. And so basically, all through this area outside of G-O, we have absences. And indeed we can even have absences within G-O, an extinction, a population that’s been exterminated if we are talking about disease vectors. Notice that we can have absences essentially anywhere in this diagram. When we’re dealing with niche models, we are essentially asking about presence or absence of suitable conditions, and those suitable conditions are only within A, and the only portion of those suitable areas where A is ‘right’ that the species has any experience with is this area, and then to the extent that biotic interactions are also limiting, we cut down to just G-O. So, that’s essentially saying that these absence points that are elsewhere are irrelevant. And indeed you can have absence that are present under perfectly suitable conditions, but the species has no access to that site. So, I tell you this because I want you to think about the relative weight that you should be giving to presence data versus absence data. We have a few opportunities for presence data to be misleading, but we have a lot of opportunities for absence data to be misleading. Now, let’s take this one step further … we’re going to look at a probability tree, and at this first juncture, we’re going to ask whether the BAM is satisfied. To take this further, we’re going to look at a probability tree, and so each of the branches that you’ll see in this tree is representing essentially a decision, and so I’m going to ask at this first branch about whether the BAM conditions are satisfied. And we have a yes, and we have a no. Now, a second set of decisions can be whether the site was ever sampled. Again, yes or no. A third decision can be whether the species was detected Some species are very hard to detect. And so, even if the site was sampled, researchers may have detected it … or not. Yet another one … was the data record available? There may be a record, but it may be sitting in a museum somewhere; it may be sitting in a file cabinet somewhere. So, even if the site was sampled, the species was detected, it may well be that the data were not available. So these are just three examples … after the BAM … three examples of where alternatives exist. On any of these turns in the negative direction, we have suitable conditions, and yet the species is not present. And that means that it is essentially recorded as absent. So, what I want you to see is that we have only a few conditions under which presence can occur, but we have lots and lots of alternatives under which absence can occur. So, what I want you to think about is that presence data are very reliable, and absence data are less reliable. The corresponding errors are omission error, which is leaving out a known presence, and commission error, which is including a supposed absence within the prediction of presence. And so, what I would like you to do is think that we need to prioritize omission error over commission error. An error of omission is a much more serious error than an error of commission. If we keep these lessons in mind, then we can go back to our question of thresholding in a very powerful way. So, returning to our question of thresholding, and bearing in mind this lesson, there are three characteristics that we should exact from any thresholding mechanism. First of all, we need to spend much more time worrying about omission error than commission error. Second is … we need to accept commission to avoid omission. And three, we need to accept only certain amounts of omission. I am going to explain each of these points. This is a very general point … I hope that I have already convinced you that omission error is more serious than commission error. It means that, if we’re balancing, because any decision we make about thresholding, if we’re balancing between accepting more area to avoid leaving something out, that’s OK … it’s OK if these thresholding decisions give you a broad area, because we need to avoid omission, which the more serious error. And the last point requires just a little bit of explanation … If our data were perfect, if our data had no error, then every presence point would document an existing population of the species. And our model should be forced to include every single presence point. But if you’ve worried about data sets in any detail before, you know that most data sets have error. And so, we have to be able to think that … well, how much error is present in our data set? And so, what we do is, as we prepare our data for analysis, and there are other modules of this curriculum that deal with data cleaning and details of handling and preparing data, we have to think about how much error is inherent in those data, and some data sets will be very clean … maybe it’s a data set that was collected by the expert in that taxon and the sites documented with a GPS unit, and the identifications checked very carefully, or, at the other end of the spectrum, it may be data harvested online and not particularly error-checked, and there may be some proportion of the data that are erroneous. So, we summarize that likely proportion of presence data that can be misleading in terms of sites of occurrence as “E.” And it’s simply the proportion of the occurrence data for the species that you expect may hold errors. E in the best cases will be zero, and in the worst cases maybe 5-10%. So, this is the topic for another teaching module, essentially how do we assess error in our presence data? But essentially this is something you have to think about when you are developing these thresholding approaches. E is, to repeat, the proportion of your occurrence data that are likely or expected to include some error that could, essentially, move one of these points out of the appropriate area and into a less suitable area, erroneously. Now, we can finally talk about thresholding. Let’s consider two possibilities. If E equals zero … again, we are NOT considering all of the possible techniques for thresholding … we’re just considering the one that fits well with our conceptual framework, which is that omission error is worse than commission error. OK? So, Pearson et al., in 2007, offered the least training presence thresholding approach, which I am going to term, “threshold that includes 100% of the training data.” And so if we had trained this model using these red points, the lowest level of suitability accorded to any of the training points is this one, and so, in that situation, we would term all of this area as suitable. That’s when E equals zero, and so we use T equals 100. Now, what happens if we have some error? For example, maybe this point really belongs here. If we have E greater than zero, which is to say, some error in our data set, then, we need to ignore some points, and so what we can do is modify T-100 to be T-100-minus-E. OK? So, with this idea, if E=0, you are looking for T-100 … you’re looking for the threshold that includes 100% of the occurrence data that were used to train the model. But if E were 5%, then we’d be looking for T-95. And so in that case we might be discounting this point, and in that case, the next highest suitability value is at THIS level, and so we’re going to take all of this area as being present, which is this. So that’s the idea of essentially ignoring points that are most likely to reflect the effects of error. Just to make this crystal clear, let’s look at this in a one-dimensional view. So, we have our landscape, and we have our suitability values, and … again, we’re tilting this surface flat … and so maybe we have one area of suitability and then another area of suitability … that and that. Now, if our occurrence points fall like this So, again, you are looking at this map from the side, and those dots are now these occurrence points, Well, we can say … well … our T-100 approach is going to impose a threshold like that, and our predicted area of distribution is going to be something like this. across our landscape. But … and this is the idea of this point, or maybe even a point out here … imagine that we have something that is just erroneous … this one. Well, if we use a T-100 approach, or a least training presence thresholding approach, that pushes our prediction down to here as far as a suitability threshold, and we then have to classify the whole region as being suitable. But hopefully we have cleaned our data and assessed our data and played with our data enough that we know that there’s some error in there, so we’d have a value of E that’s nonzero. and in that case we may be justified, obviously we have to decide on the value of the E parameter a priori, not after the fact, but if we’ve decided that E is nonzero, we have a justification for removing this and going back to our original suitability level. So that’s a very simple way of thresholding ecological niche models or species distribution models if you prefer that terminology. and it has great advantages … It’s not the typical approach, as you will see in all of the readings that I’ve offered you for this module. Rather, it’s based on this contrast between the two types of error, omission and commission error, This is based on prioritizing omission error, and essentially doing everything we can to avoid omission error. And so what we are doing here is we are trusting in our presence points more than in our ‘lack of presence’ points, and we’re establishing a threshold based on where the points are falling with respect to suitability. So, a very simple framework, but, in a niche modeling world, in a world where you’re attempting to delineate the entire ecological potential of the species, and delineate thereby the geographic range of the species, you have the opportunity to decide on the threshold that maximizes the avoidance of omission error, and sacrifices a little bit on the commission side to be able to achieve that goal. There you go. I hope this module was useful. And there are several other modules you should refer to, related to this, things about E, things about BAM, things about model evaluation, all relate to this issue of thresholding. Thanks a lot.