Welcome to the training of Math for
data science professionals. This is the first lecture which is approximately of eight minutes. In this first lesson we will focus on defining what exactly is data science and how a typical data science process looks like. In Chapter one we will first start with a very very basic definition of data science then we will then take up a sample data and we will see a micro workflow of how actually data science works. Then we will redefine our definition with a more in-depth and professional definition. At the end of the lesson we have talked about two golden rules or two important rules you know which will help you to focus on learning math with data science. but the first thing is what exactly is data science? Data science is all about converting data to information. We have huge amount of data which is lying in the database you have huge amount of data which is lying in the excel sheet or wherever and we want to present that data to your management or to a decision maker in like one or two lines summary is what data science is all about. At the left-hand side I have huge amount of data of 100 days approximately and this much data if we give it to manager he won’t be able to make any kind of decision We need to take all this single
single transaction and create a summary we need to go
and create two three lines of information which can represent the summary of this data and by looking at this information the decision-maker can make better decisions which can improve the business sales or give some new dimension to the business. that’s what data science is all about. At the left hand side I have this number of days and per day what sales has been made. We can give some summary
information to the decision maker the easiest one is average,
average means on average how much sales we are making everyday. We can also it some information like what is the minimum sales we have made at any given moment of time on these no. of days i.e 27 and what is the maximum no. of sales we are doing? We can also use the Mode Mode says which is the repetitive number from this Over here 27 is repeated lot of times, 29 is repeated, 22 is repeated. It tells us the most repetitive sales which we made is 80. It tells the maximum number which is getting repeated in a Dataset. Here we have just given out 4 important summaries to the decision maker by using this summary he can make decisions. Let us not use the word Average will use mean instead. All these M we can remember – Mean, Min, Max, Mode. We have applied 4 Math Statistical formulas out here and have arrived to some kind of summary which we can give to decision maker. When a decision maker will look at these 4 statistics he can make some decisions or come to some conclusion. He can see that the average sales is 582 but at some moment of time he will also make sales of 70000. If we are making a sales of 70000 why on an average we are making so less per day? On the last day we have made a entry of a bulk order must be there was some kind of a bulk order and because of this the max sales are looking higher and the Mean is not looking proper. The decision maker can more concentrate on bulk orders so that his business can be multi-forced. What has really happened in the past 4 minutes. In the past 4 minutes we have applied Data Science and while doing so we have done three things the first is we have used statistical math, have used the math formulas to arrive to summary values. First thing is we used statistics The second thing what we did out here is we used excel used excel knowledge and use this formula like Mode, Max and so on. Because of the excel knowledge we were able to implement these formulas easily, had good IT knowledge of excel. The most important thing is domain knowledge when we saw this 70000 max value we were able to figure out that this must be some kind of Bulk order. Because we had domain knowledge of the retail or selling we were able to figure out why this difference was We used three things Math, IT, let us not just say excel tomorrow it is possible we are going to use Python or R, in general we used IT, Math and Domain Knowledge to arrive to this summary information. Thats what exactly is Data Science. Data Science is a Multi-disciplinary field. Initially we had started with a layman definition It is a multi-disciplinary feel which comprises of statistical math, probability math IT knowledge where we can use excel formula or use Python, R. and domain knowledge. Throughout this lessons would advice and request to follow two golden rules if really want to get successful into the Data Science field. The first rule is Do not solve math but try to apply Math. When we wanted to calculate the average of this data. We just used the average formula of excel. Rather than solving math by physical pen paper that won’t really help us try to apply math. The second golden rule is stick to excel and do not jump to Python and R. Lot of people start with Python and R get into programming languages installations but then they loose the focus of statistical math. Throughout this lesson use excel and stick to applying math rather than solving math. That brings us to the end of this session at the end of the session giving two excel sheets, one is on which we have shown the demo The other excel sheet is the practice excel sheet. This is a practice excel sheet in which we are giving some other dataset and want you to find out the Mean, Min, Max and Mode. Giving a hint there are two odds in this summary so try to find out those odds. After every lesson we will give you a practice sheet the sheet which we used in the tutorial and at the end of the video we will have a small Q&A deck which is flashed up you can have a look and summarize what you have learnt in the lesson. In lesson 2 we will talk about descriptive analysis. Welcome to Lesson 2 Lesson 2 covers 4 important topics Descriptive statistics i.e Mean, Median, Mode, Max and Min it talks about Spread and the importance of Spread, outlier and Quartiles. This whole lesson is approximately of 15 to 16 minutes and has 4 chapters. The first thing a Data Science engineer should do when he gets any data for analysis is to identify if this data is spread or concentrated. If the data is highly spread then one value inside that dataset is far away from the other value. If it is concentrated then data revolves around certain values. We have two dataset of sales out here Plot these datasets using scatter out here. We have plotted this total sales 1. Create one more scatter plot which will plot the second one. First will concentrate on Total Sales 1 In this Total Sales 1 when we plotted this static plot lot of data is concentrated around certain section. Pull these all the data up and try to do small analysis. Lot of data is revolving around 80 to 88. or 80 to 90. Lot of this datasets are actually revolving around the certain value. There is a other group of section as well. 27 to something. If some small concentration of data here as well 27 to 35 must be and we can see some data out here which is almost 112. This says the total sales 1 data is highly concentrated it is not dispersed. or we can say the spread is not too much. This data out here is a straight line add a data level and set them and check how the data is spread The data is having a high spread from 16 to 54. This is almost a linear graph definitely we won’t find such kind of pattern of high concentration. This kind of dataset is termed as Dataset which follows the measures of central tendency. This kind of dataset is different dataset which does not really have any kind of central tendency and the way of doing analysis for such kind of dataset is different. The first thing as a Data Science engineer find out is the dataset following the measures of central tendency or is the dataset not following the measures of central tendency. Depending on that is the data having high spread or low spread the analysis will depend. The problem here is we have lot of record millions of records, it is very difficult to plot the graph and do these things We need to have some mathematical formulas which can quickly run and evaluate and say that is the data having a high spread or low spread. Creating such kind of visual graphs for millions of records is almost impossible and the whole point we showed you this graph out here is to show how visually a concentrated data looks like a spread data looks like a measures of central tendency looks like. How we can use mathematical formulas and find out without plotting a graph is the data following measures of central tendency or does not following it. First thing what we need to do here is to get description about this data, get the statistical description of all these data. Before we start calculating the spread measure for that we have 5M Formulas, the first one is Mean then we have Median, Mode Max and Min of the whole dataset. First will start with the Mean Mean means Average it gives us a summary of data Median is the center of the data Median in a sorted dataset, Median value is which is coming exactly in between We have this dataset of 10, 20, 30, 40, 50 Median is third data Mode helps to find out the most repetitive value in the dataset. In this case it is 29. These 5 formulas give me the statistical description. It tells the average value is 581, Median value is 80, the most repetitive value in the dataset is 29 and the Max and Min. When we look at this dataset the first thing we see here is the Median and Min are almost like 5 to 6 times apart. This is not a good sign. The Min and Median should be nearby. It should be like maximum 5 to 10% of difference In this dataset purposely at the end we have included bulk values and because of this the whole average is looking very bad. In this whole dataset there is one value out here which is absurd and outside the range In that day there is a bulk sales by some corporation thats why the sales jumped up and because of this one single value this whole average is looking very different. Only looking at the Meann sometimes can be very dangerous. We also have to look at the Median. The Median gives the middle value with that the absurd value is not coming into the calculation. The first thing we found out here is an absurd value in the dataset which is making the average looking very weird. This absurd value is termed as an Outlier. Outlier is a value which is absurd and lies outside the datatset. In this case we saw the absurd value but what if this was million of records? To find if there is an absurd value we need to use something called as Quartiles. In order to find an outlier, outlier is an absurd value in a dataset which lie outside the range of the maximum dataset what we have. To find mathematically we have to use the Quartiles. Calculate Quartiles and find out the outlier. A Quartile divides dataset into three parts or four parts For each one of these parts it tries to find out the Median. To find a Quartile we have a formula The first Quartile has taken the dataset and divided into 3 sections. The first Quartile is the Median value of the first part of the dataset which is divided. The Quartile 2 and the value of the Median are same. Median value and Quartile 2 value is the middle of the dataset. Quartile is where we divide the dataset into sections It can be three sections or maximum 4 sections then finding the Median of those three halfs or 4 halfs. To calculate or checkout the Outlier we need to calculate Interquartile Interquartile is Quartile 3-Quartile 1 A normal range of data the Max and the Min as follows it is the highest range of data – Quartile 3+1.5*Interquartile. The lower end of data can be Quartile 1-1.5*IQR Find the low range low range is Quartile 1- 1.5*IQR and the high range is Quartile 3+ 1.5*IQR The maximum value of the dataset is 167 and the minimum value should be -53 Depending on the calculations the maximum value can go to 167 we can see that this value(70000) is very high. We will exclude the value for now By doing so now the Mean and Median are looking quite nearby. Will put some value here, if we exclude it again that breaks calculated in the Mean There is 67 and 80 that looks reasonable. By using the interquartile we can find out the normal, minimum and maximum range of dataset and then we can hunt down the Outlier. Before we start measuring the thread first we need to checkout if the data is having some values which should not be included or some wrong values or some absurd values or values which do not go with the current dataset. Try to hunt the Outlier and remove the Outlier so the calculations are genuine. Calculate the Mean, Median, Mode, Max and Min for the other dataset as well. The Mean and Median are almost equal there is no repeating values. In this situation where Mean and Median are exactly equal and there are no repeating values then it is very much possible this data can be a sequential value. Here it is 5, 6, 7, 8, 9…… thats why The Mean, Median and Mode are looking very exact. Second, the Max value is 140 and Min value is 5 by using Mean Median Mode we can find out the nature of the dataset. we can describe the dataset. like now we can describe this dataset is Sequential with sequential values. When the Mean and Median very far away we came to know there was an Outlier. By using Mean Median and Mode we can find out what is the nature of the dataset. Now calculate Range Range is the Max -Min The range of dataset 1(Total sales 1) is very less than the range of the total sales 2. This indicates the spread of total sales 2 is much higher than total sales 1. The first arithmetic or statistical formula to find out the measure by just calculating the Range. Range is Max – Min. From this we can know here the Total Sales 2 has higher spread as compare to Total Sales 1. In this video we were trying to find out what is a Spread? How to calculate Spread? First we did it visually and using graph we did it If we have huge dataset we cannot use graph, we need to use calculations. Thats where we talked about Mean Median Mode Max and Min. Before we try to find out the spread ensure there is no Outlier. To calculate the Outlier we used Quartiles We calculated Quartiles, IQR, the Max and the Min. from the IQR and eliminated the Outlier. Finally we calculated the Range by using the Range we can now know the Dataset 2 has higher spread than Dataset 1. At the end of this video will give you a small practice test whatever we have talked in this class. At the end of the video we have a small Q&A try to answer those questions to revise.