Hello everyone and welcome to this

interesting session on data science full course. So before we begin let’s

have a quick look at the agenda of this session so first of all I’ll be starting

off by explaining you guys about the evolution of data how it led to the

growth of data science, machine learning, AI and all the different aspects of data.

Then we’ll have a quick introduction to data science, understand what exactly it

is then we’ll move forward to the data science careers and the salary and

understand what are the different job profiles in the data science career path

how to become a data scientist data analyst or a machine learning engineer.

Then we’ll move on to the first and the foremost part of data science which is

statistics and after completing statistics we’ll move on to machine learning where

we’ll understand what exactly is machine learning what are the different types of

machine learning and how are they used and where are they used the different

algorithms and next we’ll understand what is deep learning and how deep

learning is different from machine learning, what is the relationship

between AI, machine learning and deep learning in terms of data science and

understand how exactly neural network works, how to create a neural network and

much more, So let’s begin our session now. Data is increasingly shaping the systems

that we interact with every day, whether you are searching something on Google

using Siri or browsing your Facebook feed you are consuming the result of

data analysis. It is increasing at a very alarming rate where we are generating

2.5 quintillion bytes of it every day. Now that’s a lot of data and considering

there are more than 3 billion Internet users in the world a quantity

that has tripled in the last 12 years and 4.3 billion cell phone users that’s

a heck lot of data and this rapid growth has generated opportunity for new

professionals who can make sense out of this data. Now given its transformation

ability it’s no wonder that so many data arrays with jobs have been created in

the past few years like data analysts, data scientists, machine learning

engineers, artificial intelligence engineers and much more. And before we dwell into the details of all of these different

professionals, let’s understand exactly what data science is. So data science

also known as the relevant science is an interdisciplinary field about scientific

methods, processes and systems to extract knowledge or insights from data in

various forms. It’s structured or unstructured. It is the study of where

information comes from what it represents and how it can be turned into

a valuable resource in the creation of business and IT strategies. So data

science employs many techniques and theories from fees like mathematics,

statistics, information science as well as computer science, and can be applied

to small data sets also yet most people think data science is when you are

dealing with big data or large amounts of data. So this brings the question

which job profile is suitable for you, is it the data analysts, the data scientist

or the machine learning engineer. Now data scientist has been called the

sexiest java 21st century nonetheless data science is a hot and growing field

so before we drill into the data science let’s discuss all of these profiles one

by one and see what this roles are and how they work in the industries so read a

science career usually starts with mathematics and stats as the base which

brings up the force profile in our data science career path which is a data

analyst so Idina analyst delivers value to the companies by taking information

about specific topics and then interpreting analyzing and presenting

the finding in comprehensive reports now many different types of businesses use

data analysts to help as experts data analysts are often called on to use the

skills and tools provide competitive analysis and identify trends within the

industry’s most entry-level professional interested in going into Data related

jobs start off as data analyst qualifying for this role is as simple as

it gets all you need is a bachelor’s degree in computer science mathematics

and a good statistical knowledge strong technical skills would be a plus and can

give you an edge over most other applicants so next we have data

scientists there are several definitions available on data scientists but in

simple words scientist is one who practices the art

of data science the highly popular term data scientist was coined by DJ Patton

and Jeff hammer backer data scientists are those who crack complex data

problems with a strong expertise in certain scientific disciplines they work

with several elements related to mathematics statistics computer science

and much more now data scientists are usually business analysts or data

analysts with a difference it is a position for specialists and you can

specialize in different types of skills like speech analytics text analytics

which is the natural language processing image processing video processing

medicine simulation material simulation now each of these specialists roles are

very limited in number and hence the value of such a specialist is immense

now if we talk about AI or machine learning ingenious so machine learning

engineers are sophisticated programmers who develop machines and systems that

can learn and apply knowledge without specific direction artificial

intelligence is the goal of a machine learning engineer they are computer

programmers but their focus goes beyond specifically programming machines to

perform specific tasks now they create programs that will enable machines to

take actions without being specifically directed to perform those tasks so now

if we have a look at the salary trends of all of these professionals so

starting with a data analyst the average salary in the u.s. is around 83,000

dollars or it’s almost close to eighty four thousand dollars whereas in India

it’s around four lakh and four thousand rupees per annum. Now coming to data

scientist the average salary is ninety one thousand dollars nine eleven point

five thousand dollars and in India it is almost seven lakh rupees and finally

four ml in ten years the average salary in the u.s. is around one hundred and

eleven thousand dollars whereas in India is around seven lakh and twenty thousand

dollars so as you can see the radius scientist an ml ingenious position are a

certain higher position which requires certain degree of expertise in that

field so that’s the reason why there is a difference in the salary of all the

three professionals so if you have a look at the road map of becoming any one

of these profession so what first one needs to do is own a

bachelor’s degree now this bachelor’s degree can be in

either computer science mathematics information technology statistics

finance or even economics now after completing a bachelor’s degree the next

comes is fine-tuning the technical skills during the technical skills is

one of the most important parts in the roadmap where you learn all the

statistical methods and packages you either learn are Python essays languages

which are very important you learn about data warehousing business intelligence

data cleaning visualization reporting techniques walking knowledge of Hadoop

and MapReduce is very very important and if you talk about machine learning

techniques it is one of the most important parts of the data science

career now apart from these technical skills there are also some business

skills which are very much required so this involves analytical problem-solving

effective communication creative thinking as well as industry knowledge

now after fine-tuning your technical skills and developing all the business

skills you have the options of either going for a job or either going for a

master’s degree or certification programs now I might suggest as you go

for a master’s degree as just coming out of the BTech world and having the

technical skills is not enough so you need to have a certain level of

expertise in the field so it’s better to go for any masters or PhD programs which

are in computer science statistics or machine learning you can also go for big

data certifications and you can also go for industry certifications regarding

the data analysis machine learning or the data science it so happens that

arica also provides a machine learning data analysis as well as a data science

certification training they have master’s program which are equivalent to

a master’s degree which you get from a certain University so do check it out

guys I’ll leave the link to all of these in the description box below and after

you have completed the master’s degree what comes is working on the projects

which are related to this field so it’s better if you work on machine learning

deep learning or data ethics projects that will give you an edge over other

competitors while applying for a job scenario so a certain level of expertise

in the field is also required and this is how you will succeed in the rate of

science career path there are certain skills which are required which I was

talking about earlier the technical skills and the non technical skills now

if you talk about the skills which are required to become all of these

professions so they are mostly the same so for any data analyst first of all you

need to have analytical skills which involves Maths having good knowledge of

matrix multiplications the Fourier transformations and all next we have

communication skills so come looking for a data analyst require someone who has

the good communication skills who can explain all of their technical terms to

non-technical teams such as marketing or the sales team another important skill

required is critical thinking you need to think in certain directions and gain

insights from the data so that’s one of the most important part of a data

analysts job obviously you need to pay attention on the details so as a minor

shift or the deviation in the result or in the calculation what you say the

analysis might result in some sort of loss of the company it’s not necessarily

to create a loss but it’s better to avoid any kind of deviation from the

results so paying attention to the detail is very very important and then

again we talk about the mathematical skills knowing about all the types of

differentiations and integrations is going to help a lot because you know a

lot of machine learning algorithms as I would say are mostly mathematical terms

or mathematical functions so having good knowledge of mathematics is also

required apart from this the usage of technical tools such as Python are we

have essays you need to know about the big data ecosystem how it works the HDFS

how to extract data create a pipeline you know about JavaScript a little and

if you talk about the skills of data scientist it’s almost the same having

analytical and statistics knowledge now another important part here is to know

the machine learning algorithms as it plays an important role in the data

science career from solving skills obviously now another important aspect

if you talk about the skill which differs from that of a data analyst is

only deep learning so deep learning I’ll talk about deep learning later in the

second half or the later part of the video so having a good knowledge of deep

learning and the various frameworks such as tensorflow

PI torch you have piano all of this is very required for data scientists and

again business communication as I mentioned earlier is very much required

because as you know these are one of the technical roles most technical roles in

the industries and the output of these roles or what I would say the output of

what these professions – is not that much technical is more business oriented

so they have to explain all of these findings to either the non-technical

teams the sales the marketing and again you need the technical tools and the

skills now for machine learning engineer obviously programming languages having

good knowledge of our Python C++ or Java it’s very much required you need to know

about calculus and statistics as I mentioned earlier learning about

mattresses integration now another important skill here is signal

processing so a lot of times machine learning engineers have to work on

robots and signal processing they work on human-like robots they work on

robotics which mimic human behavior so a lot of signal processing techniques are

also required in this field applied mathematics as I mentioned earlier and

again neural networks it is one of the base of artificial intelligence which is

being used and again we have natural language processing so as you know we

have personal assistants like Siri and Cortana and they work on language

processing and not just language processing you have audio processing as

well as video processing so that they can interact with a real environment and

provide a certain answer to a particular question so these were the skills I

would say for all of these three roles next if we have a look at the

peripherals of data science so first of all we have statistics needless to say

there are programming languages we have short read integrations then we have

machine learning which is a big part of data science and then again we have big

data so let’s start with statistics which is the first area of data science

or I should say the first milestone which we should cover

so for statistics let’s understand first what exactly is data

so data in general terms refers to facts and statistics collected together for

reference or analysis when working with statistic it’s important to recognize

the different types of data so data can be broadly classified into numerical

categorical and ordinal now data with no inherent order or ranking such as a

gender or race is called nominal data so as you can see in the type 1 we have

male female male female that is nominal data

now data with an ordered series is called ordinal data so as you can see

here we have an ordered series where we have the customer IDs and the rating

scale no data with only two options series is called binary data now in this

type of data there are only two options like either yes or no or true or false

or 1 or 0 so as you can see here we have customer ID and in the owner or car

column we have either yes or no now the types of data we just discussed under

law describe the quality of something in size appearance value or something such

kind of data is broadly classified into qualitative data now data which can be

categorized into a classification data which is based upon counts there is only

a finite number of values possible and the values cannot be subdivided

meaningfully is called discrete data so as you can see here in our example we

have organization and the number of products so this cannot be subdivided

into number of sub products right and if you talk about data which can be

measured on a continuum or a scale no data which can have almost any numeric

value and can be subdivided into finer and finer increments is called

continuous data so as you can see here in patient ID we have weight of the

patient it is 6.5 kgs now kgs can be subdivided into grams and milligrams and

final refinement is also possible now this type of data that can be measured

by the quantity of something rather than its quality is called quantitative data

now that we have honest with the different types of data qualitative and

quantitative it’s time to understand the types of variables we have now there are

majorly two types of variables dependent and independent variables so if you want

to know whether caffeine affects your appetite the presence or the absence of

the amount of caffeine would be the independent

variable and how hungry you are would be the dependent variables so in statistics

dependent variable is the outcome of an experiment as you change the independent

variable you watched what happens to the dependent variable whereas if you talk

about independent variable a variable that is not affected by anything that

you or the researcher does usually plotted on the x-axis

now the next step after knowing about the datatypes and the variables is to

know about population and sampling and that comes into experimental research

now in experimental research the aim is to manipulate an independent variable

and then examine the effect that this change has on a dependent variable now

since it is possible to manipulate the independent variable experimental

research has the advantage of enabling a researcher to identify a cause and

effect between the variables well suppose there are 100 volunteers at the

hospital and a doctor needs to check the working of a particular medicine which

has been cleared by the government so the doctor divides those hundred

patients into two groups of 50 and then asked one group to take one type of

medicine and the other group to not take any medicine at all and then after of me

then compare the results and in non experimental research the researcher

does not manipulate the independent variable this is not to say that it is

impossible to do so but it will either be impractical or it will be unethical

to do so so for example a researcher may be interested in the effect of illegal

recreational drug views which is the independent variable on certain types of

behavior which is the dependent variable however why is possible it would be

unethical to ask an individual to take illegal drugs in order to study what

effects this hat on certain behaviors it is always good to go for experimental

research rather than non experimental research so next in our session we have

population and sampling those are two of the most important terms in statistics

so let’s understand these terms so in statistic the term population is the

entire pool from which a sample is drawn statistician also speak of a population

of objects or events or procedures or observation

including such things as the quantity of the number of vehicle owned by a penny

person now population is thus an aggregate of creatures things cases and

so on and a population commonly contains too many individuals to study

conveniently an investigation is often restricted to one or most samples drawn

from it now a world chosen sample will contain most of the information about a

particular population parameter but the relationship between the sample and the

population must be such as to allow true inferences to be made about a population

from that sample for that we have different types of sampling techniques

so in probabilities there are sampling methods which are classified either as

probability or non probability so in probability sampling each member of the

population has a known nonzero probability of being selected probably

the methods include random sampling systematic sampling and stratified

sampling whereas in nonprobability sampling members are selected from a

population in some non-random manner but these includes convenience sampling

judgement sampling quota sampling and snowball sampling while sampling is

important there is another term which is known as sampling error so sampling

error is a degree to which a sample might differ from the population when

inferring to a population results are reported plus or minus the sampling

error now in probability sampling there are three terms which are random

sampling systematic sampling and stratified sampling so talking about

random sampling probability of each member of the population to be chosen

has equal chance of being selected such type of sampling is random sampling

never talk about systematic sampling it is often used instead of random sampling

and it is also called the NEP name selection technique now pay attention to

the name called Anette name so after the required sample size has been calculated

every NS record is selected from the list of the population member now it’s

only advantage over Anna’s having technique is its simplicity now the

final type of sampling is a stratified sampling so a stratum is a subset of the

population that shares at least one common characteristics the researcher

first hand you fires irrelevant stratums and there

actual representations in the population before analysis so now that we know how

our data is and what kind of sampling is done let’s have a look at the measure of

center which helps describe to what extent this pattern holds for a specific

numerical value so as you can see in measure of center we have three terms

which are the mean median and mode and I’m sure everyone must be aware of all

of these terms I’ll not get into the details of these

terms what’s more important is to know about the measure of spreads now a

measure of spread sometime called a measure of dispersion is used to

describe the variability in the sample or population it is usually used in

conjunction with a measure of Center tendency such as the mean or median

provide an overall description of a set of data now if you talk about deviation

it is the difference between each X I and the mean for a sample population

which is known as the deviation about the mean whereas variance is based on

deviation and entails computing squares of deviation so as you can see here we

have the formula for the variance which is the difference between the mean and

the particular data point squared and divided by the total number of data

points and it’s summation standard deviation is basically the under root of

variance so as you can see the formula is the same just we have the under root

over the variance so that was stand evasion and variance another topic in

probability and statistics is kunis so skewness is a measure of symmetry or

more precisely the lack of symmetry so as you can see here we have left skewed

symmetric non symmetric left skewed we have right skewed so normally

distributed curves are the most symmetric curves we’ll talk about normal

distribution later so after skewness what we need to know

about is the confusion matrix now confusion matrix represent a tabular

representation of actual versus the predicted values now this help us find

the accuracy of the model when we are creating any machine learning or the

team learning model to find the accuracy what we do is plot a confusion matrix so

what you need to do is you can calculate the accuracy of your model with adding

the true positives and the true negative and dividing it with the true positives

plus true negatives plus false positive plus false negatives that will give you

the accuracy of the model so as you can see in the image we have good bad for

predicted as well as actual and as you can see here the true positive D and the

true negative a are the two areas where we have created it it was good and the

actual value was good in true negative a we have the predicted it was bad and the

actually it’s bad so model which gets the higher true positive and true

negatives are the ones which have the higher accuracy so that’s what confusion

matrix are for now the next term and a very important term in statistics is

probability so probability is the measure of how likely something will

occur it is the ratio of desired outcomes to the total outcomes now if I

roll a dice there are six total possibilities one two three four five

and six now each possibility has one outcome so

each has a probability of one out of six now for instance the probability of

getting a number two is one out of six since there is only a single two on the

dice now when talking about the probability distribution techniques or

the terminologies there are three possible terms which are the probability

density function normal distribution and the central limit theorem so probability

density function it is the equation describing a continuous probability

distribution so it is usually referred as PDF now if we talk about normal

distribution so the normal distribution is a probability distribution that

associates the normal random variable X with a cumulative probability the normal

distribution is defined by the following equation so as you can see here Y is 1

by Sigma into the square root of 2 pi 2 whole multiplied by E raised to power

minus X minus mu whole square divided by 2 Sigma square where X is a random

normal variable mu is the mean and Sigma is the standard deviation now the

central limit theorem states that the sampling distribution of the mean of any

independent random variable will be normal or nearly normal if the sample

size is large enough now accuracy or the resemblance to normal distribution

depends on however two factors the first one is a number of sample points taken

and second is the shape of the underlying population now enough about

statistics if you want to know more about statistics and if you want to get

in-depth knowledge over statistics you can refer to our statistics for data

science video I’ll leave the link to that video in the description box so

that video talks about statistics and probability in a more depth movie then I

explained here so I will talk about the p-value is the hypotheses what all are

required or any data science project so let’s move on to our next part of data

science learning which is learning paths which is the machine learning so let’s

understand what exactly is machine learning so machine learning is an

application of artificial intelligence that provides systems the ability to

automatically learn and improve from experience without being explicitly

programmed now getting computers to program themselves and also teaching

them to make decisions using data where writing software is a bottleneck let the

data do the work instead now machine learning is a class of algorithms which

is data driven that is unlike normal algorithms it is the data that does what

the good answer is so if we have a look at the various features of machine

learning so first of all it uses the data to detect patterns in a data set

and adjust the program actions accordingly it focuses on the

development of computer programs that can teach themselves to grow and change

when exposed to new data so it’s not just the old data on which it has been

trained so whenever a new data is entered the program changes accordingly

it enables computers to find hidden insights using iterative algorithms

without being explicitly programmed either so machine learning is a method

of data analysis that automates analytical model building now let’s

understand how exactly it Wells so if we have a look at the diagram which is

given here we have traditional programming on one side we have machine

learning on the other so first of all in traditional program what we used to do

was provide the data provide the program and the computer used to generate the

output so things have changed now so in machine learning what we do is provide

the data and we provide a predicted output to the machine now what the

machine does is learns from the data find hidden insights and creates a model

now it takes the output data also again and it reiterates and trains and grows

accordingly so that the model gets better every time it’s a strain with the

new data or the new output so the first and the foremost application of machine

learning in the industry I would like to get your attention towards is the

navigation or the Google Maps so Google Maps is probably the app we use whenever

we go out and require assistant in directions and traffic right the other

day I was traveling to another city and took the expressway and the math

suggested despite the havoc traffic you are on the fastest route no but how does

it know that well it’s a combination of people currently using the services the

historic data of that fruit collected over time and a few tricks acquired from

the other companies everyone using maps is providing their location their

average speed the route in which they are traveling which in turn helps Google

collect massive data about the traffic which may extemporary the upcoming

traffic and it adjust your route according to it which is pretty amazing

right now coming to the second application which is the social media if

we talk about Facebook so one of the most common application is automatic

friend tanks suggestion in Facebook and I’m sure you might have gotten this so

it’s present in all the other social media platform as well so Facebook uses

face detection and image recognition to automatically find the face of the

person which matches its database and hence it suggests us to tag that person

based on deep face now if the face is Facebook’s machine

learning project which is responsible for recognition of faces and define

which person is in the picture and it also provides alternative tags to the

images already uploading on Facebook so for example if we have a look at this

image and we introspect the following image on Facebook we get the alt tag

which has a particular description so in our case what we get here is the image

may contain sky grass outdoor and nature now transportation and commuting is

another industry where machine learning is used heavily so if you have used an

app to book a cab recently then you are already using machine learning to an

extent and what happens is that it provides a personalized application

which is unique to you it automatically detects your location and provides

option to either go home or office or any other frequent basis based on your

history and patterns it uses machine learning algorithm layered on top of

historic trip date had to make more accurate ETA predictions now uber with

the implementation of machine learning on their app and their website saw a 26

percent accuracy in delivery and pick up that’s a huge a point now coming to the

virtual person assistant as a name suggests virtual person assistant assist

in finding useful information when asked why a voice or text if you have the

major applications of machine learning here a speech recognition speech to text

conversion natural language processing and text-to-speech conversion all you

need to do is ask a simple question like what is my schedule for tomorrow or show

my upcoming flights now for answering your personal assistant searches for

information or recalls your related queries to collect the information

recently personal assistants are being used in chat pods which are being

implemented in various food ordering apps online training web sites and also

in commuting apps as well again product recommendation now this is one of the

area where machine learning is absolutely necessary and it was one of

the few areas which emerged the need for machine learning now suppose you check

an item on Amazon but you do not buy it then and there but the next day you are

watching videos on YouTube and suddenly you see an ad for the same item you

switch to Facebook there also you see the same ad and again you go back to any

other side and you see the ad for the same sort of items so how does this

happen well this happens because Google tracks your search history and

recommends asked based on your search history this is one of the coolest

application of machine learning and in fact 35% of Amazon’s revenue is

generated by the products recommendation now coming to the cool and highly

technological side of machine learning we have self-driving cars if we talk

about self-driving car it’s here and people are already using it now

machine learning plays a very important role in self-driving cars as I’m sure

you guys might have heard about Tesla the leader in this business and the

excurrent artificial intelligence is driven by the hardware manufacturer

Nvidia which is based on unsupervised learning algorithm which is a type of

machine learning algorithm now in media state that they did not train their

model to detect people or any of the objects as such the model works on deep

learning and Traut sources it’s data from the other vehicles and drivers it

uses a lot of sensors which are a part of IOT and according to the data

gathered by McKenzie the automotive data will hold a tremendous value of 750

billion dollars but that’s a lot of dollars we are talking about it now next

again we have Google Translate now remember the time when you travel to the

new place and you find it difficult to communicate with the locals or finding

local spots where everything is written in a different languages well those days

are gone Google’s G and M T which is the Google

neural machine translation is a neural machine learning that works on thousands

of languages and dictionary it uses natural language processing to provide

the most accurate translation of any sentence of words since the tone of the

word also matters it uses other techniques like POS tagging named entity

recognition and chunking and it is one of the most used applications of machine

learning now if we talk about dynamic pricing setting the rice price for a

good or a service is an old problem in economic theory there are a vast amount

of pricing strategies that depend on the objective sort be it a movie ticket a

plane ticket or a cafe everything is dynamically priced now in recent year

machine learning has enabled pricing solution to track buying trends and

determine more competitive product prices now if we talk about uber how

does Oberer determine the price of your right

who was biggest use of machine learning comes in the form of surge pricing a

machine learning model named as geosearch if you are getting late for a

meeting and you need to book an uber in a crowded area get ready to pay twice

the normal fear even for flats if you’re traveling in

the festive season the chances are that prices will be twice as much as the

original price now coming to the final application of machine learning we have

is the online video streaming we have Netflix Hulu and Amazon Prime video now

here I’m going to explain the application using the Netflix example so

with over 100 million subscribers there is no doubt that Netflix is the daddy of

the online streaming world when Netflix PD dries has all the movie

industrialists taken aback forcing them to us how on earth could one single

website take on Hollywood now the answer is machine learning the Netflix

algorithm constantly gathers massive amounts of data about user activities

like when you pause rewind fast-forward what do you want the content TV shows on

weekdays movies on weekend the date you watch the time you watch whenever you

pause and leave a content so that if you ever come back they would such as the

same video the rating events which are about four million per day the searches

which are about three million per day the browsing and the scrolling behavior

and a lot more now they collect this data for each subscriber they have and

use the recommender system and a lot of machine learning applications and that

is why they have such a huge customer retention rate so I hope these

applications are enough for you to understand how exactly machine learning

is changing the way we are interacting with the society and how fast it is

affecting the world in which we live in so if you have a look at the market

trend of the machine learning here so as you can see initially it wasn’t much in

the market but if you have a look at the 2016 side there was an enormous growth

in machine learning and this happened mostly because you know earlier we had

the idea of machine learning but then again we did not had the amount of big

data so as you can see the red line we have here in the histogram and the power

plot is that of the Big Data so Big Data also increased during the years and

which led to the increase in the amount of data generated and recently we had

that power or I should say the underlying technology and the hardware

to support that power that makes us create machine learning programs that

will work on the spectator so that is why you see very high inclination during

the 2016 period time as compared to 2012 so because during 2016 we got new

hardware and we were able to find insights using those hardware and

program and create models which would work on heavy data now let’s have a look

at the life cycle of machine learning so a typical machine learning life cycle

has six steps so the first step is collecting data second is video

wrangling then we have the third step per be analyzed the data fourth step

where we train the algorithm the fifth step is when we test the algorithm and

the sixth step is when we deploy that particular algorithm for industrial uses

so when we talk about the fourth step which is collecting data so here data is

being collected from various sources and this stage involves the collection of

all the relevant data from various sources now if we talk about data

wrangling so data wrangling is the process of cleaning and converting raw

data into a format that allows convenient consumption now this is a

very important part in the machine learning lifecycle as it’s not every

time that we receive a data which is clean and is in a proper format

sometimes their value is missing sometimes there are wrong values

sometimes data format is different so a major part in a machinery lifecycle goes

in data wrangling and data cleaning so if we talk about the next step which is

data analysis so data is analyzed to select and filter the data required to

prepare the model so in this step we take the data use machine learning

algorithms to create a particular model now next again when we have a model what

we do is strain the model now here we use the data sets and the algorithm is

trained on between data set through which algorithm understand the pattern

and the rules which govern the particular data once we have trained the

algorithm next comes testing so the testing data

set determines the accuracy of our models

so what we do is provide the test dataset to the model and which tells us

the accuracy of the particular model whether it’s 60% 70% 80% depending upon

the requirement of the company and finally we have the operation and

optimization so if the speed and accuracy of the model is acceptable then

that moral should be deployed in the real system the model that is used in

the production should be made with all the available data models improve with

the amount of available data used to create them all the result of the moral

needs to be incorporated in the business strategy now after the model is deployed

based upon its performance the model is updated and improved if there is a dip

in the performance the moral is retrained so all of these happen in the

operation and optimization stage now before we move forward since machine

learning is mostly done in Python and us so and if we have a look at the

difference between Python and our I’m pretty sure most of the people would go

for Python and the major reason why people go for python is because python

has more number of libraries and python is being used in just more than data

analysis and machine learning so some of the important Python libraries here

which I want to discuss here so first of all I’ll talk about matplotlib now what

Matt brought lib does is that it enables you to make bar charts scatter plots the

line charts histogram basically what it does is helps in the visualization

aspect as data analyst and machine learning ingenious what one needs to

represent the data in such a format that it is used that it can be understood by

non-technical people such as people from marketing people from sales and other

departments as well so another important Python library here we have a seaborne

which is focused on the visuals of statistical models which includes heat

maps and depict the overall distributions

sometimes people work on data which are more geographically aligned and I would

say in those cases he traps are very much required now next we come to

scikit-learn and scikit-learn is the one of the most

famous libraries of python i would say it’s simple and efficient or data mining

and for data analysis it is built on numpy and my rock lab and it is

open-source next on our list we have pandas it is the perfect tool for data

wrangling which is designed for quick and easy data manipulation aggregation

and visualization and finally we have numpy now numpy stands for a numerical

Python provides an abundance of useful features for operation on n arrays which

has an umpire’s and matrices in spite and mostly it is used for mathematical

purposes so which gives a plus point to any machine learning algorithm so as

these were the important part in larry’s which one must know in order to do any

price and programming for machine learning or as such if you are doing

Python programming you need to know about all of these libraries so guys

next what we are going to discuss other types of machine learning so then again

we have three types of machine learning which are supervised reinforcement and

unsupervised machine learning so if we talk about supervised machine learning

so supervised learning is where you have the input variable X and the output

variable Y and you use an algo I know to learn the mapping function from the

input to the output so if we take the case of object detection here so or face

detection I rather say so first of all what we do is input the raw data in the

form of labelled faces and again it’s not necessary that we just input faces

to train the model what we do is input a mixture of faces and non-faces images so

as you can see here we have labeled face and labeled on faces what we do is

provide the data to the algorithm the algorithm creates a model it uses the

training dataset to understand what exactly is in a face what exactly is in

a picture which is not a face and after the model is done with the training and

processing so to test it what we do is provide particular input of a face or an

on face what we know see the major part of supervised learning here is that we

exactly know the output so when we are providing a face we

our selves know that it’s a phase so to test that particular model and get the

accuracy we use the labeled input raw data so next when we talk about

unsupervised learning unsupervised learning is the training of a model

using information that is neither classified nor labeled now this model

can be used to cluster the input data in classes or the basis of the statistical

properties for example for a basket full of vegetables we can cluster different

vegetables based upon their color or sizes so if I have a look at this

particular example here we have what we are doing is we are inputting the raw

data which can be either apple banana or mango what we don’t have here which was

previously there in supervised learning are the labels so what the algorithm

does is that it visually gets the features of a particular set of data

it makes clusters so what will happen is that it will make a cluster of red

looking fruits which are Apple yellow local fruits which are banana and based

upon the shape also it determines what exactly the fruit is and categorizes it

as mango banana or apple so this is unsupervised learning now the third type

of learning which we have here is reinforcement learning so reinforcement

learning is the learning by interacting with a space or an environment it

selects the action on the basis of its past experience the exploration and also

by new choices a reinforcement learning agent learns from the consequences of

its action rather than from being taught explicitly so if we have a look at the

example here the input data we have what it does is goes to the training goes to

the agent where the agent selects the algorithm it takes the best action from

the environment gets the reward and the model is strange so if you provide a

picture of a green apple although the Apple which it particularly

nose is red what it will do is it will try to get an answer and with the past

experience what it has and it will recreate the algorithm and then finally

provide an output which is according to our requirements so now these were the

major types of machine learning algorithms next what we never do is dig

deep into all of these types of machine learning one by one so let’s get started

with supervised learning first and understand what exactly is supervised

learning and what are the different algorithms inside it how it works the

algorithms the working and we’ll have a look at the various algorithm demos now

which will make you understand it in a much better way so let’s go ahead and

understand what exactly is supervised learning so supervised learning is where

you have the input variable X and the output variable Y and using algorithm to

learn the mapping function from the input to the output as I mentioned

earlier with the example of face detection so it is cos subbu is learning

because the process of an algorithm learning from the training data set can

be thought of as a teacher supervising the learning process so if we have a

look at the supervised learning steps or what will rather say the workflow so the

model is used as you can see here we have the historic data then we again we

have the random sampling we split the data enter training error set and the

testing data set using the training data set we with the help of machine learning

which is supervised machine learning we create statistical model now after we

have a model which is being generated with the help of the training data set

what we do is use the testing data set for prediction and testing what we do is

get the output and finally if we have the model validation outcome that was

third training and testing so if we have a look at the prediction part of any

particular supervised learning algorithm so the model is used for operating

outcome of a new data set so whenever performance of the model degraded the

model is retrained or if there are any performance issues

the model is retrained with the help of the new data now when we talk about

supervisor in there are not just one but quite a few algorithms here so we have

linear regression logistic regression this is entry we have random forest we

have made biased classifiers so linear regression is used to estimate real

values for the cost of houses the number of cars

the total sales based on the continuous variable so that is what Rainier

generation is now when we talk about logistic regression it is used to

estimate discrete values for example which are binary values like zero and

one yes or no true and false based on the given set of independent way so for

example when you are talking about something like the chance of winning or

if we talk about winning which can be the true or false if will it rain today

which it can be the yes or no so it cannot be like when the output of a

particular algorithm or the particular question is either yes/no or binary then

only we use a logic regression now next we have decision trees so so these are

used for classification problems it works for both categorical and

continuous dependent variables and if we talk over random forest so random forest

is an N symbol of a decision tree it gives better prediction and accuracy

that decision tree so that is another type of supervised learning algorithm

and finally we have the Nate Byars classifier so it is a classification

technique based on the based theorem with an assumption of independence

between predictors so we’ll get more into the details of all of these

algorithms one by one so let’s get started with linear regression so first

of all let us understand what exactly linear regression is so linear

regression analysis is a powerful technique you operating the unknown

value of a variable which is the dependent variable from the known value

of another variable which is the independent variable so a dependent

variable is the variable to be predicted or explained in a regression model

whereas an independent variable is a variable related to the dependent

variable in a regression equation so if you have a look here as a simple linear

regression so it’s basically equivalent to a simple line which is with a slope

which is y equals a plus B X where Y is the dependent variable a is the

y-intercept we have P which is the slope of the line and X which is the

independent variable so intercept is the value of the

dependent variable Y when the value of the independent variable X is 0 it is

the the line cuts the y-axis whereas slope

is the change in the dependent variable for a unit increase in the independent

variable it is the tangent of the angle made by the line with the x-axis now

when we talk about the relation between the variables we have a particular term

which is known as correlation so correlation is an important factor to

check the dependencies when there are multiple variables what it does is it

gives us an insight of the mutual relationship among variables and it is

used for creating a correlation plot with the help of the Seabourn library

which I mentioned earlier which is one of the most important libraries in

Python so correlation is very important term to know about now if we talk about

regression lines so linear regression analysis is a powerful technique used

for predicting the unknown value of a variable which is the dependent variable

from the regression line which is simply a single line that best fits the data in

terms of having the smallest overall distance from the line to the points so

as you can see in the plot here we have the different points or the data points

so these are known as the fitted points then again we have the regression line

which has the smallest overall distance from the line to the points so you have

a look at the distance between the point to the regression line so what this line

shows is the deviation from the regression line so exactly how far the

point is from the regression line so let’s understand a simple use case of

linear regression with the help of a demo so first of all there is a real

state company use case which I’m going to talk about so first of all here we

have John he has some baseline for pricing the villa’s and the independent

houses he has in Boston so here we have the data set description which we’re

going to use so this data set has different columns such as the crime rate

per capita which is CRI M it has proportional residential residential

land zone for the Lots proportion of non retail business the river the United

Rock side concentration average number of rooms and the proportion of the owner

occupying the built prior to 1940 the distance

of the five Boston employment centers in excess of accessibility to Riedl

highways and much more so first of all let’s have a look at the data set we

have here so one number I don’t thing here guys is

that I’m gonna be using Jupiter notebook to execute all my practicals you are

free to use the spider notebook or the console either so it basically comes

down to your preference so for my preference I’m going to use the Jupiter

notebook so for this use case we’re gonna use the Boston housing data set so

as you can see here we have the data set which has the CRI mzn in desc CAS NO x

the different variables and we have the data set of form almost I would say like

500 houses so what John needs to do is plan the pricing of the housing

depending upon all of these different variables so that it’s profitable for

him to sell the house and it’s easier for the customers also to buy the house

so first of all let me open the code here for you so first of all what we’re

gonna do is import the library is necessary for this project so we’re

going to use the numpy we’re going to import numpy as NP import pandas at PD

then we’re gonna also import the matplotlib and then we are going to do

is read the Boston housing data set into the BOS one variable so now what we are

going to do is create two variables x and y so what we’re gonna do is take 0

to 13 I’ll say is from CR I am two LS dat in 1x because that’s the independent

variable and Y here is dependent variable which is the MA TV which is the

final price so first of all what we need to do is plot a correlation so what

we’re gonna do is import the Seabourn library as SN s we’re going to use the

correlations to plot the correlations between the different 0 to 13 variables

what we gonna do is also use ma DV here also so what we’re going to do is SN s

dot heatmap correlations to be going to use the square

to differentiate usually it comes up in square only or circles so you don’t know

so we’re gonna use square you want to see you see map with the Y as GNP you

this is the color so there’s no rotation in the y axis and we’re gonna rotate the

excesses to the 90 degree and let’s we gonna plot it now so this is what the

plot looks like so as you can see here the more thicker or the more darker the

color gets the more is the correlation between the variables so for example if

you have a look at CRI M and M a DV right so as you can see here the color

is very less where the correlation is very low so one thing important what we

can see here is the tax and our ad which is the full value of the property and

RIT is the index of accessibility to the radial highways now these things are

highly correlated and that is natural because the more it is connected to the

highway and more closer it is to the highway the more easier it is for people

to travel and hence the tax on it is more as it is closer to the highways now

what we’re going to do is from SQL and dot cross-validation we’re going to

import the Train test split and we’re gonna split the data set now so what we

are going to do is create four variables which are the extreme X test Y train

white tests and we’re going to use a train test split function to split the x

and y and here we’re going to use the test size 0.3 tree which will split the

data set into the test size will be 33% well as the training size will be 67%

now this is dependent on you usually it is either 60/40 70/30 this depends on

your use case your data you have the kind of output you are getting the model

you are creating and much more then again from SQL learn dot linear model

we’re going to import linear regression now this is the major functions we’re

gonna use just linear regression function which is present in SQL which

is a scikit-learn so we going to create our linear regression model into LM and

the model which are going to create and we’re going to fit the training videos

which has the X train and the why train then we’re gonna create a prediction

underscore 5 which is the LM dot credit and I take the X test variables which

will provide the predicted Y variables so now finally if we plot the scatter

plot of the Y test and the y predicted what we can see is that and we give the

X label as white test and the Y label has y predicted we can see the

regression line which we have plotted in at the scatter plot and if you want to

draw a regression line it’s usually it will go through all of these points

excluding the extremities which are here present at the endpoints so this is how

a normal linear regression works in Python what you do is create a

correlation you find out you split the dataset into training and testing

variables then again you define what is going to be your test size import the

reintegration moral use the training data set into the model fitted use the

test data set to create the predictions and then use the wireless code test and

the predicted Y and plot the scatter plot and see how close your model is

doing with the original data it had and check the accuracy of that model now

typically you use these steps which was collecting data what we did data

wrangling analyze the data we trained the algorithm we use the test algorithm

and then we deployed so fitting a model means that you are making your algorithm

learn the relationship between predictors and the outcomes so that you

can predict the future values of the outcome so the best fitted model has a

specific set of parameters which best defines the problem at hand since this

is a linear model with the equation y equals MX plus C so in this case the

parameters of the model learns from the data that are M and C so this is what

more fitting now if it have a look at the types of fitting which are available

so first of all machine learning algorithm first attempt to solve the

problem of underfitting that is of taking a line that does not

approximate the data well and making it approximate to the data better so

machine does not know where to stop in order to solve the problem and it can go

ahead from appropriate to overfit more sometimes when we say a model overfits

we mean that it may have a low error rate for training data but it may not

generalize well to the overall population of the data we are interested

in so we have under fact appropriate and over fit these are the types of fitting

now guys this was linear regression which is a type of supervised learning

algorithm in machine learning so next what we’re going to do is understand the

need for logistic regression so let’s consider a use case as in political

elections are being contested in our country and suppose that we are

interested to know which candidate will probably win now the outcome variables

result in binary either win or lose the predictor variables are the amount of

money spent the age the popularity rank and etc etcetera now here the best fit

line in the regression war is going below 0 and above what and since the

value of y will be discrete that is between 0 & 1 the linear rain has to be

clipped at 0 & 1 now linear regression gives us only a single line to classify

the output with linear regression our resulting curve cannot be formulated

into a single formula as you obtain three different straight lines what we

need is a new way to solve this problem so hence people came up with logistic

regression so let’s understand what exactly is logic regression so logistic

regression is a statistical method for analyzing a data set in which there are

1 or more independent variables that determine an outcome and the outcome is

a binary class type so example a patient goes a followed a teen checkup in the

hospital and his interest is to know whether the cancer is benign or

malignant now a patient’s data such as sugar level blood pressure eight skin

width and the previous medical history are recorded and a daughter checks the

patient data and it reminds the outcome of his illness and severity of illness

the outcome will result in binary that is zero if the cancer is malignant and

one if it’s been I know no strict regression is a statistical method used

for analyzing a dataset there were say one or more dependent variables like we

discuss like the sugar level blood pressure

skin with the previous medical history and the output is binary class type so

now let’s have a look at the lowest ik regression curve now the law

disintegration code is also called a sigmoid curve or the S curve the sigmoid

function converts any value from minus infinity to infinity to the discrete

value 0 or 1 now how to decide whether the value is 0 or 1 from this curve so

let’s take an example what we do is provide a threshold value we set it we

decide the output from that function so let’s take an example with the threshold

value of 0.4 so any value above 0.4 will be rounded off to 1 and anyone below 0.4

we really reduce to 0 so similarly we have polynomial regression also so when

we have nonlinear data which cannot be predicted with a linear model we switch

to the polynomial regression now such a scenario is shown in the below graph so

as you can see here we have the equation y equals 3x cubed plus 4x squared minus

5x plus 2 now here we cannot perform this linearly so we need polynomial

regression to solve these kind of problems now when we talk about logistic

regression there is an important term which is decision tree and this is one

of the most used algorithms in supervised learning now let’s understand

what exactly is a decision tree so our decision tree is a tree like structure

in which internal load represent tests on an attribute now each attribute

represents outcome of test and each leaf node represents the class label which is

a decision taken after computing all attributes

apart from root to the leaf represents classification rules and a decision tree

is made from our data by analyzing the variables from the decision tree now

from the tree we can easily find out whether there will be came tomorrow if

the conditions are rainy and less windy now let’s see how we can implement the

same so suppose here we have a data set in which we have the outlook so what we

can do is from each of the Outlawz we can divide the data as sunny overcast

and rainy so as you can see in the sunny side we get two yeses and three noes

because the outlook is sunny the humidity is now

and oven is weak and strong so it’s a fully sunny day what we have is that

it’s not a pure subset so what we’re gonna do is split it further so if you

have a look at the overcast we have humidity high normal week so yes

during overcast weekend play and if you have a look at the Raney’s area we have

three SS and – no so again what we’re going to do is split it further so when

we talk of a sunny then we have humidity in humidity we have high and normal so

when the humidity is normal we’re going to play which is the pure subset and if

the humidity is high we are not going to play which is also a pure subset now so

let’s do the same for the rainy day so during rainy day we have the vent

classifier so if the wind is to be it becomes a pure subset we’re going to

play and if the vent is strong it’s a pure substance we not gonna play so the

final decision tree looks like this so first of all we check if the outlook is

sunny overcast or rain if it’s overcast we will play if it’s sunny we then again

check the humidity if the humidity is high we will not play if the humidity is

normal real play then again in the case of rainy if we check the vent if the

wind is weak the play will go on and similarly if the wind is strong the play

must stop so this is how exactly a decision tree works so let’s go ahead

and see how we can implement logisitics relation in decision trees now for

logistic regression we’re going to use the Casa data set so this is how the

data set looks like so here we have the eye diagnosis radius mean – I mean

parameter mean these are the stats of particular cancer cells or the cyst

which are present in the body so we have like total 33 columns all the way

starting from IDE – unnamed 32 so our main goal here is to define whether or

I’ll say predict whether the cancer is pinang on mannequin so first of all what

vinegar – is from scikit-learn dot small selection we’re gonna import

cross-validation score and again we’re going to use numpy

for linear algebra we’re gonna use pandas as PD because for data processing

the CSV file input for data manipulation in sequel and most of the stuff then

we’re going to import the matplotlib it is used for plotting the graph we’re

going to import Seabourn which is used to plot interactive graph like in the

last example we saw we plotted a heatmap correlation so from SK learn we’re going

to import the logistic regression which is the major model or the algorithm

behind the whole logic regression we’re gonna import the train dressed split so

as to split the raita into two paths training and testing data set we’re

going to import metrics to check the error and the accuracy of the model and

we’re gonna import decision tree classifier so first of all what we’re

gonna do is create a variable data and use the pandas PD to read the data from

the data set so here the header 0 means that the zeroth row is our column name

and if we have a look at the data or the top six part of the data we’re going to

use the friend data dot head and get the data dot info so as you can see here we

have so many data columns such as highly diagnosis radius being in text remain

parameter main area means smoothness mean we have texture worst symmetry

worst we have fractal dimension worse and lastly we have the unnamed so first

of all we can see we have six rows and 33 columns and if you have a look at all

of these columns here right we get the total number which is the 569 which is

the total number of observation we have and we check whether it’s non null and

then again we check the type of the particular column so it’s integer it’s

object float mostly most of them are float some are integer so now again

we’re going to drop the unnamed column which is the column 30 second 0 to 33

which is the 30 second column so in this process we will change it in our data

itself so if you want to save the old data you can also see if that but then

again that’s of no use so theta dot columns will give us all of these

columns when we remove that so you can see here in the output we do not

have the final one which was the unnamed the last one we have is the type which

is float so latex we also don’t want the ID column for our analysis so what we’re

gonna do is we’re gonna drop the ID again so as I said above the data can be

divided into three paths so let’s divide the features according to their category

now as you know our diagnosis column is object type so we can map it to the

integer value so we what we wanna do is use the data diagnosis and we’re gonna

map it to M 1 and B 0 so that the output is either M or B now if we use a rated

or described so you can see here we have 8 rows and 1 columns because we dropped

two of the columns and in the diagonals we have the values here let’s get the

frequency of the cancer stages so here we’re going to use the Seabourn SNS not

count plot data with diagnosis and Lee will come and if we use the PLT dot show

so here you can see the diagnosis for 0 is more and for 1 is less if you plot

the correlation among this data so we’re going to use the PLT dot figure SNS

start heat map we’re gonna use a heat map we’re going to plot the correlation

c by true we’re going to use square true and we’re gonna use the cold warm

technique so as you can see here the correlation of the radius worst with the

area worst and the parameter worst is more whereas the radius worst has high

correlation to the parameter mean and the area mean because if the radius is

more the parameter is more area is more so based on the core plot let’s select

some features from the model now the decision is made in order to remove the

: era t so we will have a prediction variable in which we have the texture

mean the parameter mean the smoothness mean the compactors mean and the

symmetry mean but these are the variables which we’ll use for the

prediction now we’ll gonna split the data into the training and testing data

set now in this our main data is splitted into training a test data set

with the 0.3 test size that is 30 to 70 ratio next what we’re going to do is

check the dimension of that training and the testing data says so what we’re

going to do is use the print command and pass the parameter train dot shape and

test our shape so what we can see here is that we have almost like 400 398

observations were 31 columns in the training dataset

whereas 171 rows and 31 columns in the testing dataset so then again what we’re

going to do is take the training data input what we’re going to do is create a

Train underscore X with the prediction underscore rad and train is for y is for

the diagnosis now this is the output of our training data same as we did for the

test so we’re going to use test underscore X for the test prediction

variable and test underscore Y for the test diagnosis which is the output of

the test data now we’re going to create a logistic regression method and create

a model logistic dot fit in which you’re going to fit the training data set which

is strain X entering Y and then we’re going to use a TEM P which is a

temporary variable in which you can operate X and then what we’re going to

do is we’re going to compare to EMP which is a test X with the test Y to

check the accuracy so the accuracy here we get is 0.9 1 then again what we need

to do this was like location normal roads retribution are we going to use

classifier so we’re going to create a decision tree classifier with random

state given as 0 now what next we’re going to do is create the

cross-validation school which is the CLF we take the moral we take the train X 3

and Y and C V equals 10 the cross-validation score now if we fit the

training test and the sample weight we have not defined here check the input of

his true and XID x sorted is none so if we get the parameters true we predict

using the test X and then predict the long probability of test X and if we

compare the score of the test X to test Y with the sample weight none we get the

same result as a decision tree so this is how you implement a decision tree

classifier and check the accuracy of the particular model so that was it

so next on our list is random forest so let’s understand what exactly is a

random forest so random forest is an symbol classifier made using many

decision tree models so so what exactly are in symbol malls so n symbol malls

combines the results from different models the result from an N simple mall

is usually better than the result of the one of the individual model because

every tree votes for one class the final decision is based upon the majority of

votes and it is better than decision tree because compared to decision tree

it can be much more accurate it rests if efficiently on the last data set it can

handle thousands of input variables without variable deletion and what it

does is it gives an estimate of what variables are important in the

classification so let’s take the example of weather data so let’s understand I

know for us with the help of the hurricanes and typhoons data set so we

have the data about hurricanes and typhoons from 1851 to 2014 and the data

comprises off location when the pressure of tropical cyclones in the Pacific

Ocean the based on the data we have to classify the storms into hurricanes

typhoons and the sub categories as further to predefined classes mentioned

so the predefined classes are TD tropical cyclone of tropical depression

intensity which is less than 34 knots if it’s between thirty four to six to 18 oz

it’s D s greater than 64 knots it’s a cheer which is a hurricane intensity e^x

is esta tropical cyclone s T is less than 34 it’s a subtropical cyclone or

subtropical depression s s is greater than 34 which is a subtropical cyclone

of subtropical storm intensity and then again we have L o which is a low that is

neither a tropical cyclone a tropical subtropical cyclone or non and

extraterrestrial cyclone and then again finally we have DB which is disturbance

of any intensity now these were the predefined classes description so as you

can see this is the data in which we have the ID name date event say this

line it’s your longitude maximum when minimum when there are so many variables

so let’s start with imp the pandas then again we import the

matplotlib then we gonna use the aggregate method in matplotlib we’re

going to use the matplotlib in line which is used for plotting interactive

graph and I like it most for plots so next what we’re going to do is import

Seabourn as SNS now this is used to plot the graph again and we’re going to

import the model selection which is the Train test split so we’re gonna import

it from a scaler and the scikit-learn we have to import metrics watching the

accuracy then we have to import sq learn and then again from SQL and we have to

import tree from SQL or dot + symbol we’re gonna import the random forest

classifier from SQL and Road metrics we’re going to import confusion matrix

so as to check the accuracy and from SQL and on message we’re gonna also import

the accuracy score so let’s import random and let’s read

the dataset and print the first six rows of the data sets you can see here we

have the ID we have the name date time it will stay this latitude longitude so

in total we have 22 columns here so as you can see here we have a column name

status which is TS TS TS for the four six so what we’re gonna do is data at

our state as visible P dot categorical data the state so what we can do is make

it a categorical data with quotes so that it’s easier for the machine to

understand it rather than having certain categories as means we’re gonna use the

categories as numbers so it’s easier for the computer to do the analysis so let’s

get the frequency of different typhoons so what we’re going to do is random dot

seed then again what are we gonna do is if we have to drop the status we have to

drop the event because these are unnecessary we’re gonna drop latitude

longitude we’re gonna drop ID then name the date and the time it occurred so if

we print the prediction list so ignore the error here so that’s not necessary

so we have the maximum and minimum and pressure low went any low when deci low

when s top blue and these are the parameters on which we’re going to do

the predictions so now we’ll split that into training and testing data sets so

then again we have the trained comet test and we’re gonna use a trained test

split especially in the 70s of 30 industrial standard ratio now important

thing here to note is that you can split it in any form you want can be either

60/40 70/30 80/20 it all depends upon the model which you have our the

industrial requirement which you have so then again if after printing let’s check

the dimensions so the training dataset comprises of eighteen thousand two

hundred and ninety five rows were twenty two columns whereas the testing dataset

comprised of eight thousand rows with twenty two columns we have the training

data input train x we had a train y so status is the final output of the

training data which will tell us the status whether it’s a TS d d which it’s

an hu which kind of a hurricane or typhoon or any kind of subcategories

which are defined which were like subtropical cyclone the subtropical

typhoon and much more so our prediction or the output variable will be status so

so this is these are the list of the training columns which we have here now

same we have to do for the test variable so we have the test x with the

prediction underscore rat with a test y with the status so now what we’re going

to do is build a random foils classifier so in the model we have the random

forest classifier with estimators as 100 a simple random for small and then we

fit the training data set which is a training X and train by then we again

make the prediction which is the world or predict that with the test underscore

X then that and this will predict for the test data and prediction will

contain the rated value by our model predicted values of the diagnosis column

for the test inputs so if you print the metrics of the accuracy score between

the prediction and the test and a score why to check the accuracy we get 95%

accuracy now the same if we’re going to do with decision tree so again we’re

gonna use the model tree dot decision tree classifier we’re going to use the

Train X and tree in Y which other training data sets

new prediction is smaller for a task or text we’re going to create a data frame

which is the Parador data frame and if we have a look at the prediction and the

test underscore Y you can see the state has 10 10 3 3 10 10 11 and 5 5 3 11 and

3 3 so it goes on and on so it has 7840 2 rows and 1 column and if you print the

accuracy we get a ninety-five point five seven percent of accuracy and if you

have a look at the accuracy of the random for us we get 95 point six six

percent which is more than 95 point five seven so as I mentioned earlier usually

random forest gives a better output or creates a better more than the decision

tree classifier because as I mentioned earlier it combines the result from

different models you know so the final decision is based upon the majority of

votes and is usually higher than the decision tree models so let’s move ahead

with our knee by selca rhythm and let’s see what exactly is neat bias so nave

bias is a simple but surprisingly powerful algorithm for predictive

modeling now it is a classification technique based on the base theorem with

an assumption of independence among predictors it comprises of two parts

which are the nave and the bias so in simple terms an a bias classifier

assumes that the presence of a particular feature in a class is

unrelated to the presence of any other feature even of these features depend on

each other or upon the existence of the other features all of these properties

independently contribute to the probability that a fruit it’s an apple

or an orange and that is why it is known as a noun a base model is easy to build

and particularly useful for very large data sets in probability theory and

statistics Bayes theorem which is alternatively known as the base law or

the Bayes rule also emitted as Bayes theorem describes the probability of an

event based on the prior knowledge of conditions that might be related to the

event so Bayes theorem is a way to figure out the conditional probability

now conditional probability is the probability of an event happening given

that it has some to one or more other events for example

your probability of getting a parking space is connected to the town today you

park where you park and what conventions are going on at the same time so base

Hyrum is slightly more nuanced and a nutshell it gives us the actual

probability of an event given information about tests so let’s talk

about the base Hyrum now so now given any I policies edge and evidence II

Bayes theorem states that the relationship between the probability of

the hypothesis before getting the evidence pH and the probability of the

hypothesis after getting the evidence which is P H bar e is PE bar H into

probability of H there are a probability of e which means it’s the probability of

even after in the hypothesis inter priority of the hypothesis divided by

the probability of the evidence so let’s understand it with a simple example here

so now for example if a single card is drawn from standard deck of playing

cards the probability of that card being a king is 4 out of 52 now since there

are 4 kings in a standard deck of 52 cards the rewarding this if the king is

the event this card is a king the priority of the king that is the

probability of king equals 4 by 52 which in turn is 1 by 30 now if the event is

is varieties or instance someone looks at the card that the single card is a

face card then the posterior probability which is the P of King given it’s a face

can be calculated using the Bayes theorem given the probability of King

given its face is equal to probability of the face given its a king there is a

probability of face into the probability of King since every King is also a face

card so the probability of face given its a king is equal to 1

and since there are 3 face cards in each suit that are jacking and Queen the

probability of face card is 3 out of 30 combining these given likelihood ratios

are we get the value using the paste theorem of probability of King events of

face is equal to 1 out of 3 so foreign joint probability distribution

with events a and B the probability of a intersection B which is the conditional

probability of a given B is now defined as property of intersection B divided by

the probability of B now this is how we get the base theorem now that we know

the different basic proof of how we got the base theorem so let’s have a look at

the working of the base your answer with the help of an examples here so let’s

take the same example of the radius set of the these forecasts in which we had

the sunny rainy overcast so first of all what we’re

gonna do is first we will create a frequency table using each attribute of

the data set so as you can see here we have the frequency table here for the

outlook humidity and the wind so for Outlook we have the frequency table here

we have the frequency table for humidity and the wind so next what we’re gonna do

is create the probability of sunny given say s that is three out of ten find the

probability of sunny which is five out of 14 and this 14 comes from the total

number of observations there and from yes and no so similarly we’re gonna find

the probability of yes also which is 10 out of 14 which is 0.7 one for each

frequency table will generate these kind of likelihood tables so the likelihood

of yes given it’s a sunny is equal to 0.51

similarly the likelihood of no given sunny is equal to 0.40 so here you can

look that using Bayes theorem we have found out the likelihood of yes given

it’s a sunny and no given it’s a sunny similarly we’re gonna do the save all

likelihood table for humidity and the same for wind so for humidity we’re

gonna check the probability of yes given its high humidity is high probability of

plane no given the humidity is high is your going to calculate it using the

same base theorem so suppose we have a day with the following values in which

we have the outlook as rain humidity as high and wind as we since we discussed

the same example earlier with the decision tree we know the answer so

let’s not get ahead of ourselves and let’s try to find out the answer using

the Bayes theorem let’s understand how neat bass works

actually so first of all we gonna use the likelihood of yes on that day so

that equals to probability of Outlook of rain given it’s a yes into probability

of humidity high given SAS interpretive NVQ NCS into probability of yes okay so

that gives us zero point zero one nine similarly they’re probably likelihood of

no on that day is the outlook is rain in units and no humidity is high given its

and no and win this week given so know that equals to zero point zero one six

now what we’re going to do is find the probability of V s and no and for that

what we’re going to do is take the probability the likelihood and divide it

with the sum of the likelihoods obvious and known so and that really gonna get

the probability of yes overall so you think that formula we get the

probability of years as zero point five five and the probability of no as zero

point four five and our model predicts that there is a fifty five percent

chance that there will be game tomorrow if it’s rainy

the humidity is high and the wind is weak now if you have a look at the

industrial use cases of any bias we have new scatterings use categorization as

what happens is that the news are comes in a lot of tags and it has to be

categorized so that the user gets information he needs in a particular

format then again we have spam filtering which is one of the major use cases of

Nate Byars classifier as it classifies the email as spam or ham then finally we

have with a prediction also as we saw just with the example that we predict

whether we’re going to play or not that sort of prediction is always there so

guys this was all about supervised learning we discussed linear regression

logistic regression we discussed named pies we’ve discussed random forests

decision tree and we understood how the random forest is better than decision

tree in some cases it might be equal to decision tree but nonetheless it’s

always gonna provide us a better result so guys that was all about the

supervised learning so but before that let’s go ahead and see how exactly we’re

gonna implement nay bias so guys here we have another data set

run or walk it’s the kinematic data sets and it has been measured using the

mobile sensor so let the target were able to be Y assign all the columns

after it to X using scikit-learn a by a small we’re going to observe the

accuracy generate a classification report using scikit-learn

now we’re going to repeat the model once only the acceleration values as

predictors and then using only the gyro value aspirators and we’re going to

comment on the difference in accuracy between the two moles so here we have a

data set which is run or walk so let me open that for you so here I was data

sets run or walk so as you can see we have the date time user name risk

activity acceleration XY assertions see Cairo ex Cairo y Cairo Z so based on it

let’s see how we can implement the name by is classifier and so first of all

what we’re gonna do is import pandas at speedy then we gonna import matplotlib

for plotting we’re gonna read the run or walk data file with pandas period or

tree and a CSV let’s have a look at the info so first of all we see that we have

88 thousand five hundred eighty eight rows with 11 columns so we have the

date/time username rest activity assertion XYZ Cairo XYZ and the memory

uses is send point 4 MB data so this is how you look at the columns D F dot

columns now again we’re gonna split the dataset into training and testing data

sets so we’re going to use the Train test flight model so that’s what we’re

gonna do is split it into X train X test y train by test and we’re gonna split it

into the size of 0.2 here so again I am saying it depends on you what is the

test size so let’s print the shape of the training and see it’s 70,000

observation has six columns now what we’re going to do is from the

scikit-learn dot knee pius we’re going to import the caution NB which is the

question a bias and we’re going to put the classifier as caution NB then

we’ll pass on the extreme and white rain variables to the classifier and again we

have the wireless co-credit which is the classifier predict X text and we gonna

compare the Y underscore predict with the y underscore test to see the

accuracies for that so for that we’re going to import sq learn dot matrix

we’re going to import the accuracy score now let’s compare both of these so the

accuracy what we get is ninety five point five four percent now another way

is to get a confusion matrix bill so from scikit-learn dot matrix we’re going

to import the confusion matrix and we’re gonna plot the matrix of five predict

and white test so as you can see here we have 90 and 699 that’s a very good

number so now what we’re gonna do is create a classification report so from

metrics we’re gonna import the classification because reports we’re

going to put the target names as walk comma run and friends the report using

white s and by predict within target means we have so for walking we get the

precision of 0.92 and the recall of 0.99 f1 score is zero point nine six the

support is eight thousand six hundred seventy three and for runway

appreciation of ninety ninety percent with the recoil of 0.92 and f1 score of

zero point 95 so guys this is how you exactly use the Gaussian in me or the

new pie’s classifier on it and all of these types of algorithms which are

present in the supervisor or unsurprised or reinforcement learning are all

present in the cyclotron library so one second assist SQL learn is a very

important library when you are dealing with machine learning because you do not

have to code any algorithm hard coding algorithm every algorithm is present

there all you have to do is just passed it either split the dataset into

training and testing dataset and then again you have to find the predictions

and then compare the predicted Y with the test case Y so that is exactly what

we do every time we work on a machine learning algorithm now guys that was all

about supervised learning let’s go ahead and understand what exactly is

unsupervised learning so sometimes the given data is unstructured and unlabeled

so it becomes difficult to classify the data into different categories so answer

learning helps to solve this problem this learning is used to cluster the

input data in classes on the basis of their statistical properties so example

we can cluster different bikes based upon the speed limit their acceleration

or the average that they are giving so I’m supporting is a type of machine

learning algorithm used to draw inferences from Veda sets consisting of

input data without labeled responses so if you have a look at the workflow or

the process flow of unsupervised learning so the training data is

collection of information without any label we have the machine learning

algorithm and then began the clustering models so what it does is that

distributes the data into different clusters and again if you provide any

unlabeled new data it will make a prediction and find out to which cluster

that particular data or the data set belongs to or the particular data point

belongs to so one of the most important algorithms in unsupervised learning is

clustering so let’s understand exactly what is clustering so a clustering

basically is the process of dividing the datasets into groups consisting of

similar data points it means grouping of objects based on

the information found in the data describing the objects or their

relationships so clustering models focused on identifying groups of similar

records and labeling records according to the group to which they belong now

this is done without the benefit of prior knowledge about the groups and

their characteristics so and in fact we may not even know exactly how many

groups are there to look for now these models are often referred to as

unsupervised learning models since there is no external standard by which to

judge the models classification performance there are no right or wrong

answers to these model and if we talk about why clustering is used so the goal

of clustering is to determine the intrinsic group in a set of unlabeled

data sometime the partitioning is the goal or the

of clustering algorithm is to make sense of and exact value from the last set of

structured and unstructured data so that is why clustering is used in the

industry and if you have a look at the various use cases of clustering in the

industry so first of all it’s being used in marketing so discovering distinct

groups in customer databases such as customers who make a lot of

long-distance calls customers who use internet more than

cause they also using insurance companies for like identifying groups of

cooperation insurance policyholders with high average game rate farmers crash

cops which is profitable they are using cease mix studies and define probable

areas of oil or gas exploration based on Seesmic data and they’re also used in

the recommendation of movies if you would say they are also used in Flickr

photos they also use by Amazon for recommending the product which category

it lies in so basically if we talk about clustering there are three types of

clustering so first of all we have the exclusive clustering which is the hard

clustering so here an item belongs exclusively to one cluster not several

clusters and the data point belong exclusively to one cluster so an example

of this is the k-means clustering so claiming clustering does this exclusive

kind of clustering so secondly we have overlapping clustering so it is also

known as soft clusters in this an item can belong to multiple clusters as its

degree of association with each cluster is shown and for example we have fuzzy

or the C means clustering which means being used for overlapping clustering

and finally we have the hierarchical clustering so when two clusters have a

painting change relationship or a tree-like structure then it is known as

hierarchical cluster so as you can see here from the example we have a pain

child kind of relationship in the cluster given here so let’s understand

what exactly is k-means clustering so today means clustering is an inquiry um

whose main goal is to group similar elements of data points into a cluster

and it is the process by which objects are classified into a predefined number

of groups so that they are as much it is similar

as possible from one group to another group but as much as similar or possible

within each group now if you have a look at the algorithm working here you’re

right so first of all it starts with an defying the number of clusters which is

key then again we find the centroid we find the distance objects to the

distance object to the centroid distance of objects to the centroid then we find

the grouping based on the minimum distance has the centroid converge if

true then we make a cluster false we then I can find the centroid repeat all

of the steps again and again so let me show you how exactly clustering was with

an example here so first we need to decide the number of clusters to be made

now another important task here is how to decide the important number of

clusters or how to decide the number of clusters we’ll get into that later so

force let’s assume that the number of clusters we have decided is three so

after that then we provide the centroids for all the creditors which is guessing

the algorithm calculates the Euclidean distance of the point from each centroid

and assigns the data point to the closest cluster now Euclidean distance

all of you know is the square root of the distance the square root of the

square of the distance so next when the centroids are calculated again we have

our new clusters for each data point then again the distance from the points

to the new clusters are calculated and then again the points are assigned to

the closest cluster and then again we have the new centroid scatter it and now

these steps are repeated until we have a repetition in the centroids or the new

centers are very close to the very previous ones so until unless our output

gets repeated or the outputs are very very close enough we do not stop this

process we keep on calculating the Euclidean distance of all the points to

the centroids then we calculate the new centroids and that is how claiming is

clustering works basically so an important part here is to understand how

to decide then value of K or the number of clusters

it does not make any sense if you do not know how many class are you going to

make so to decide the number of clusters we have the elbow method so let’s assume

first of all compute the sum squared error which is the SS e for some value

of K for example let’s take two four six and eight now the SS e which is the sum

squared error is defined as a sum of the squared distance between each number

member of the cluster and its centroid mathematically and if you mathematically

it is given by the equation which is provided here and if you brought the key

against the SS II you will see that the error decreases as K gets large now this

is because the number of cluster increases they should be smaller so this

distortion is also smaller now the idea of the elbow method is to choose the key

at which the SSE decreases abruptly so for example here if we have a look at

the figure given here we see that the best number of cluster is at the elbow

so as you can see here the graph here genius abruptly after number four so for

this particular example we’re going to use for as a number of cluster so first

of all while working with k-means clustering there are two key points to

know first of all be careful about where you start so choosing the first Center

at random choosing the second Center that is far away from the first Center

some of it choosing the NH Center as far away possible from the closest of the

all the other centers and the second idea is to do as many runs of k-means

each with different random standing points so that you get an idea where

exactly and how many clusters you need to make and where exactly the centroid

lies and how the data is getting converged now he means he’s not exactly

a very good method so let’s understand the pros and cons of k-means clustering

z’ we know that k-means is simple and understandable

everyone don’t see that the first go the items automatically assigned to the

clusters now if we have a look at the corns so first of all one needs to

define the number of clusters this is a very heavy task as us if we have 3/4 or

if we have 10 categories and if you do not know

but number of clusters are gonna be it’s very difficult for anyone to you know to

guess the number of clusters now all the items are forced into clusters whether

they are actually belong to any other cluster or any other category they are

forced to to lie in that other category in which they are closest to and this

against happens because of the number of clusters with not defining the correct

number of clusters or not being able to guess the correct number of clusters so

and most of all it’s unable to handle the noisy data and the outliners because

anyways and machine learning engineers and data scientists have to clean the

data but then again it comes down to the analysis what they are doing and the

method that they are using so typically people do not clean the data for k-means

clustering or even if the clean there are sometimes are now see noisy and

outliners data which affect the whole model so that was all for k-means

clustering so what we’re gonna do is now a use k-means clustering for the movie

data sets so we have to find out the number of clusters and divide it

accordingly so the use case is that first of all we have at the air set of

five thousand movies and what we want to do is group them look the movies into

clusters based on the facebook lights so guys let’s have a look at the demo here

so first of all what we’re gonna do is import deep copy numpy pandas Seabourn

the various libraries which we’re going to use now and from map rat levels when

you use ply PI plot and we’re gonna use this GD plot and next what we’re gonna

do is import the data set and look at the shape of the data set so if you have

a look at the shape of the data set we can see that it has five thousand and

forty three rows with 28 columns and if you have a look at the head of the data

set we can see it has five thousand forty three data points so

what we’re gonna do is place the data points in the plot we take the director

Facebook Likes and we have a look at the data columns yeah face number in poster

cast total Facebook Likes director Facebook Likes so what we have done here

now is taking the director Facebook Likes and the actor 3 Facebook Likes

right so we have five thousand forty three rows and two columns now using the

key means from s key alone what we’re going to do is import it first

when import key means from SQL or cluster remember guys sq done is a very

important library in Python for machine learning so and the number of cluster

what we’re gonna do is provide as five note this again the number of cluster

depends upon the SSE which is the sum squared errors or the we’re going to use

the elbow method so I’m not going to go into the details of that again so we’re

gonna fit the data into the k-means dot fit and if you find the cluster centers

then for the k-means and print it so what we find is is an array of five

clusters and if you print the label of the k-means cluster now next what we’re

gonna do is plot the data which we have with the clusters with the new data

clusters which we have found and for this we’re going to use the Seabourn and

as you can see here we have plotted the card we have plotted the data into the

grid and you can see here we have five clusters so probably what I would say is

that the cluster three and the cluster zero are very very close so it might

depend see that’s exactly what I was going to say is that initially the main

challenge and k-means clustering is to define the number of centers which are

the key so as you can see here that the third center and the zeroth cluster the

third cluster and is your cluster are very very close to each other so guys it

probably could have been in one another cluster and the another disadvantage was

that we do not exactly know how the points are to be arranged so it’s very

difficult to force the data into any other cluster which makes our analysis a

little different works fine but sometimes it might be

difficult to code in the k-means clustering

now let’s understand what exactly is siemens clustering so the fuzzy c means

is an extension of a key means clustering and the popular simple

clustering technique so fuzzy clustering also referred as soft clustering is a

form of clustering in which each data point can belong to more than one

cluster so he means tries to find the hard clusters where each point belongs

to one cluster whereas the fuzzy c means discovers the soft clusters in a soft

cluster any point can belong to more than one cluster at a time with a

certain affinity value towards each fuzzy c means assigns the degree of

membership which ranges from 0 to 1 to an object to a given cluster so there is

a stipulation that the sum of fuzzy membership of an object to all the

cluster it belongs to must be equal to 1 so the degree of membership of this

particular point to pool of these clusters 0.6 and 0.4 and if you add a

peak at 1 so that is one of the logic behind the fuzzy c means so on and this

affinity is proportional to the distance from the point to the center of the

cluster now then again we have the pros and cons of fuzzy c means so first of

all it allows a data point to be in multiple clusters that’s a pro it’s a

more neutral representation of the behavior of genes genes usually are

involved in multiple functions so it is a very good type of clustering when we

are talking about genes first of and again if we talk about the cons again we

have to define C which is the number of clusters same as K next we need to

determine the membership cutoff value also so that takes a lot of time and

it’s time-consuming and the clusters are sensitive to initial assignment of

centroid so a slight change or deviation from the center’s is going to result in

a very different kind of you know a funny kind of output we get from the

fuzzy see means and one of the major disadvantage of a C means clustering is

that it’s this are non-deterministic algorithm so it does not give you a

particular output as in such that’s that now let’s have a look at the

third type of clustering which is the hierarchical clustering so uh

hierarchical clustering is an alternative approach which builds a

hierarchy from the bottom up or the top to bottom and does not require to

specify the number of clusters beforehand

another algorithm works as in first of all we put each dita point in its own

cluster and if I that closes to cluster and combine them into one more cluster

repeat the above step till the data points are in a single cluster now there

are two types of hierarchical clustering one is elaborated clustering and the

other one is division clustering so a cumulative clustering builds the

dendogram from bottom level while the division clustering it starts all the

data points in one cluster from cluster now again her archaic clustering also

has some sort of pros and cons so in the pros though no assumption of a

particular number of cluster is required and it may correspond to meaningful

taxonomies whereas if we talk about the course once a decision is made to

combine two clusters it cannot be undone and one of the major disadvantage of

these hierarchical clustering is that it becomes very slow if we talk about very

very large datasets and nowadays I think every industry are using last year as

its and collecting large amounts of data so hierarchical clustering is not the

app or the best method someone might need to go for so there’s that now when

we talk about unsupervised learning so we have k-means clustering and again and

there’s another important term which people usually miss while talking about

us was learning and there’s one very important concept of market basket

analysis now it is one of the key techniques used by large retailers to

uncover association between items now it works by looking for combination of

items that occurred together frequently in the transactions to put it it another

way it allows retailers to analyze the relationships between the items that the

people buy for example people who buy bread also tend to buy butter the

marketing team at the retail store should target customers who buy bread

and butter and provide them an offer so that they buy a third eye

like an egg so if a customer buys bread and butter and sees a discount or an

offer on eggs he will be encouraged to spend more money and buy the eggs but

this is what market basket analysis is all about now to find the association

between the two items and make predictions about what the customers

will buy there are two algorithms which are the Association rule mining and the

ebrary algorithms so let’s discuss each of these algorithm with an example first

of all if we have a look at the Association rule mining now it’s a

technique that shows how items are associated to each other for example

customers who purchase bread have a 60% likelihood of also purchasing Jam and

customers who purchase laptop are more likely to purchase laptop bags now if

you take an example of an association rule if you have a look at the example

here a aro B it means that if a person buys an Adam 8 then he will also buy an

item P now there are three common ways to measure a particular Association

because we have to find these rules on the basis of some statistics right so

what we do is use support confidence and lift now these three common ways and the

measures to have a look at the Association rule mining and know exactly

how good is that rule so first of all we have support so support gives the

fraction of the transaction which contains an item a and B so it’s

basically the frequency of the item in the whole item set whereas confidence

gives how often the item a and B occurred together given the number of

item given the number of times a occur so it’s frequency a comma B divided by

the frequency of a now lift what indicates is the strength of the rule

over the random co-occurrence of a and B if you have a close look at the

denominator of the lift formula here we have support a into support B now a

major thing which can be noted from this is that the support of a and B are

independent here so if the value of lift or the denominator value of the lift is

more it means that the items are independently selling more not together

so that in turn will decrease the value of lift so what happens is that suppose

the value of lift is more that implies that

which we get it implies that the rule is strong and it can be used for later

purposes because in that case the support in to support p-value which is

the denominator of lift will be low which in turn means that there’s a

relationship between the items a and B so let’s take an example of Association

rule mining and understand how exactly it works so let’s suppose we have a set

of items a B C D and E and we have the set of transactions which are t1 t2 t3

t4 and t5 and what we need to do is create some sort of rules for example

you can see a D which means that if a person buys a he buys D if a person buys

C he buys a if it wasn’t by his a he by C and for the fourth one is if a person

buy a B and C he is in turn by a now what we need to do is calculate the

support confidence and left of these rules now head again we talk about a

priori algorithm so a priori algorithm and the associated rule mining go

hand-in-hand so what a predators is algorithm it uses the frequent itemsets

to generate the Association rules and it is based on the concept that a subset of

a frequent item set must also be a frequent Isum set so let’s understand

what is a frequent item set and how all of these work together so if we take the

following transactions of items we have transaction T 1 T 2 T 5 and the items

are 1 3 4 2 3 5 1 2 3 5 to 5 and 1 3 5 now another more important thing about

support which I forgot to mention was that when talking about Association rule

mining there is a minimum support count what we need to do now the first step is

to build a list of items set of size 1 using this transaction data set and use

the minimum support count 2 now let’s see how we do that if we create the

tables see when if you have a close look at the table C 1 we have the item set 1

which has a support 3 because it appears in the transaction 1 3 & 5 similarly if

you have a look at the item set the single item 3 so it has a supporter of 4

it appears in t 1 D 2 D 3 and T 5 but if we have a look at

the items at 4 it only appears in the transaction once so it’s support value

is 1 now the item set with the support rally which is less than the minimum

support value that is to have to be eliminated so the final David which is a

table F 1 has 1 2 3 and 5 it does not contain the 4 now what we’re going to do

is create the item list of the size 2 and all the combination of the item sets

in f1 are used in this iteration so we’ve left four behind we just have 1 2

3 and 5 so the possible item sets of 1 2 1 3 1 5 2 3 2 5 & 3 5 then again we’ll

calculate these support so in this case if we have a closer look at the table c2

we see that the items at 1 comma 2 is having a support value 1 which has to be

eliminated so the final table F 2 does not contain 1 comma 2 similarly if we

create the item sets of size 3 and calculate these support values but

before calculating the support let’s perform the peirong on the data set now

what Spearing so after all the combinations are made we divide the

table see three items to check if there are another subset whose support is less

than the minimum support value this is a priori algorithm so in the item sets 1 2

3 what we can see that we have 1 2 and in the 1 to 5 again we have 1 2 so we’ll

discard poor of these item sets and we’ll be left with 1 3 5 & 2 3 5 so with

135 we have three subsets 1 5 1 3 3 5 which are present in table F 2 then

again we have 2 3 2 5 & 3 5 which are also present in tea we’ll have to so we

have to remove 1 comma 2 from the table C 3 and create the table F 3

now if we’re using the items of C 3 to create the adults of c4 so what we find

is that we have the item set 1 2 3 5 the support value is 1 which is less than

the minimum support value of 2 so what we’re going to do is stop

and we’re gonna return to the previous item set that is the table c3 so the

final table f3 was one three five with the support value of two and two three

five with the support value of two now what waiting a Jew is generate all

the subsets of each frequent itemsets so let’s assume that our minimum confidence

value is 60% so for every subset s of AI the output

rule is that s gives I two s is that s recommends i ns if the support of I

divided by the support of s is greater than or equal to the minimum confidence

value then only we’ll proceed further so keep in mind that we have not used lift

till now we are only working with support and confidence so applying rules

with Adam sets of f3 we get rule 1 which is 1 comma 3 which gives 1 3 5 & 1 3 it

means if you buy 1 & 3 there’s a 66% chance that you’ll buy

item 5 also similarly the rule 1 comma 5 it means that if you buy 1 & 5

there’s 100% chance that you will buy 3 also similarly if we have a look at rule

5 & 6 here the confidence value is less than 60% which was the assumed

confidence value so what we’re going to do is we’ll reject these files now an

important thing to note here is that have a closer look to the rule 5 and

rule 3 you see it’s it has 1 5 3 1 5 3 3 1 5

it’s very confusing so one thing to keep in mind is that the order of the item

sets is also very important that will help us allow create good rules and

avoid any kind of confusion so that’s done so now let’s learn how Association

rule I used in market basket analysis problem so what we’ll do is we will be

using the online transactional data of a retail store for generating Association

rules so first of all what you need to do is import pandas MLT ml X T and D

libraries from the imported and read the data so first of all what we’re going to

do is read the data what we’re gonna do is from ml X T and e

dot frequent patterns we’re going to improve the a priori and Association

rules as you can see here we have the head of the data you can see we have

inverse number of stock code the description quantity the inverse TTL

unit price customer ID and the country so in the next step what we will do is

we will do the data cleanup which includes reviewing spaces from some of

the descriptions given and what we’re going to do is drop the rules that do

not have the inverse numbers and remove the Freight transaction so hey what what

you’re gonna do is remove which do not have an invoice number if the string

contains type seen was a number then we’re going to remove that those are the

credits remove any kind of spaces from the descriptions so as you can see here

we have like five iron and 32,000 rows with eight columns so next what we

wanted to do is after the clean up we need to consolidate the items into one

transaction per row with each product for the sake of keeping the data assets

small we gonna only look at the sales for France so we’re gonna use the only

France and group by invoice number description with the quantity sum up and

C so which leaves us with 392 rows and 1563 columns now there are a lot of

zeros in the data but we also need to make sure any positive values are

converted to a 1 and anything less than 0 is set to 0 so for that we’re going to

use this code defining end code units if X is less than 0 it owns 0 if X is

greater than 1 returns 1 so what we’re going to do is map and apply it to the

whole data set we have here so now that we have structured the data properly so

the next step is to generate the frequent item set that has support of at

least 7% now this lumber is chosen so that you

can you get close enough now what we’re gonna do is generate the ruse with the

corresponding support confidence and lift so we had given the minimum support

at 0.7 the metric is lift frequent item set and threshold is one so these are

the following rules now a few rules with a high lift value which means that it

occurs more frequently than would be expected given the number of transaction

the product combinations most of the places the confidence is high as well so

these are few of the observations what we get here if we filter the data frame

using the standard pandas code for large lift six and high confidence 0.8 this is

what the output is going to look like these are 1 2 3 4 5 6 7 8 so as you can

see here we have the eh rules which are the final rules which are given by the

Association rule mining and that is how all the industries or any of these we’ve

talked about large retailers they tend to know how their products are used and

how exactly they should rearrange and provide the offers on the products so

that people spend more and more money and time in the shop so that was all

about Association rule mining so so guys that’s all for unsupervised learning I

hope you got to know about the different formulas how unsupervised learning works

because you know we did not provide any label to the data all we did was create

some rules and not knowing what the data is and we did clusterings

different types of clusterings k-means simi’s hierarchical clustering so now

coming to the third and last type of learning is the reinforcement learning

so what reinforcement learning is it’s a type of machine learning where an agent

is put in an environment and it learns to behave in this environment by

performing certain actions and observing the rewards which it gets from those

actions so a reinforcement learning is all about taking an appropriate action

in order to maximize a reward in the particular situation and in supervised

learning the training theater comprises of input and expected output

so the model is strained with the expected output itself but when it comes

to reinforcement learning there is no expected output the

reinforcement agent decides what actions to take in order to perform a given task

in the absence of a training dataset it is bound to learn from its expertise so

let’s understand reinforcement learning with an analogy so consider a scenario

wherein a baby is learning how to walk now this scenario can go in two ways

first the baby starts walking in and makes it to the candy

now since the candy is the end goal the baby is happy it’s positive the baby is

happy positive reward now coming to the second scenario the baby starts walking

but falls due to some hurdle in between now the baby gets hurt and does not get

to the candy it’s negative the baby is sad negative reward just like we humans

learn from our mistakes by a trial and an earth reinforcement learning is also

similar and we have an agent which is baby a reward which is candy and many

hurdles in between the agent is supposed to find the best possible path to reach

the reward so guys if you have a look at some of the important reinforcement

learning definitions first of all we have the agent so the reinforcement

learning algorithm that learns from trial in err that’s the agent now if we

talk about environment the world through which the agent moves or the obstacles

which the agent has to conquer or the environment

now actions a are all the possible steps that the agent can take the state s is

the current conditions returned by the environment then again we have reward R

and instant return for the environment to appraise the last action then again

we have policy which is PI it is the approach that the agent uses to remind

the next action based on the current state we have value V which is the

expected long-term return with discount as open to the short-term what are then

again we have the action value Q this is similar to value except it takes an

extra parameter which is the current state action which is a now let’s talk

about reward maximization for a moment now reinforcement learning agent works

based on the theory of reward maximization this is exactly why the RL

must be trained in such a way that he takes the best action so that the reward

is maximum now the collective rewards at a

particular time and the respective action is written as G T equals RT plus

one RT plus two and so on now the equation is an ideal

representation of rewards generally things do not work out like this while

summing up the cumulative rewards now let me explain this with a small gape in

the figure you see a fox right some meat and a Tyler our reinforcement learning

agent is the Fox and his end goal is to eat the massive Otto meat before being

eaten by the tiger since this fox is clever fellow he eats

the meat that is closer to him rather than the meat which is close to the

tiger because the closer he goes to the Tiger the tiger the higher are his

chances of getting killed as a result the reward near the tiger in if they are

bigger meat chunks will be discounted this is done because of the uncertainty

factor that the tiger might kill the Fox now the next thing to understand is how

discounting of reward works now to do this we define a discount called the

gamma the value of gamma is between 0 & 1

the smaller the gamma the larger the discount and vice versa so our

cumulative discounted reward is GT summation of K 0 to infinity gamma to

the power P as DK t plus k plus 1 where gamma belongs to 0

to 1 but if the Fox decides to explore a bit it can find bigger rewards that is

this big chunk of meats this is called exploration so the reinforcement

learning basically works on the basis of exploration and exploitation

so exploitation is about using the already known expert information to

heighten the rewards whereas exploration is all about exploring and capturing

more information about the environment there is another problem which is known

as the K armed bandit problem the K armed bandit it is a metaphor

representing a casino slot machine with K pull levers or arms the users or the

customer pulls any one of the levers to win a projected reward

the objective is to select the leeward that will provide the user with the

highest reward now here comes the epsilon greedy algorithm it tries to be

fair to do opposite cause of exploration exploitation by using a mechanism of

flipping a coin which is like if you flip a coin and comes up head you should

explore for memory butter comes up days you should exploit it takes whatever

action seems best at the present moment so with probability while epsilon the

epsilon greedy algorithm exploits the best known option with probability

epsilon by 2 epsilon 0 it explores the best known option and with the

probability epsilon by 2 with probability epsilon by 2 the algorithm

explores the best known option and with the probability epsilon by 2 the epsilon

greedy algorithm explores the worst known option now let’s talk about Markov

decision process the mathematical approach for mapping a solution in

reinforcement learning is called Markov decision process which is MDP in a way

the purpose of reinforcement learning is to solve a Markov decision process now

the following parameters are used to attain a solution set of actions a set

of states s we have the reward our policy PI and the value V and we have

translational function T probability that our forum leads to s now to briefly

sum it up the agent must take up an action to transition from the start

state to end state s while doing so the agent receives the reward R for each

action he takes the series of actions taken by the agent define the policy PI

and the rewards collected by collected to find the value of V the main goal

here is to maximize the rewards by choosing the optimum policy now let’s

take an example of choosing the shortest path now consider the given example here

so what we have is given the above representation our goal here is to find

the shortest path between a and D each edge has a number linked to it and this

denotes the cost to traverse that edge now the task at hand is to traverse from

point A to D with the minimum possible cost in this problem the set of states

are denoted by the nodes ABCD a d the action is to traverse from one

node to another are given by a arrow B or C our OD reward is the cost

represented by each edge and the policy is the path taken to reach each

destination a to C to D so you start off at node a and take baby steps to your

destination initially only the next possible node is visible to you if you

follow the greedy approach and take the most optimal step that is choosing a to

see instead of a to B or C now you are at node C and want to traverse to node T

you must again choose the path wisely choose the path with the lowest cost we

can see that a CD has the lowest cost and hence we take that path to conclude

the policy is a to C to D and the value is 120

so let’s understand Q learning algorithm which is one of the most use

reinforcement learning algorithm with the help of examples

so we have five rooms in a building connected by toast and each room is

numbered from 0 through 4 the outside of the building can be thought of as one

big room which is tea room number five now dose 1 & 4 lead into the building

from the room 5 outside now let’s represent the rooms on

a graph and each node each room has a node and each door as link so as you can

see here we have represented it as a graph and our goal is to reach the node

5 which is the outer space so what we’re gonna do is and the next step is to

associate a reward value to each toe so the dose that directed read to the you

will have a reward of 100 whereas the doors that do not directly connect to

the target have a reward and because the dose had to weigh two arrows are

assigned to each room and each row contains an instant about valley so

after that the terminology in the q-learning includes the term states and

action so the room 5 represents a state agents movement from one room to another

room represents in action and in this figure a state is depicted as a node

while an action is represented by the arrows so for example let’s say can eat

in that Traverse from room to to the roof I so the initial state is gonna be

the state to it then the next step is from stage 2 to stage 3 next is to moves

from stage 3 to stage either 2 1 or 4 so if it goes to the 4 it reaches stage 5

so that’s how you represent the hole traversing of any particular agent in

all of these rooms a represents their actions via notes so we can put this

state diagram and instant reward values into a reward table which is the matrix

R so as you can see the minus 1 here in the table represents the null values

because you cannot go from 1 to 1 right and since there is no way from to go

from 1 to 0 so that is also minus 1 so minus 1 represents the null values

whereas the 0 represents zero reward and 100

represents the reward going to the room five so one more important thing to know

here is that if you’re enrolled fireman you could go to room five the reward is

hundred so what we need to do is add another matrix Q representing the memory

of what the agent has learned to experience the rows of matrix Q

represent the current state of the agent whereas the columns represent the

possible action leading to the next state now if the formula to calculate

the Q matrix is if a particular Q at a particular state and the given action is

equal to the R of that state in action plus gamma which we discussed earlier

the Kurama parameter which we discussed earlier which ranges from 0 to 1 into

the maximum of the Q or the next state comma all actions so let’s understand

this with an example so here are the nine steps which any Q learning

algorithm particularly has so first of all is to set the gamma parameter and

the environment rewards in the matrix R then we need to do is initialize the

matrix Q to 0 select the random initial state set the initial state to current

state select one among all the possible actions for the current state using this

possible action consider going to the next state when you get the next state

get the maximum Q value for this next state based upon all the actions compute

the Q value using the formula repeat the above steps until the current state

equals your code so the first step is to set the values of the learning

parameters gamma which is 0.8 and initial state as room number one so the

next initialize the Q matrix a zero matrix so on the left hand side as you

can see here we have the Q matrix which has all the values as 0 now from room 1

you can either go to room 3 or room 5 so let’s select room 5 because that’s our

end goal so from room 5 calculate the maximum cube value for this next state

based on all possible actions so Q 1 comma 5 equals R 1 comma 5 which is

hundred plus zero point eight which is the gamma into the maximum of Q 5 comma

1 5 comma 4 and 5 comma 5 so maximum or five comma one five comma

four five comma five is hundred so the Q values from initially as you can see

here the Q values are initialized to zero so it does not matter as of now so

the maximum is zero so the final Q value for Q 1 comma 5 is 100 so so that’s how

we’re gonna update our Q matrix so Q matrix the position has 1 comma 5 in the

second row gets updated to 100 so the first step we have turned right now that

for the next episode we start with a randomly chosen initial state so let’s

assume that the stage is 3 so from rule number 3 you can either go to room

number 1 2 or 4 so let’s select the option of room number 1 because from our

previous experience what we’ve seen is that one has directly connected to room

5 so from room / 1 calculate the maximum Q value for this next state based on all

possible action so 3 comma 1 if we take we get our 3 4 1 plus 0 point 8 comma

into maximum of T’s we get the value as 80 so the matrix Q gets updated now for

the next episode the next state 1 now becomes the current state we repeat the

inner loop of the Q learning algorithm because tip 1 is not the goal state from

1 you can either go to 3 of 5 so let’s select 105 as that’s our goal so from

room row 5 again we can go from all of these so the Q matrix remains the same

since Q 1 5 is already fed to the agent and that is how you select the random

starting points and fill up the Q Q matrix and see where which path will

lead us there with the maximum provide points now what we gonna do is do the

same coding using the Python in machine learning so what we’re going to do is

improve an umpire’s NP we’re gonna take the R matrix as we defined earlier so

that the minus 1 are the nerve values zeros are the values which provides a 0

and hundreds is the value so what we’re going to do is initialize the Q matrix

now to 0 we’re going to put gamma as 0.8 and set the initial state as 1

now here returns all the available actions in the state given as an

argument so if we define the of action with the given state we get the

available action in the current state so we have the another function here which

is known as a sample next action what this function does is that chooses at

random which action to be performed within the range of all the available

actions and finally we have action which is the sample next action with the

available act now again we have another function which is update now what it

does is that it updates the Q matrix according to the path selected and a Q

learning algorithm so so initially our Q matrix is all 0 so what we’re gonna do

is we’re gonna train it over 10,000 iterations and let’s see what exactly

gives the output of the Q value so if then the agent learns more through for

the iterations it will finally breach converges value in Q matrix so the Q

matrix can then be normalized at is converted to percentage by dividing all

the non-zeros entities by the highest number which is 500 in this case so once

the matrix Q gets close enough to the state of convergence agent has learned

the most optimal path to the goal State so what we’re gonna do next is divide it

by 5 which is the maximum here so Q R and P Q max in 200 so that we get a

normalized now once the Q matrix gets close enough

to the state of convergence the agent has learned or the paths so the optimal

path given by the Q learning employer Thomas if it starts from 2 it will go to

3 then go to 1 and then go to 5 if it starts at 2 it can go to 3 then 4 then 5

that will give us the same results so as you can see here is the output given by

the Q learning algorithm is the selected path is 2 3 1 and Feinstein from the Q

State – so this is how exactly a reinforcement learning algorithm works

it finds the optimal solution using the path and given the action and rewards

and the various other definitions or the various other challenges I would say

actually the main goal is to get the master reward and get the maximum value

through the environment and that’s how an agent learns through its own path and

going millions and millions of iterations learning how each part will

give us what reward so that’s how the Q learning algorithm works and that’s how

it works in Python as well as I showed you so now that you have a clear idea of

the different machine learning algorithms how it works the different

phases of machine learning the different applications of machine learning how

supervised learning works how unsupervised learning works our

reinforcement learning works and what to choose in what scenario what are the

different algorithms under all of these types of machine learning next move

forward to the next part our session Rich’s understanding about artificial

intelligence deep learning and machine learning

well data science is something that has been there for ages nonetheless and data

science is the extraction of knowledge from data by using scientific techniques

and algorithms people usually have a certain level of dilemma or I would say

a certain level of confusion when it comes to differentiating between the

terms artificial intelligence machine learning and deep learning so don’t

worry I’ll clear all of these doubts for you artificial intelligence is a

technique which enables machine to mimic human behavior now the idea behind

artificial intelligence is fairly simple yet fascinating which is to make

intelligent machines that can take decisions on their own now for years it

was thought that computers would never match the power of the human brain well

back then we did not have enough data and computational power but now with big

data coming into existence and with the advent of GPUs artificial intelligence

is possible now machine learning is a subset of artificial intelligence

technique which uses statistical method to enable machines to improve with

experience whereas deep learning is a subset of machine learning which makes

the computation of multi-layer neural network feasible it uses the neural

networks to stimulate human-like decision-making so as you can see if we

talk about the data science ecosystem we have artificial intelligence machine

learning and deep learning deep learning being the innermost circle is very much

required for machine learning as well as artificial in

but why was deep learning required so for that less understand the need for

deep lolly so a step towards artificial intelligence was machine learning and

machine learning was a subset of ei play it deals with the extraction of patterns

from the last dataset haslam la dataset was not a problem what was a problem was

machine learning algorithms could not handle the hight dimensional data where

we have a large number of inputs and outputs which rounds thousands of

dimensions handling and processing such type of data becomes very complex and

resource exhaustion now this is also termed as the curse of dimensionality

now another challenge faced by machine learning was to specify the features to

be extracted so as we saw earlier in all the algorithms which are discussed now

we had to specify the features to be extracted now this plays an important

role in protecting the outcome as well as in achieving better actress therefore

without feature extraction the challenge for the programmer increases as the

effectiveness of the algorithm very much depends on how insightful the programmer

is now this is where deep learning comes into picture and comes to the rescue

but deep learning is capable of handling the high dimensional data and is also

efficient in focusing on the right features on its own so what exactly is

deeper so deep learning is a subset of machine learning as I mentioned earlier

where similar machine learning algorithms are used to Train deep neural

networks so as to achieve better accuracy in those cases where the former

was not performing up to the MA basically deep learning mimics the way

our brain functions and learns from experience so as you know our brain is

made up of billions of neurons that allows us to do amazing things when the

brain of a small kid is capable of solving complex problems which are very

difficult to solve even using the supercomputers so how can we achieve the

same functionality in programs now this is where we understand artificial neuron

and artificial neural networks so first of all let’s have a look at the

different applications of deep learning we have automatic machine translation

object classification before automatic handwriting generation

character text generation we have image caption generation colorization

of black and white images we have automatic game playing and much more now

google lens is a set of vision based computing capabilities that allows your

smartphone to understand what’s going on in a photo video or any live feed

for instance point your phone at a flower and google lens will tell you on

the screen which type of flower it is you can in that camera at any restaurant

sign to see the reviews and other recommendations now if we talk word

mushroom transition this is a task where you are given words in some language and

you have to translate the words to a desired language see English but this

kind of translation is classic example of image recognition and final

application of deep learning which we have here is image polarization so

automatic colorization of black and white images as you know earlier we did

not had color photographs back there in 40s and 50s we did not have any color

photographs so through deep learning analyzing water shadows is present in

the image how the light is bouncing off the skin tone of the people automatic

colorization is now possible and this is all possible because of deep learning

now deep learning studies the basic unit of a brain cell called a neuron now let

us understand the functionality of a biological neuron and how we mimic this

functionality in the perceptron or what we call is an artificial neuron so as

you can see here we have the image of a biological neuron so it has a cell body

it has mitochondrion nucleus we have dendrites there we have the axon we have

the node of the ran of ear you have the scavenge cell and the synapse so we need

not know about all of these so what we need to know mostly about is dendrite

which receives signals from other neurons we have a cell body which sums

up all the inputs and we have axon which is used to transmit the signals to the

other cells now an artificial neuron or perceptron is a linear model which is

based upon the same principle and is used for binary classification

it models a neuron which has a set of inputs each of which is given a specific

weight and the neuron computes some functions on these weighted inputs and

gives the outputs it receives n inputs corresponding to each feature it then

sums up those inputs applies the transformation and produces an output it

has generally two functions which are the summation and the transformation but

the transformation is also known as activation functions so as you can see

here we have certain inputs we have certain weights we have the transfer

function and then we have the activation function now the transfer function is

nothing but the summation function here and it is the schematic for a neuron in

a neural network so this is how we mimic a biological neuron in terms of

programming now the way it shows the effectiveness of a particular input move

the weight of input more it will have an impact on the neural network on the

other hand bias is an additional parameter in the perceptron which is

used to address the output along with the weighted sum of the inputs to the

neuron which helps the model in a way that it can best fit for the given data

activation functions translate the inputs into outputs and it uses a

threshold to produce an output there are many functions that are use has

activation functions such as linear or identity we have unit or binary step we

have sigmoid logistic tan edge ray Lu and soft Max now if we talk about the

linear transformation or the activation function so a linear transform is

basically the identity function where the dependent variable has a direct

proportional relationship with the independent variable now in practical

terms it means that a function passes the signal through unchanged now the

question arises when to use linear transform function simple answer is when

we want to solve a linear regression problem we apply a linear transformation

function and next in our list of activated functions we have your next

step the output of a unit step function is either 1 or 0 now it depends on the

threshold value we define a step function with the threshold value

five is shown here so let’s consider X is five so if the value is less than

five the output will be zero whereas if the value is equal to or greater than

five then the valuable one this equal to is very much important to consider here

because sometimes people put up the equal two in the lower end of the side

so that’s not it how it is used but rather it’s used on the upper hand side

where if the value is greater than particular X greater than or equal to X

then only the value will be one now a sigmoid function is a machine that

converts an independent variable of near infinite range into simple probabilities

between 0 & 1 now most of its output will be very close to either 0 or 1 and

if you have a look at the function here we have 1 divided by n plus y raise to

power minus beta X so I’m not going to the details or the mathematical function

of a particular sigmoid but it’s very much used to convert the independent

variables of very large infinite range to the values between 0 & 1

now the question arises when to use a sigmoid transformation function so when

we want to map the input values to a value in the range of 0 to 1 where we

know the output should lie only between these two numbers we apply the sigmoid

transformation function note an H is a hyperbolic trigonometric function now

unlike the sigmoid function the normalized range of tan H is minus 1 to

1 it’s very much similar to the sigmoid function but the advantage of tan H is

that it can deal more easily with negative numbers now next on our list we

have Ray Lu now rail you or the rectify linear unit transform function only

activates our node if the input is above a certain quantity while the input is

below 0 the output is 0 but when the input Rises about a certain threshold or

if we take in this case at 0 but if you have a certain value X if it crosses

that certain threshold it has a linear relationship with the dependent variable

now this is very much different from a normal linear transformation so

has certain threshold now the question arises here again when to use a railroad

transformation function so when we want to map the input values to a value in

the range so as input X to maximum 0 comma X that is it Maps the negative

inputs to 0 and the positive inputs are output without any change we apply a

rectified linear unit or the railroad transformation function now the final

one which we have is sort max so when we have four or five classes of outputs the

softmax function will give the probability distribution of each it is

useful for finding out the class which has the maximum probability so soft mass

is a function you will often find at the output layer of a classifier now suppose

we have an input of say the letters of English words and we want to classify

which letter it is so for that case we’re going to use the sort max function

because in the output we have certain classes but I would say in English if we

take English we had 26 classes from A to Z so in that case softmax activation

function is very much important now artificial neuron can be used to

implement logic gates now I’m sure you guys must be familiar with the working

of all K that is the output is one if any of the input is also one therefore a

perceptron can be used as a separator or a decision line that divides the input

set of or gate into two classes the first class being the inputs having

output as 0 that lies below the decision line and the second class would be

inputs having output as 1 that lie above the decision line or the separator so

mathematically a perceptron can be thought of like an equation of weights

inputs and bias as you can see here we have f of X is equal to weight into the

input vector plus the bias so let’s go ahead with our demo understand how we

can implement this perceptron example which is of an or gate using neural

networks using artificial neuron or the perceptron and here we’re going to use

tensor flow along with Python so let’s understand what exactly is tensor flow

first before going it to the demo so basically tensor flow is a deep learning

framework by Google to understand it in a very easy way

let’s understand the two terms of tensorflow which are the tensors and the

flow so starting with tensors tensors are standard way of representing theater

in deep learning and they are just multi-dimensional arrays it is an

extension of two-dimensional table matrices through the data with higher

dimension so as you can see have first of all we have a tensor of dimension 6

then we have a tensor of dimension 6 comma 4 which is 2d and again we have a

tensor of dimension 6 4 and 2 which is reading now this dimension is not

restricted to 3 we can have four dimensions five dimensions it depends

upon the number of inputs or the number of classes or the parameters which we

provide to a particular neural net or a particular perceptron so which brings us

tensorflow intensive flow the computation is approached as a data flow

graph so we have a tensor and then again we have a flow in which we suppose for

taking the example here we have the data we do addition then we do matrix

multiplication then we check the result if it’s good then it’s fine and if the

result is not good then we again do some sort of matrix multiplication or

addition it depends upon the function what we are using and then finally we

have the output so if you want to know about it as a flow we have an entire

playlist on tensor flow and deep learning which you should see i’ll give

the link to all of these videos in the description box so let’s go ahead with

our demo and understand how we can implement the or gates using perceptron

so first of all what we’re going to do is import all the required libraries and

Here I am going to import only one library which is the tensor flow library

so what we’re going to do is import tensorflow a steal

now the next step what we’re going to do is define vector variables for input and

output so for that we need to create variables for storing the input output

and the bias for the perceptron so as you can see here we have the training

input and again we have the training output now what we’re going to do next

is define the weight variable and here we are we will define the tensor

variable of the shape 3 comma 1 and for our weights and we will assign

some random values to it initially so we’re going to use T AF dot variable and

we’re going to use TF run random normal to assign random variables to the 3

cross 1 tensor next what we do is define placeholders for input and output and so

that they can accept external inputs on the run so this will be T F dot float32

so for X we are going to use a dimension for 3 and for y it’s dimension of 1 now

as discussed earlier the input received by a positron is force multiplied by the

respective weights and then all of these weights input our sum together now this

sum value is then fed to the activation for obtaining the final result of the or

gate perceptron so this is the output here what we are defining so it’s TF dot

neural networks dot relu using the relu activation function here and we are

doing the matrix multiplication of the weights and biases in this case I have

used the rayleigh function but you are free to use any of the activation

functions according to your needs the next what we’re going to do is calculate

the cost or ere so we need to calculate the cost which is the mean squared error

which is nothing but the square of the differences or the perceptron output and

the desired output so the equation will be loss equals D F dot reduce some and

we’ll use the TF dot Square output minus y now the cool of a perceptron is to

minimize the loss or the cost or the error so here we are going to use the

gradient descent optimizer which will reduce the loss and it is a very

important part of any neural network to use any sort of optimizer so here we are

using the gradient descent optimizer you can know more about the gradient descent

optimizer in other a Drake of videos or deep learning and neural networks

now the next step comes is to initialize the variables so variables are only

defined with TF dot variables the initially what weighted so we need to

initialize this variable define so for that we’re going to use the T F dot

global variable initializer and we’re going to create the F dot session and we

will not run with the initialization variables so as

all the variables are initialized not coming to the last step what we’re going

to do is we need to train our perceptron that is update away our values of the

weights and the biases in the successive iteration to minimize the error or the

Ross so here I will be training our perceptron in hundred epochs so as you can see here for I in range

hundred we are going to run the session with training data in and why as a

trainee at the output and we’re going to calculate the loss and feed it directly

to the X train and why train and again and print the epoch so as you can see

here for the first iteration the loss was two point zero seven and coming down

if as soon as the iterations increase the loss is decreasing because of the

gradient optimizer it’s learning how the data is and coming down to the

hundredths or the final epoch here we have the loss of zero point two seven

start with two point zero seven here initially and we ended up with zero

point two seven loss which is very good this was how perceptron works on a

particular given data set it learns about it and as you saw earlier we gave

a set of input the input variables we provided weights we had a summation

function and then we use the rail u activation function in the code to get

the final output and then we trained the particular model for hundred iterations

with the training data so as to minimize the loss and the loss came down all the

way from two point seven to zero point two seven well if you think perceptron

solves all the problem of making a human brain then you were wrong there are two

major problems first problem being that the single layer perceptron cannot

classify non linearly separable data points and which other complex problems

that involve a lot of parameters cannot be solved by a single layer perceptron

now consider the example here and the complexity with the parameters involved

to take a decision by the marketing team so as you can see here for every email

direct paid referral program or organic we have certain number of social media

subcategories Google Facebook LinkedIn we have twitter and then we have the

type such as the search ad remarketing as interest as ad look like ads and

again the parameters to be considered are the customer acquisition cost money

span leads generated customers generated time taken to become a customer

and all of these problems cannot be solved by a single layer of perceptron

our one neuron cannot take in so many inputs and that is why more than one

neuron would be used to solve these kind of problems

so neural network is really just a composition of perceptrons connected in

different ways and operating on activation functions so for that we have

three different terminologies in a particular neural network we have the

input layers we have the hidden layers and we have the output layers so in

hidden layer we have hidden nodes which provide information from the outside

world to the network and heart together referred to as the input layer now the

hidden nodes perform computations and transfer information from the input

nodes to the output nodes now a collection of hidden nodes forms idle

layer in our image we have one two three four hidden layers and finally the

output nodes are collectively referred to as output layers and are responsible

for computation and transferring information from the network to the

outside world now that you have an idea of how a

perceptron behaves the different parameters involved and the different

layers of neural networks let’s continue this session and see how we can create

our own neural network from scratch in this image as you can see here we have

given a list of faces first of all the patterns of local contrast is being

computed in the input layer then in the hidden layer 1 we get the face features

and in the hidden layer 2 we get the different features of the face and

finally we have the output layer now if we talk about training networks and

weights in a particular neural networks we can estimate the weight values for

our training data using stochastic gradient descent optimizer

as I mentioned earlier now it requires two parameters which is the learning

rate and as I mentioned earlier learning rate is used to limit the amount of each

weight is corrected each time it is updated and epoch is a number of times

to run through the training data while updating the way so in the previous

example we had 100 ebox so we trained the whole model hundred times and these

along with the training data will be the arguments to the function as data

scientists or data analysts or machine learning engineers working on

the hyper parameters is the most important part because anyone can do the

coding it’s your experience and your way of thinking about the learning rate and

the epochs the model which you are working the input data you are taking

how much time it will require to train because time is limited and as you know

these hyper parameters are the only things which are successful data centers

will be guessing when creating a particular model and these play a huge

role in the model such as even a slight difference in learning create of the e

box might result in the model training time so as it will take longer time to

Train having a large amount of data using the particular data set that these

all things are what data scientist or machine learning engineer keeps in mind

while creating them all let’s create our own new network and here we are going to

use the MN is DDS a so the MN IC data set consists of 60,000 training samples

and 10,000 testing samples of handwritten digit images not the images

are of the size 28 into 28 pixels and the output can lie anywhere between 0 to

9 now the task here is to train a model which can accurately identify the digit

present on the image so let’s see how we can do this using tensor fro and Python

so firstly we are going to use the import function here to bring all the

print function from Python 3 into python 2.6 or the future statements let’s

continue with our cone so next what we are going to do is from

pencil for examples tutorials we can take the mi nasty data which is already

provided by tensorflow in their example tutorials data but this is only for the

learning part and later on you can use this particular data for more purposes

for your learning now next what we are going to do is create MN ist and we’re

going to use the input data tour tree data set and one hot is given us through

here so we’re going to import tensorflow and whack plot lib next what we are

going to do is define the hyper parameters here so as I mentioned

earlier we have few hyper parameters like learning rate equals batch size

display step is not a very big hyper parameter to consider here but so the

learning rate we have given here is 0.001 training epochs is 15 that is up

to you because more than number of epochs the more time it will take for

the model to Train and here you have to take a decision between the amount of

time it takes for the model to train and give the output versus the speed again

we have the batch size of 100 now this is one of the most important have a

parameter to be considered because you cannot take all of the images at once

and create the radius so you need to do it in a bath size manner and for that we

define a bad size of 100 so out of 60,000 we’re going to take 100 as a bath

size 100 images which will go through 15 iterations and the training set has

60,000 images so you do the math how many batch we will require and how many

epochs for each batch we’ll have 15 a box the next step is defining the hidden

layers and the input and the classes so for input layers have taken 256 numbers

these are the number of perceptron I need or the number of features to be

extracted in the first layer so this number is arbitrary you can use it

according to your requirements and your needs so for simplicity I am using two

bits X here and the same I’m going to use for the hidden layer 2 now for the

number of inputs I’m going to use 784 and that is why because as I discussed

earlier the MST data has an image or the shape 28 cross 28 which is

784 so in short we have 784 pixels to be considered in a particular image and

each pixel will provide immense amount of data so I am taking a 784 input and

number of output classes Here I am defining ten because the output can

either range from zero one two three four five six seven eight and nine

so the total number of classes or the output classes here I’m going to use are

ten and again we are going to create x and y variables X for the input and Y

for the output classes now as you can see here we have the multi-layer

perceptron in which we have defined all the hidden layers and the output layers

so the layer one will do the addition and first I will do the matrix

multiplication of the weights and the input with the biases and then it will

provide a summation and then again the outward for this one will be given to

layer two by using the activation function of rail you here so as you can

see here we have rail you activation function for layer 1 layer 2 will take

the input of layer 1 with the weights provided in h2 hidden to layer with the

biases of b2 layer it will do the multiplication of layer 1 into weights

it will add the biases and then again we’ll have a rail lu activation function

and the output of this layer 2 will be given to the output layer so as you can

see here in the final output layer we have matrix multiplication of layer 2

into weights of the output layer plus the biases of the output layer and what

we’re going to do is return the output so let’s mention the weights and the

biases so here we are taking random points for that and next what we’re

going to do is use the prediction of the multi-layer perceptron using the input

weights and biases and one thing more important what we’re going to do here is

define the cost so we’re going to use the TF naught reduce mean and we are

using the short max cross entropy with logits

this is a function and here we are using the atom optimizer rather than the

gradient descent optimizer with learning rate provided initially and what we’re

going to do is minimize the cost so again we’re going to initialize all

the global variables and we have two arrays for cos history and accuracy

history so as to store all the values and train our model so we’re going to

create a session and the training cycle for epoch in the range of 15 we first

initialize the average cost at zero and the total patch is the MN asset in

number of examples divided by bass has which is 100 and we loop it over all the

patches run the optimization or the back propagation and the cost operation to

get the loss value and then we have to display the logs per each Ipoh for that

will show the epochs and the cost at each step we’re going to calculate the

accuracy add the last to the correct prediction and will append the accuracy

to the list after every epoch we will append the cost after every epoch

because that is what and we have created cos history and the accuracy history for

that purpose and finally we will plot the cost history using the matplotlib

and we’ll plot the accuracy history also and what we’re going to do is we’re

going to see how accurate is our model so so let’s train it now and as you can

see at first epoch we have cost 188 and address is 0.85 so if you see just have

the second epoch the cost has reduced from 188 to 42 now it’s 26 as you can

see the accuracy is increasing from 0.85 to 0.909 one you have reached five

epochs you see the cost is diminishing at a huge rate which is very good and

you can use different types of optimizers or gradient descent or be it

atom optimizer and not go to the details of the optimization because that is

another half an hour or one hour to explain you guys what exactly it is and

how exactly it works so as you can see till the tenth epoch or 11th epoch we

have cost 2.4 and the accuracy is 0.94 let’s wait a little further till the

50th epoch is turn so as you can see in the 15th eat walk

we have cost 0.83 and actress is 0.94 we start with cost 188 and accuracy 0.85

have you ever east the accuracies of 0.94 so as you can see this is the graph

of the cost it started from 188 ending at 0.8 3 we

have the crop of the accuracy which started from 0 point 8 4 or 8 5 2 all

the way to zero point nine four so as you can see the 14th epoch reached an

accuracy of 0.9 4/7 as you can see here in the graph again and in the 15th epoch

we came to the accuracy of 0.9 for now one might ask the question the accuracy

was higher in that particular epoch why has the accuracy decreased another

important aspect or have a parameter to consider here is the cost the more lower

the cost the more accurate will be your mod so the goal is to minimize the cost

which will in turn increase the accuracy and finally accuracy here we have a 0.9

for tonight which is very good now this was all about deep learning neural

networks and tensorflow how would create a perceptron or deep neural network what

are the different hyper parameters involved how does a neuron work so let’s

have a look at the companies hiring these professionals these data

professionals in the data science environment we have companies all the

way from startups to big giants so the major companies here we can see as our

Dropbox Adobe IBM we have Walmart who were chase LinkedIn Red Hat and there

are so many companies and as I mentioned earlier the required for these

professions are high but the people applying are too low because you need a

certain level of experience to understand how things are working you

need to understand machine learning to understand deep learning you need to

understand all the statistics and property and that is not an easy task so

you require at least 3 to 6 months of rigorous training with minimum one to

two years of practical implementation and project work I would say to go into

data science career if you think that’s the career you want to go so

Yurika as you know provides data science master program we have a machine

learning master program but as you can see in the data master program we have

Python statistics we have our statistics we have data size using our Python for

data science we have Apache spark and Scylla we have PA and deep learning with

tensorflow we have tableau so guys as you can see here we have 12 courses in

this master program with 250 hours of interactive learning via capstone

projects and as you can see here we have a certain discount going on the hike in

salary you get is much more if you go for data science rather than any other

program so you can see we have Python statistics a statistics data science

using Python we have Python for data science Apache spark and Scala which is

a very important part in data science you need to know what the Hadoop

ecosystem we have deep learning with tensorflow

you have tableau and this is a 31 feet course as I mentioned earlier it’s not

an easy task and you do not become a D assigned all in one month or in two

months you cry a lot of training and a lot of practice to become a data

scientist or machine learning engineer or even a data analyst because you see a

lot of topics on a vast list of areas is what you need to cover

and once you cover all of these topics what you need to do is select an either

which you wanna work the kind of data which you’re going to be handling

whether it be text data it would be medical records if it’s video audio or

images for processing it is not an easy task to become a data scientist so you

need a very good and a very correct path of learning to become a real scientist

so so guys that’s it for this session I hope you enjoyed the session and got to

know about data science the different aspects of data science how it works all

the ways to either from statistics probability machine learning deep

learning and finally coming to AI so this was the path of data science and I

hope you enjoyed this session and if you have any queries regarding session or

any other session please feel free to mention it in the comment section below

and we’ll happily answer all of your queries till then thank you

and happy learning. I hope you have enjoyed listening to this video, please

be kind enough to like it and you can comment any of your doubts and queries

and we will reply them at the earliest. Do look out for more videos in our

playlist and subscribe to edureka! channel to learn more. Happy learning!

Please share you email id in the comment section if you need the data-sets and codes shown in this video. For Edureka Data Science Masters Certification Curriculum, Visit our Website: http://bit.ly/2UCJl75

[email protected]

[email protected]

Best channel

[email protected]

[email protected] or [email protected]

[email protected]

Looking for dataset and code. Emai Id : [email protected]

ONE OF THE BEST YOUTUBE CHANNEL FOR#DATA#SCIENCEWHICH EXPLAINED EVERY CONCEPTS DEEPLY..U guys r just awesome. Keep it up. The vedios aroused my interest in data science

Best video lecture on DAta science.All credit goes to edureka

big project

How to create your own deep learning library?

Thank you so much for this content! It changed my life and turned me into completely different person. I tell about what I do in terms of artificial intelligence, cyberspace, software development, drones and lifestyle on my channel. I would highly appreciate your feedback!

Thank you so much you people are always on top of expectations ๐๐๐

Great video sir, I AM A GRADUATE FROM MECHANICAL ENGINEERING I DON'T HV any background in coding , am I able to become data scientist?If so what should I do to become the best

Sir I'm working as procurement specialist can I also become a data scientist??

I am a mainframe developer of 3 years experience..since mainframe is sun setting slowly and also the scope in that is very minimal…can you please tell me which technology will be good for me to pursue my carrier.

Well explained… Pls share some more deep details on what specific knowledge on Mathematics and Statistics are needed… If you have them with you pls share the link

Very helpful content for beginners…Neatly explained…Refreshed topics which I learnt with good examples…

you guys are just awesome ..your learning technique is too good….keep it up and thank you so much to all edureka! team….$$##

Hi Sir, I am having 12 years of experience in infrastructure management. Now I want to switch to Data science field. No prior programming experience. Is it OK for switch? Which technology to learn ? I want to switch in management role.

Thank you for the great sharing.

I am a absolute beginner in pytho. An interested in data science. Should i watch it? Will it be helpful for me even if i am noob?

nice but you have this video in HINDI ๐