Statistics in Machine Learning

Note: In life cycle of data science more then 60% of time goes into data analysis like feature engineering, feature selection. Feature engineering means Cleaning data, handle missing values, unbalanced data, category features.

Python packages

Pandas - read dataset

read_csv, head, isnull, getdummies, drop, concat,

Numpy - work with arrays
matplotlib.pyplot - for visualization
Seaborn - for visualization

heatmap, countplot, boxplot,

Handling Categorical features

one hot encoding for nominal variables
label encoding for ordinal variables

Ways for finding Outliers

Scatter plot
Box plot
z-score
IQR

Correlation:

Strength of association between two variables

This is both ways A and B = B and A

Regression

If one of the variable is dependent & other is independent variable
Regression equation = Average value of 'y' is a function of x

R Square

Significance of F & P values

Covariance(cov)

Quantify relationship between features, random variables in a particular dataset
This helps in finding direction of relationship
Important topic for data preprocessing

Examples

Quantify relationship between Size & prices of houses
Height, weight

cov(size,price)=1/n for all elements (sigma(xi-xmu)*(yi-ymu) )

Pearson Correlation Coefficient

cov(size,price)/stddev(x)*stddev(y)
This provides direction of relationship & strength of relationship
Value always ranges between 1 & -1

Spearman's rank correlation coefficient- Statistics

Finding outliers in dataset

Z-score

z score=(xi-mu)/stddev
z-score after 3rd stddev is outlier

Inter Quartile range

IQR = 75%-25%

Logistic Regression

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

T test, Chi Square test, Anova test

one hamil proprotation test

When one categorical feature

Chi Square test

Two categorical features

T test

One numerical features

Correlation

Two numerical features

Anova

More than 2 categorical features

Cosine Similarity & Cosine Distance

Cosine similarity is represented as cos theta
It is angle between two points eg: two dimentional chart with movies (action x-axis, comedy y-axis)
cosineDistance= 1 - cosine similarity

Co-Variance

variance(x) = 1/n* sigma((xi-xmu) ^2)
cov(x,y)=1/n* sigma((xi-xmu) * (yi-ymu))
This tells Direction of relationship BUT not strength
This helps us in quantifying relationship between features, random variables in a dataset.
This tells how the two variables are related positively or negatively. But doesn't tell how much it is.

Pearson Correlation Coefficient

Pearson CC = cov(x,y)/stddev(x)*stddev(y)
This tells Direction of relationship as well as Strength of relationship
Value ranges between -1 and 1

Spearman rank Correlation Coefficient

Spearman Correlation Coefficient of RANK of X & Rank of Y
Find rank(x), rank(y)- get difference di
Spearman rank CC = 1-(6 sigma di^2)/n(n^2-1)

Search This Blog

MachineLearning & AI

Statistics in Machine Learning

Comments

Post a Comment

Popular posts from this blog

Cluster Analysis