Statistics in Machine Learning
Note: In life cycle of data science more then 60% of time goes into data analysis like feature engineering, feature selection. Feature engineering means Cleaning data, handle missing values, unbalanced data, category features.
Python packages
- Pandas - read dataset
- read_csv, head, isnull, getdummies, drop, concat,
- Numpy - work with arrays
- matplotlib.pyplot - for visualization
- Seaborn - for visualization
- heatmap, countplot, boxplot,
Handling Categorical features
- one hot encoding for nominal variables
- label encoding for ordinal variables
Ways for finding Outliers
- Scatter plot
- Box plot
- z-score
- IQR
Correlation:
- Strength of association between two variables
- This is both ways A and B = B and A
- If one of the variable is dependent & other is independent variable
- Regression equation = Average value of 'y' is a function of x
R Square
Significance of F & P values
- Quantify relationship between features, random variables in a particular dataset
- This helps in finding direction of relationship
- Important topic for data preprocessing
- Examples
- Quantify relationship between Size & prices of houses
- Height, weight
- cov(size,price)=1/n for all elements (sigma(xi-xmu)*(yi-ymu) )
Pearson Correlation Coefficient
- cov(size,price)/stddev(x)*stddev(y)
- This provides direction of relationship & strength of relationship
- Value always ranges between 1 & -1
Spearman's rank correlation coefficient- Statistics
Finding outliers in dataset
- Z-score
- z score=(xi-mu)/stddev
- z-score after 3rd stddev is outlier
- IQR
- Inter Quartile range
- IQR = 75%-25%
- from sklearn.linear_model import LogisticRegression
- from sklearn.metrics import confusion_matrix
T test, Chi Square test, Anova test
- one hamil proprotation test
- When one categorical feature
- Chi Square test
- Two categorical features
- T test
- One numerical features
- Correlation
- Two numerical features
- Anova
- More than 2 categorical features
Cosine Similarity & Cosine Distance
- Cosine similarity is represented as cos theta
- It is angle between two points eg: two dimentional chart with movies (action x-axis, comedy y-axis)
- cosineDistance= 1 - cosine similarity
- variance(x) = 1/n* sigma((xi-xmu) ^2)
- cov(x,y)=1/n* sigma((xi-xmu) * (yi-ymu))
- This tells Direction of relationship BUT not strength
- This helps us in quantifying relationship between features, random variables in a dataset.
- This tells how the two variables are related positively or negatively. But doesn't tell how much it is.
Pearson Correlation Coefficient
- Pearson CC = cov(x,y)/stddev(x)*stddev(y)
- This tells Direction of relationship as well as Strength of relationship
- Value ranges between -1 and 1
Spearman rank Correlation Coefficient
- Spearman Correlation Coefficient of RANK of X & Rank of Y
- Find rank(x), rank(y)- get difference di
- Spearman rank CC = 1-(6 sigma di^2)/n(n^2-1)
Comments
Post a Comment