Statistics in Machine Learning

Note: In life cycle of data science more then 60% of time goes into data analysis like feature engineering, feature selection.  Feature engineering means Cleaning data, handle missing values, unbalanced data, category features.
 
Python packages
  • Pandas - read dataset
    • read_csv, head, isnull, getdummies, drop, concat,
  • Numpy - work with arrays
  • matplotlib.pyplot - for visualization
  • Seaborn - for visualization
    • heatmap, countplot, boxplot,
Handling Categorical features
  • one hot encoding for nominal variables
  • label encoding for ordinal variables
Ways for finding Outliers
  • Scatter plot
  • Box plot 
  • z-score
  • IQR

Correlation

  • Strength of association between two variables 
    • This is both ways  A and B = B and A
Regression
  • If one of the variable is dependent & other is independent variable
  • Regression equation = Average value of 'y' is a function of x
R Square
Significance of F & P values
 

 

Covariance(cov)
  • Quantify relationship between features, random variables in a particular dataset 
  • This helps in finding direction of relationship
  • Important topic for data preprocessing
    •  Examples
      • Quantify relationship between Size & prices of houses
      • Height, weight
  • cov(size,price)=1/n for all elements (sigma(xi-xmu)*(yi-ymu) )

Pearson Correlation Coefficient
  • cov(size,price)/stddev(x)*stddev(y)
  • This provides direction of relationship & strength of relationship
  • Value always ranges between 1 & -1
Spearman's rank correlation coefficient- Statistics

Finding outliers in dataset
  • Z-score
    • z score=(xi-mu)/stddev 
    • z-score after 3rd stddev is outlier
  • IQR
    • Inter Quartile range
      • IQR = 75%-25%
Logistic Regression
  •   from sklearn.linear_model import LogisticRegression
  •   from sklearn.metrics import confusion_matrix

T test, Chi Square test, Anova test

  • one hamil proprotation test 
    • When one categorical feature
  • Chi Square test
    • Two categorical features
  • T test 
    • One numerical features
  • Correlation
    • Two numerical features
  • Anova
    • More than 2 categorical features

Cosine Similarity & Cosine Distance

  • Cosine similarity is represented as cos theta
  • It is angle between two points eg: two dimentional chart with movies (action x-axis, comedy y-axis)
  • cosineDistance= 1 - cosine similarity
Co-Variance
  • variance(x) =  1/n* sigma((xi-xmu) ^2)
  • cov(x,y)=1/n* sigma((xi-xmu) * (yi-ymu))
  • This tells Direction of relationship BUT not strength
  • This helps us in quantifying relationship between features, random variables in a dataset.
  • This tells how the two variables are related positively or negatively. But doesn't tell how much it is.
Pearson Correlation Coefficient
  • Pearson CC = cov(x,y)/stddev(x)*stddev(y)
  • This tells Direction of relationship as well as Strength of relationship
  • Value ranges between -1 and 1
Spearman rank Correlation Coefficient
  • Spearman Correlation Coefficient of RANK of X & Rank of Y
  • Find rank(x), rank(y)- get difference di
  • Spearman rank CC =  1-(6 sigma di^2)/n(n^2-1)

 

Comments

Popular posts from this blog

Cluster Analysis