You are currently viewing Complete Guide to Learn Statistics
Statistics | Data Science

Complete Guide to Learn Statistics

Statistics is a branch of mathematics concerned with collecting, analyzing, and interpreting data. It involves using statistical methods to make inferences and predictions about a population based on a sample of data. Statistics is used in a variety of fields, including biology, finance, social sciences, and many others, to gain insights and make decisions based on data.

Statistics and Data Science

Statistics and data science are related fields that both involve working with data to extract insights and make decisions. Statistics provides a set of tools and methods for analyzing and interpreting data, while data science encompasses a broader range of techniques and approaches for working with data, including those from computer science, statistics, and domain-specific knowledge. Data scientists use statistical methods, among other techniques, to analyze data and create predictive models. In addition to statistical analysis, data science often involves tasks such as data cleaning, data visualization, and communicating results to stakeholders.

Some important concepts are:

  1. Descriptive Statistics
  2. Probability
  3. Inferential Statistics
  4. Non-parametric Statistics
  5. Analysis of Variance (ANOVA)
  6. Experimental Design
  7. Regression Analysis
  8. Time Series Analysis
  9. Multivariate Analysis
  10. Machine Learning Algorithms

Descriptive Statistics: Mean, median, mode, standard deviation, etc.

Descriptive statistics is a branch of statistics that deals with summarizing, organizing, and presenting data in a meaningful way. Descriptive statistics provides ways to describe and characterize the central tendency, dispersion, and distribution of data. Some common descriptive statistics measures include mean, median, mode, standard deviation, variance, and percentiles. Descriptive statistics helps to simplify and summarize large amounts of data and make it easier to understand and communicate. It is a useful tool for exploratory data analysis and can provide important insights into the characteristics of a dataset.

Python Implementation :

import numpy as np
import pandas as pd

# Generate data
data = np.random.normal(100, 10, 1000)

# Create a Pandas DataFrame from the data
df = pd.DataFrame(data, columns=['Values'])

# Calculate mean
mean = df['Values'].mean()

# Calculate median
median = df['Values'].median()

# Calculate mode
mode = df['Values'].mode().values[0]

# Calculate range
range = df['Values'].max() - df['Values'].min()

# Calculate variance
variance = df['Values'].var()

# Calculate standard deviation
standard_deviation = df['Values'].std()

# Calculate interquartile range (IQR)
IQR = df['Values'].quantile(0.75) - df['Values'].quantile(0.25)

# Print results
print("Mean:", mean)
print("Median:", median)
print("Mode:", mode)
print("Range:", range)
print("Variance:", variance)
print("Standard Deviation:", standard_deviation)
print("IQR:", IQR)

This code generates random data using the numpy library, creates a Pandas DataFrame from the data, and calculates various descriptive statistics measures such as mean, median, mode, range, variance, standard deviation, and interquartile range (IQR). The results are then printed to the console.

Probability: Basic rules of probability, Bayes’ Theorem, random variables and distributions.

Probability is a branch of mathematics that deals with the likelihood of events and the analysis of random phenomena. It provides a mathematical framework for modeling and understanding uncertainty. The basic idea of probability is to assign a number between 0 and 1 to represent the likelihood of an event occurring, with 0 meaning that an event is impossible and 1 meaning that an event is certain.

Probability theory is used in a wide range of applications, including statistical inference, decision making under uncertainty, risk management, and many others. It provides a set of mathematical tools for making predictions about the behavior of random processes and for estimating the likelihood of different outcomes. The laws of probability and statistical inference form the foundation of many statistical methods and are essential for data analysis and machine learning.

import numpy as np
from scipy.stats import norm

# Basic Rules of Probability
# P(A or B) = P(A) + P(B) - P(A and B)
prob_a = 0.7
prob_b = 0.5
prob_a_and_b = 0.3
prob_a_or_b = prob_a + prob_b - prob_a_and_b
print("P(A or B):", prob_a_or_b)

# Bayes' Theorem
# P(A|B) = P(B|A) * P(A) / P(B)
prob_b_given_a = 0.6
prob_a_given_b = prob_b_given_a * prob_a / prob_b
print("P(A|B):", prob_a_given_b)

# Random Variables and Distributions
# Generate random samples from a normal distribution
np.random.seed(0)
samples = norm.rvs(loc=0, scale=1, size=1000)
print("Mean of samples:", np.mean(samples))
print("Variance of samples:", np.var(samples))

# Plot the histogram of the samples
import matplotlib.pyplot as plt
plt.hist(samples, bins=50, density=True)
plt.show()

In this code, numpy is used for basic calculations and scipy.stats is used to generate random samples from a normal distribution. The histogram of the samples is plotted using matplotlib to visualize the distribution.

Inferential Statistics: Point and interval estimation, hypothesis testing, correlation and regression analysis.

Inferential statistics is a branch of statistics that deals with making inferences about a population based on a sample of data. It involves using statistical methods to make predictions and draw conclusions about a population from a sample. The goal of inferential statistics is to generalize from a sample to a population and make inferences about population parameters based on sample statistics.

Inferential statistics makes use of probability theory to quantify the uncertainty in inferences. It provides methods for hypothesis testing, estimating population parameters, and making predictions about future events based on past data. Inferential statistics plays a crucial role in many scientific and engineering disciplines, including medicine, finance, marketing, and many others. By using statistical methods to make inferences, inferential statistics provides a powerful tool for making decisions and drawing conclusions based on data.

import numpy as np
import scipy.stats as stats
import pandas as pd
import matplotlib.pyplot as plt

# Generating a random sample data
np.random.seed(100)
sample_data = np.random.normal(100, 10, 100)

# Point Estimation
mean_estimate = np.mean(sample_data)
print("Mean Estimate: ", mean_estimate)

# Interval Estimation
conf_int = stats.norm.interval(0.95, loc=mean_estimate, scale=10/np.sqrt(100))
print("Confidence Interval: ", conf_int)

# Hypothesis Testing
hyp_mean = 105
t_statistic, p_value = stats.ttest_1samp(sample_data, hyp_mean)
if p_value < 0.05:
    print("Reject Null Hypothesis")
else:
    print("Fail to Reject Null Hypothesis")

# Correlation Analysis
df = pd.DataFrame({'Variable1': sample_data, 'Variable2': np.random.normal(105, 15, 100)})
corr_matrix = df.corr()
print("Correlation Matrix: ", corr_matrix)

# Regression Analysis
slope, intercept, r_value, p_value, std_err = stats.linregress(df['Variable1'], df['Variable2'])
print("Regression Results: Slope - {}, Intercept - {}, R-value - {}, P-value - {}".format(slope, intercept, r_value, p_value))

This code demonstrates the implementation of point and interval estimation, hypothesis testing, correlation analysis, and regression analysis using various statistical functions available in Python’s numpy and scipy.stats libraries.

Non-parametric Statistics: Wilcoxon rank-sum test, Kruskal-Wallis test, etc.

Non-parametric statistics is a branch of statistics that does not assume a specific distribution for the data, unlike parametric statistics which assumes a particular distribution (such as normal distribution). Non-parametric methods are used when the distribution of the data is unknown or when the data does not meet the assumptions required for parametric methods.

Non-parametric methods provide a flexible alternative to parametric methods and can be used when the sample size is small, the data is categorical, or when the data is not normally distributed. Some common non-parametric methods include the Wilcoxon rank-sum test, the Kruskal-Wallis test, and the Mann-Whitney U test for comparing two independent samples. Non-parametric methods are widely used in fields such as psychology, sociology, medicine, and many others. They provide a valuable tool for making inferences and drawing conclusions from data without making strong assumptions about the underlying distribution of the data.

sample code to perform Wilcoxon rank-sum test (also known as Mann-Whitney U test) in Python using the scipy library:

import numpy as np
from scipy.stats import mannwhitneyu

# sample data
data1 = [1, 2, 3, 4, 5]
data2 = [2, 4, 6, 8, 10]

# perform Wilcoxon rank-sum test
stat, p_value = mannwhitneyu(data1, data2)

# print results
print("Statistic: ", stat)
print("P-value: ", p_value)

# interpret results
if p_value < 0.05:
    print("Reject null hypothesis, samples are significantly different")
else:
    print("Fail to reject null hypothesis, samples are not significantly different")

The following code uses the kruskal function from the scipy.stats library to perform the Kruskal-Wallis test. The function takes as input a list of arrays representing the different groups. The test returns a test statistic and a p-value, which are used to determine if there is a significant difference between the groups. If the p-value is less than 0.05, we reject the null hypothesis that the groups are from the same population, indicating that there is a significant difference between the groups.

import scipy.stats as stats
import numpy as np

# Define the data for three different groups
group1 = [1, 2, 3, 4, 5]
group2 = [6, 7, 8, 9, 10]
group3 = [11, 12, 13, 14, 15]

# Combine the data into a single list
data = [group1, group2, group3]

# Perform the Kruskal-Wallis test
stat, p_value = stats.kruskal(*data)

# Print the results
print("Statistic: ", stat)
print("p-value: ", p_value)

# Interpret the results
if p_value < 0.05:
    print("Reject the null hypothesis, the groups are not from the same population.")
else:
    print("Fail to reject the null hypothesis, the groups are from the same population.")

Analysis of Variance (ANOVA): One-way ANOVA, two-way ANOVA, repeated measures ANOVA, etc.

Analysis of Variance (ANOVA) is a statistical technique used to test the equality of means among two or more groups. It is used to determine if there is a significant difference between the means of two or more groups, and is commonly used in experimental design and analysis to test for the effects of one or more independent variables on a dependent variable.

ANOVA decomposes the total variance in a data set into the variance between groups and the variance within groups. The between-group variance measures the variability of the group means, while the within-group variance measures the variability of the observations within each group. ANOVA provides a statistical test to determine if the between-group variance is significantly larger than the within-group variance, which would indicate a significant difference between the means of the groups.

ANOVA is used in a wide range of fields, including psychology, biology, marketing, and many others, to test hypotheses about differences between groups and to make inferences about population means. ANOVA is a powerful tool for making decisions and drawing conclusions based on data and is widely used in experimental design and data analysis.

import scipy.stats as stats

# Sample data for groups 1, 2, and 3
group1 = [12, 15, 20, 21, 23, 25]
group2 = [10, 12, 15, 18, 22, 27]
group3 = [13, 14, 17, 19, 22, 24]

# Combine the groups into a list of lists
groups = [group1, group2, group3]

# Perform one-way ANOVA
f_value, p_value = stats.f_oneway(*groups)

# Print the results
print("F-value: ", f_value)
print("P-value: ", p_value)

# Interpret the results
if p_value < 0.05:
    print("Reject null hypothesis: Means are not equal.")
else:
    print("Fail to reject null hypothesis: Means are equal.")

n this code, we use the scipy.stats.f_oneway function to perform the one-way ANOVA test. The function returns the F-value and P-value of the test. If the P-value is less than 0.05, we reject the null hypothesis that the means of the groups are equal, and conclude that at least one of the groups has a different mean.

Experimental Design: Completely randomized design, randomized block design, factorial design, etc.

Experimental design is the process of planning and conducting a study to test a hypothesis. It involves defining the research question, selecting a sample, manipulating independent variables, controlling extraneous variables, and measuring a dependent variable. The goal of experimental design is to establish a cause-and-effect relationship between the independent and dependent variables by manipulating the independent variable and observing the effect on the dependent variable.

Experimental design plays a critical role in many scientific and engineering disciplines, including medicine, psychology, biology, and many others. It provides a systematic approach to test hypotheses and draw conclusions about the relationships between variables. Good experimental design requires careful planning, control of extraneous variables, and the use of appropriate statistical methods to analyze the data.

There are many different types of experimental designs, including completely randomized designs, randomized block designs, factorial designs, and others. The choice of design depends on the research question and the type of data being collected. Effective experimental design is essential for accurately testing hypotheses and making valid inferences about population parameters.

import numpy as np

def randomized_block_design(treatments, blocks):
    # Number of treatments and blocks
    n_treatments = len(treatments)
    n_blocks = len(blocks)

    # Total number of experimental units
    n = n_treatments * n_blocks

    # Create a 2D array of experimental units
    units = np.arange(n)
    units = units.reshape(n_blocks, n_treatments)

    # Randomly permute the rows of the units array
    np.random.shuffle(units)

    # Assign treatments to experimental units
    assigned_treatments = np.zeros(n, dtype=int)
    for i in range(n_blocks):
        assigned_treatments[units[i]] = treatments

    return assigned_treatments

# Define treatments
treatments = np.array([1, 2, 3, 4])

# Define blocks
blocks = np.array([1, 2, 3, 4])

# Generate randomized block design
assigned_treatments = randomized_block_design(treatments, blocks)

print(assigned_treatments)

This code will generate a randomized block design, where each block has all the treatments. The assigned treatments are stored in the assigned_treatments array.

Regression Analysis: Simple linear regression, multiple linear regression, logistic regression, etc.

Regression analysis is a statistical technique used to model the relationship between a dependent variable and one or more independent variables. It is used to make predictions about the value of the dependent variable based on the values of the independent variables. Regression analysis is widely used in many fields, including economics, finance, psychology, and many others, to understand and quantify the relationships between variables.

Regression models can be linear or non-linear and can have one or multiple independent variables. The most common type of regression is linear regression, which models the relationship between the dependent variable and independent variable as a linear equation. Non-linear regression models are used when the relationship between the variables is more complex.

Regression analysis involves fitting a regression model to the data and using statistical methods to assess the strength and significance of the relationship between the variables. The goal of regression analysis is to find the best-fitting model that accurately predicts the value of the dependent variable based on the independent variables. Regression analysis provides a powerful tool for making predictions, understanding the relationships between variables, and making inferences about population parameters.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# Load the data into a Pandas dataframe
data = pd.read_csv("data.csv")

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data[['x']], data['y'], test_size=0.2)

# Create a Linear Regression model
model = LinearRegression()

# Train the model on the training data
model.fit(X_train, y_train)

# Predict the target variable on the test data
y_pred = model.predict(X_test)

# Print the model coefficients
print('Intercept:', model.intercept_)
print('Slope:', model.coef_[0])

Time Series Analysis: ARIMA, GARCH, exponential smoothing, etc.

Time series analysis is a statistical technique used to analyze and model the behavior of a variable over time. It is used to analyze data that is collected at regular intervals, such as daily, weekly, monthly, or yearly data. Time series analysis is widely used in fields such as economics, finance, weather forecasting, and many others to make predictions and understand the patterns and trends in data over time.

Time series analysis involves analyzing the properties of the data, such as trends, seasonality, and residuals, and modeling the data using statistical techniques. The goal of time series analysis is to identify patterns and trends in the data, make predictions about future values, and develop a model that can be used to make predictions based on past data.

There are many different methods used in time series analysis, including trend analysis, seasonality analysis, exponential smoothing, and ARIMA (AutoRegressive Integrated Moving Average) models. The choice of method depends on the characteristics of the data and the research question. Time series analysis provides a powerful tool for understanding and predicting the behavior of a variable over time.

ARIMA (AutoRegressive Integrated Moving Average) time series analysis using Python’s statsmodels library:

import numpy as np
import pandas as pd
from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_squared_error

# Load time series data
data = pd.read_csv("time_series_data.csv")

# Split time series data into train and test sets
train = data[:int(len(data) * 0.8)]
test = data[int(len(data) * 0.8):]

# Fit ARIMA model
model = ARIMA(train, order=(2, 1, 1))
model_fit = model.fit()

# Make predictions
forecast = model_fit.forecast(steps=len(test))[0]

# Evaluate the model
error = mean_squared_error(test, forecast)
print("MSE: ", error)

This code will fit an ARIMA model to the training data and make predictions on the test data. The mean squared error (MSE) between the actual and predicted values is calculated and printed as the evaluation of the model.

The order of the ARIMA model, (2, 1, 1), refers to the number of auto-regressive (AR) terms, the number of differences (I) needed for stationarity, and the number of moving average (MA) terms, respectively. The choice of the ARIMA order can be determined through trial and error or by using methods such as the auto-ARIMA algorithm.

GARCH (Generalized Autoregressive Conditional Heteroskedasticity) time series analysis using Python’s arch library:

import numpy as np
import pandas as pd
from arch import arch_model
from sklearn.metrics import mean_squared_error

# Load time series data
data = pd.read_csv("time_series_data.csv")

# Split time series data into train and test sets
train = data[:int(len(data) * 0.8)]
test = data[int(len(data) * 0.8):]

# Fit GARCH model
model = arch_model(train, mean="Zero", vol="GARCH", p=1, q=1)
model_fit = model.fit()

# Make predictions
forecast = model_fit.forecast(horizon=len(test), method='simulation')

# Evaluate the model
error = mean_squared_error(test, forecast.mean.iloc[-1, :])
print("MSE: ", error)

This code will fit a GARCH model to the training data and make predictions on the test data. The mean squared error (MSE) between the actual and predicted values is calculated and printed as the evaluation of the model. The choice of the GARCH order, p=1 and q=1, refers to the number of autoregressive (AR) and moving average (MA) terms, respectively. The choice of the GARCH order can be determined through trial and error.

Multivariate Analysis: Principal Component Analysis (PCA), Factor Analysis, Canonical Correlation, etc.

Multivariate analysis is a statistical technique used to analyze and model the relationships between multiple variables. It is used to examine the relationship between two or more independent variables and a dependent variable, and is commonly used in fields such as psychology, marketing, finance, and many others.

Multivariate analysis is an extension of regression analysis, which models the relationship between a dependent variable and one independent variable. Multivariate analysis can be used to model complex relationships between multiple independent variables and a dependent variable. There are many different methods of multivariate analysis, including multiple regression, logistic regression, discriminant analysis, and others.

The goal of multivariate analysis is to understand the relationships between multiple variables and to make predictions about the value of the dependent variable based on the values of the independent variables. Multivariate analysis provides a powerful tool for making predictions, understanding complex relationships between variables, and making inferences about population parameters. It is a useful tool for analyzing complex data sets and for making decisions based on multiple variables.

import numpy as np
from sklearn.decomposition import PCA

# Input data
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])

# PCA model with 2 components
pca = PCA(n_components=2)

# Fit the model and transform the data to the first two principal components
X_pca = pca.fit_transform(X)

# Print the transformed data
print(X_pca)

The output will be a transformed data set with reduced dimensionality, represented by the first two principal components.

Machine Learning Algorithms: K-nearest neighbors (KNN), decision trees, random forests, etc.

The k-nearest neighbors (KNN) algorithm is a supervised machine learning algorithm used for classification and regression. It is based on the idea that the closest neighbors to a data point are the most similar and should have the most influence on the prediction for that data point.

In Python, the KNN algorithm can be implemented using the scikit-learn library. The following code provides a basic example of KNN implementation in Python:

import numpy as np
from sklearn.neighbors import KNeighborsClassifier

# sample data
X = np.array([[2, 3], [5, 4], [9, 6], [4, 7], [8, 1], [7, 2]])
y = np.array([0, 0, 1, 1, 1, 0])

# create the model
knn = KNeighborsClassifier(n_neighbors=3)

# fit the model
knn.fit(X, y)

# predict for new data
new_data = np.array([[3, 4]])
prediction = knn.predict(new_data)
print(prediction)

In this example, the KNN algorithm is used to predict the class of a new data point [3, 4] using the training data X and the target labels y. The parameter n_neighbors=3 means that the algorithm will consider the 3 nearest neighbors to make a prediction. The prediction result will be an array with a single value, either 0 or 1, depending on the class to which the new data point belongs.

A decision tree is a type of algorithm used in supervised learning for both classification and regression. It is a tree-like model of decisions and their possible consequences, used to predict the value of a target variable by learning simple decision rules inferred from the data features.

In Python, decision trees can be implemented using the scikit-learn library. The following code provides a basic example of a decision tree implementation in Python:

import numpy as np
from sklearn import tree

# sample data
X = np.array([[2, 3], [5, 4], [9, 6], [4, 7], [8, 1], [7, 2]])
y = np.array([0, 0, 1, 1, 1, 0])

# create the model
clf = tree.DecisionTreeClassifier()

# fit the model
clf = clf.fit(X, y)

# predict for new data
new_data = np.array([[3, 4]])
prediction = clf.predict(new_data)
print(prediction)

In this example, the decision tree algorithm is used to predict the class of a new data point [3, 4] using the training data X and the target labels y. The prediction result will be an array with a single value, either 0 or 1, depending on the class to which the new data point belongs. The decision tree algorithm splits the data into smaller subgroups based on the features, making a prediction for each subgroup, until it reaches a leaf node with a predicted value.

if you need help running any of the above codes ..please DM me, I will be happy to answer your queries

Online Learning Resource

Join the Job Guaranteed Data Science course at elearners365.com

Classroom Training Resource

READ MORE

Join Free Data Science Webinars

If you have any questions or need help getting started, please let me know. I would be more than happy to assist you.

My LinkedIn: www.linkedin.com/in/connectjaya

My Email: [email protected]

Leave a Reply