Most Important Statistics Topics in Data Science

Think Different - Dhiraj Patra
16 min readJul 26, 2023

--

unplush

Top Statistics Concepts asked in Data Science Interviews are:

  • p-value

p-value: The p-value is a statistical measure that is used to determine the significance of a result. It is calculated by comparing the observed data to the expected data under the null hypothesis. A low p-value (typically less than 0.05) indicates that the observed data is unlikely to have occurred by chance, and therefore the null hypothesis is rejected.

Example: A company is testing a new marketing campaign to see if it increases sales. They compare sales data from the months before and after the campaign was launched. The p-value for the difference in sales is 0.01. This means that there is only a 1% chance that the observed difference in sales could have occurred by chance. Therefore, the company can conclude that the new marketing campaign is statistically significant and has increased sales.

  • Subcategory_list

Subcategory_list: This is a list of all the subcategories of statistics that are commonly tested in data science interviews. These subcategories include:

Probability, Sampling, Hypothesis testing, Confidence intervals, Regression analysis, Time series analysis, Machine learning

  • Linear Regression

Linear Regression: Linear regression is a statistical method that is used to model the relationship between two or more variables. In its simplest form, linear regression models the relationship between a dependent variable and one independent variable. The dependent variable is the variable that is being predicted, and the independent variable is the variable that is used to make the prediction.

Example: A company is trying to predict how much money they will make in sales next year. They have data on the company’s sales for the past five years, as well as data on the company’s marketing expenses for the past five years. They can use linear regression to model the relationship between sales and marketing expenses. The model can then be used to predict how much money the company will make in sales next year based on their marketing expenses.

  • t-test

t-test: The t-test is a statistical test that is used to compare the means of two groups. The t-test can be used to test for a difference in means between two independent groups, or between two dependent groups.

Example: A company is testing two different types of marketing campaigns to see which one is more effective. They randomly assign half of their customers to one campaign and the other half to the other campaign. They then measure the sales of each group of customers after one month. The t-test can be used to test for a difference in sales between the two groups of customers.

  • Correlation Coefficient

Correlation Coefficient: The correlation coefficient is a statistical measure that is used to quantify the strength of the relationship between two variables. The correlation coefficient can range from -1 to 1, where 0 indicates no correlation, -1 indicates a negative correlation, and 1 indicates a positive correlation.

Example: A company is trying to determine if there is a correlation between the number of times a customer visits their website and the amount of money they spend on the website. They can use the correlation coefficient to quantify the strength of the relationship between these two variables.

  • Type of Errors

Type of Errors: There are two types of errors that can be made in hypothesis testing: Type I errors and Type II errors.

A Type I error is made when the null hypothesis is rejected when it is actually true.

A Type II error is made when the null hypothesis is not rejected when it is actually false.

The probability of making a Type I error is typically set at 0.05, which means that there is a 5% chance of making this error. The probability of making a Type II error is typically set at 0.20, which means that there is a 20% chance of making this error.

  • z-test

z-test: The z-test is a statistical test that is used to test for a difference in means between two groups when the population standard deviations are known.

Example: A company is testing two different types of marketing campaigns to see which one is more effective. They know the population standard deviation for each type of campaign. The z-test can be used to test for a difference in means between the two campaigns.

  • Central Limit Theorem

Central Limit Theorem: The Central Limit Theorem states that the distribution of the sample mean will approach a normal distribution as the sample size increases.

Example: A company is taking samples of their customer’s satisfaction ratings. They want to know if the distribution of the sample means is normal. They can use the Central Limit Theorem to determine if the distribution of the sample means is likely to be normal.

  • Skewed Distribution

Skewed Distribution: A skewed distribution is a distribution where the majority of the data points are concentrated on one side of the mean

  • Power Analysis

Power Analysis: Power analysis is a statistical method used to determine the minimum sample size needed to detect a statistically significant effect. The power of a test is the probability of rejecting the null hypothesis when the alternative hypothesis is true.

Example: A company is testing a new marketing campaign to see if it increases sales. They want to have a power of 0.8, which means that there is an 80% chance of rejecting the null hypothesis when the alternative hypothesis is true. They know that the standard deviation of sales is 10, and they want to detect an increase in sales of 5%. The power analysis will tell them how many customers they need to sample in order to have a power of 0.8.

  • Power

Power: Power is a measure of the ability of a statistical test to detect a real effect. A high power means that the test is more likely to detect a real effect, even if the effect is small. A low power means that the test is less likely to detect a real effect, even if the effect is large.

Example: In the example of the company testing a new marketing campaign, the power of the test is 0.8. This means that there is an 80% chance of rejecting the null hypothesis when the alternative hypothesis is true. In other words, there is an 80% chance that the company will be able to detect a real increase in sales if the new marketing campaign is actually effective.

  • Simpson’s Paradox

Simpson’s Paradox: Simpson’s Paradox is a statistical phenomenon that occurs when the overall trend in a data set is reversed when the data is stratified by another variable.

Example: A company is testing two different types of marketing campaigns to see which one is more effective. They look at the overall results of the test and find that campaign A is more effective than campaign B. However, when they stratify the data by gender, they find that campaign B is actually more effective for men, while campaign A is more effective for women. This is an example of Simpson’s Paradox.

  • R Squared

R Squared: R Squared is a statistical measure of the strength of the relationship between two variables. R Squared can range from 0 to 1, where 0 indicates no relationship, and 1 indicates a perfect relationship.

Example: A company is trying to determine if there is a correlation between the number of times a customer visits their website and the amount of money they spend on the website. They use R Squared to quantify the strength of the relationship between these two variables. R Squared is 0.5, which means that there is a moderate relationship between the number of times a customer visits the website and the amount of money they spend on the website.

  • Confidence Interval

Confidence Interval: A confidence interval is a range of values that is likely to contain the true value of a population parameter. The confidence interval is calculated using the sample data and the confidence level.

Example: A company is taking samples of their customer’s satisfaction ratings. They want to know the true average satisfaction rating of their customers. They use a confidence interval to estimate the true average satisfaction rating. The confidence interval is 80% to 90%, which means that there is a 90% confidence that the true average satisfaction rating is between 80% and 90%.

  • Multiple Testing

Multiple Testing: Multiple testing is a statistical technique that is used to control the false discovery rate (FDR) when performing multiple statistical tests. The FDR is the expected proportion of false positives among all the rejected null hypotheses.

Example: A company is testing 10 different marketing campaigns to see which one is most effective. They use multiple testing to control the FDR at 0.05. This means that there is a 5% chance that one of the marketing campaigns will be rejected even if the null hypothesis is true.

  • Exponential Distribution

Exponential Distribution: The exponential distribution is a continuous probability distribution that is often used to model the time between events. The exponential distribution has a parameter called the rate, which is the average number of events that occur in a unit of time.

Example: A company is trying to estimate the average time it takes for a customer to make a purchase. They can use the exponential distribution to model the time between purchases. The rate of the exponential distribution will be the average number of purchases that a customer makes in a unit of time.

  • Expectation

Expectation: Expectation is a measure of the central tendency of a probability distribution. The expectation of a random variable is the average value of the variable.

Example: A company is trying to estimate the average salary of its employees. They can use the expectation of the salary distribution to estimate the average salary. The expectation of the salary distribution will be the average

  • Bootstrap

Bootstrap: Bootstrapping is a statistical method that is used to estimate the uncertainty of a statistical estimate. Bootstrapping works by resampling the original data set with replacement and then re-estimating the parameter of interest. This process is repeated many times, and the distribution of the bootstrap estimates is used to estimate the uncertainty of the original estimate.

Example: A company is trying to estimate the average salary of its employees. They use bootstrapping to estimate the uncertainty of the estimate. They resample the data set with replacement 1000 times, and then re-estimate the average salary each time. The distribution of the bootstrap estimates is used to estimate the uncertainty of the original estimate.

  • Overfitting

Overfitting: Overfitting is a problem that occurs in machine learning when the model fits the training data too well. This can happen when the model is too complex or when there is not enough data. Overfitting can lead to poor performance on new data.

Example: A company is using a machine learning model to predict customer churn. The model is trained on data from the past year. The model fits the training data very well, but it performs poorly on new data. This is an example of overfitting.

  • Coefficients

Coefficients: In statistics, a coefficient is a number that is used to quantify the relationship between two variables. Coefficients can be used to estimate the slope of a line in a linear regression model, or the strength of a correlation between two variables.

Example: A company is using linear regression to predict sales. The coefficient for the marketing budget variable is 0.5. This means that for every $1 increase in marketing budget, sales are expected to increase by $0.5.

  • Covariance

Covariance: In statistics, covariance is a measure of the linear relationship between two variables. Covariance can be positive, negative, or zero. A positive covariance indicates that the two variables tend to move in the same direction, while a negative covariance indicates that the two variables tend to move in opposite directions. A covariance of zero indicates that there is no linear relationship between the two variables.

Example: A company is trying to determine if there is a relationship between the number of times a customer visits their website and the amount of money they spend on the website. They calculate the covariance between these two variables and find that it is positive. This indicates that the two variables tend to move in the same direction.

  • Mann-Whitney U Test

Mann-Whitney U Test: The Mann-Whitney U test is a non-parametric statistical test that is used to compare the medians of two independent groups. The Mann-Whitney U test is a non-parametric test, which means that it does not make any assumptions about the distribution of the data.

Example: A company is trying to determine if there is a difference in the satisfaction ratings of two different groups of customers. They use the Mann-Whitney U test to compare the median satisfaction ratings of the two groups. The Mann-Whitney U test finds that there is a significant difference in the median satisfaction ratings of the two groups.

  • Estimator

Estimator: In statistics, an estimator is a statistic that is used to estimate the value of a population parameter. Estimators can be biased or unbiased. A biased estimator is an estimator that systematically underestimates or overestimates the value of the population parameter. An unbiased estimator is an estimator that is not systematically biased.

Example: A company is trying to estimate the average salary of its employees. They can use the sample mean as an estimator of the population mean. The sample mean is an unbiased estimator of the population mean.

  • Normality Test

Normality Test: A normality test is a statistical test that is used to determine if a data set is normally distributed. A normally distributed data set is a data set that follows a bell-shaped curve. Normality tests can be parametric or non-parametric. Parametric normality tests make assumptions about the distribution of the data, while non-parametric normality tests do not.

Example: A company is trying to determine if the distribution of customer satisfaction ratings is normally distributed. They use the Shapiro-Wilk normality test to test for normality. The Shapiro-Wilk normality test finds that the distribution of customer satisfaction ratings is not normally distributed.

  • Bayes Theorem

Bayes Theorem: Bayes’ theorem is a theorem in probability theory that is used to calculate the probability of an event given the probability of another event. Bayes’ theorem can be used to update our beliefs about the probability of an event based on new information.

Example: A company is trying to determine if a new marketing campaign is effective. They use Bayes’ theorem to update their beliefs about the effectiveness of the campaign based on the results of the campaign.

  • Binomial Distribution

Binomial Distribution: In statistics, the binomial distribution is a discrete probability distribution that describes the number of successes in a sequence of independent Bern

Here is a detailed example of how the binomial distribution can be used to calculate the probability of success for a new marketing campaign.

The company knows that the probability of a customer clicking on an advertisement is 0.25.

The company wants to know the probability of getting 10 clicks on 40 advertisements.

To calculate this probability, we can use the binomial distribution formula:

P(X = k) = nCk * p^k * (1 — p)^(n — k)

Where:

P(X = k) is the probability of getting k successes in n trials.

n is the number of trials.

k is the number of successes.

p is the probability of success on a single trial.

In this case, we have:

n = 40

k = 10

p = 0.25

Plugging these values into the formula, we get:

P(X = 10) = 40C10 * (0.25)¹⁰ * (0.75)³⁰

This is a very large number, so we can use a calculator to evaluate it. The calculator gives us a value of 0.008621006.

This means that there is a 0.8621% chance of getting 10 clicks on 40 advertisements when the probability of a customer clicking on an advertisement is 0.25.

  • ANOVA

ANOVA (Analysis of Variance) is a statistical test that is used to compare the means of two or more groups. ANOVA can be used to determine if there is a significant difference between the means of the groups.

Example: A company is trying to determine if there is a difference in the average salary of male and female employees. They use ANOVA to compare the average salaries of male and female employees. ANOVA finds that there is a significant difference in the average salaries of male and female employees.

  • f-test

f-test is a statistical test that is used to compare the variances of two or more groups. The f-test is a parametric test, which means that it makes assumptions about the distribution of the data.

Example: A company is trying to determine if there is a difference in the variance of customer satisfaction ratings between two different groups of customers. They use the f-test to compare the variances of customer satisfaction ratings between the two groups. The f-test finds that there is a significant difference in the variance of customer satisfaction ratings between the two groups.

  • Variance

Variance is a measure of how spread out a data set is. The variance of a data set is calculated by squaring the difference between each data point and the mean of the data set.

  • Repeated measure design

Repeated measure design is a type of experimental design that is used to study the effect of a treatment over time. In a repeated measure design, each participant is tested multiple times, and the results of the tests are compared to see if the treatment has had an effect.

  • Multivariate Analysis

Multivariate Analysis is a statistical technique that is used to analyze data that has multiple variables. Multivariate analysis can be used to identify relationships between variables, and to predict the value of one variable based on the value of other variables.

  • KS test

KS test (Kolmogorov-Smirnov test) is a non-parametric statistical test that is used to compare two data sets to see if they come from the same distribution. The KS test is a non-parametric test, which means that it does not make any assumptions about the distribution of the data.

  • Poisson Distribution

Poisson Distribution is a discrete probability distribution that describes the number of events that occur in a given interval of time or space. The Poisson distribution is often used to model the number of customers that arrive at a store in a given hour, or the number of defects that occur in a manufactured product.

  • MLE

MLE (Maximum Likelihood Estimation) is a statistical method that is used to estimate the parameters of a statistical model. MLE is a method of finding the parameters of a model that maximize the likelihood of the observed data.

  • MAP

MAP (Maximum A posteriori probability) is a statistical method that is used to estimate the parameters of a statistical model. MAP is a method of finding the parameters of a model that maximize the posterior probability of the data.

  • MGF

MGF (Moment Generating Function) is a function that is used to represent the probability distribution of a random variable. The MGF of a random variable can be used to calculate the mean, variance, and other moments of the distribution.

  • Non-parametric Tests

Non-parametric Tests are statistical tests that do not make any assumptions about the distribution of the data. Non-parametric tests are often used when the data is not normally distributed.

  • Newton’s method

Newton’s method is a numerical method that is used to find the roots of a function. Newton’s method is a recursive method that starts with an initial guess for the root of the function. The method then iteratively updates the guess until the error between the guess and the actual root is small enough.

  • Combinatorics

Combinatorics is the study of counting problems. Combinatorics is used to solve problems that involve counting the number of possible arrangements, combinations, or permutations of objects.

  • ARIMA

ARIMA (Autoregressive Integrated Moving Average) is a statistical model that is used to model time series data. ARIMA models are used to predict future values of a time series, and to identify patterns in time series data.

Here is an example of how ARIMA can be used:

A company is trying to predict the number of sales they will make next month. They have data on the number of sales they have made for the past 12 months. They use ARIMA to model the time series data.

The ARIMA model has three parameters: p, d, and q.

p is the number of autoregressive terms. An autoregressive term is a term that depends on the previous values of the time series.

d is the number of differencing terms. Differencing is a technique that is used to make the time series stationary. A stationary time series is a time series that does not have a trend or a seasonal component.

q is the number of moving average terms. A moving average term is a term that depends on the previous errors of the model.

The ARIMA model is fit to the data using a technique called maximum likelihood estimation. Maximum likelihood estimation is a method of finding the parameters of a model that maximize the likelihood of the observed data.

Once the ARIMA model is fit, the company can use it to predict the number of sales they will make next month.

Here are the steps involved in using ARIMA to predict the number of sales:

Collect the data on the number of sales for the past 12 months.

Plot the data to see if it is stationary. If the data is not stationary, use differencing to make it stationary.

Choose the values of p, d, and q. There are no hard and fast rules for choosing these values. However, a good starting point is to choose p = d = 1 and q = 0.

Fit the ARIMA model to the data using maximum likelihood estimation.

Use the ARIMA model to predict the number of sales for the next month.

  • Hypothesis testing

Hypothesis testing is a statistical method that is used to test the validity of a hypothesis. Hypothesis testing involves making a statement about the value of a population parameter, and then collecting data to see if the data supports the statement.

Here is an example of a hypothesis:

Hypothesis: A new marketing campaign will increase sales by 10%.

To test this hypothesis, a company could conduct an experiment. The company could randomly divide its customers into two groups. The first group would be exposed to the new marketing campaign, and the second group would not be exposed to the new marketing campaign. The company could then track sales for both groups over a period of time. If sales for the group that was exposed to the new marketing campaign increased by 10% or more, then the hypothesis would be supported.

Here are the steps involved in testing a hypothesis:

State the hypothesis. The hypothesis is a statement about the relationship between two or more variables.

Collect data. The data should be collected in a way that is unbiased and representative of the population.

Analyze the data. The data should be analyzed using a statistical test that is appropriate for the type of data and the hypothesis being tested.

Make a decision. The decision is about whether or not the hypothesis is supported by the data.

Here are some of the types of errors that can be made in hypothesis testing:

Type I error: A type I error is rejecting the null hypothesis when it is actually true.

Type II error: A type II error is failing to reject the null hypothesis when it is actually false.

The level of significance is the probability of making a type I error. The level of significance is typically set at 0.05, which means that there is a 5% chance of making a type I error.

The power of a test is the probability of rejecting the null hypothesis when it is actually false. The power of a test can be increased by increasing the sample size.

Here are some of the advantages of using hypothesis testing:

Hypothesis testing can be used to determine if there is a statistically significant relationship between two or more variables.

Hypothesis testing can be used to make decisions about whether or not to reject the null hypothesis.

Here are some of the disadvantages of using hypothesis testing:

Hypothesis testing can be complex and time-consuming.

Hypothesis testing can only be used to make decisions about the null hypothesis.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Hope this will help you.

I am a Software Architect | AI, Data Science, IoT, Cloud ⌨️ 👨🏽 💻

Love to learn and share. Thank you.

--

--

Think Different - Dhiraj Patra

I am a Software architect for AI, ML, IoT microservices cloud applications. Love to learn and share. https://dhirajpatra.github.io