# Statistics in Data Science [Part 1]

It is said that Statistics is the semantics of Data Science. So the first question that immediately comes to mind is “*What is Statistics?*”.

## Statistics is a set of mathematical methods and tools that warrant us to answer crucial questions about data.

The next thing that probably went through your head was “*What kind of statistics should I know or learn in particular for data science?*”. That should not be the case, instead one should just learn statistics because it is the mastery of cracking the mystery inside datasets.

Statistics is divided into **two categories**, namely:

**1. Descriptive Statistics** — The data is summarized through the given observations. It is a way to describe, organize and represent a collection of data using graphs, tables and summary measures.

- Under Descriptive Statistics we have
**measures of central tendency**(mean, median and mode),**measures of dispersion**(variance, standard deviation, interquartile range and skewness) and**correlation**.

**2. Inferential Statistics **— This interprets the meaning of Descriptive Statistics. This method offers the use of information collected from a sample to make decisions, predictions or inferences from a population.

- Hypothesis testing, z-test, t-test and confidence intervals are all found under inferential statistics.

If you are a novice data scientist(somewhat like me 😄) you are thinking, “*I did not see any mention on regression, is that not part of statistics?*”. Linear regression, multiple linear regression and forecasting fall into a bracket known as statistical analysis or statistical modeling. These methods are seen more as standard analysis tools for inferential statistics than a branch of Statistics.

This is going to be a series of posts delving into each of the categories or branches of Statistics. In this next part of my post I will walk you through on how to carry out Descriptive Statistics in Python. ** Please note that you can use any programming language you are familiar with or one that you are interested in learning about, Python is just preferable for me**.

I will take an example of the CO2 Emissions by Vehicles dataset, which you can download here

, to demonstrate my comprehension of the notions and approaches I have grasped. This dataset represents the details of how CO2 emissions by a vehicle vary with respect to the different features.

Step 1 — Importing and Processing Data Set

Import the necessary libraries (pandas, numpy, matplotlib and seaborn) and load the dataset.

- The dataset is comma separated.
- By utilizing the
**.head()**function from the pandas library it displays the first five data entries. Utilizing the**.tail()**function displays the last five data entries.

Step 2 — Exploring the Data Set

**.shape** function returns the complete number of rows and columns in the dataset.

- The dataset is made up of 7385 data entries and 12 columns or features.

**.info()** function lets us know the data types associated with our columns and also if the dataset contains null values or not.

- Dataset is comprised of integer, float and object values.
- There are no null and missing values in the columns.

Step 3 — Descriptive Statistics

The pandas function, **.describe()**, is very convenient in obtaining various summary statistics associated with the columns or features of our dataset.

*i) Measures of Central Tendency*

- The summary statistics displayed are for the features that are numerical (float and integer data types).
**Mean**is higher than**median**(50% quartile) for all other features except Cylinders (number of cylinders), the difference being attributed to the distribution of data.**Maximum values**for all features are higher than that of the 75% quartile, this implies that there are outliers in our dataset.

**Mode**constitutes the most frequent value of a feature in the dataset and can be used with categorical features.- Ford is the most frequent make, for transmission it is the Automatic with Select Shift and fuel type we have regular gasoline.

*ii) Measures of Dispersion*

- Our
**standard deviation**numbers (from the**.describe()**summary) could be considered small which means that our set of values are close to the mean, on average. - The
**variance**is just a square of the standard deviation and the same interpretation applies.

- In the creation of the box-plots (to better visualize the
**interquartile range**) , I saved the features as their own variables for convenience.

- The
**Interquartile Range (IQR)**is a measure of statistical dispersion, and is calculated as the difference between the upper quartile (75th percentile) and the lower quartile (25th percentile). - It is better visualized with box-plots as one will be able to identify outliers.
- As can be seen on the box-plots, all features contain outliers. These data points differ significantly from the rest.

**Skewness**is the measure of the symmetry, or lack of , for a random variable about its mean.- The value can be positive, negative or undefined.
- Cylinders and Fuel Consumption in Highway Roads exhibit a highly skewed distribution, whereas the rest of the features exhibit a moderately skewed distribution.

*iii) Correlation*

**Correlation**between variables indicates that as one feature changes in value, the other feature tends to change in a specific direction.**Positive correlation coefficients**represent that when the value of one feature increases, the value of the other feature also tends to increase.- For
**negative correlation coefficients**when the value of one feature increases, the value of the other feature tends to decrease. - There is high positive correlation between the carbon dioxide emissions (CO2 Emissions(g/km)) and the rest of the features except Fuel Consumption Comb(mpg) where there is high negative correlation.
- Fuel Consumption Comb(mpg) displays high negative correlation with all the other features.

**Conclusion**

I hope I was able to clarify the concept of Descriptive Statistics using Python . Please do try these steps on different datasets and give some feedback in the comments section.

If I failed to capture any other useful information in my approach, do share in comments. Also feel free to also share descriptive statistics done with a different programming language.

Below are some reference posts that I found very insightful, you can utilize them to grow your own knowledge base as well. Thank you for taking the time to read 😀.