How to Display General Numeric Analysis of a DataFrame

Posted on

Introduction

This is an article where the main focus is to be able to display a certain method for calculating some statistical data. Those statistical data available from the Pandas library for the Series or DataFrame object variable. So, that statistical data is actually a basic statistical details like percentile, mean, and standard deviation. In this context, that certain method exist in the Pandas library with the name of ‘describe()’. Furthermore, the main purpose of this method is for analyzing both numeric and object from a Series or a DataFrame columns. It can also analyzing the Series or DataFrame columns of mixed data types. But when this method is applied to a series of string, it returns a different output.

This article will show how to use the Pandas describe method. Using the Pandas describe method, it is used to perform summary descriptive statistics for Pandas Series or Dataframe. Soon after the execution of the describe() method, it will generate output in the form of summary statistics. Actually, it is very important to understand the data using that high-level summary statistics. Moreover, it is the first step for exploratory data analysis (EDA). So, it will be very helpful as the first step especially for data science work. In other words, it opens up that work to statistics which need for further exploration. Fortunately, the Pandas .describe() method will provide generalized descriptive statistics that summarize the central tendency of the data, the dispersion, and the shape of the dataset’s distribution. It also provides helpful information on missing NaN data.

How to Display General Numeric Analysis of a DataFrame

So, in this part, there will be an actual execution about how to modify it using its different parameters, in order to get the results hoping for. If the DataFrame contains numerical data, the description contains these information for each column:

count - The number of not-empty values.
mean - The average (mean) value.
std - The standard deviation.
min - the minimum value.
25% - The 25% percentile*.
50% - The 50% percentile*.
75% - The 75% percentile*.
max - the maximum value.

Below is a case or an example for DataFrame object variable with several steps for getting into the execution of the method :

  1. Actually, the first step will be always Command Prompt execution since it is running a command for describing DataFrame as follows :

    Microsoft Windows [Version 10.0.22000.856]
    (c) Microsoft Corporation. All rights reserved.
    
    C:\Users\Personal>
    
    
  2. After that, just get into a Python command console. Make sure that ‘python’ tool exist in the device. Read an article in ‘How to Install Python in Microsoft Windows‘ to install python as a reference. There is also another similar article as an additional reference in ‘How to Install Python in Microsoft Windows 11‘.

    Microsoft Windows [Version 10.0.22000.856]
    (c) Microsoft Corporation. All rights reserved.
    
    C:\Users\Personal>python
    Python 3.10.5 (tags/v3.10.5:f377153, Jun 6 2022, 16:14:13) [MSC v.1929 64 bit (AMD64)] on win32
    Type "help", "copyright", "credits" or "license" for more information.
    
    Microsoft Windows [Version 10.0.22000.856]
    (c) Microsoft Corporation. All rights reserved.
    
    C:\Users\Personal>python
    Python 3.10.5 (tags/v3.10.5:f377153, Jun 6 2022, 16:14:13) [MSC v.1929 64 bit (AMD64)] on win32
    Type "help", "copyright", "credits" or "license" for more information.
    >>>
    
  3. Following after, just import the Pandas library by executing the following source code :

    Microsoft Windows [Version 10.0.22000.856]
    (c) Microsoft Corporation. All rights reserved.
    
    C:\Users\Personal>python
    Python 3.10.5 (tags/v3.10.5:f377153, Jun 6 2022, 16:14:13) [MSC v.1929 64 bit (AMD64)] on win32
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import pandas as pd
    

    Before importing the Pandas library, just make sure that it exist. Read an article in this ‘How to Use Pandas‘ for further information about how to install and use the Pandas library.

  4. After that, just do and perform a reading from a CSV file to retrieve data in a DataFrame object variable as follows :
    Microsoft Windows [Version 10.0.22000.856]
    (c) Microsoft Corporation. All rights reserved.
    
    C:\Users\Personal>python
    Python 3.10.5 (tags/v3.10.5:f377153, Jun 6 2022, 16:14:13) [MSC v.1929 64 bit (AMD64)] on win32
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import pandas as pd
    >>> df = pd.read_csv("mortgage-testing-data.csv")
    

    Actually, the file in the above source code execution exist in this link for further information as a reference.

  5. Furthermore, before going on to execute ‘describe()’ method, just check the available data type and column exist in the DataFrame first by executing the following command :

    Microsoft Windows [Version 10.0.22000.856]
    (c) Microsoft Corporation. All rights reserved.
    
    C:\Users\Personal>python
    Python 3.10.5 (tags/v3.10.5:f377153, Jun 6 2022, 16:14:13) [MSC v.1929 64 bit (AMD64)] on win32
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import pandas as pd
    >>> df = pd.read_csv("mortgage-testing-data.csv")
    >>> df.info()
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 1000 entries, 0 to 999
    Data columns (total 8 columns):
    # Column Non-Null Count Dtype
    --- ------ -------------- -----
    0 conforming_loan_limit 1000 non-null object
    1 derived_sex 1000 non-null object
    2 action_taken 1000 non-null int64
    3 loan_amount 1000 non-null float64
    4 loan_term 1000 non-null float64
    5 property_value 1000 non-null float64
    6 income 1000 non-null float64
    7 debt_to_income_ratio 1000 non-null object
    dtypes: float64(4), int64(1), object(3)
    memory usage: 62.6+ KB
    >>>
    
  6. After checking the available column with each of the respective data type, just execute the command. It is the command using  ‘describe()’ method for describing basic summary of statistical data details as follows :

    >>> df.describe()
          action_taken loan_amount  loan_term   property_value income
    count 1000.0000    1.000000e+03 1000.000000 1.000000e+03   1000.000000
    mean     2.0000    2.598000e+05 328.034000  3.292000e+05    162.068000
    std      1.0005    2.485964e+05 60.285132   3.374051e+05   1542.864534
    min      1.0000    5.000000e+03 12.000000   5.000000e+03    -15.000000
    25%      1.0000    1.050000e+05 300.000000  1.250000e+05     43.750000
    50%      2.0000    2.050000e+05 360.000000  2.400000e+05     74.000000
    75%      3.0000    3.250000e+05 360.000000  4.050000e+05    122.250000
    max      3.0000    2.765000e+06 372.000000  3.455000e+06  47444.000000
    >>>

    In the above command execution, there are several basic summary of statistical data details. Furthermore, it only contain the statistic analysis for column which has numeric data type. It starts from the number of the non-empty item or element. Following by mean which is a statistical data for describing central tendency. Another statistics detail exist is std which stands for standard deviation for describing dispersion of the DataFrame. Another statistical data involve min which is describing the value of the element which has the minimum value. That is also applies to max as statistics detail. On the contrary, the max statistic detail will describe the value of the element which has the maximum value. On the other hand, 25% , 50% and 75% are the statistical detail for comparing the value of an element to the value exist in another element in the same DataFrame.

Leave a Reply