Before creating any model, the first and foremost thing that we generally do is create the feature and target matrix. Let’s see how we will do that.
Before that, let’s understand our dataset which was taken from Kaggle:
Also refer this post to see how we implement an algorithm after selecting the Feature and Target Matrix.
- A new coronavirus designated 2019-nCoV was first identified in Wuhan, the capital of China’s Hubei province
- People developed pneumonia without a clear cause and for which existing vaccines or treatments were not effective.
- The virus has shown evidence of human-to-human transmission
- Transmission rate (rate of infection) appeared to escalate in mid-January 2020
- As of 30 January 2020, approximately 8,243 cases have been confirmed
Each row contains report from each region/location for each day
Each column represents the number of cases reported from each country/region
Now that we have known about what the dataset about, let’s dive straight into it and see how it looks like:
1. Load the dataset
import pandas as pd df = pd.read_csv("covid_19_clean_complete.csv") df.head()
2. Extracting the all the rows which have Country/Region colum value as Afganistan
3. Displaying the column names
df.columns O/P: Index(['Province/State', 'Country/Region', 'Lat', 'Long', 'Date', 'Confirmed', 'Deaths', 'Recovered'], dtype='object')
4.Displaying the total number of rows in the dataset
df.index O/P: RangeIndex(start=0, stop=21484, step=1)
5. Setting the index of the dataframe to ‘Country/Region’
6. Displaying the dataframe which has index set to ‘Country/Region’ and Province/State is not null
7 Resetting the Index
8. Retrieving the first 5 rows and first 3 columns
9. Retrieving the first 20 rows of Country/Region, Province/State, Lat, Long
df1 = df.loc[0:20,['Country/Region','Province/State','Lat','Long']] df1
10. Retrieving the rows from the new dataframe df1 where Province/State is not null
11. Creating the variable X to store the Independent Features and deleting the unnecessary columns and the dependent feature
X = df.drop('Recovered',axis=1) X.head()
12.Printing the shape of X
X.shape O/P: (21484, 7)
13. creating variable y to store the label/dependent variable.
y = df['Recovered'] y.head()
14. Printing the shape of variable y
y.shape O/P: (21484,)