During the TensorFlow tutorial, you will use the adult dataset. It is often used with classification task. It is available in this URL https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data
The data is stored in a CSV format. This dataset includes eights categorical variables:
This dataset includes 8 categorical variables:
And, 6 continuous variables:
In this tutorial, you will learn:
To import a CSV dataset, you can use the object pd.read_csv(). The basic argument inside is:
Pandas read_csv() Syntax
- filepath_or_buffer: Path or URL with the data
- sep=’, ‘: Define the delimiter to use
- `names=None`: Name the columns. If the dataset has ten columns, you need to pass ten names
- `index_col=None`: If yes, the first column is used as a row index
- `skipinitialspace=False`: Skip spaces after delimiter.
For more information about read_csv(), please check the official documentation:
Pandas read_csv() Example
## Import csv import pandas as pd ## Define path data COLUMNS = ['age','workclass', 'fnlwgt', 'education', 'education_num', 'marital', 'occupation', 'relationship', 'race', 'sex', 'capital_gain', 'capital_loss', 'hours_week', 'native_country', 'label'] PATH = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data" df_train = pd.read_csv(PATH, skipinitialspace=True, names = COLUMNS, index_col=False) df_train.shape
An easy way to see the data is to use the groupby method. This method can help you to summarize the data by group. Below is a list of methods available with groupby() method:
- count: count
- min: min
- max: max
- mean: mean
- median: median
- standard deviation: sdt
Inside groupby(), you can use the column you want to apply the method.
Let’s have a look at a single grouping with the adult dataset. You will get the mean of all the continuous variables by type of revenue, i.e., above 50k or below 50k:
You can get the minimum of age by type of household:
df_train.groupby(['label'])['age'].min() label <=50K 17 >50K 19 Name: age, dtype: int64
You can also group by multiple columns. For instance, you can get the maximum capital gain according to the household type and marital status.
df_train.groupby(['label', 'marital'])['capital_gain'].max() label marital <=50K Divorced 34095 Married-AF-spouse 2653 Married-civ-spouse 41310 Married-spouse-absent 6849 Never-married 34095 Separated 7443 Widowed 6849 >50K Divorced 99999 Married-AF-spouse 7298 Married-civ-spouse 99999 Married-spouse-absent 99999 Never-married 99999 Separated 99999 Widowed 99999 Name: capital_gain, dtype: int64
You can create a plot following groupby. One way to do it is to use a plot after the grouping.
To create a more excellent plot, you will use unstack() after mean() so that you have the same multilevel index, or you join the values by revenue lower than 50k and above 50k. In this case, the plot will have two groups instead of 14 (2*7).
If you use Jupyter Notebook, make sure to add %matplotlib inline, otherwise, no plot will be displayed
%matplotlib inline df_plot = df_train.groupby(['label', 'marital'])['capital_gain'].mean().unstack() df_plot
- To import a CSV dataset in Pandas, you can use the object pd.read_csv().
- The groupby() method can help you to summarize the data by group.
- You can also group by multiple columns. For instance, you can get the maximum capital gain according to the household type and marital status.