Tableau
Introduction to Tableau Desktop Software: Workspace & Navigation
Tableau Desktop Workspace In the start screen, go to File > New to open a Tableau Workspace The...
In this tutorial, you will learn:
During the TensorFlow tutorial, you will use the adult dataset. It is often used with classification task. It is available in this URL https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data
The data is stored in a CSV format. This dataset includes eights categorical variables:
This dataset includes eights categorical variables:
moreover, six continuous variables:
hours_week
To import a CSV dataset, you can use the object pd.read_csv(). The basic argument inside is:
Syntax:
pandas.read_csv(filepath_or_buffer,sep=', ',`names=None`,`index_col=None`,`skipinitialspace=False`)
For more information about readcsv(), please check the official documentation
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html.
Consider the following Example
## Import csv import pandas as pd ## Define path data COLUMNS = ['age','workclass', 'fnlwgt', 'education', 'education_num', 'marital', 'occupation', 'relationship', 'race', 'sex', 'capital_gain', 'capital_loss', 'hours_week', 'native_country', 'label'] PATH = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data" df_train = pd.read_csv(PATH, skipinitialspace=True, names = COLUMNS, index_col=False) df_train.shape
Output:
(32561, 15)
An easy way to see the data is to use the groupby method. This method can help you to summarize the data by group. Below is a list of methods available with groupby:
Inside groupby(), you can use the column you want to apply the method.
Let's have a look at a single grouping with the adult dataset. You will get the mean of all the continuous variables by type of revenue, i.e., above 50k or below 50k
df_train.groupby(['label']).mean()
age | fnlwgt | education_num | capital_gain | capital_loss | hours_week | |
label | ||||||
<=50K | 36.783738 | 190340.86517 | 9.595065 | 148.752468 | 53.142921 | 38.840210 |
>50K | 44.249841 | 188005.00000 | 11.611657 | 4006.142456 | 195.001530 | 45.473026 |
You can get the minimum of age by type of household
df_train.groupby(['label'])['age'].min()
label <=50K 17 >50K 19 Name: age, dtype: int64
You can also group by multiple columns. For instance, you can get the maximum capital gain according to the household type and marital status.
df_train.groupby(['label', 'marital'])['capital_gain'].max() label marital <=50K Divorced 34095 Married-AF-spouse 2653 Married-civ-spouse 41310 Married-spouse-absent 6849 Never-married 34095 Separated 7443 Widowed 6849 >50K Divorced 99999 Married-AF-spouse 7298 Married-civ-spouse 99999 Married-spouse-absent 99999 Never-married 99999 Separated 99999 Widowed 99999 Name: capital_gain, dtype: int64
You can create a plot following groupby. One way to do it is to use a plot after the grouping.
To create a more excellent plot, you will use unstack() after mean() so that you have the same multilevel index, or you join the values by revenue lower than 50k and above 50k. In this case, the plot will have two groups instead of 14 (2*7).
If you use Jupyter Notebook, make sure to add %matplotlib inline, otherwise, no plot will be displayed
%matplotlib inline df_plot = df_train.groupby(['label', 'marital'])['capital_gain'].mean().unstack() df_plot
Tableau Desktop Workspace In the start screen, go to File > New to open a Tableau Workspace The...
What is Tableau? Tableau is a powerful and fastest-growing data visualization tool used in the...
What is ETL? ETL is a process that extracts the data from different source systems, then...
Security Information and Event Management tool is a software solution that aggregates and analyses activity...
{loadposition top-ads-automation-testing-tools} A Data Warehouse is a collection of software tools...
What is OLTP? OLTP is an operational system that supports transaction-oriented applications in a...