preparing School Data
Project Overview
A school district wants to predict the per-pupil costs of a school based on high-level summary data. This way, they'll have a reasonable estimation of how well a school manages its costs relative to what a model would predict. The objective is to prepare the data for modelling.
Approach
Step 1 - Combine the data: Combine data from various files into one sheet, with one row per School.
Step 2 - Clean the Data: Address duplicate data, missing data, and any other data issues.
Step 3 - Outliers Detection: Search for outliers and determine the best way to address them.
Data Understanding
The data comes in four CSV files that contain data for two different school districts.
DistrictA_Attendance - This file contains average daily attendance, per cent attendance, and pupil-teacher ratio data for the 25 schools in district A.
DistrictA_Finance - This file includes average monthly teacher salary and per-pupil cost data for the 25 schools in district A.
DistrictB_Attendance - This file contains average daily attendance, per cent attendance, and pupil-teacher ratio data for the 21 schools in district B.
DistrictB_Finance - This file includes average monthly teacher salary and per-pupil cost data for the 21 schools in district B.
Step 1 - Combine the data
We'll bring in four input tools, each one bringing in one of the files to import. There are two files for each district: an attendance file and a finance file.
Starting with District A's attendance file, we have one record for each school, and there are three numeric fields, average attendance, per cent attendance, and pupil-teacher ratio.
District A's finance tab is structured differently. It has multiple rows for each School because the numeric fields are stacked on top of each other. So this data will have to be transformed before we can merge it with the attendance data.
District B's two datasets are structured similarly.
Formatting and Blending Data
To build the dataset, we'll have to merge each of these datasets into one.
First, we'll have to transform the finance datasets
Then we'll merge the finance and attendance datasets for each district
And lastly, we'll combine the data for the two districts together.
Transforming with Crosstab
To transform the finance datasets, connect a Crosstab tool to each of the finance input tools. In the configuration window, select School as the group by value, metric as the column headers, and value for the Values for New Columns. This will format the data in the same way the attendance data is formatted.
Join
Next, bring in a join tool and connect both datasets. In the configuration window, select the school variable in each of the datasets to join on. Then choose the variables you want to keep. You can drop one of the school variables, but you should keep the five numeric variables. You'll want to do the same for the other district.
Step 2 - Clean the Data
Now let's check for data issues. Specifically, let's look for duplicates and missing data first.
Visualizing with the Field Summary Tool
The field summary report gives a few helpful reports. Looking at histograms of each variable is an excellent way to do this.
Duplicate Data
We see a potential duplicate because there's one School with two records. Let's take a look at the observations to confirm. All the data is the same for these two records, so it appears to be a duplicate. So let's delete one of the observations.
There are several ways to do this. An easy way is to use the select records tool. The record we want to delete is number 39, so in the configuration window of the select records tool, we can select 1-38 and everything 40 and after.
The record we want to delete is number 39, so in the configuration window of the select records tool, we can select 1-38 and everything 40 and after.
Missing Data
You can see on the field summary report we are missing a few observations for the two attendance variables. We can either delete or impute. Since there are only a few, let's delete the records, but we'll note it since we may come back to run the model with the data imputed or if we end up not using those variables in the model.
An easy way to remove missing data is to use a filter tool. In the configuration window, you can select the ATT variable, and filter out NULL records. To do this, you want to keep all records that are not NULL, so you can use the drop-downs to select ATT Is Not Null or type in the formula !IsNull(ATT).
Union
Next, bring in a union tool, which will stack the data on top of each other.
Step 3 - Outlier Detection
To check for outliers, let's start with some scatterplots to visualize the relationship between each predictor variable and the target variable. Since there are four predictor variables, let's drag in four scatterplot tools and connect them to the cleaned dataset. Then we configure each one with a different predictor variable, attach browse tools, and run the workflow.
Let's start with attendance. The box and whisker plots on either axis use the interquartile range to determine whether a point is an outlier. You can see here there are two dramatic outliers for attendance, so let's address those first, then come back to the others.
For these outliers, it looks like the data may not be adequately recorded. Let's look at the observations by attaching a sort tool and browse tool to the dataset. You can see that almost every observation has a decimal, and these two observations seem to be about 10X the average, so it's highly likely that the data is an error, and we should divide by 10. Normally we would validate with the source, but for now, we'll make the assumption. I'll use a formula tool to filter out these records and create another scatter plot.
ATTENDANCE: There still is one more outlier. For this one, it seems like the data is probably correct, and it's just a larger school. My concern with keeping this observation is that it may skew the data, creating or masking a relationship with PPC. If I knew that none of the schools I was predicting for was going to be that size, I'd delete it. Otherwise, I'd keep it in. Let's plan on building a model with and without this variable.
PERCENT ATTENDANCE: Now, let's look at per cent attendance. There doesn't appear to be outliers.
PUPIL TEACHER RATIO: For Pupil-Teacher Ratio, there are 2 outliers, which are also outliers for PPC. This makes sense since we'd expect the relationship to behave this way. Based on the fitted line, the outliers are in line with the relationship, so we'd leave them in.
TEACHER SALARIES: Two records are outliers, with meager salaries. Just like for PTR, these seem to be in line with the trend and not dramatically different, so it's probably best to keep them in.
How to deal with outliers
As a reminder, let's quickly review how to handle an outlier. There are three main methods:
Delete: When data is erroneous or when the outlier hurts the model's ability to make predictions (perhaps the value is unlikely to appear again, so keeping it in the model will skew all other predictions).
Impute: Also, when data is erroneous, we could use the average or median value in its place.
Leave it: If the data is good data, it may be best to leave the data in and try with and without it to see the difference
Conclusion
We tackled common data wrangling techniques in this project to blend and transform data into a tidy form using Alteryx. This involved drawing in disparate datasets and merging them, identifying duplicate data, missing values and outliers.