Predicting Diamond Prices

Project Overview

A jewellery company wants to put in a bid to purchase a large set of diamonds but is unsure how much to bid. In this project, the results from a regression model are used to recommend how much the company should bid for the diamonds.


 Step 1 – Understand the data: There are two datasets. One contains the data used to build the regression model, and the other has the data for the diamonds the company would like to purchase. Both datasets contain carat, cut, and clarity data for each diamond. 

  • Carat represents the weight of the diamond and is a numerical variable.

  • Cut represents the quality of the cut of the diamond and falls into five categories: fair, good, very good, ideal, and premium.

  • Clarity represents the internal purity of the diamond and falls into eight categories: I1, SI2, SI1, VS2, VS1, VVS2, VVS1.

  • Color represents the colour of the diamond and is rated D through J, with D being the most colourless (and valuable) and J being the most yellow.

Step 2 – Build the regression model 

Step 3 – Calculate the predicted price for diamonds: For each diamond, the values for each of the variables are plugged into the equation. The equation is then solved to get the estimated or predicted diamond price.

Step 4 – Make a recommendation: The predicted price of each diamond is then used to calculate the bid price for the whole set. 


Review the business problem

We have one dataset with 50,000 diamonds with data on cut, clarity, colour, carat weight, and retail price. We'll use it to build a model to predict prices for a batch of 3,000 diamonds. For this batch, we have the same information, minus the price. Ultimately, we want to recommend the bid price for the entire batch, which will be 70% of the sum of the predicted diamond prices.

Input the data

We'll start by bringing in the diamond data with an input tool. Since the data is in a csv, we'll need to change the data types, bring in a select tool, and set the numeric data to double and the ordinal data to an integer.

Now let's take a look at the data to get a feel for it. Let's bring in a scatter plot. There is only one continuous numeric predictor variable, carat, so let's chart carat and price. As expected, price increases with carat weight, but there's a lot of variation among diamonds of the same weight. The additional predictor variables explain some of this variation.

PredDiamondPrice_ScatterPlot.png

Linear regression tool

Now we can add a linear regression tool and set the target variable to price. Remember that the target variable is defined as the variable that we want to predict. In this case, we want to predict price.

Let's start by adding all the predictor variables and look at the results. The first thing I check is the p-value on the predictor variables. We don't want to include any variables that aren't statistically significant. If the p-value is less than .05, we can be 95% confident that there exists a relationship between the predictor and the target variable.

 

Regression Results

For this model, it so happens that all the predictor variables are statistically significant so we can leave them all in. Let's take a look at the adjusted R-squared. It's above 0.9. While a high r-squared is not a guarantee that the model is good, in this case, we have confidence that our model explains a significant portion of the variation in prices.

Notice that for each of the categorical variables (cut, clarity, and colour), there's one value that doesn't have its coefficient. This is because one category represents the baseline case that all other categories are compared against. In this case, the baseline is a diamond with D for colour, fair for cut, and I1 for clarity.

Score the model

Now that we have a model, it's time to apply it to the batch of 3,000 to make a prediction and recommendation. Alteryx makes this simple with a tool called a Score tool.

 

Score Tool

Let's start by bringing in the data set for which prices will be predicted and set the data types correctly. Now, let's bring in the score tool. We attach the linear regression to one side and the new_diamonds dataset to the other. The results of the Score tool will create a field called "Score", which represents the predicted diamond price for each of the diamonds. Now attach a browse tool and a scatterplot tool so we can visualize the data.

PredDiamondPrice_ScatterPlot2.png

You can see the data is a bit tighter than the larger dataset of diamonds that had actual prices. This shows that the model doesn't account for all the variation. Also, we can see that some of the prices are predicted to be negative, which doesn't make sense. To correct this, a minimum price can be set for a diamond. However, I expect some prices to be overestimated and others underestimated. 

 

Summarize the data

Next, we bring in a summarize tool to add all the predicted prices together. This gives us an estimate of the total retail value of the batch of 3,000 diamonds. To develop a bid price, we multiply this bid by 70% since the company targets purchases of diamonds at 70% of retail value.

Summarize Tool

First, attach the Summarize tool to the Score tool. Then you'll see the configuration panel like the one below. To sum the Score field, click on Score in the top panel, then click add and select the Sum function. This will be added to the action panel and will create a field called Sum_Score.

PredDiamondPrice_SummariseConfig.png

 Formula Tool

Next, we'll add a formula tool to multiple the sum_score field by 0.7. To do this:

  • Attach the Formula tool to the Summarize tool and take a look at the configuration panel.

  • For the output field at the top, select sum_score, since it's the field you will use to make a calculation.

  • Either select the variable sum_score from the Existing Fields drop-down or write in [Sum_Score] into the Expression panel.

  • Just like in Excel, write the formula: [Sum_Score] * 0.7. This will take the sum of all the predicted diamond prices and multiply it by 0.7.

  • Attach a browse tool and run the workflow.

PredDiamondPrice_FormulaConfig.png

Bid Recommendation

The output of the formula tool gives an answer of approximately $8,230,695, which is the recommended bid price for the 3000 diamonds.

Previous
Previous

Predicting Catalogue Demand

Next
Next

Simple & Multiple Linear Regression