Fully Charged: Diamond Price Modeling

Princess Cut Model and Methodology

While researching whether it was possible to scrape diamond price database info from BlueNile, I came across this excellent post. A. Marder wrote a nice script to query BlueNile and report the data in CSV format for further analysis.

Diamond Grading Basics

The most common grading parameters for diamonds are the "four C's":
• $p_1$ : Color (how close to colorless)
• $p_2$ : Clarity (lack of physical crystal defects/inclusions)
• $p_3$ : Cut (quality of the resulting refraction/sparkle)
• $p_4$ : Carat (weight of the stone - related to diameter/size)

See the Gemological Institute of America's guide for details on the particular grading parameters.

A side-note, make sure to view a few samples of different colors/clarities before purchasing as you may find there to be very little difference in "lower quality" gems vs. the price.

Data Scraping

I use A. Marder's Python script to scrape price data from BlueNile, one of the main online diamond retailers.

I used the Windows 10 beta Ubuntu bash shell to install and run the script using the following commands:

Installed GIT to allow for easy installation from A. Marder's repository

 sudo apt install git

Installed python-dev for some needed setup dependencies

 sudo apt install python-dev

Needed to do upgrade/update to make sure packages were at their latest version

 sudo apt upgrade

 sudo apt update

Needed to force pip to upgrade virtualenv to mitigate some other errors

 sudo pip install --upgrade virtualenv

Installed iterdata scraping script by A. Marder from GIT repository

 pip install git+https://github.com/amarder/iterdata.git

Lastly I created the following shell script (scriptname.sh) to perform the database scraping:

 echo "BlueNile Scraping Started..."  
 datapipe read_bluenile --pause 10 --min-carat 0.5 --max-carat 1.5 -- shape PR \  
 limit --count 10000 \  
 write_csv --path diamonds.csv

Execute the script with

 ./scriptname.sh

--pause adjusts the time between queries to avoid hammering BlueNile's site
--min-carat and --max-carat bound the weight of returned diamonds
--shape defines the particular cut shape, in this case only "princess" cut diamonds are returned. Round cuts use shape "RD"
limit --count stops scraping after a certain number of results are returned. Note that by default BlueNile will return results least expensive first, so if only a low number of results are returned, you may not accumulate a sample that's representative of the total population

Review the code in the GIT repository for other scraping options, including:

--pause
--shape
--min-price
--max-price
--min-carat
--max-carat
--min-cut
--max-cut
--min-color
--max-color
--min-clarity
--max-clarity
--start-index
--page-size
--country
--language
--currency
--sort-column
--sort-direction

Price Modeling

I used a similar price model to A. Marder, with an exponential dependency on carat weight and a multiplicative factor for each of the other "four C's."

$(O_4+{M_4}{p_4}^{E_4}){M_1}{M_2}{M_3}$
Where
$O_4$ is the carat offset
$M_4$ is the carat multiplier
$E_4$ is the carat exponent
$p_4$ is the actual carat value (e.g. 0.75)
$M_1$ is the color multiplier
$M_2$ is the clarity multiplier
and
$M_3$ is the cut quality multiplier

These parameters were (simultaneously) optimized using Excel's GRG Nonlinear solver (with multistart enabled) to minimize the summed square error of the predicted cost vs. actual. This requires constraining all variables and does take some time to run through the optimization process (<30 minutes on a reasonable computer).

Using the optimized parameters yields a predicted price for any particular combination of the "four C's" by choosing the appropriate parameters from the model fitting results below based on desired parameters and plugging it into the price model.

Model Fitting Results

For my particular set of raw data, I arrived at the following model parameters (truncated from Excel's exact solution for each value):

Model Parameters		Value	Data Samples	Marginal (Upgrade) Ratio
Carat Score	Offset $O_4$	-32.95
	Carat Exponent $E_4$	1.9388
	Carat Weight Score $M_4$	438.03
Color Score $M_1$	D	2.3801	1065
	E	2.0591	2017	15.6%
	F	1.8983	1859	8.5%
	G	1.7152	1540	10.7%
	H	1.4993	997	14.4%
	I	1.3626	787	10.0%
	J	1.1318	642	20.4%
Cut Score* $M_3$	Signature Ideal	7.8167	1123
	Ideal	0.9039	0	764.8%
	Very Good	5.9463	5668	-84.8%
	Good	5.6206	2116	5.8%
Clarity Score $M_2$	FL	1.7116	34
	IF	1.5224	405	12.4%
	VVS1	1.2489	1234	21.9%
	VVS2	1.1386	1185	9.7%
	VS1	1.0787	1747	5.5%
	VS2	0.9952	1832	8.4%
	SI1	0.8895	1707	11.9%
	SI2	0.7658	763	16.1%

*The marginal upgrade ratio for this cut is obviously incorrect for Ideal/Very Good because there are no Ideal cut samples. This is because Princess Cut diamonds do not have a standardized/accepted cut grading system like Rounds.

and the following overall descriptive statistics for the model:

Total Mean Square Error	3.7E+09
Average Error	12.7482
StDev Error	645.065

Model Validation

The below is a histogram of the error in the model-predicted price vs. the actual price for the entire data set. The best-fit normal distribution for the displayed range had a standard deviation of ~360$

The error in the model is roughly normally distributed from Z-scores of -2 to 2 (~95% of the population) as seen by general inspection of the linearity in the above plot but deviates in the number of samples in the long tails (severely under or over-estimated pricing cases).

The "clustering" near the mean in the above the normal distribution overlay with histogram can be explained by looking at the probability plot in the 1-sigma region around the mean below (~68% of the population), which shows a somewhat lower standard deviation (slope) of ~400 (vs ~600 for the larger Z-score region of -2 to 2 seen above).

The residual error versus carat plot below shows no net trend, indicating a good fit of the model. The carat plot also illustrates the popularity of certain "round number" carat values. The density in number of samples clearly trails off between "common" weights.

There is a slight trend in residuals versus actual price (shown below), which indicates that this model trends towards under-estimating as the price of a diamond increases.

There is little overall trend in the error versus clarity (seen below), indicating that parameter to be well modeled. If anything the model tends to slightly under-estimate pricing of lower quality diamonds and over-estimate the pricing of higher quality diamonds with regards to clarity.

There is little overall trend in the error versus color (shown below), indicating that parameter to be well modeled. "H" and "I" color grades tend to be slightly underestimated, perhaps due to their popularity as a "good value" grade. The model tends to slightly overestimate the pricing of the very best grade "D".

Conclusions

Overall the model appears to provide a good general prediction of price, sufficient to give you some confidence in whether a given stone is above or below the prevailing market averages.

There appears to be several hundred dollars variation due to un-modeled factors and market variance ($ $\sigma$ of ~400-600). This translates to a swing around a mean prediction of (approximately) 1000$ ( $\pm \$500$ ) which is suggesting prices seem to vary more drastically than I might have initially expected. Some of this may be a result of no standardized "cut" grading for Princess Cut diamonds.

The only significant un-modeled parameter I can think of in this case is the length/width ratio of the square princess cut (ideally 1:1). There are a number of significant outliers - which could represent very good/bad deals, or could reflect the impact of an un-modeled parameter.

The modeling of diamond price relationship to carat weight by an exponent of ~2 fits the data well $price \propto carat^2$

Very generally, the marginal cost for improving a clarity or color by one grade is +10%.

Availability rapidly diminishes between "common" carat weight values, although the model pricing still appears generally accurate.

The analysis was repeated for round-cut diamonds, see below the break.

Round Cut Model

The above was repeated for round cuts in the same carat range.

Model Parameters		Value	Samples	Marginal (Upgrade) Ratio
Carat Score	Offset $O_4$	-31.723
	Carat Exponent $E_4$	1.74712
	Carat Weight $M_4$	331.436
Color Score $M_1$	D	3.01632	15652
	E	2.70267	15443	11.6%
	F	2.51447	15895	7.5%
	G	2.25572	17492	11.5%
	H	2.00437	14864	12.5%
	I	1.73161	10323	15.8%
	J	1.45516	6241	19.0%
Cut Score $M_3$	Signature Ideal	10	1432
	Ideal	7.73571	66171	29.3%
	Very Good	7.03961	21319	9.9%
	Good	6.45063	6988	9.1%
Clarity Score $M_2$	FL	2.0324	218
	IF	1.59749	6786	27.2%
	VVS1	1.42161	11587	12.4%
	VVS2	1.27298	13835	11.7%
	VS1	1.19254	17597	6.7%
	VS2	1.1007	18229	8.3%
	SI1	0.96349	16473	14.2%
	SI2	0.81737	11185	17.9%

Conclusions

Overall the price of a round cut versus an equivalent princess cut is approximately 25% higher, likely due to demand and the fact that diamond crystals grow closer to the princess cut shape, which means there is less weight removed to create the princess than the round.

Model fit was good although there is a more pronounced deviation from the model from 1 to 1.2 carats where the model tends to underestimate pricing, likely due to demand/popularity of those weights and cut.

Clarity showed a very good fit to the multiplicative model. Color showed a slight trend with the best qualities prices over-estimated by ~100$ (on average) and lower quality prices overestimated by ~100$ (on average).

Clarity, cut, and color all showed approximately +10% marginal cost to a one-step upgrade except at the very lowest and highest grades.

Standard deviation of the error over the whole data set was similar to Princess Cut at ~670$, with a similar best-fit over the main central distribution of 373$.

Fully Charged

Sunday, January 29, 2017

Diamond Price Modeling

Princess Cut Model and Methodology

Diamond Grading Basics

Data Scraping

Price Modeling

Model Fitting Results

Model Validation

Conclusions

Round Cut Model

Conclusions

No comments:

Post a Comment