Sunday, January 29, 2017

Diamond Price Modeling


Princess Cut Model and Methodology


While researching whether it was possible to scrape diamond price database info from BlueNile, I came across this excellent post. A. Marder wrote a nice script to query BlueNile and report the data in CSV format for further analysis.

Diamond Grading Basics

The most common grading parameters for diamonds are the "four C's":
: Color (how close to colorless)
: Clarity (lack of physical crystal defects/inclusions)
: Cut (quality of the resulting refraction/sparkle)
: Carat (weight of the stone - related to diameter/size)

See the Gemological Institute of America's guide for details on the particular grading parameters.

A side-note, make sure to view a few samples of different colors/clarities before purchasing as you may find there to be very little difference in "lower quality" gems vs. the price.

Data Scraping

I use A. Marder's Python script to scrape price data from BlueNile, one of the main online diamond retailers.

I used the Windows 10 beta Ubuntu bash shell to install and run the script using the following commands:

Installed GIT to allow for easy installation from A. Marder's repository
 sudo apt install git  

Installed python-dev for some needed setup dependencies
 sudo apt install python-dev  

Needed to do upgrade/update to make sure packages were at their latest version
 sudo apt upgrade  
 sudo apt update  

Needed to force pip to upgrade virtualenv to mitigate some other errors
 sudo pip install --upgrade virtualenv

Installed iterdata scraping script by A. Marder from GIT repository
 pip install git+https://github.com/amarder/iterdata.git 

Lastly I created the following shell script (scriptname.sh) to perform the database scraping:

 echo "BlueNile Scraping Started..."  
 datapipe read_bluenile --pause 10 --min-carat 0.5 --max-carat 1.5 -- shape PR \  
 limit --count 10000 \  
 write_csv --path diamonds.csv  

Execute the script with 
 ./scriptname.sh  

--pause adjusts the time between queries to avoid hammering BlueNile's site
--min-carat and --max-carat bound the weight of returned diamonds
--shape defines the particular cut shape, in this case only "princess" cut diamonds are returned. Round cuts use shape "RD"
limit --count stops scraping after a certain number of results are returned. Note that by default BlueNile will return results least expensive first, so if only a low number of results are returned, you may not accumulate a sample that's representative of the total population

Review the code in the GIT repository for other scraping options, including:
  • --pause
  • --shape
  • --min-price
  • --max-price
  • --min-carat
  • --max-carat
  • --min-cut
  • --max-cut
  • --min-color
  • --max-color
  • --min-clarity
  • --max-clarity
  • --start-index
  • --page-size
  • --country
  • --language
  • --currency
  • --sort-column
  • --sort-direction

Price Modeling

I used a similar price model to A. Marder, with an exponential dependency on carat weight and a multiplicative factor for each of the other "four C's."


Where
 is the carat offset
 is the carat multiplier
 is the carat exponent
 is the actual carat value (e.g. 0.75)
 is the color multiplier
 is the clarity multiplier
and
 is the cut quality multiplier

These parameters were (simultaneously) optimized using Excel's GRG Nonlinear solver (with multistart enabled) to minimize the summed square error of the predicted cost vs. actual. This requires constraining all variables and does take some time to run through the optimization process (<30 minutes on a reasonable computer).

Using the optimized parameters yields a predicted price for any particular combination of the "four C's" by choosing the appropriate parameters from the model fitting results below based on desired parameters and plugging it into the price model.

Model Fitting Results

For my particular set of raw data, I arrived at the following model parameters (truncated from Excel's exact solution for each value):

Model Parameters
Value
Data Samples
Marginal (Upgrade) Ratio
Carat Score
Offset $O_4$
-32.95


Carat Exponent $E_4$
1.9388


Carat Weight Score $M_4$
438.03


Color Score
D
2.3801
1065

E
2.0591
2017
15.6%
F
1.8983
1859
8.5%
G
1.7152
1540
10.7%
H
1.4993
997
14.4%
I
1.3626
787
10.0%
J
1.1318
642
20.4%
Cut Score*
Signature Ideal
7.8167
1123

Ideal
0.9039
0
764.8%
Very Good
5.9463
5668
-84.8%
Good
5.6206
2116
5.8%
Clarity Score
FL
1.7116
34

IF
1.5224
405
12.4%
VVS1
1.2489
1234
21.9%
VVS2
1.1386
1185
9.7%
VS1
1.0787
1747
5.5%
VS2
0.9952
1832
8.4%
SI1
0.8895
1707
11.9%
SI2
0.7658
763
16.1%

*The marginal upgrade ratio for this cut is obviously incorrect for Ideal/Very Good because there are no Ideal cut samples. This is because Princess Cut diamonds do not have a standardized/accepted cut grading system like Rounds.

and the following overall descriptive statistics for the model:

Total Mean Square Error
3.7E+09
Average Error
12.7482
StDev Error
645.065



Princess-Cut Live Model




Model Validation

The below is a histogram of the error in the model-predicted price vs. the actual price for the entire data set. The best-fit normal distribution for the displayed range had a standard deviation of ~360$





The error in the model is roughly normally distributed from Z-scores of -2 to 2 (~95% of the population) as seen by general inspection of the linearity in the above plot but deviates in the number of samples in the long tails (severely under or over-estimated pricing cases).

The "clustering" near the mean in the above the normal distribution overlay with histogram can be explained by looking at the probability plot in the 1-sigma region around the mean below (~68% of the population), which shows a somewhat lower standard deviation (slope) of ~400 (vs ~600 for the larger Z-score region of -2 to 2 seen above).


The residual error versus carat plot below shows no net trend, indicating a good fit of the model. The carat plot also illustrates the popularity of certain "round number" carat values. The density in number of samples clearly trails off between "common" weights.


There is a slight trend in residuals versus actual price (shown below), which indicates that this model trends towards under-estimating as the price of a diamond increases.


There is little overall trend in the error versus clarity (seen below), indicating that parameter to be well modeled. If anything the model tends to slightly under-estimate pricing of lower quality diamonds and over-estimate the pricing of higher quality diamonds with regards to clarity.


There is little overall trend in the error versus color (shown below), indicating that parameter to be well modeled. "H" and "I" color grades tend to be slightly underestimated, perhaps due to their popularity as a "good value" grade. The model tends to slightly overestimate the pricing of the very best grade "D".


Conclusions


Overall the model appears to provide a good general prediction of price, sufficient to give you some confidence in whether a given stone is above or below the prevailing market averages.

There appears to be several hundred dollars variation due to un-modeled factors and market variance ($  of ~400-600). This translates to a swing around a mean prediction of (approximately) 1000$ () which is suggesting prices seem to vary more drastically than I might have initially expected. Some of this may be a result of no standardized "cut" grading for Princess Cut diamonds.

The only significant un-modeled parameter I can think of in this case is the length/width ratio of the square princess cut (ideally 1:1). There are a number of significant outliers - which could represent very good/bad deals, or could reflect the impact of an un-modeled parameter.

The modeling of diamond price relationship to carat weight by an exponent of ~2 fits the data well 

Very generally, the marginal cost for improving a clarity or color by one grade is +10%.

Availability rapidly diminishes between "common" carat weight values, although the model pricing still appears generally accurate.

The analysis was repeated for round-cut diamonds, see below the break.