Princess Cut Model and Methodology
While researching whether it was possible to scrape diamond price database info from BlueNile, I came across this excellent post. A. Marder wrote a nice script to query BlueNile and report the data in CSV format for further analysis.
Diamond Grading Basics
The most common grading parameters for diamonds are the "four C's":• : Color (how close to colorless)
• : Clarity (lack of physical crystal defects/inclusions)
• : Cut (quality of the resulting refraction/sparkle)
• : Carat (weight of the stone - related to diameter/size)
See the Gemological Institute of America's guide for details on the particular grading parameters.
A side-note, make sure to view a few samples of different colors/clarities before purchasing as you may find there to be very little difference in "lower quality" gems vs. the price.
Data Scraping
I use A. Marder's Python script to scrape price data from BlueNile, one of the main online diamond retailers.
I used the Windows 10 beta Ubuntu bash shell to install and run the script using the following commands:
Installed GIT to allow for easy installation from A. Marder's repository
Installed GIT to allow for easy installation from A. Marder's repository
sudo apt install git
Installed python-dev for some needed setup dependencies
sudo apt install python-dev
Needed to do upgrade/update to make sure packages were at their latest version
sudo apt upgrade
sudo apt update
Needed to force pip to upgrade virtualenv to mitigate some other errors
sudo pip install --upgrade virtualenv
Installed iterdata scraping script by A. Marder from GIT repository
pip install git+https://github.com/amarder/iterdata.git
Lastly I created the following shell script (scriptname.sh) to perform the database scraping:
Execute the script with
--pause adjusts the time between queries to avoid hammering BlueNile's site
--min-carat and --max-carat bound the weight of returned diamonds
--shape defines the particular cut shape, in this case only "princess" cut diamonds are returned. Round cuts use shape "RD"
limit --count stops scraping after a certain number of results are returned. Note that by default BlueNile will return results least expensive first, so if only a low number of results are returned, you may not accumulate a sample that's representative of the total population
echo "BlueNile Scraping Started..."
datapipe read_bluenile --pause 10 --min-carat 0.5 --max-carat 1.5 -- shape PR \
limit --count 10000 \
write_csv --path diamonds.csv
Execute the script with
./scriptname.sh
--pause adjusts the time between queries to avoid hammering BlueNile's site
--min-carat and --max-carat bound the weight of returned diamonds
--shape defines the particular cut shape, in this case only "princess" cut diamonds are returned. Round cuts use shape "RD"
limit --count stops scraping after a certain number of results are returned. Note that by default BlueNile will return results least expensive first, so if only a low number of results are returned, you may not accumulate a sample that's representative of the total population
Review the code in the GIT repository for other scraping options, including:
- --pause
- --shape
- --min-price
- --max-price
- --min-carat
- --max-carat
- --min-cut
- --max-cut
- --min-color
- --max-color
- --min-clarity
- --max-clarity
- --start-index
- --page-size
- --country
- --language
- --currency
- --sort-column
- --sort-direction
Price Modeling
I used a similar price model to A. Marder, with an exponential dependency on carat weight and a multiplicative factor for each of the other "four C's."Where
is the carat offset
is the carat multiplier
is the carat exponent
is the actual carat value (e.g. 0.75)
is the color multiplier
is the clarity multiplier
and
is the cut quality multiplier
These parameters were (simultaneously) optimized using Excel's GRG Nonlinear solver (with multistart enabled) to minimize the summed square error of the predicted cost vs. actual. This requires constraining all variables and does take some time to run through the optimization process (<30 minutes on a reasonable computer).
Using the optimized parameters yields a predicted price for any particular combination of the "four C's" by choosing the appropriate parameters from the model fitting results below based on desired parameters and plugging it into the price model.
Model Fitting Results
For my particular set of raw data, I arrived at the following model parameters (truncated from Excel's exact solution for each value):
Model Parameters
|
Value
|
Data Samples
|
Marginal (Upgrade) Ratio
|
|
Carat Score
|
Offset $O_4$
|
-32.95
|
||
Carat Exponent $E_4$
|
1.9388
|
|||
Carat Weight Score $M_4$
|
438.03
|
|||
Color
Score
|
D
|
2.3801
|
1065
|
|
E
|
2.0591
|
2017
|
15.6%
|
|
F
|
1.8983
|
1859
|
8.5%
|
|
G
|
1.7152
|
1540
|
10.7%
|
|
H
|
1.4993
|
997
|
14.4%
|
|
I
|
1.3626
|
787
|
10.0%
|
|
J
|
1.1318
|
642
|
20.4%
|
|
Cut
Score*
|
Signature Ideal
|
7.8167
|
1123
|
|
Ideal
|
0.9039
|
0
|
764.8%
|
|
Very Good
|
5.9463
|
5668
|
-84.8%
|
|
Good
|
5.6206
|
2116
|
5.8%
|
|
Clarity
Score
|
FL
|
1.7116
|
34
|
|
IF
|
1.5224
|
405
|
12.4%
|
|
VVS1
|
1.2489
|
1234
|
21.9%
|
|
VVS2
|
1.1386
|
1185
|
9.7%
|
|
VS1
|
1.0787
|
1747
|
5.5%
|
|
VS2
|
0.9952
|
1832
|
8.4%
|
|
SI1
|
0.8895
|
1707
|
11.9%
|
|
SI2
|
0.7658
|
763
|
16.1%
|
*The marginal upgrade ratio for this cut is obviously incorrect for Ideal/Very Good because there are no Ideal cut samples. This is because Princess Cut diamonds do not have a standardized/accepted cut grading system like Rounds.
and the following overall descriptive statistics for the model:
Total Mean Square Error
|
3.7E+09
|
Average Error
|
12.7482
|
StDev Error
|
645.065
|
Model Validation
The below is a histogram of the error in the model-predicted price vs. the actual price for the entire data set. The best-fit normal distribution for the displayed range had a standard deviation of ~360$The error in the model is roughly normally distributed from Z-scores of -2 to 2 (~95% of the population) as seen by general inspection of the linearity in the above plot but deviates in the number of samples in the long tails (severely under or over-estimated pricing cases).
The "clustering" near the mean in the above the normal distribution overlay with histogram can be explained by looking at the probability plot in the 1-sigma region around the mean below (~68% of the population), which shows a somewhat lower standard deviation (slope) of ~400 (vs ~600 for the larger Z-score region of -2 to 2 seen above).
The residual error versus carat plot below shows no net trend, indicating a good fit of the model. The carat plot also illustrates the popularity of certain "round number" carat values. The density in number of samples clearly trails off between "common" weights.
There is a slight trend in residuals versus actual price (shown below), which indicates that this model trends towards under-estimating as the price of a diamond increases.
There is little overall trend in the error versus clarity (seen below), indicating that parameter to be well modeled. If anything the model tends to slightly under-estimate pricing of lower quality diamonds and over-estimate the pricing of higher quality diamonds with regards to clarity.
There is little overall trend in the error versus color (shown below), indicating that parameter to be well modeled. "H" and "I" color grades tend to be slightly underestimated, perhaps due to their popularity as a "good value" grade. The model tends to slightly overestimate the pricing of the very best grade "D".
Conclusions
Overall the model appears to provide a good general prediction of price, sufficient to give you some confidence in whether a given stone is above or below the prevailing market averages.
There appears to be several hundred dollars variation due to un-modeled factors and market variance ($ of ~400-600). This translates to a swing around a mean prediction of (approximately) 1000$ () which is suggesting prices seem to vary more drastically than I might have initially expected. Some of this may be a result of no standardized "cut" grading for Princess Cut diamonds.
The only significant un-modeled parameter I can think of in this case is the length/width ratio of the square princess cut (ideally 1:1). There are a number of significant outliers - which could represent very good/bad deals, or could reflect the impact of an un-modeled parameter.
The modeling of diamond price relationship to carat weight by an exponent of ~2 fits the data well
Very generally, the marginal cost for improving a clarity or color by one grade is +10%.
Availability rapidly diminishes between "common" carat weight values, although the model pricing still appears generally accurate.
The analysis was repeated for round-cut diamonds, see below the break.