Component Interpretation Summary via SDA

D.E. Beaudette
2016-08-17
This document is based on soilDB version 1.8-1.

You probably need to install some packages before this will work. Have a look at this tutorial to get setup.

Lets get some interpretations that are relevant to MLRA 18 (Sierra Nevada Foothills) soils so that we can compare all components named Auburn and Dunstone. Just for fun lets throw in a very different soil from MLRA 17 (San Joaquin Valley); Hanford.

Note that SDA has a 100,000 row limit, so we have to be a little clever when writing the queries. In this case, letting the database filter out NULL fuzzy ratings and the reasons for ratings.

library(soilDB)
library(plyr)
library(reshape2)
library(lattice)

# beware, there are hard limits (10k rows) on what can be returned by SDA
q <- "SELECT component.cokey, compname, mrulename, interplr, interplrc
FROM component INNER JOIN cointerp ON component.cokey = cointerp.cokey
WHERE compname IN ('Auburn', 'Dunstone', 'Hanford')
AND seqnum = 0
AND mrulename IN ('ENG - Construction Materials; Topsoil', 
'ENG - Sewage Lagoons', 'ENG - Septic Tank Absorption Fields', 
'ENG - Unpaved Local Roads and Streets', 
'AGR - California Revised Storie Index (CA)', 
'AGR - Pesticide Loss Potential-Leaching')
AND interplr IS NOT NULL;"

# query and check
x <- SDA_query(q)
head(x)

cokey	compname	mrulename	interplr	interplrc
11730485	Hanford	ENG - Unpaved Local Roads and Streets	0.400	Somewhat limited
11730485	Hanford	ENG - Sewage Lagoons	1.000	Very limited
11730485	Hanford	ENG - Septic Tank Absorption Fields	1.000	Very limited
11730485	Hanford	ENG - Construction Materials; Topsoil	0.822	Fair
10789621	Auburn	AGR - Pesticide Loss Potential-Leaching	0.000	Not limited
10789621	Auburn	AGR - California Revised Storie Index (CA)	0.290	Grade 4 - Poor

OK, so how do the fuzzy numbers (interplr values) for these components compare? Box and whisker plots are the simplest, but sometimes density plots are helpful for viewing the complete distribution. Looks like Auburn and Dunstone are pretty similar, at least in terms of these interpretations.

# compute number of rules and soils
n.soils <- length(unique(x$compname))
n.rules <- length(unique(x$mrulename))

# compare population with box-whisker plot
bwplot(compname ~ interplr | mrulename, data=x, layout=c(1,n.rules))

densityplot(~ interplr | mrulename, groups=compname, data=x, layout=c(1,n.rules), auto.key=list(points=FALSE, lines=TRUE, columns=n.soils), scales=list(y=list(relation='free')))

Those figures were useful, sometimes it is nice to see the median values for each interpretation (rows) and component (columns).

s <- ddply(x, c('mrulename', 'compname'), summarize, 
           low=quantile(interplr, probs=0.05, na.rm=TRUE), 
           rv=quantile(interplr, probs=0.5, na.rm=TRUE), 
           high=quantile(interplr, probs=0.95, na.rm=TRUE))

knitr::kable(dcast(s, mrulename ~ compname, value.var = 'rv'), caption = "Median Fuzzy Ratings")

Median Fuzzy Ratings
mrulename	Auburn	Dunstone	Hanford
AGR - California Revised Storie Index (CA)	0.307	0.23	0.8530
AGR - Pesticide Loss Potential-Leaching	0.000	NA	1.0000
ENG - Construction Materials; Topsoil	0.000	0.00	0.8220
ENG - Septic Tank Absorption Fields	1.000	1.00	1.0000
ENG - Sewage Lagoons	1.000	1.00	1.0000
ENG - Unpaved Local Roads and Streets	1.000	1.00	0.0135

What about the categorical ratings? Note that the kable function from the knitr package makes nice HTML tables for us.

knitr::kable(xtabs(~ interplrc + compname , data=x, subset= mrulename == 'ENG - Construction Materials; Topsoil'), caption="ENG - Construction Materials; Topsoil")

ENG - Construction Materials; Topsoil
	Auburn	Dunstone	Hanford
Fair	5	0	76
Poor	89	13	10

knitr::kable(xtabs(~ interplrc + compname , data=x, subset= mrulename == 'AGR - California Revised Storie Index (CA)'), caption = "AGR - California Revised Storie Index (CA)'")

AGR - California Revised Storie Index (CA)'
	Auburn	Dunstone	Hanford
Grade 1 - Excellent	0	0	59
Grade 2 - Good	0	0	25
Grade 3 - Fair	6	0	2
Grade 4 - Poor	75	9	0
Grade 5 - Very Poor	13	3	0
Grade 6 - Nonagricultural	0	1	0

knitr::kable(xtabs(~ interplrc + compname , data=x, subset= mrulename == 'ENG - Septic Tank Absorption Fields'), caption = "ENG - Septic Tank Absorption Fields")

ENG - Septic Tank Absorption Fields
	Auburn	Dunstone	Hanford
Somewhat limited	0	0	4
Very limited	89	13	82

knitr::kable(xtabs(~ interplrc + compname , data=x, subset= mrulename == 'ENG - Sewage Lagoons'), caption = "ENG - Sewage Lagoons")

ENG - Sewage Lagoons
	Auburn	Dunstone	Hanford
Very limited	89	13	86

Interpretation Based Similarity

Here is a crazy idea, what if we could sort a collection of soil series based on a subset of relevant interpretations? We can, as long as some assumptions are made:

there exists a small set of interpretations that reliably describe some aspect of "similarity"
the mean fuzzy rating is a realistic description of central tendency
in aggregate, the collection of components named for a soil series will approximate the central tendency of that series

Lets try it with a small set of MLRA 17, 18, and 22A soils. NULL ratings will confound interpretation of the results--there are no "AGR - Pesticide Loss Potential-Leaching" ratings for components named "Dunstone". I have tried to select a small set of relevant interpretations, but clearly there are many possibilities.

library(cluster)
library(ape)

# set list of soil series (component names)
soil.list <- c('Pardee', 'Yolo', 'Capay', 'Aiken', 'Amador', 'Pentz', 'Sobrante',
'Argonaut', 'Toomes', 'Jocal', 'Holland', 'Auburn', 'Dunstone', 
'Hanford', 'Redding', 'Columbia', 'San Joaquin', 'Fresno')

# set list of relevant interpretations
interp.list <- c('ENG - Construction Materials; Topsoil', 
'ENG - Sewage Lagoons', 'ENG - Septic Tank Absorption Fields', 
'ENG - Unpaved Local Roads and Streets', 
'AGR - California Revised Storie Index (CA)')

# compose query
q <- paste0("SELECT compname, mrulename, AVG(interplr) as interplr_mean
FROM component INNER JOIN cointerp ON component.cokey = cointerp.cokey
WHERE compname IN ", format_SQL_in_statement(soil.list), "
AND seqnum = 0
AND mrulename IN ", format_SQL_in_statement(interp.list), "
AND interplr IS NOT NULL
GROUP BY compname, mrulename;")

# send query
x <- SDA_query(q)

# reshape long -> wide
x.wide <- dcast(x, compname ~ mrulename, value.var = 'interplr_mean')
knitr::kable(x.wide, digits = 3, caption="Mean Fuzzy Ratings for Select Soil Series")

Mean Fuzzy Ratings for Select Soil Series
compname	AGR - California Revised Storie Index (CA)	ENG - Construction Materials; Topsoil	ENG - Septic Tank Absorption Fields	ENG - Sewage Lagoons	ENG - Unpaved Local Roads and Streets
Aiken	0.739	0.217	1.000	0.966	0.909
Amador	0.217	0.000	1.000	1.000	0.732
Argonaut	0.441	0.050	1.000	1.000	0.996
Auburn	0.297	0.001	1.000	1.000	0.997
Capay	0.532	0.170	1.000	0.677	1.000
Columbia	0.594	0.426	0.984	0.984	0.874
Dunstone	0.213	0.000	1.000	1.000	0.923
Fresno	0.135	0.000	1.000	1.000	0.419
Hanford	0.843	0.667	0.988	1.000	0.212
Holland	0.634	0.108	0.979	0.999	0.867
Jocal	0.626	0.095	0.921	0.993	1.000
Pardee	0.207	0.000	1.000	1.000	1.000
Pentz	0.243	0.004	1.000	1.000	0.739
Redding	0.239	0.039	1.000	1.000	0.892
San Joaquin	0.268	0.218	1.000	1.000	0.632
Sobrante	0.439	0.071	1.000	1.000	0.871
Toomes	0.170	0.000	1.000	1.000	1.000
Yolo	0.830	0.826	0.885	0.633	0.730

# create distance matrix
d <- daisy(x.wide[, -1])

# cluster via divisive hierachical method
h <- as.hclust(diana(d))

# transfer compname labels and convert to 'ape' class for plotting
h$labels <- x.wide$compname
h <- as.phylo(h)

# plot as dendrogram
par(mar=c(1,1,3,1))
plot(h)
title('Component Similarity via Select Interpretation Fuzzy Values')

Interesting. Discussion to be continued...

This document is based on soilDB version 2016-08-17.

Component interpretation comparison via SDA

Component Interpretation Summary via SDA

Interpretation Based Similarity