2f463df5494c48829e8db2abba7efaca

Subsampling Comparisons

[1]:

import rioxarray as rxr
import geopandas as gpd
from shapely import unary_union

from gval import CatStats

Create Subsampling DataFrame

Let’s open up a geopackage with polygons to use for subsampling:

[2]:

data_path = './'
polygons_continuous = gpd.read_file(f'{data_path}/subsample_continuous_polygons.gpkg')

To use this DataFrame as a subsampling DataFrame let’s use create_subsampling_df:

[3]:

polygons_continuous.gval.create_subsampling_df(subsampling_type=["include", "include"], inplace=True)
polygons_continuous

[3]:

	geometry	subsample_type	subsample_id
0	POLYGON ((-97.72375 29.56328, -97.72304 29.558...	include	1
1	POLYGON ((-97.71604 29.55635, -97.71587 29.551...	include	2

The DataFrame above has a geometry column, a subsample type with the value of “include” (calculating data within the geometry) or “exclude” (remove all data contained within the geometry), and subsample_id.

There is also the ability to add subsampling_weights:

[4]:

polygons_continuous = polygons_continuous.gval.create_subsampling_df(subsampling_type=["exclude", "exclude"], subsampling_weights=[2, 1])
polygons_continuous.explore()

[4]:

Make this Notebook Trusted to load map: File -> Trust Notebook

Continuous Compare Subsampling

[5]:

cds = rxr.open_rasterio(f'{data_path}/candidate_continuous_1.tif', band_as_variable=True, mask_and_scale=True)
bds = rxr.open_rasterio(f'{data_path}/benchmark_continuous_1.tif', band_as_variable=True, mask_and_scale=True)

[6]:

cds.gval.cont_plot(title="Candidate Map")

[6]:

<matplotlib.collections.QuadMesh at 0x7fa3aac9af50>

_images/SphinxSubsamplingTutorial_12_1.png

[7]:

bds.gval.cont_plot(title="Benchmark Map")

[7]:

<matplotlib.collections.QuadMesh at 0x7fa3aaa6d180>

_images/SphinxSubsamplingTutorial_13_1.png

Let’s use this newly created subsampling dataframe on a continuous comparison. For each subsample an agreement map is created and then used to calculate continuous statistics. There are four subsampling-average types:

full-detail: reports all metrics calculated on separate bands and subsamples.
band: reports all metrics on subsamples with band values averaged.
subsample: reports all metrics on bands with subsample values averaged.
weighted: reports all metrics on bands with subsample values averaged and scaled by weights.

Full-Detail

[8]:

ag, met = cds.gval.continuous_compare(benchmark_map=bds,
                                      metrics=["mean_percentage_error"],
                                      subsampling_df=polygons_continuous,
                                      subsampling_average="full-detail")
met

[8]:

	subsample	band	mean_percentage_error
0	1	1	0.125928
1	1	2	-0.111844
2	2	1	0.167116
3	2	2	-0.143187

The agreement maps for each subsample will look as follows:

[9]:

ag[0].gval.cont_plot(title="Agreement Map 1st Exclusion")

[9]:

<matplotlib.collections.QuadMesh at 0x7fa3aa732590>

_images/SphinxSubsamplingTutorial_18_1.png

[10]:

ag[1].gval.cont_plot(title="Agreement Map 2nd Exclusion")

[10]:

<matplotlib.collections.QuadMesh at 0x7fa3aa6e5cc0>

_images/SphinxSubsamplingTutorial_19_1.png

Band

[11]:

ag, met = cds.gval.continuous_compare(benchmark_map=bds,
                                      metrics=["mean_percentage_error"],
                                     subsampling_df=polygons_continuous,
                                     subsampling_average="band")
met

[11]:

	subsample	band	mean_percentage_error
0	1	averaged	0.007042
1	2	averaged	0.011964

Subsample

[12]:

ag, met = cds.gval.continuous_compare(benchmark_map=bds,
                                      metrics=["mean_percentage_error"],
                                     subsampling_df=polygons_continuous,
                                     subsampling_average="subsample")
met

[12]:

	subsample	band	mean_percentage_error
0	averaged	1	0.146522
1	averaged	2	-0.127515

Weighted

[13]:

ag, met = cds.gval.continuous_compare(benchmark_map=bds,
                                      metrics=["mean_percentage_error"],
                                      subsampling_df=polygons_continuous,
                                      subsampling_average="weighted")
met

[13]:

	subsample	band	mean_percentage_error
0	averaged	1	0.083952
1	averaged	2	-0.037281

If one desires to have only one subsample with all of the exclusions combined use Shapely’s unary_union:

[14]:

combined_df = gpd.GeoDataFrame(geometry=[unary_union(polygons_continuous['geometry'])],
                               crs=polygons_continuous.crs)
combined_df.gval.create_subsampling_df(subsampling_type=["exclude"], inplace=True)
ag, met = cds.gval.continuous_compare(benchmark_map=bds,
                                      metrics=["mean_percentage_error"],
                                      subsampling_df=combined_df,
                                      subsampling_average="full-detail")
met

[14]:

	subsample	band	mean_percentage_error
0	1	1	0.133122
1	1	2	-0.117482

[15]:

ag[0].gval.cont_plot(title="Agreement Map Exclusion")

[15]:

<matplotlib.collections.QuadMesh at 0x7fa3aa4e0820>

_images/SphinxSubsamplingTutorial_28_1.png

Categorical

[16]:

# Subsampling DF
polygons_categorical = gpd.read_file(f'{data_path}/subsample_two-class_polygons.gpkg')
polygons_categorical.gval.create_subsampling_df(subsampling_type=["include", "include"], inplace=True)
polygons_categorical.explore()

[16]:

Make this Notebook Trusted to load map: File -> Trust Notebook

[17]:

# Candidate and Benchmark
cda = rxr.open_rasterio(f'{data_path}/candidate_map_multiband_two_class_categorical.tif', mask_and_scale=True)
bda = rxr.open_rasterio(f'{data_path}/benchmark_map_multiband_two_class_categorical.tif', mask_and_scale=True)

[18]:

cda.gval.cat_plot("Candidate Map")

[18]:

<matplotlib.collections.QuadMesh at 0x7fa399f6e290>

_images/SphinxSubsamplingTutorial_32_1.png

[19]:

bda.gval.cat_plot("Benchmark Map")

[19]:

<matplotlib.collections.QuadMesh at 0x7fa38a781ba0>

_images/SphinxSubsamplingTutorial_33_1.png

Just as done earlier in continuous comparison, the following performs subsampling on categorical comparisons.. For each subsample an agreement map, a cross-tabulation table, and a metric table is created. There are three subsampling-average types:

full-detail: reports all metrics calculated on separate bands and subsamples.
band: reports all metrics on subsamples with band values averaged.
subsample: reports all metrics on bands with subsample values averaged.

Full-detail

[20]:

ag, ctab, met = cda.gval.categorical_compare(benchmark_map=bda,
                                             metrics="all",
                                             positive_categories=[2],
                                             negative_categories=[0, 1],
                                             subsampling_df=polygons_categorical,
                                             subsampling_average="full-detail")
met.transpose()

[20]:

	0	1	2	3
band	1	1	2	2
subsample	1	2	1	2
fn	201953.0	182389.0	68239.0	58638.0
fp	761242.0	397531.0	43646.0	65967.0
tn	762262.0	398338.0	1479858.0	729902.0
tp	201936.0	181301.0	335650.0	305052.0
accuracy	0.50026	0.499879	0.94195	0.892541
balanced_accuracy	0.500157	0.499506	0.901198	0.877941
critical_success_index	0.173316	0.238171	0.749997	0.70999
equitable_threat_score	0.000104	-0.000426	0.696008	0.602268
f_score	0.29543	0.384715	0.857141	0.830402
false_discovery_rate	0.790344	0.686781	0.115071	0.1778
false_negative_rate	0.500021	0.501496	0.168955	0.161231
false_omission_rate	0.209448	0.31407	0.044079	0.074363
false_positive_rate	0.499665	0.499493	0.028648	0.082887
fowlkes_mallows_index	0.323765	0.395147	0.857564	0.830444
matthews_correlation_coefficient	0.000255	-0.000918	0.821398	0.751849
negative_likelihood_ratio	0.999373	1.001976	0.173938	0.175802
negative_predictive_value	0.790552	0.68593	0.955921	0.925637
overall_bias	2.384759	1.591553	0.93911	1.020152
positive_likelihood_ratio	1.000628	0.99802	29.0084	10.119461
positive_predictive_value	0.209656	0.313219	0.884929	0.8222
prevalence	0.209552	0.313645	0.209552	0.313645
prevalence_threshold	0.499922	0.500248	0.156594	0.239171
true_negative_rate	0.500335	0.500507	0.971352	0.917113
true_positive_rate	0.499979	0.498504	0.831045	0.838769

The agreement map will look as follows:

[21]:

ag[0].gval.cat_plot(title="Agreement Map 1st Inclusion")

[21]:

<matplotlib.collections.QuadMesh at 0x7fa39a335600>

_images/SphinxSubsamplingTutorial_38_1.png

[23]:

ag[1].gval.cat_plot(title="Agreement Map 2nd Inclusion")

[23]:

<matplotlib.collections.QuadMesh at 0x7fa3a6fa5d80>

_images/SphinxSubsamplingTutorial_39_1.png

Band

[24]:

ag, ctab, met = cda.gval.categorical_compare(benchmark_map=bda,
                                             metrics="all",
                                             positive_categories=[2],
                                             negative_categories=[0, 1],
                                             subsampling_df=polygons_categorical,
                                             subsampling_average="band")
met.transpose()

[24]:

	0	1
subsample	1	2
band	averaged	averaged
fn	270192.0	241027.0
fp	804888.0	463498.0
tn	2242120.0	1128240.0
tp	537586.0	486353.0
accuracy	0.721105	0.69621
balanced_accuracy	0.700678	0.688723
critical_success_index	0.333352	0.408399
equitable_threat_score	0.192488	0.211025
f_score	0.500021	0.579948
false_discovery_rate	0.599556	0.487969
false_negative_rate	0.334488	0.331363
false_omission_rate	0.107547	0.176026
false_positive_rate	0.264157	0.29119
fowlkes_mallows_index	0.516237	0.585118
matthews_correlation_coefficient	0.342864	0.356123
negative_likelihood_ratio	0.454564	0.467492
negative_predictive_value	0.892453	0.823974
overall_bias	1.661934	1.305853
positive_likelihood_ratio	2.519382	2.296222
positive_predictive_value	0.400444	0.512031
prevalence	0.209552	0.313645
prevalence_threshold	0.38651	0.397562
true_negative_rate	0.735843	0.70881
true_positive_rate	0.665512	0.668637

Subsample

[25]:

ag, ctab, met = cda.gval.categorical_compare(benchmark_map=bda,
                                             metrics="all",
                                             positive_categories=[2],
                                             negative_categories=[0, 1],
                                             subsampling_df=polygons_categorical,
                                             subsampling_average="subsample")
met.transpose()

[25]:

	0	1
subsample	averaged	averaged
band	1	2
fn	384342.0	126877.0
fp	1158773.0	109613.0
tn	1160600.0	2209760.0
tp	383237.0	640702.0
accuracy	0.500117	0.92339
balanced_accuracy	0.499837	0.893723
critical_success_index	0.198944	0.730401
equitable_threat_score	-0.000122	0.657571
f_score	0.331866	0.844199
false_discovery_rate	0.751469	0.146089
false_negative_rate	0.50072	0.165295
false_omission_rate	0.248774	0.054299
false_positive_rate	0.499606	0.04726
fowlkes_mallows_index	0.352259	0.844253
matthews_correlation_coefficient	-0.000282	0.793505
negative_likelihood_ratio	1.000651	0.173494
negative_predictive_value	0.751226	0.945701
overall_bias	2.008927	0.977509
positive_likelihood_ratio	0.999348	17.662067
positive_predictive_value	0.248531	0.853911
prevalence	0.248653	0.248653
prevalence_threshold	0.500082	0.192211
true_negative_rate	0.500394	0.95274
true_positive_rate	0.49928	0.834705

As done earlier, if one wants to combine the inclusion of multiple polygons use unary_union:

[26]:

combined_df = gpd.GeoDataFrame(geometry=[unary_union(polygons_categorical['geometry'])],
                               crs=polygons_continuous.crs)
combined_df.gval.create_subsampling_df(subsampling_type=["include"], inplace=True)
ag, ctab, met = cda.gval.categorical_compare(benchmark_map=bda,
                                             metrics="all",
                                             positive_categories=[2],
                                             negative_categories=[0, 1],
                                             subsampling_df=combined_df,
                                             subsampling_average="full-detail")
met.transpose()

[26]:

	0	1
band	1	2
subsample	1	1
fn	384342.0	126877.0
fp	1158773.0	109613.0
tn	1160600.0	2209760.0
tp	383237.0	640702.0
accuracy	0.500117	0.92339
balanced_accuracy	0.499837	0.893723
critical_success_index	0.198944	0.730401
equitable_threat_score	-0.000122	0.657571
f_score	0.331866	0.844199
false_discovery_rate	0.751469	0.146089
false_negative_rate	0.50072	0.165295
false_omission_rate	0.248774	0.054299
false_positive_rate	0.499606	0.04726
fowlkes_mallows_index	0.352259	0.844253
matthews_correlation_coefficient	-0.000282	0.793505
negative_likelihood_ratio	1.000651	0.173494
negative_predictive_value	0.751226	0.945701
overall_bias	2.008927	0.977509
positive_likelihood_ratio	0.999348	17.662067
positive_predictive_value	0.248531	0.853911
prevalence	0.248653	0.248653
prevalence_threshold	0.500082	0.192211
true_negative_rate	0.500394	0.95274
true_positive_rate	0.49928	0.834705

[27]:

ag[0].gval.cat_plot(title="Agreement Map Inclusion")

[27]:

<matplotlib.collections.QuadMesh at 0x7fa3a1331e40>

_images/SphinxSubsamplingTutorial_46_1.png