Building your own subset
Mike Johnson
Lynker, NOAA-AffiliateJustin Singh
Lynker, NOAA-AffiliateSource:
vignettes/05-subsetting.Rmd
05-subsetting.Rmd
We are now at the point where we have a common understanding of how hydrofabrics are sourced, manipulated for a given modeling task, and generated on a VPU basis.
This aim of this section is that each of us can build a subset network for a hydrolocation of interest. Doing this requires an understanding of the following:
- The CONUS network file
- General topological sorts and pitfalls
- The R based
hydrofabric::subset_network
- The CLI, Go based, utility for micro service subsets.
The CONUS network file
Remember, reference, refactor, and NextGen products are distributed by VPU.
While the data products are segmented, they do describe a holistic network.
To easy data discoverabilty, cross-walking, and indexing, the network layers of each VPU are joined into a single Parquet file (~80MB in size)
net = arrow::read_parquet("../data/conus_net.parquet")
glimpse(net)
## Rows: 3,178,347
## Columns: 16
## $ id <chr> "wb-282", "wb-282", "wb-282", "wb-282", "wb-2462…
## $ toid <chr> "tnx-1000000171", "tnx-1000000171", "tnx-1000000…
## $ divide_id <chr> "cat-282", "cat-282", "cat-282", "cat-282", "cat…
## $ ds_id <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mainstem <int> 2238463, 2238463, 2238463, 2238463, 2243682, 224…
## $ hl_id <chr> NA, NA, NA, NA, "176", "176", "176", "183", "183…
## $ hydroseq <int> 19980, 19980, 19980, 19980, 19979, 19979, 19979,…
## $ hl_uri <chr> NA, NA, NA, NA, "HUC12-010100100101", "HUC12-010…
## $ hf_source <chr> "NHDPlusV2", "NHDPlusV2", "NHDPlusV2", "NHDPlusV…
## $ hf_id <dbl> 4287171, 4287061, 4287169, 4287115, 816563, 8164…
## $ lengthkm <dbl> 3.695258, 3.695258, 3.695258, 3.695258, 5.135779…
## $ areasqkm <dbl> 6.786451, 6.786451, 6.786451, 6.786451, 13.28626…
## $ tot_drainage_areasqkm <dbl> 6.786451, 6.786451, 6.786451, 6.786451, 51.65776…
## $ type <chr> "terminal", "terminal", "terminal", "terminal", …
## $ vpu <chr> "01", "01", "01", "01", "01", "01", "01", "01", …
## $ hf_hydroseq <dbl> 150061851, 150030537, 150035911, 150045246, 1500…
In the above table you’ll see that every relationship between the features in the current hydrofabric, the source hydrofabric, and all hydrolocations, have been exploded into a “many-to-many” table.
For example, we can look at the flowpath wb-1002
and
find that it is defined by the aggregation of NHDPlusV2 COMID 2239831
and 2239831.
glimpse(filter(net, id == "wb-1002"))
## Rows: 2
## Columns: 16
## $ id <chr> "wb-1002", "wb-1002"
## $ toid <chr> "nex-294", "nex-294"
## $ divide_id <chr> "cat-1002", "cat-1002"
## $ ds_id <dbl> NA, NA
## $ mainstem <int> 2239831, 2239831
## $ hl_id <chr> NA, NA
## $ hydroseq <int> 1927, 1927
## $ hl_uri <chr> NA, NA
## $ hf_source <chr> "NHDPlusV2", "NHDPlusV2"
## $ hf_id <dbl> 4290271, 4290437
## $ lengthkm <dbl> 6.291308, 6.291308
## $ areasqkm <dbl> 23.06835, 23.06835
## $ tot_drainage_areasqkm <dbl> 54.84601, 54.84601
## $ type <chr> "nexus", "nexus"
## $ vpu <chr> "01", "01"
## $ hf_hydroseq <dbl> 150045065, 150035814
Or, that the terminal outflow of ‘HUC12-010100100101’ occurs at
tnx-1000000019
which is fed by an aggregate flowpath made
up of three source flowpaths
(COMID={816563, 816417, 816415}
)
glimpse(filter(net, hl_uri == 'HUC12-010100100101'))
## Rows: 3
## Columns: 16
## $ id <chr> "wb-2462", "wb-2462", "wb-2462"
## $ toid <chr> "tnx-1000000019", "tnx-1000000019", "tnx-1000000…
## $ divide_id <chr> "cat-2462", "cat-2462", "cat-2462"
## $ ds_id <dbl> NA, NA, NA
## $ mainstem <int> 2243682, 2243682, 2243682
## $ hl_id <chr> "176", "176", "176"
## $ hydroseq <int> 19979, 19979, 19979
## $ hl_uri <chr> "HUC12-010100100101", "HUC12-010100100101", "HUC…
## $ hf_source <chr> "NHDPlusV2", "NHDPlusV2", "NHDPlusV2"
## $ hf_id <dbl> 816563, 816417, 816415
## $ lengthkm <dbl> 5.135779, 5.135779, 5.135779
## $ areasqkm <dbl> 13.28626, 13.28626, 13.28626
## $ tot_drainage_areasqkm <dbl> 51.65776, 51.65776, 51.65776
## $ type <chr> "terminal", "terminal", "terminal"
## $ vpu <chr> "01", "01", "01"
## $ hf_hydroseq <dbl> 150027523, 150024646, 150066278
Why do it this way?
1. Speed
system.time({
net = read_parquet('s3://nextgen-hydrofabric/pre-release/conus_net.parquet')
})
# user system elapsed
# 2.180 2.123 41.288
Defered Evaluation
The times you would want to open the entire network are limited, often, we are interested in small bits of data that
system.time({
net = open_dataset('s3://nextgen-hydrofabric/pre-release/conus_net.parquet') %>%
filter(id == "wb-1002") %>%
collect()
})
# user system elapsed
# 2.294 2.588 24.261
2. Feature Identification
By location
here = AOI::geocode("National Water Center, Alabama", pt = TRUE)
x = dataRetrieval::findNLDI(location = here)
(filter(net, hf_id == x$origin$comid))
## # A tibble: 1 × 16
## id toid divide_id ds_id mainstem hl_id hydroseq hl_uri hf_source hf_id
## <chr> <chr> <chr> <dbl> <int> <chr> <int> <chr> <chr> <dbl>
## 1 wb-1649… nex-… cat-1649… NA 1717455 NA 4370 NA NHDPlusV2 1.58e7
## # ℹ 6 more variables: lengthkm <dbl>, areasqkm <dbl>,
## # tot_drainage_areasqkm <dbl>, type <chr>, vpu <chr>, hf_hydroseq <dbl>
For a gage in Calfornia
hl = filter(net, hl_uri == "Gages-11123000")
(slice_max(hl, hf_hydroseq))
## # A tibble: 1 × 16
## id toid divide_id ds_id mainstem hl_id hydroseq hl_uri hf_source hf_id
## <chr> <chr> <chr> <dbl> <int> <chr> <int> <chr> <chr> <dbl>
## 1 wb-3400… nex-… cat-3400… NA 1839328 3569 30250 Gages… NHDPlusV2 1.76e7
## # ℹ 6 more variables: lengthkm <dbl>, areasqkm <dbl>,
## # tot_drainage_areasqkm <dbl>, type <chr>, vpu <chr>, hf_hydroseq <dbl>
For the HUC12 of Atascadero Creek in Santa Barbara:
hl = filter(net, hl_uri == "HUC12-180600130201")
(wbd = slice_max(hl, hf_hydroseq))
## # A tibble: 1 × 16
## id toid divide_id ds_id mainstem hl_id hydroseq hl_uri hf_source hf_id
## <chr> <chr> <chr> <dbl> <int> <chr> <int> <chr> <chr> <dbl>
## 1 wb-3400… nex-… cat-3400… NA 1841917 2281 16722 HUC12… NHDPlusV2 1.76e7
## # ℹ 6 more variables: lengthkm <dbl>, areasqkm <dbl>,
## # tot_drainage_areasqkm <dbl>, type <chr>, vpu <chr>, hf_hydroseq <dbl>
For the dam on Horsetooth Reservoir
hl = filter(net, hl_uri == "NID-CO01659-1")
(wbd = slice_max(hl, hf_hydroseq))
## # A tibble: 1 × 16
## id toid divide_id ds_id mainstem hl_id hydroseq hl_uri hf_source hf_id
## <chr> <chr> <chr> <dbl> <int> <chr> <int> <chr> <chr> <dbl>
## 1 wb-3314… nex-… cat-3314… NA 1223227 7253 11610 NID-C… NHDPlusV2 2.90e6
## # ℹ 6 more variables: lengthkm <dbl>, areasqkm <dbl>,
## # tot_drainage_areasqkm <dbl>, type <chr>, vpu <chr>, hf_hydroseq <dbl>
By known COMID
filter(net, hf_id == 101)
## # A tibble: 1 × 16
## id toid divide_id ds_id mainstem hl_id hydroseq hl_uri hf_source hf_id
## <chr> <chr> <chr> <dbl> <int> <chr> <int> <chr> <chr> <dbl>
## 1 wb-26052… nex-… cat-2605… NA 2603469 281 120 HUC12… NHDPlusV2 101
## # ℹ 6 more variables: lengthkm <dbl>, areasqkm <dbl>,
## # tot_drainage_areasqkm <dbl>, type <chr>, vpu <chr>, hf_hydroseq <dbl>
3. Topo sorting!
A national network provides an opportunity for rapid network navigation and subsetting. The structure of the CONUS network file also facilitates this, if a few precautions are taken.
While the “many-to-many” relationship of the network file allows for fast indexing and discovery, the duplication of feature IDs is not conducive to graph based approaches.
# Most downstream feature associated with
(wbd = filter(net, hl_uri == "HUC12-180600130201") %>%
select(id, toid, hl_uri, vpu, hydroseq) %>%
slice_max(hydroseq, with_ties = TRUE))
## # A tibble: 8 × 5
## id toid hl_uri vpu hydroseq
## <chr> <chr> <chr> <chr> <int>
## 1 wb-3400759 nex-3400758 HUC12-180600130201 18 16731
## 2 wb-3400759 nex-3400758 HUC12-180600130201 18 16731
## 3 wb-3400759 nex-3400758 HUC12-180600130201 18 16731
## 4 wb-3400759 nex-3400758 HUC12-180600130201 18 16731
## 5 wb-3400759 nex-3400758 HUC12-180600130201 18 16731
## 6 wb-3400759 nex-3400758 HUC12-180600130201 18 16731
## 7 wb-3400759 nex-3400758 HUC12-180600130201 18 16731
## 8 wb-3400759 nex-3400758 HUC12-180600130201 18 16731
Seeking just the topology of the network would be impractical/wrong
(filter(net, vpu == wbd$vpu[1]) %>%
select(id, toid, divide_id))
## # A tibble: 170,349 × 3
## id toid divide_id
## <chr> <chr> <chr>
## 1 wb-3378885 tnx-1000012409 cat-3378885
## 2 wb-3378885 tnx-1000012409 cat-3378885
## 3 wb-3378885 tnx-1000012409 cat-3378885
## 4 wb-3378885 tnx-1000012409 cat-3378885
## 5 wb-3378885 tnx-1000012409 cat-3378885
## 6 wb-3378885 tnx-1000012409 cat-3378885
## 7 wb-3378885 tnx-1000012409 cat-3378885
## 8 wb-3378885 tnx-1000012409 cat-3378885
## 9 wb-3378885 tnx-1000012409 cat-3378885
## 10 wb-3378885 tnx-1000012409 cat-3378885
## # ℹ 170,339 more rows
Thus, you NEED to make sure that operations running on the network file are dealing with distinct rows for the operations you want.
In the case of hydrosequencing, these are unique
id
/toid
pairs
(wbd_net = filter(net, vpu == wbd$vpu) %>%
select(id, toid, divide_id) %>%
distinct())
## Warning: There was 1 warning in `filter()`.
## ℹ In argument: `vpu == wbd$vpu`.
## Caused by warning in `vpu == wbd$vpu`:
## ! longer object length is not a multiple of shorter object length
## # A tibble: 65,280 × 3
## id toid divide_id
## <chr> <chr> <chr>
## 1 wb-3378885 tnx-1000012409 cat-3378885
## 2 wb-3378890 tnx-1000013027 cat-3378890
## 3 wb-3378892 tnx-1000012410 cat-3378892
## 4 wb-3378895 tnx-1000013028 cat-3378895
## 5 wb-3378974 tnx-1000013029 cat-3378974
## 6 wb-3380350 tnx-1000012411 cat-3380350
## 7 wb-3380601 tnx-1000012412 cat-3380601
## 8 wb-3380616 tnx-1000012414 cat-3380616
## 9 wb-3380617 tnx-1000013030 cat-3380617
## 10 wb-3380618 tnx-1000012415 cat-3380618
## # ℹ 65,270 more rows
With a complete, non duplicated network, we can easily sort and subset based on a defined outlet(s)
(get_sorted(wbd_net, outlets = wbd$toid[1]))
## Warning in check_hy_outlets(x, fix = FALSE): Outlets don't follow hydroloom
## convention of 0 or '', not fixing.
## # A tibble: 20 × 3
## id toid divide_id
## <chr> <chr> <chr>
## 1 wb-3400765 nex-3400757 cat-3400765
## 2 nex-3400757 wb-3400757 NA
## 3 wb-3400757 nex-3400758 cat-3400757
## 4 wb-3400769 nex-3400759 cat-3400769
## 5 nex-3400759 wb-3400759 NA
## 6 wb-3400759 nex-3400758 cat-3400759
## 7 wb-3400760 nex-3400761 cat-3400760
## 8 nex-3400761 wb-3400761 NA
## 9 wb-3400761 nex-3400758 cat-3400761
## 10 wb-3400768 nex-3400758 cat-3400768
## 11 wb-3400766 nex-3400762 cat-3400766
## 12 nex-3400762 wb-3400762 NA
## 13 wb-3400762 nex-3400763 cat-3400762
## 14 wb-3400767 nex-3400764 cat-3400767
## 15 nex-3400764 wb-3400764 NA
## 16 wb-3400764 nex-3400763 cat-3400764
## 17 wb-3400770 nex-3400763 cat-3400770
## 18 nex-3400763 wb-3400763 NA
## 19 wb-3400763 nex-3400758 cat-3400763
## 20 nex-3400758 wb-3400758 NA
Complete Example:
system.time({
wbd = filter(net, hl_uri == "HUC12-180600130201") %>%
select(toid, vpu, hydroseq) %>%
slice_max(hydroseq, with_ties = FALSE)
wbd_net = filter(net, vpu == wbd$vpu) %>%
select(id, toid, divide_id) %>%
distinct() %>%
get_sorted(outlets = wbd$toid)
})
## Warning in check_hy_outlets(x, fix = FALSE): Outlets don't follow hydroloom
## convention of 0 or '', not fixing.
## user system elapsed
## 0.890 0.137 1.057
## [1] "wb-3400765" "nex-3400757" "wb-3400757" "wb-3400769" "nex-3400759"
## [6] "wb-3400759" "wb-3400760" "nex-3400761" "wb-3400761" "wb-3400768"
## [11] "wb-3400766" "nex-3400762" "wb-3400762" "wb-3400767" "nex-3400764"
## [16] "wb-3400764" "wb-3400770" "nex-3400763" "wb-3400763" "nex-3400758"
## [21] "wb-3400758" "cat-3400765" "cat-3400757" "cat-3400769" "cat-3400759"
## [26] "cat-3400760" "cat-3400761" "cat-3400768" "cat-3400766" "cat-3400762"
## [31] "cat-3400767" "cat-3400764" "cat-3400770" "cat-3400763"
## attr(,"na.action")
## [1] 23
## attr(,"class")
## [1] "omit"
4. Easy integration with SQL
Once you have a set of IDs, the SQL backing of GPKG allows for quickly extracting features based on that ID set
## USE GET HYDROFABRIC HERE
(gpkg = get_fabric(VPU = wbd$vpu, cache_dir = "cihro-data"))
## cihro-data/nextgen_18.gpkg
Subsetting!
Now, while conceptual that is straight forward, doing this repetitively is annoying. Equally, writing out the logic for all layers is a loop that can be prone to easy errors.
In these cases, we provide a subset_network function that works o n
system.time({
xx = subset_network(id = wbd$toid,
lyrs = c("divides", "nexus", "flowpaths", "lakes", "flowpath_attributes"),
cache_dir = "cihro-data")
})
## Warning: There was 1 warning in `dplyr::filter()`.
## ℹ In argument: `hf_id == comid`.
## Caused by warning in `hf_id == comid`:
## ! longer object length is not a multiple of shorter object length
## Starting from: `wb-3400757`
## Subsetting: divides (1/5)
## Subsetting: nexus (2/5)
## Subsetting: flowpaths (3/5)
## Subsetting: lakes (4/5)
## Subsetting: flowpath_attributes (5/5)
## user system elapsed
## 5.082 1.045 15.140
lapply(xx, names)
## $divides
## [1] "fid" "geom" "divide_id"
## [4] "toid" "type" "ds_id"
## [7] "areasqkm" "id" "lengthkm"
## [10] "tot_drainage_areasqkm" "has_flowline"
##
## $nexus
## [1] "fid" "geom" "id" "toid" "hl_id" "hl_uri" "type"
##
## $flowpaths
## [1] "fid" "geom" "id"
## [4] "toid" "mainstem" "order"
## [7] "hydroseq" "lengthkm" "areasqkm"
## [10] "tot_drainage_areasqkm" "has_divide" "divide_id"
##
## $lakes
## [1] "fid" "geom" "id" "toid" "hl_id"
## [6] "hl_reference" "hl_link" "hl_uri" "Dam_Length" "ifd"
## [11] "LkArea" "LkMxE" "OrificeA" "OrificeC" "OrificeE"
## [16] "time" "WeirC" "WeirE" "WeirL"
##
## $flowpath_attributes
## [1] "fid" "id" "rl_gages"
## [4] "rl_NHDWaterbodyComID" "Qi" "MusK"
## [7] "MusX" "n" "So"
## [10] "ChSlp" "BtmWdth" "time"
## [13] "Kchan" "nCC" "TopWdthCC"
## [16] "TopWdth" "length_m"
Example for Poudre River!
system.time({
xx = subset_network(hl_uri = 'Gages-06752260',
lyrs = c("divides", "nexus", "flowpaths"),
cache_dir = "cihro-data",
outfile = "cihro-data/poudre-subset.gpkg")
})
## Starting from: `nex-278699`
## Subsetting: divides (1/3)
## Subsetting: nexus (2/3)
## Subsetting: flowpaths (3/3)
## Deleting layer `layer_styles' using driver `GPKG'
## user system elapsed
## 3.882 1.330 14.254
st_layers(xx)
## Driver: GPKG
## Available layers:
## layer_name geometry_type features fields crs_name
## 1 divides Multi Polygon 419 9 NAD83 / Conus Albers
## 2 nexus Point 171 5 NAD83 / Conus Albers
## 3 flowpaths Multi Line String 419 10 NAD83 / Conus Albers
## 4 layer_styles NA 3 12 <NA>
CLI Option
For those interested in using the NOAA NextGen fabric as is, we have provided a Go-based CLI here
This utility has the following syntax:
hfsubset - Hydrofabric Subsetter
Usage:
hfsubset [OPTIONS] identifiers...
hfsubset (-h | --help)
Example:
hfsubset -l divides -o ./poudre-divides.gpkg -r "pre-release" -t hl "Gages-06752260"
hfsubset -o ./poudre-all.gpkg -t hl "Gages-06752260"
Options:
-l string
Layers to subset (default "divides,nexus,flowpaths,network,hydrolocations")
-o string
Output file name (default "hydrofabric.gpkg")
-r string
Hydrofabric version (default "pre-release")
-t string
One of: hf, hl, or comid (default "hf")
NextGen Needs GeoJSON: Final example
While GPKG support is the end goal for NextGen, it current requires GeoJSON and CSV inputs.
Fortunately, ogr2ogr
provides easy ways to extract these
sub layer/formats from the GPKG file.
Here is a full-stop example of extracting a subset for a hydrolocation, using the CLI, and generating the needed files for NextGen
cd <WHEREVER YOU WANT!>
mkdir poudre
cd poudre
hfsubset -l divides,nexus,flowpath_attributes -o ./poudre-subset.gpkg -r "pre-release" -t hl "Gages-06752260"
ogr2ogr -f GeoJSON flowpath.geojson poudre-subset.gpkg flowpath
ogr2ogr -f GeoJSON catchments.geojson poudre-subset.gpkg divides
ogr2ogr -f GeoJSON nexus.geojson poudre-subset.gpkg nexus
ogr2ogr -f CSV flowpath_attributes.csv poudre-subset.gpkg flowpath_attributes
ls poudre