Skip to contents

We are now at the point where we have a common understanding of how hydrofabrics are sourced, manipulated for a given modeling task, and generated on a VPU basis.

This aim of this section is that each of us can build a subset network for a hydrolocation of interest. Doing this requires an understanding of the following:

  1. The CONUS network file
  2. General topological sorts and pitfalls
  3. The R based hydrofabric::subset_network
  4. The CLI, Go based, utility for micro service subsets.

The CONUS network file

  • Remember, reference, refactor, and NextGen products are distributed by VPU.

  • While the data products are segmented, they do describe a holistic network.

  • To easy data discoverabilty, cross-walking, and indexing, the network layers of each VPU are joined into a single Parquet file (~80MB in size)

net = arrow::read_parquet("../data/conus_net.parquet")

glimpse(net)
## Rows: 3,178,347
## Columns: 16
## $ id                    <chr> "wb-282", "wb-282", "wb-282", "wb-282", "wb-2462…
## $ toid                  <chr> "tnx-1000000171", "tnx-1000000171", "tnx-1000000…
## $ divide_id             <chr> "cat-282", "cat-282", "cat-282", "cat-282", "cat…
## $ ds_id                 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ mainstem              <int> 2238463, 2238463, 2238463, 2238463, 2243682, 224…
## $ hl_id                 <chr> NA, NA, NA, NA, "176", "176", "176", "183", "183…
## $ hydroseq              <int> 19980, 19980, 19980, 19980, 19979, 19979, 19979,…
## $ hl_uri                <chr> NA, NA, NA, NA, "HUC12-010100100101", "HUC12-010…
## $ hf_source             <chr> "NHDPlusV2", "NHDPlusV2", "NHDPlusV2", "NHDPlusV…
## $ hf_id                 <dbl> 4287171, 4287061, 4287169, 4287115, 816563, 8164…
## $ lengthkm              <dbl> 3.695258, 3.695258, 3.695258, 3.695258, 5.135779…
## $ areasqkm              <dbl> 6.786451, 6.786451, 6.786451, 6.786451, 13.28626…
## $ tot_drainage_areasqkm <dbl> 6.786451, 6.786451, 6.786451, 6.786451, 51.65776…
## $ type                  <chr> "terminal", "terminal", "terminal", "terminal", …
## $ vpu                   <chr> "01", "01", "01", "01", "01", "01", "01", "01", …
## $ hf_hydroseq           <dbl> 150061851, 150030537, 150035911, 150045246, 1500…

In the above table you’ll see that every relationship between the features in the current hydrofabric, the source hydrofabric, and all hydrolocations, have been exploded into a “many-to-many” table.

For example, we can look at the flowpath wb-1002 and find that it is defined by the aggregation of NHDPlusV2 COMID 2239831 and 2239831.

glimpse(filter(net, id == "wb-1002"))
## Rows: 2
## Columns: 16
## $ id                    <chr> "wb-1002", "wb-1002"
## $ toid                  <chr> "nex-294", "nex-294"
## $ divide_id             <chr> "cat-1002", "cat-1002"
## $ ds_id                 <dbl> NA, NA
## $ mainstem              <int> 2239831, 2239831
## $ hl_id                 <chr> NA, NA
## $ hydroseq              <int> 1927, 1927
## $ hl_uri                <chr> NA, NA
## $ hf_source             <chr> "NHDPlusV2", "NHDPlusV2"
## $ hf_id                 <dbl> 4290271, 4290437
## $ lengthkm              <dbl> 6.291308, 6.291308
## $ areasqkm              <dbl> 23.06835, 23.06835
## $ tot_drainage_areasqkm <dbl> 54.84601, 54.84601
## $ type                  <chr> "nexus", "nexus"
## $ vpu                   <chr> "01", "01"
## $ hf_hydroseq           <dbl> 150045065, 150035814

Or, that the terminal outflow of ‘HUC12-010100100101’ occurs at tnx-1000000019 which is fed by an aggregate flowpath made up of three source flowpaths (COMID={816563, 816417, 816415})

glimpse(filter(net, hl_uri == 'HUC12-010100100101'))
## Rows: 3
## Columns: 16
## $ id                    <chr> "wb-2462", "wb-2462", "wb-2462"
## $ toid                  <chr> "tnx-1000000019", "tnx-1000000019", "tnx-1000000…
## $ divide_id             <chr> "cat-2462", "cat-2462", "cat-2462"
## $ ds_id                 <dbl> NA, NA, NA
## $ mainstem              <int> 2243682, 2243682, 2243682
## $ hl_id                 <chr> "176", "176", "176"
## $ hydroseq              <int> 19979, 19979, 19979
## $ hl_uri                <chr> "HUC12-010100100101", "HUC12-010100100101", "HUC…
## $ hf_source             <chr> "NHDPlusV2", "NHDPlusV2", "NHDPlusV2"
## $ hf_id                 <dbl> 816563, 816417, 816415
## $ lengthkm              <dbl> 5.135779, 5.135779, 5.135779
## $ areasqkm              <dbl> 13.28626, 13.28626, 13.28626
## $ tot_drainage_areasqkm <dbl> 51.65776, 51.65776, 51.65776
## $ type                  <chr> "terminal", "terminal", "terminal"
## $ vpu                   <chr> "01", "01", "01"
## $ hf_hydroseq           <dbl> 150027523, 150024646, 150066278

Why do it this way?

1. Speed

system.time({
  net = read_parquet('s3://nextgen-hydrofabric/pre-release/conus_net.parquet')
})

#    user  system elapsed 
#  2.180   2.123  41.288 
Defered Evaluation

The times you would want to open the entire network are limited, often, we are interested in small bits of data that

system.time({
  net = open_dataset('s3://nextgen-hydrofabric/pre-release/conus_net.parquet') %>% 
    filter(id == "wb-1002") %>% 
    collect()
})

#   user  system elapsed 
#  2.294   2.588  24.261

2. Feature Identification

By location
here = AOI::geocode("National Water Center, Alabama", pt = TRUE)
  
x = dataRetrieval::findNLDI(location = here)

(filter(net, hf_id == x$origin$comid))
## # A tibble: 1 × 16
##   id       toid  divide_id ds_id mainstem hl_id hydroseq hl_uri hf_source  hf_id
##   <chr>    <chr> <chr>     <dbl>    <int> <chr>    <int> <chr>  <chr>      <dbl>
## 1 wb-1649… nex-… cat-1649…    NA  1717455 NA        4370 NA     NHDPlusV2 1.58e7
## # ℹ 6 more variables: lengthkm <dbl>, areasqkm <dbl>,
## #   tot_drainage_areasqkm <dbl>, type <chr>, vpu <chr>, hf_hydroseq <dbl>
For a gage in Calfornia
hl = filter(net, hl_uri == "Gages-11123000")

(slice_max(hl, hf_hydroseq))
## # A tibble: 1 × 16
##   id       toid  divide_id ds_id mainstem hl_id hydroseq hl_uri hf_source  hf_id
##   <chr>    <chr> <chr>     <dbl>    <int> <chr>    <int> <chr>  <chr>      <dbl>
## 1 wb-3400… nex-… cat-3400…    NA  1839328 3569     30250 Gages… NHDPlusV2 1.76e7
## # ℹ 6 more variables: lengthkm <dbl>, areasqkm <dbl>,
## #   tot_drainage_areasqkm <dbl>, type <chr>, vpu <chr>, hf_hydroseq <dbl>
For the HUC12 of Atascadero Creek in Santa Barbara:
hl = filter(net, hl_uri == "HUC12-180600130201")

(wbd = slice_max(hl, hf_hydroseq))
## # A tibble: 1 × 16
##   id       toid  divide_id ds_id mainstem hl_id hydroseq hl_uri hf_source  hf_id
##   <chr>    <chr> <chr>     <dbl>    <int> <chr>    <int> <chr>  <chr>      <dbl>
## 1 wb-3400… nex-… cat-3400…    NA  1841917 2281     16722 HUC12… NHDPlusV2 1.76e7
## # ℹ 6 more variables: lengthkm <dbl>, areasqkm <dbl>,
## #   tot_drainage_areasqkm <dbl>, type <chr>, vpu <chr>, hf_hydroseq <dbl>
For the dam on Horsetooth Reservoir
hl = filter(net, hl_uri == "NID-CO01659-1")

(wbd = slice_max(hl, hf_hydroseq))
## # A tibble: 1 × 16
##   id       toid  divide_id ds_id mainstem hl_id hydroseq hl_uri hf_source  hf_id
##   <chr>    <chr> <chr>     <dbl>    <int> <chr>    <int> <chr>  <chr>      <dbl>
## 1 wb-3314… nex-… cat-3314…    NA  1223227 7253     11610 NID-C… NHDPlusV2 2.90e6
## # ℹ 6 more variables: lengthkm <dbl>, areasqkm <dbl>,
## #   tot_drainage_areasqkm <dbl>, type <chr>, vpu <chr>, hf_hydroseq <dbl>
By known COMID
filter(net, hf_id == 101)
## # A tibble: 1 × 16
##   id        toid  divide_id ds_id mainstem hl_id hydroseq hl_uri hf_source hf_id
##   <chr>     <chr> <chr>     <dbl>    <int> <chr>    <int> <chr>  <chr>     <dbl>
## 1 wb-26052… nex-… cat-2605…    NA  2603469 281        120 HUC12… NHDPlusV2   101
## # ℹ 6 more variables: lengthkm <dbl>, areasqkm <dbl>,
## #   tot_drainage_areasqkm <dbl>, type <chr>, vpu <chr>, hf_hydroseq <dbl>

3. Topo sorting!

A national network provides an opportunity for rapid network navigation and subsetting. The structure of the CONUS network file also facilitates this, if a few precautions are taken.

While the “many-to-many” relationship of the network file allows for fast indexing and discovery, the duplication of feature IDs is not conducive to graph based approaches.

# Most downstream feature associated with 
(wbd = filter(net, hl_uri == "HUC12-180600130201") %>% 
   select(id, toid, hl_uri, vpu, hydroseq) %>% 
   slice_max(hydroseq, with_ties = TRUE)) 
## # A tibble: 8 × 5
##   id         toid        hl_uri             vpu   hydroseq
##   <chr>      <chr>       <chr>              <chr>    <int>
## 1 wb-3400759 nex-3400758 HUC12-180600130201 18       16731
## 2 wb-3400759 nex-3400758 HUC12-180600130201 18       16731
## 3 wb-3400759 nex-3400758 HUC12-180600130201 18       16731
## 4 wb-3400759 nex-3400758 HUC12-180600130201 18       16731
## 5 wb-3400759 nex-3400758 HUC12-180600130201 18       16731
## 6 wb-3400759 nex-3400758 HUC12-180600130201 18       16731
## 7 wb-3400759 nex-3400758 HUC12-180600130201 18       16731
## 8 wb-3400759 nex-3400758 HUC12-180600130201 18       16731

Seeking just the topology of the network would be impractical/wrong

(filter(net, vpu == wbd$vpu[1]) %>% 
  select(id, toid, divide_id))
## # A tibble: 170,349 × 3
##    id         toid           divide_id  
##    <chr>      <chr>          <chr>      
##  1 wb-3378885 tnx-1000012409 cat-3378885
##  2 wb-3378885 tnx-1000012409 cat-3378885
##  3 wb-3378885 tnx-1000012409 cat-3378885
##  4 wb-3378885 tnx-1000012409 cat-3378885
##  5 wb-3378885 tnx-1000012409 cat-3378885
##  6 wb-3378885 tnx-1000012409 cat-3378885
##  7 wb-3378885 tnx-1000012409 cat-3378885
##  8 wb-3378885 tnx-1000012409 cat-3378885
##  9 wb-3378885 tnx-1000012409 cat-3378885
## 10 wb-3378885 tnx-1000012409 cat-3378885
## # ℹ 170,339 more rows

Thus, you NEED to make sure that operations running on the network file are dealing with distinct rows for the operations you want.

In the case of hydrosequencing, these are unique id/toid pairs

(wbd_net = filter(net, vpu == wbd$vpu) %>% 
  select(id, toid, divide_id) %>% 
  distinct())
## Warning: There was 1 warning in `filter()`.
##  In argument: `vpu == wbd$vpu`.
## Caused by warning in `vpu == wbd$vpu`:
## ! longer object length is not a multiple of shorter object length
## # A tibble: 65,280 × 3
##    id         toid           divide_id  
##    <chr>      <chr>          <chr>      
##  1 wb-3378885 tnx-1000012409 cat-3378885
##  2 wb-3378890 tnx-1000013027 cat-3378890
##  3 wb-3378892 tnx-1000012410 cat-3378892
##  4 wb-3378895 tnx-1000013028 cat-3378895
##  5 wb-3378974 tnx-1000013029 cat-3378974
##  6 wb-3380350 tnx-1000012411 cat-3380350
##  7 wb-3380601 tnx-1000012412 cat-3380601
##  8 wb-3380616 tnx-1000012414 cat-3380616
##  9 wb-3380617 tnx-1000013030 cat-3380617
## 10 wb-3380618 tnx-1000012415 cat-3380618
## # ℹ 65,270 more rows

With a complete, non duplicated network, we can easily sort and subset based on a defined outlet(s)

(get_sorted(wbd_net, outlets = wbd$toid[1]))
## Warning in check_hy_outlets(x, fix = FALSE): Outlets don't follow hydroloom
## convention of 0 or '', not fixing.
## # A tibble: 20 × 3
##    id          toid        divide_id  
##    <chr>       <chr>       <chr>      
##  1 wb-3400765  nex-3400757 cat-3400765
##  2 nex-3400757 wb-3400757  NA         
##  3 wb-3400757  nex-3400758 cat-3400757
##  4 wb-3400769  nex-3400759 cat-3400769
##  5 nex-3400759 wb-3400759  NA         
##  6 wb-3400759  nex-3400758 cat-3400759
##  7 wb-3400760  nex-3400761 cat-3400760
##  8 nex-3400761 wb-3400761  NA         
##  9 wb-3400761  nex-3400758 cat-3400761
## 10 wb-3400768  nex-3400758 cat-3400768
## 11 wb-3400766  nex-3400762 cat-3400766
## 12 nex-3400762 wb-3400762  NA         
## 13 wb-3400762  nex-3400763 cat-3400762
## 14 wb-3400767  nex-3400764 cat-3400767
## 15 nex-3400764 wb-3400764  NA         
## 16 wb-3400764  nex-3400763 cat-3400764
## 17 wb-3400770  nex-3400763 cat-3400770
## 18 nex-3400763 wb-3400763  NA         
## 19 wb-3400763  nex-3400758 cat-3400763
## 20 nex-3400758 wb-3400758  NA

Complete Example:

system.time({
  wbd = filter(net, hl_uri == "HUC12-180600130201") %>% 
   select(toid,  vpu, hydroseq) %>% 
   slice_max(hydroseq, with_ties = FALSE)

 wbd_net = filter(net, vpu == wbd$vpu) %>% 
  select(id, toid, divide_id) %>% 
    distinct() %>% 
    get_sorted(outlets = wbd$toid)
})
## Warning in check_hy_outlets(x, fix = FALSE): Outlets don't follow hydroloom
## convention of 0 or '', not fixing.
##    user  system elapsed 
##   0.890   0.137   1.057
(ids = na.omit(unique(unlist(wbd_net))))
##  [1] "wb-3400765"  "nex-3400757" "wb-3400757"  "wb-3400769"  "nex-3400759"
##  [6] "wb-3400759"  "wb-3400760"  "nex-3400761" "wb-3400761"  "wb-3400768" 
## [11] "wb-3400766"  "nex-3400762" "wb-3400762"  "wb-3400767"  "nex-3400764"
## [16] "wb-3400764"  "wb-3400770"  "nex-3400763" "wb-3400763"  "nex-3400758"
## [21] "wb-3400758"  "cat-3400765" "cat-3400757" "cat-3400769" "cat-3400759"
## [26] "cat-3400760" "cat-3400761" "cat-3400768" "cat-3400766" "cat-3400762"
## [31] "cat-3400767" "cat-3400764" "cat-3400770" "cat-3400763"
## attr(,"na.action")
## [1] 23
## attr(,"class")
## [1] "omit"

4. Easy integration with SQL

Once you have a set of IDs, the SQL backing of GPKG allows for quickly extracting features based on that ID set

## USE GET HYDROFABRIC HERE
(gpkg = get_fabric(VPU = wbd$vpu,  cache_dir = "cihro-data"))
## cihro-data/nextgen_18.gpkg
db <- dbConnect(SQLite(), gpkg)

t = tbl(db, "divides") %>%
      filter(if_any(any_of(c('divide_id', 'id', 'ds_id')), ~ . %in% ids)) %>%
      collect()

st_as_sf(t, crs = 5070) %>%  
  mapview()

Subsetting!

Now, while conceptual that is straight forward, doing this repetitively is annoying. Equally, writing out the logic for all layers is a loop that can be prone to easy errors.

In these cases, we provide a subset_network function that works o n

system.time({
  xx = subset_network(id = wbd$toid, 
                      lyrs = c("divides", "nexus", "flowpaths", "lakes", "flowpath_attributes"),
                      cache_dir = "cihro-data")
})
## Warning: There was 1 warning in `dplyr::filter()`.
##  In argument: `hf_id == comid`.
## Caused by warning in `hf_id == comid`:
## ! longer object length is not a multiple of shorter object length
## Starting from: `wb-3400757`
## Subsetting: divides (1/5)
## Subsetting: nexus (2/5)
## Subsetting: flowpaths (3/5)
## Subsetting: lakes (4/5)
## Subsetting: flowpath_attributes (5/5)
##    user  system elapsed 
##   5.082   1.045  15.140
lapply(xx, names)
## $divides
##  [1] "fid"                   "geom"                  "divide_id"            
##  [4] "toid"                  "type"                  "ds_id"                
##  [7] "areasqkm"              "id"                    "lengthkm"             
## [10] "tot_drainage_areasqkm" "has_flowline"         
## 
## $nexus
## [1] "fid"    "geom"   "id"     "toid"   "hl_id"  "hl_uri" "type"  
## 
## $flowpaths
##  [1] "fid"                   "geom"                  "id"                   
##  [4] "toid"                  "mainstem"              "order"                
##  [7] "hydroseq"              "lengthkm"              "areasqkm"             
## [10] "tot_drainage_areasqkm" "has_divide"            "divide_id"            
## 
## $lakes
##  [1] "fid"          "geom"         "id"           "toid"         "hl_id"       
##  [6] "hl_reference" "hl_link"      "hl_uri"       "Dam_Length"   "ifd"         
## [11] "LkArea"       "LkMxE"        "OrificeA"     "OrificeC"     "OrificeE"    
## [16] "time"         "WeirC"        "WeirE"        "WeirL"       
## 
## $flowpath_attributes
##  [1] "fid"                  "id"                   "rl_gages"            
##  [4] "rl_NHDWaterbodyComID" "Qi"                   "MusK"                
##  [7] "MusX"                 "n"                    "So"                  
## [10] "ChSlp"                "BtmWdth"              "time"                
## [13] "Kchan"                "nCC"                  "TopWdthCC"           
## [16] "TopWdth"              "length_m"

Example for Poudre River!

system.time({
  xx = subset_network(hl_uri = 'Gages-06752260', 
                      lyrs = c("divides", "nexus", "flowpaths"),
                      cache_dir = "cihro-data",
                      outfile = "cihro-data/poudre-subset.gpkg")
})
## Starting from: `nex-278699`
## Subsetting: divides (1/3)
## Subsetting: nexus (2/3)
## Subsetting: flowpaths (3/3)
## Deleting layer `layer_styles' using driver `GPKG'
##    user  system elapsed 
##   3.882   1.330  14.254
st_layers(xx)
## Driver: GPKG 
## Available layers:
##     layer_name     geometry_type features fields             crs_name
## 1      divides     Multi Polygon      419      9 NAD83 / Conus Albers
## 2        nexus             Point      171      5 NAD83 / Conus Albers
## 3    flowpaths Multi Line String      419     10 NAD83 / Conus Albers
## 4 layer_styles                NA        3     12                 <NA>

Open in QGIS

CLI Option

For those interested in using the NOAA NextGen fabric as is, we have provided a Go-based CLI here

This utility has the following syntax:

hfsubset - Hydrofabric Subsetter

Usage:
  hfsubset [OPTIONS] identifiers...
  hfsubset (-h | --help)

Example:
  hfsubset -l divides -o ./poudre-divides.gpkg -r "pre-release" -t hl "Gages-06752260"
  hfsubset -o ./poudre-all.gpkg -t hl "Gages-06752260"

Options:
  -l string
        Layers to subset (default "divides,nexus,flowpaths,network,hydrolocations")
  -o string
        Output file name (default "hydrofabric.gpkg")
  -r string
        Hydrofabric version (default "pre-release")
  -t string
        One of: hf, hl, or comid (default "hf")

NextGen Needs GeoJSON: Final example

While GPKG support is the end goal for NextGen, it current requires GeoJSON and CSV inputs.

Fortunately, ogr2ogr provides easy ways to extract these sub layer/formats from the GPKG file.

Here is a full-stop example of extracting a subset for a hydrolocation, using the CLI, and generating the needed files for NextGen

cd <WHEREVER YOU WANT!>
mkdir poudre
cd poudre

hfsubset -l divides,nexus,flowpath_attributes -o ./poudre-subset.gpkg -r "pre-release" -t hl "Gages-06752260"

ogr2ogr -f GeoJSON flowpath.geojson poudre-subset.gpkg flowpath

ogr2ogr -f GeoJSON catchments.geojson poudre-subset.gpkg divides  
ogr2ogr -f GeoJSON nexus.geojson poudre-subset.gpkg nexus
ogr2ogr -f CSV flowpath_attributes.csv poudre-subset.gpkg flowpath_attributes

ls poudre