updated ccss notebook example based on the new structure (#25)

igarousi · web-flow · commit 10ac7930292d · 2026-03-25T15:07:49.000-06:00
Merging Irene's changes with the plotting function.
diff --git a/examples/collect_observations/ccss_swe_collect_observations.ipynb b/examples/collect_observations/ccss_swe_collect_observations.ipynb
@@ -4,11 +4,35 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "![NWM](img/NWM.png)\n",
+    "  <p align=\"center\">\n",
+    "    <img\n",
+    "      src=\"../img/2025_01_02_NS_173_Snow_Survey.jpg\"\n",
+    "      alt=\"CCSS snow survey at Phillips Station\"\n",
+    "      width=\"1000\"\n",
+    "      height=\"600\"\n",
+    "    />\n",
+    "  </p>\n",
     "\n",
-    "# Retrieve and Analyze National Water Model Snow Data for a Watershed of Interest\n",
-    "Authors: Irene Garousi-Nejad (igarousi@cuahsi.org)\n",
-    "Last updated:Oct 16, 2025"
+    "  <p align=\"center\">\n",
+    "    <span style=\"color: gray;\"><em>\n",
+    "      Source: California Department of Water Resources. California Department of Water Resources staff members Manon von Kaenel (left), Jordan Thoennes, and Andy Reising conduct the first media snow survey of the 2025 season at Phillips Station in the Sierra Nevada, about 90 miles east of Sacramento off Highway 50 in El Dorado County. Photo taken January 2, 2025. Image obtained from\n",
+    "      <a href=\"https://pixel-ca-dwr.photoshelter.com/galleries/C0000lSOVLm_Sxso/G0000CdIWl5a2_xA/I0000OkEmK_kC73A/2025-01-02-NS-173-Snow-Survey-jpg\">this link</a>.\n",
+    "    </em></span>\n",
+    "  </p>\n",
+    "\n",
+    "# Retrieve California Cooperative Snow Surveys (CCSS) Data for a Watershed of Interest\n",
+    "\n",
+    "  **Author:** Irene Garousi-Nejad ([igarousi@cuahsi.org](mailto:igarousi@cuahsi.org))\n",
+    "  **Last updated:** March 24, 2026"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "622da967",
+   "metadata": {},
+   "source": [
+    "#### Introduction:  \n",
+    "This notebook demonstrates how to access the California Cooperative Snow Surveys (CCSS) data, which is program led by the California Department of Water Resources (DWR) to support water supply forecasting and flood management missions. "
    ]
   },
   {
@@ -22,7 +46,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Ensure that the `nwm_env` conda environment is selected as your Jupyter kernel. This environment should already be created if you followed the instructions under section \"Creating your HydroLearnEnv Virtual Environment\" in the `getting_started.md` file."
+    "Ensure that the `cssi_evaluation` conda environment is selected as your Jupyter kernel. This environment should already be created if you followed the instructions under section \"Creating your HydroLearnEnv Virtual Environment\" in the `GettingStarted.md` file."
    ]
   },
   {
@@ -43,55 +67,29 @@
     "import os\n",
     "import time\n",
     "import sys\n",
+    "from pathlib import Path\n",
     "\n",
     "prefix = os.environ['CONDA_PREFIX']\n",
     "os.environ['PROJ_LIB'] = os.path.join(prefix, 'share', 'proj')\n",
     "\n",
     "# add the src directory to the path so we can import evaluation modules\n",
-    "sys.path.append('../../src/')\n",
+    "#sys.path.append('../../src/')\n",
+    "repo_root = Path.cwd().resolve().parents[1]\n",
+    "sys.path.insert(0, str(repo_root / \"src\"))\n",
     "\n",
     "import pyproj\n",
     "import pandas as pd\n",
     "import xarray as xr\n",
     "import geopandas as gpd\n",
     "from dask.distributed import Client\n",
     "\n",
-    "from cssi_evaluation.utils import plot_utils\n",
-    "from cssi_evaluation.utils import dataPrep_utils\n",
+    "from cssi_evaluation.utils import plot_utils, dataPrep_utils\n",
     "from cssi_evaluation.external_data_access import observation_utils\n",
     "\n",
     "%load_ext autoreload\n",
     "%autoreload 2"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "We'll use dask to parallelize our code. To manage parallel computation and visualize progress of long-running tasks, we initialize a Dask “cluster,” which defines how many workers are used and how much computing power each worker has. \n",
-    "\n",
-    "In this setup, we create a Dask client with `Client(n_workers=6, threads_per_worker=1, memory_limit='2GB')`, which launches a cluster with 6 workers. Each worker uses a single thread, typically mapped to one CPU core, allowing for efficient parallel processing across 6 cores. Each worker also has a memory limit of 2 GB, for a total of up to 12 GB across the cluster.\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [],
-   "source": [
-    "# use a try accept loop so we only instantiate the client\n",
-    "# if it doesn't already exist.\n",
-    "try:\n",
-    "    print('Dashboard link:', client.dashboard_link)\n",
-    "except:    \n",
-    "    # The client should be customized to your workstation resources.\n",
-    "    client = Client(n_workers=6, threads_per_worker=1, memory_limit='2GB') \n",
-    "    print('Dashboard link:', client.dashboard_link)\n",
-    "print(client)"
-   ]
-  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -107,25 +105,21 @@
    },
    "outputs": [],
    "source": [
+    "# repository path\n",
+    "repo_root = Path.cwd().resolve().parents[1]\n",
+    "\n",
     "# path to the model domain data\n",
-    "domain_data_path = 'examples/nwm/domain_data/' \n",
+    "domain_data_path = f\"{repo_root}/examples/nwm/domain_data/\" \n",
     "\n",
     "# Path to the watershed shapefile\n",
-    "watershed = f\"{domain_data_path}TolumneRiver_18040009.shp\"\n",
-    "\n",
-    "# file retrieved using `curl -L https://raw.githubusercontent.com/egagli/snotel_ccss_stations/main/all_stations.geojson -o all_stations.geojson`\n",
-    "snotel_geojson = f\"{domain_data_path}all_stations.geojson\"\n",
-    "\n",
-    "# Path to NWM snow data\n",
-    "conus_bucket_url = 's3://noaa-nwm-retrospective-3-0-pds/CONUS/zarr/ldasout.zarr'\n",
+    "watershed_path = f\"{domain_data_path}TolumneRiver_18040009.shp\"\n",
     "\n",
     "# Start and end times of a water year (note that this code currently works for one water year)\n",
     "StartDate = '2018-10-01'\n",
     "EndDate = '2020-09-30'\n",
     "\n",
     "# Path to save results (obs and mod stands for observation and modeled, respectively)\n",
-    "OBS_OutputFolder = 'examples/nwm/obs_outputs' \n",
-    "MOD_OutputFolder = 'examples/nwm/mod_outputs'"
+    "OBS_OutputFolder = './cssi_outputs' "
    ]
   },
   {
@@ -159,8 +153,8 @@
     "all_stations_gdf = all_stations_gdf[all_stations_gdf['csvData']==True]\n",
     "filtered_all_stations_gdf = all_stations_gdf[all_stations_gdf.state.str.contains('California')]  \n",
     "\n",
-    "# Extract the bounding box coordinates of a watershed\n",
-    "watershed_gdf = gpd.read_file(os.path.join(os.getcwd(), watershed)).to_crs(epsg=4326)\n",
+    "# Read the watershed shapefile and standardize to WGS84\n",
+    "watershed_gdf = gpd.read_file(watershed_path).to_crs(epsg=4326)\n",
     "\n",
     "# Combine all polygons into a single MultiPolygon\n",
     "watershed_union = watershed_gdf.geometry.unary_union\n",
@@ -178,6 +172,16 @@
     "Plot these sites on a map. Then, hover over the pins to see the site names."
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c535a38c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "gdf_in_bbox"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -186,9 +190,8 @@
    },
    "outputs": [],
    "source": [
-    "## TODO: REPLACE WITH CSSI_EVALUATION.PLOTS FUNCTIONS\n",
-    "\n",
-    "m = plot_utils.plot_sites_within_domain(gdf_in_bbox, watershed_gdf, zoom_start=9)\n",
+    "# Plot the sites within the watershed boundary using the plot_utils function\n",
+    "m = plot_utils.map_sites_within_watershed(gdf_in_bbox, watershed_gdf, zoom_start=9) \n",
     "m"
    ]
   },
@@ -215,7 +218,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The following uses the `nwm_utils.py` script to download observed data for the sites within the domain. Since all the sites are from the (California Cooperative Snow Survey) CCSS network, we use the `getCCSSData` function from the module to get data. "
+    "The following uses the `observation_utils.py` script to download observed data for the sites within the domain. Since all the sites are from the (California Cooperative Snow Survey) CCSS network, we use the `getCCSSData` function from the module to get data. "
    ]
   },
   {
@@ -259,109 +262,17 @@
   },
   {
    "cell_type": "markdown",
-   "metadata": {
-    "tags": []
-   },
-   "source": [
-    "### 4. Retrieve Snow Model Outputs"
-   ]
-  },
-  {
-   "cell_type": "markdown",
+   "id": "7e4506d2",
    "metadata": {},
-   "source": [
-    "NOAA shares inputs and outputs to the National Water Model retrospective simulations version 3 at <a href=\"https://noaa-nwm-retrospective-3-0-pds.s3.amazonaws.com/index.html\" style=\"color: blue; background-color: snow;\">https://noaa-nwm-retrospective-3-0-pds.s3.amazonaws.com/index.html</a>. The following code uses `fsspec` and `xarray` Python libraries to load the Zarr metadata of snow outputs (**ldasout.zarr**) into memory. Once the code is executed, you can see the wall time, which includes time spent waiting for I/O operations, such as reading data from a remote server. In our case, it took about 12 seconds to load the metadata into memory. Set up `Dask`, a parallel computing library, to enable performing operations on large datasets that don't fit into memory by breaking them into smaller, manageable pieces called chunks."
-   ]
+   "source": []
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [],
-   "source": [
-    "%%time \n",
-    "ds = xr.open_zarr(\n",
-    "    store=conus_bucket_url,\n",
-    "    consolidated=True,\n",
-    "    storage_options={\n",
-    "        \"anon\": True,\n",
-    "        \"client_kwargs\": {\"region_name\": \"us-east-1\"}\n",
-    "    }\n",
-    ")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
+   "id": "208439e6",
    "metadata": {},
-   "source": [
-    "The following code retrieves NWM SWE output for each SNOTEL site within our watershed. For each site, it first converts latitude and longitude of the site to the projected coordinates used by the NWM. Then, it extracts the NWM SWE output for the site and the period of interest, saving the result as a DataFrame. Since NWM timestamps are in UTC, the DataFrame is converted to the local time zone to match SNOTEL observations for later comparison purposes. To fairly compare with SNOTEL, which reports SWE once daily at the start of the local day, the data is grouped by date, and the earliest record of each day is selected. Finally, the processed data is saved as a CSV file for each site."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [],
-   "source": [
-    "# Create a folder to save model outputs\n",
-    "isExist = os.path.exists(MOD_OutputFolder)\n",
-    "if isExist == True:\n",
-    "    exit\n",
-    "else:\n",
-    "    os.mkdir(MOD_OutputFolder)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "tags": []
-   },
    "outputs": [],
-   "source": [
-    "# Retrieve model outputs for the location of snotel sites\n",
-    "input_crs = 'EPSG:4269'\n",
-    "output_crs = pyproj.CRS(ds.crs.esri_pe_string) \n",
-    "\n",
-    "for i in range(0, gdf_in_bbox.shape[0]):\n",
-    "    \n",
-    "    site_name = gdf_in_bbox.iloc[i][\"name\"]\n",
-    "    print(f'[{i+1}/{len(gdf_in_bbox)}] Retrieving model output for site: {site_name}')\n",
-    "    \n",
-    "    snotel_y, snotel_x = dataPrep_utils.convert_latlon_to_yx(\n",
-    "        gdf_in_bbox.iloc[i].latitude,\n",
-    "        gdf_in_bbox.iloc[i].longitude,\n",
-    "        input_crs,\n",
-    "        output_crs\n",
-    "    )\n",
-    "    \n",
-    "    dl_start_time = time.time()\n",
-    "    ds_subset = ds[['SNEQV']].sel(y=snotel_y, x=snotel_x, method='nearest'\n",
-    "                                 ).sel(time=slice(StartDate, EndDate)).compute()\n",
-    "    dl_elapsed = time.time() - dl_start_time\n",
-    "    print(f'✅ Retrieved model outputs for {site_name} in {dl_elapsed:.2f} seconds\\n')\n",
-    "    \n",
-    "    df = ds_subset.to_dataframe()\n",
-    "    df=df.drop(columns=['x', 'y'])\n",
-    "    df.reset_index(inplace=True)\n",
-    "    df[\"time\"] = pd.to_datetime(df[\"time\"])\n",
-    "    df.rename(columns={df.columns[0]:'Date', df.columns[1]:'NWM_SWE_meters'}, inplace=True)\n",
-    "    df.iloc[:, 1:] = df.iloc[:, 1:].apply(lambda x: pd.to_numeric(x)/1000)  # convert mm to m   \n",
-    "\n",
-    "    # convert utc to local time zone\n",
-    "    df_local = dataPrep_utils.convert_utc_to_local(gdf_in_bbox.iloc[i].state, df)   \n",
-    "    \n",
-    "    # groupby the data and select the first item from each group \n",
-    "    df_local.index = pd.to_datetime(df_local['Date_Local'])\n",
-    "    df_local = df_local.groupby(pd.Grouper(freq='D')).first()\n",
-    "\n",
-    "    # save\n",
-    "    df_local.to_csv(f'./{MOD_OutputFolder}/df_{gdf_in_bbox.iloc[i].code}_{gdf_in_bbox.iloc[i].state}_NWM.csv', index=False)"
-   ]
+   "source": []
   }
  ],
  "metadata": {
diff --git a/examples/img/2025_01_02_NS_173_Snow_Survey.jpg b/examples/img/2025_01_02_NS_173_Snow_Survey.jpg
diff --git a/src/cssi_evaluation/utils/plot_utils.py b/src/cssi_evaluation/utils/plot_utils.py
@@ -12,7 +12,7 @@
 # nwm_utils.plot_custom_scatter_SWE()
 # nwm_utils.comparison_plots()
 # nwm_utils.plot_grid_vector_monthly_data()
-# nwm_utils.plot_sites_within_domain() overlaps with plot_obs_locations
+# nwm_utils.map_sites_within_watershed() overlaps with plot_obs_locations.  # IG, Mar 24,2026: I think plot_obs_locations() and plot_sites_within_domain() serve different purposes.
 # nwm_utils.plot_grid_vector_data()
 # plots.plot_metric_map()
 # plots.plot_obs_locations()
@@ -515,7 +515,7 @@ def plot_condon_diagram(metrics_df, variable, output_dir="."):
 # from Irene's nwm_utils.py
 
 
-def plot_sites_within_domain(gdf_sites, domain_gdf, zoom_start=10):
+def map_sites_within_watershed(gdf_sites, domain_gdf, zoom_start=10):
     """
     Create and return a folium map showing observation sites within a given watershed boundary.