diff --git a/gedi/README.md b/gedi/README.md index 6bdbcb3..20da90e 100644 --- a/gedi/README.md +++ b/gedi/README.md @@ -1,4 +1,126 @@ # GEDI -This folder is intended for code relating to data from the [GEDI](https://gedi.umd.edu/) mission. The Level 1B, Level 2A, and Level 2B data from GEDI are of particular interest for our work here. +This folder is intended for code relating to data from the [GEDI](https://gedi.umd.edu/) mission. +The Level 2A data from GEDI are of particular interest for our work here. +This dataset allows us to measure canopy heights for large swaths of land at approximately 30m resolution. -Currently, we are using the rGEDI package (see: [CRAN](https://cran.r-project.org/web/packages/rGEDI/index.html), [tutorial](https://cran.r-project.org/web/packages/rGEDI/vignettes/tutorial.html), [paper](https://www.researchgate.net/publication/339971349_rGEDI_An_R_Package_for_NASA%27s_Global_Ecosystem_Dynamics_Investigation_GEDI_Data_Visualizing_and_Processing), [source](https://github.com/carlos-alberto-silva/rGEDI)) to download and process the data from GEDI. As a result, a considerable amount of code here will likely be in R. +## GEDI Retrieval +The GEDI mission is a relatively new mission to the story of Earth Observation, and is still currently taking data of the Earth's surface while aboard the International Space Station. +Thus the GEDI data is still in a rudimentary form. +When downloaded, the entire Level 2A dataset available as of 3 February 2021 is nearly 40 TB. +This is too big for many individuals to work with on their personal machines, and time needed to download that much data would be too great for the highly adaptive workflow we are implement. + +We must therefore develop a methodology that can quickly find, retrieve, download, and gather the appropriate data for our purpose. +Here, we present such a methodology that utilizes the [NASA LP DAAC GEDI Data Finder service](https://lpdaac.usgs.gov/news/release-gedi-finder-web-service/) and the [NASA EarthData Search application](https://earthdata.nasa.gov/search) to accomplish this task. + +## Dependencies +* `io` +* `os` +* `h5py` +* `numpy` +* `pandas` +* `zipfile` +* `argparse` +* `requests` +* `geopandas ` +* `List from typing` +* `remtree from shutil` +* `Point from shapely.geometry` +* `urlencode from urllib.parse` + +### GEDI Finder + +The GEDI Finder program (gediFinder.py) builds off the NASA service, the LP DAAC GEDI Finder, and allows for a fast and easy way to determine the appropriate GEDI data for a specific bounding box. + +The user supplies the output directory and the bounding box of interest, and the user has the option to supply the GEDI data level and the name of the file created. +If the optional arguments for the GEDI data level and the name of the output file are not supplied, the program uses the default parameters of L2A and granule_list, respectively. + +A comma-separated list of each GEDI granule that intersects the bounding box of interest is written to the output file as a .txt file. +The list of granules is used to select the appropriate granules to download in the NASA EarthData Search application. +**To use this list in future steps, the user must open the file and copy the entire contents.** + +#### To Run +`python gediFinder.py -d DirectoryPath -b ul_lat,ul_lon,lr_lat,lr_lon -l 2a` ([source](gediFinder.py)) + +#### Arguments +- `-d,--dir` : The directory containing url txt file, formatted with a trailing slash, such that {dir}{fname} is a valid path, for fname a valid file name. +- `-b, --bbox` : The bounding box of the region of interest. In format ul_lat,ul_lon,lr_lat,lr_lon. +- `-l,--level` *(optional)* : The data product level of interest. Acceptable answers at this time are 1B, 2A, or 2B with default of 2A. The default argument if none is given is *2A*. +- `-o,--outfile` *(optional)* : The stem of the name of the output file, without file extension, optional. The default argument if none is given is *granule_list*. + +#### Output +Text file with a comma-separated list of GEDI granules that intersect user-defined bounding box + +### EarthData Search + +Pages 2-4 in the [GEDI Spatial Querying and Subsetting Quick Guide](https://lpdaac.usgs.gov/documents/635/GEDI_Quick_Guide.pdf) provides a great overview of the process we implement in relation to data collection from [NASA EarthData Search application](https://earthdata.nasa.gov/search). +The following is pasted **directly** from the [GEDI Spatial Querying and Subsetting Quick Guide](https://lpdaac.usgs.gov/documents/635/GEDI_Quick_Guide.pdf). Some modifications have been made for readability and context and are *emphasized*. + +> 1. **Access Earthdata Search** +> +> After *obtaining a comma-separated list of GEDI granules with GEDI Finder* , open [NASA Earthdata Search](https://search.earthdata.nasa.gov/). Sign in with Earthdata Login credentials or [register](https://urs.earthdata.nasa.gov/users/new) for a new account. +> +> Note: Currently there are no spatial searching capabilities for the GEDI Version 1 datasets in Earthdata Search. +> +> 2. **Search for Dataset** +> +> Search for a collection by entering the dataset short name *(e.g. GEDI02_A)* into the search box then select the desired product from the list of matching collections. +> All available granules for the product will be included in the list of matching granules. +> +> 3. **Search for Granules** +> +> Copy the list of comma-separated granule IDs *obtained with GEDI Finder* and paste it into the Granule Search box in Earthdata Search. Use the Enter key to initiate the search. +> +> 4. **Select spatial and/or layer parameters for GEDI granules** +> +> Click on the green Download All button to open the download and order menu. Under “Select Data Access Method,” select Customize. +> +> To set up the parameters for clipping out a smaller area of the granule, scroll down to the Spatial Subsetting section. +> Check the box next to Click to Enable and enter coordinates of the bounding box for the ROI. +> +> To select specific science dataset layers, scroll down to the Band Subsetting section. +> Expand the directories and select the desired layers. +> Additional information for each of the data layers can be found on the [GEDI01_B](https://doi.org/10.5067/GEDI/GEDI01_B.001), [GEDI02_A](https://doi.org/10.5067/GEDI/GEDI02_A.001), or [GEDI02_B](https://doi.org/10.5067/GEDI/GEDI02_B.001) Digital Object Identifier (DOI) landing page. +> +> 5. **Place Order** +> +> After the desired parameters for spatial and/or layer subsetting have been selected, click Done to complete the custom order form then click Download Data to initiate the order. +> When the data request is submitted, an order confirmation email is sent to the email address associated with the Earthdata login credentials or specified in the custom order form. +> +> 6. **Retrieve Data** +> +> A status update email for the data processing request will be delivered when the order has completed. The order completion email contains URLs for accessing the data outputs. +> Please note that the URLs have an expiration date and are only valid for one week. + +There is another step we implement to prepare the data for download and extraction with GEDI Combine. + +7. **Get List of Zip File URLs** + +Copy the url ending with `.zip?1` and paste into the web browser of your choosing. +The ZIP file should begin downloading. +Once the file has completed downloading, unzip the file and access the README file in the unzipped folder. +This README file will be used as the `--textfile` argument when running the GEDI Combine script. + +### GEDI Combine +The NASA EarthData Search will collect, clip, and process the GEDI granules, creating a number of zip files with the processed data to download. +The gediCombine program we present here takes the list of urls necessary to download the zip files and extracts the data from the files after downloading them. + +This program cycles through the list of urls supplied, conducting the following operations to extract the data. +* Download zip file and unzip contents +* Extracts useful data from files +* Delete HDF5 files +* Write data to a pandas DataFrame + +After each zip file has been processed and deleted, the DataFrame is written to a csv, parquet, or GeoJSON file, depending on inputs from the user. + +#### To Run +`python gediCombine.py -d DirectoryPath -t FilePath -b ul_lat,ul_lon,lr_lat,lr_lon -o gedi_output -f csv` [source](gediCombine.py) + +#### Arguments +- `-d,--dir` : The directory containing url txt file, formatted with a trailing slash, such that {dir}{fname} is a valid path, for fname a valid file name. +- `-t,--textfile` : The file path for the txt file containing the downloaded urls of the zip files, supplied by EarthData Search. +- `-b,--bbox` : The bounding box of the region of interest. In format ul_lat,ul_lon,lr_lat,lr_lon +- `-o,--outfile` *(optional)* : The stem of the name of the output file, without file extension, optional. The default argument if none is given is *gedi_output*. +- `-f,--filetype` *(optional)* : The type of file to output. Acceptable formats are: csv, parquet, GeoJSON. The default argument if none is given is *csv*. + +#### Output +Generates file of input filetype with Level 2B data retrieved from GEDI servers. diff --git a/gedi/gediFinder.py b/gedi/gediFinder.py new file mode 100644 index 0000000..d4ae69d --- /dev/null +++ b/gedi/gediFinder.py @@ -0,0 +1,141 @@ +#!/usr/bin/env python +""" +This script is meant to create a GediFinder URL and get the corresponding list of granules within a +user-defined bounding box. This list can then be used in the Earthdata Search Tool to pull data from +within a bounding box. + +To Run +---------- +python gediFinder.py -d DirectoryPath -b ul_lat,ul_lon,lr_lat,lr_lon -l level -o granule_list + +Arguments +------- + -d,--dir : str + The directory containing url txt file, formatted with a trailing slash, + such that {dir}{fname} is a valid path, for fname a valid file name. + -b, --bbox : str + The bounding box of the region of interest. In format + ul_lat,ul_lon,lr_lat,lr_lon + -l,--level : str (optional) + The data product level of interest. Acceptable answers at this time + are 1B, 2A, or 2B with default of 2A + default = 2A + -o,--outfile : str (optional) + The stem of the name of the output file, without file extension, + optional. Default value is 'gedi_output'. + default = granule_list + +Outputs +------- +Writes granule list to a txt file in directory +""" + +import os +import requests +import argparse + +from urllib.parse import urlencode + +def gediFinder( + level: str, + bbox: str, + version: str = "001", + output: str = "json", + lpdaac: str = "https://lpdaacsvc.cr.usgs.gov/services/gedifinder", +) -> str: + """ + "Description of func" + Parameters + ---------- + level : str + "description of parameter" + version : str + + bbox : List[float] + + output : str + + Returns + ------- + str of all HDF5 granules separated by commas + """ + + payload = { + 'product': f'{level}', + 'version': f'{version}', + 'bbox': f'{bbox}', + 'output': f'{output}' + } + payload_str = urlencode(payload, safe=',') + + r = requests.get(lpdaac, params = payload_str) + if r.status_code == requests.codes.ok: + print('Granules successfully gathered!') + + granules = [g.split('/')[-1] for g in r.json()['data']] # take filename from url + return ','.join(granules) + + else: + print(f'Error {r.status_code} has occurred.') + + error = str( + f'URL below failed to retrieve GEDI data. ' + f'Error {r.status_code} has occurred. \n {r.url}' + ) + return error + +if __name__ == "__main__": + parser = argparse.ArgumentParser() + parser.add_argument( + "-d", + "--dir", + help=( + "The directory containing url txt file, formatted with a trailing slash," + " such that {dir}{fname} is a valid path, for fname a valid file name." + ), + ) + parser.add_argument( + "-b", + "--bbox", + help=( + "The bounding box of the region of interest. In format" + "ul_lat,ul_lon,lr_lat,lr_lon" + ) + ) + parser.add_argument( + "-l", + "--level", + help=( + "The data product level of interest. Acceptable answers at this time" + "are 1B, 2A, or 2B with default of 2A" + ), + default = "2A" # Pacific Northwest + ) + parser.add_argument( + "-o", + "--outfile", + help=( + "The stem of the name of the output file, without file extension, " + "optional. Default value is 'gedi_output'." + ), + default="granule_list", + ) + args = parser.parse_args() + + if args.level.upper() == '1B': + level = 'GEDI01_B' + elif args.level.upper() == '2A': + level = 'GEDI02_A' + elif args.level.upper() == '2B': + level = 'GEDI02_B' + else: + raise ValueError( + f"Recieved unsupported data level {args.level}. Please provide 1B, 2A, or 2B" + ) + + print("------------------------------------------") + + with open(os.path.join(args.dir, args.outfile + ".txt"), 'w+') as file: + file.write(gediFinder(level, args.bbox)) + + print("------------------------------------------") diff --git a/gedi/process_l2a.py b/gedi/process_l2a.py index 54a2ce2..d4c2be6 100644 --- a/gedi/process_l2a.py +++ b/gedi/process_l2a.py @@ -1,16 +1,63 @@ +#!/usr/bin/env python +""" +This program goes through the following process: + Download GEDI HDF5 files + Retrieve approproate data from files + Remove GEDI HDF5 files + Write data to file +The following layers are retrieved from the raw GEDI data: + shot_number, lat_lowestmode, lon_lowestmode, elev_lowestmode, elev_highestreturn, + sensitivity, quality_flag, rh +rh is converted to the following canopy height metrics: + canopy height max, min, std, avg + canopy height percentiles 10, 25, 50, 75, 90 + number of canopy height measurements in range h<5m, 5m= ul_lon) ) - tmp_df["beam"] = [beam] * np.count_nonzero(mask) - for layer in layers: - tmp_df[layer] = _f[beam][layer][()][mask].tolist() + if np.count_nonzero(mask) == 0: + continue + print(f"{beam} in {os.path.basename(_f_name)} does not contain any usable data") - df = df.append(tmp_df) + else: + tmp_df["beam"] = [beam] * np.count_nonzero(mask) + + for layer in layers: + tmp_df[layer] = _f[beam][layer][()][mask].tolist() + + _append_canopy_metrics(tmp_df, canopy_threshold=2) + del tmp_df["rh"] + + df = df.append(tmp_df) except KeyError as e: print(f"Encountered file read error with file {_f_name}") + return df.reset_index(drop=True) def _compute_nan_percentile(a: np.ndarray, q: float) -> np.array: - """This code is taken directly from StackOverflow at: + """ + This code is taken directly from StackOverflow at: https://stackoverflow.com/questions/60015245/numpy-nanpercentile-is-extremely-slow - For 2d masked array of roughly 10^5 x 100, this performs ~3x faster than np.nanpercentile (~10s vs. 3s) and produces results which are identical. - Would be good to add two tests to verify: (1) the results of this function are the same as np.nanpercentile, and (2) that this function takes less time on a reasonably sized array, in case of future performance gains in nanpercentile. - Parameters ---------- a : array_like 2d array for which to compute a row-wise percentile. q : float Percentile to compute -- allowed values are [0, 100] - Returns ------- np.array of the q'th percentile for each row of a. Dimensions of [len(a), 1]. """ if q < 0 or q > 100: raise ValueError(f"Expected a value between 0 and 100; received {q} instead.") - mask = (a >= np.nanmin(a)).astype(int) - - count = mask.sum(axis=1) - groups = np.unique(count) - groups = groups[groups > 0] + a = np.sort(a, axis =1) + count = (~np.isnan(a)).sum(axis=1) # count number of non-nans in row + groups = np.unique(count) # returns sorted unique values + groups = groups[groups > 0] # only returns groups with at least 1 non-nan value\n", p = np.zeros((a.shape[0])) - for g in range(len(groups)): - pos = np.where(count == groups[g]) + for group in groups: + pos = np.where(count == group) values = a[pos] - values = np.nan_to_num(values, nan=(np.nanmin(a) - 1)) - values = np.sort(values, axis=1) - values = values[:, -groups[g] :] + values = values[:, :group] p[pos] = np.percentile(values, q, axis=1) + return p -def append_canopy_metrics(df: pd.DataFrame, canopy_threshold: float) -> None: +def _append_canopy_metrics(df: pd.DataFrame, canopy_threshold: float) -> None: """ This function takes a pd.DataFrame object and a numerical value corresponding to the height threshold to consider a return as coming from the forest canopy. It extracts the "rh" column from the provided DataFrame as an np.ndarray, applies a mask, and computes the relevant canopy metrics which are then appended to the provided DataFrame. - Note: the least bad option for computing these canopy metrics seemed to be using np.ma module to work with masked arrays. Not converting out of Pandas is untenable due to computation time, and the varying dimension of rows when subset to canopy-only observations means np converts the result of applying the mask to the 2d array to a 1d array. - Parameters ---------- df : pd.DataFrame @@ -162,11 +209,9 @@ def append_canopy_metrics(df: pd.DataFrame, canopy_threshold: float) -> None: labeled "rh" which is array-valued. canopy_threshold : float The minimum value for a return to be considered a "canopy" return. - Side Effects ------------ Modifies the provided pd.DataFrame in-place to add canopy metrics. - Returns ------- None @@ -203,11 +248,10 @@ def append_canopy_metrics(df: pd.DataFrame, canopy_threshold: float) -> None: ) # % >= 20m, <=30m -def df_to_geojson(df, outfile): +def df_to_geojson(df : pd.DataFrame, outfile : str) -> None: """ Convert pandas dataframe to GeoJSON file and save as given file name and path. This file type can be useful when working locally in GIS software. - Parameters ---------- df : pd.DataFrame @@ -215,7 +259,9 @@ def df_to_geojson(df, outfile): canopy height metrics appended. outfile : str The location and filename of output json file. - + Side Effects + ------- + Writes dataframe to geojson filetype Returns ------- None @@ -227,22 +273,86 @@ def df_to_geojson(df, outfile): GeoDF = GeoDF.drop(columns=["lat_lowestmode", "lon_lowestmode"]) GeoDF.to_file(outfile, driver="GeoJSON") +def download_url(url : str, dir : str, chunk_size : int = 128) -> None: + """ + Download the zip files generated by NASA EarthData Search. + Parameters + ---------- + url : str + URL used to download data zip package + dir : str + Directory used to store the zip file and unzipped contents + chunk_size : int + Determines the size of each chunk written to file during data streaming + Side Effects + ------- + Downloads and unpacks the zip file from the provide url + Returns + ------- + None + """ + filepath = os.path.join(dir, "granuleData.zip") + r = requests.get(url, stream = True) + + if r.status_code == requests.codes.ok: + print(f"Downloading files from {url}") + + # Download zip file and save to dir + with open(filepath, 'wb') as fd: + for chunk in r.iter_content(chunk_size = chunk_size): + fd.write(chunk) + + print(f"Unzipping files") + # Unzip contents + with zipfile.ZipFile(filepath, 'r') as zipObj: + # Extract all the contents of zip file in current directory + zipObj.extractall(path = dir) + + # Remove zip file + rmtree(filepath) + + return True + + else: + print( + f'Error {r.status_code} has occurred.\n' + f'{url} cannot be downloaded at this time.' + ) + + return False if __name__ == "__main__": + # things I would like to see: displays 'arg help' if no arguments are given, docstrings with proper usage parser = argparse.ArgumentParser() parser.add_argument( "-d", "--dir", help=( - "The directory containing hd5 files, formatted with a trailing slash," - " such that {dir}{fname} is a valid path, for fname a valid file name." + "The directory containing url txt file supplied by the" + "--textfile argument" + ), + ) + parser.add_argument( + "-t", + "--textfile", + help=( + "The txt file containing the urls for the zip files, supplied by" + "EarthData Search." ), ) + parser.add_argument( + "-b", + "--bbox", + help=( + "The bounding box of the region of interest. In format" + "ul_lat,ul_lon,lr_lat,lr_lon" + ) + ) parser.add_argument( "-o", "--outfile", help=( - "The stem of the name of the output file, without file extension, " + "The stem of the name of the output file, without file extension," "optional. Default value is 'gedi_output'." ), default="gedi_output", @@ -250,34 +360,68 @@ def df_to_geojson(df, outfile): parser.add_argument( "-f", "--filetype", - help="The type of file to output. Acceptable formats are: csv, parquet, GeoJSON.", - default="parquet", + help=( + "The type of file to output. Acceptable formats are: csv," + "parquet, GeoJSON. Default value is csv." + ), + default="csv", ) args = parser.parse_args() - ul_lat = 43.38 - ul_lon = -124.1 - lr_lat = 42.88 - lr_lon = -123.6 - bbox = [ul_lat, ul_lon, lr_lat, lr_lon] - - files = [ - _f - for _f in os.listdir(args.dir) - if os.path.isfile(os.path.join(args.dir, _f)) and _f.endswith(".h5") - ] - - df = gedi_L2A_to_df(args.dir, files, bbox) - append_canopy_metrics(df, canopy_threshold=2) - del df["rh"] - if args.filetype.lower() == "csv": - df.to_csv(f"{args.dir}{args.outfile}.csv", index=False) - elif args.filetype.lower() == "parquet": - df.to_parquet(f"{args.dir}{args.outfile}.parquet.gzip", compression="gzip") - elif args.filetype.lower() == "geojson": - df_to_geojson(df, f"{args.dir}{args.outfile}.geojson") + print("------------------------------------------") + + bbox = [float(b) for b in args.bbox.split(',')] + + df = pd.DataFrame() + + with open(str(args.textfile), 'r') as texturls: + for url in texturls: + if validators.url(url): + + status = download_url(url.rstrip('\n'), args.dir) + if not status: + print("------------------------------------------") + continue + print(" ---------------------- ") + # Make list of filepaths + filepaths = [ + os.path.join(root, file) + for root, dir, files in os.walk(args.dir, topdown = False) + for file in files + if file.endswith('.h5') + ] + + tmp_df = gedi_L2A_to_df(filepaths, bbox = bbox) + print(" ---------------------- ") + # append data to dataframe + df = df.append(tmp_df) + print(f"Appended {len(tmp_df)} lines to dataframe; Dataframe has {len(df)} lines") + + # Delete hdf5 files + for filepath in filepaths: + rmtree(os.path.dirname(filepath)) + + print("------------------------------------------") + + if df.empty: + print("DataFrame is empty. Not writing data to file") + print("------------------------------------------") else: - raise ValueError( - f"Received unsupported file type {args.filetype}. Please provide one " - "of: csv, parquet, or GeoJSON." - ) + if args.filetype.lower() == "csv": + filename = os.path.join(args.dir, args.outfile + ".csv") + print(f'Writing to file {filename}') + df.to_csv(filename, index=False) + elif args.filetype.lower() == "parquet": + filename = os.path.join(args.dir, args.outfile + ".parquet.gzip") + print(f'Writing to file {filename}') + df.to_parquet(filename, compression="gzip") + elif args.filetype.lower() == "geojson": + filename = os.path.join(args.dir, args.outfile + ".geojson") + print(f'Writing to file {filename}') + df_to_geojson(df, filename) + else: + raise ValueError( + f"Received unsupported file type {args.filetype}. Please provide one of: csv, parquet, or GeoJSON." + ) + print("Job Complete!") + print("------------------------------------------")