Using 'open virtual dataset' capability to work with TEMPO Level 3 data¶
Summary¶
In this tutorial, we will use the earthaccess.open_virtual_mfdataset()
function to open a week's worth of granules from the Nitrogen Dioxide (NO2) Level-3 data collection of the TEMPO air quality mission.
About TEMPO: The Tropospheric Emissions: Monitoring of Pollution (TEMPO) instrument is a geostationary satellite mission that provides hourly daytime measurements of air quality over North America. It measures key pollutants including nitrogen dioxide (NO2), formaldehyde, and ozone at high spatial resolution (~2 by 4.75 km at the center of its field of regard).
We will calculate temporal and spatial means for a subset of the data and visualize the results. This approach demonstrates cloud-optimized data access patterns that can scale from days to years of data.
Learn more: For comprehensive documentation on the earthaccess
package, visit the earthaccess documentation.
Note that this same approach can be used for a date range of any length, within the mission's duration. Running this notebook for a year's worth of TEMPO Level-3 data took approximately 15 minutes.
Prerequisites¶
AWS US-West-2 Environment: This tutorial has been designed to run in an AWS cloud compute instance in AWS region us-west-2. However, if you want to run it from your laptop or workstation, everything should work just fine but without the speed benefits of in-cloud access.
Earthdata Account: A (free!) Earthdata Login account is required to access data from the NASA Earthdata system. Before requesting TEMPO data, we first need to set up our Earthdata Login authentication, as described in the Earthdata Cookbook's earthaccess tutorial (link).
Packages:
cartopy
dask
earthaccess
version 0.14.0 or greatermatplotlib
numpy
xarray
Setup¶
import cartopy.crs as ccrs
import earthaccess
import matplotlib.pyplot as plt
import numpy as np
import xarray as xr
from matplotlib import rcParams
%config InlineBackend.figure_format = 'jpeg'
rcParams["figure.dpi"] = (
80 # Reduce figure resolution to keep the saved size of this notebook low.
)
Login using the Earthdata Login¶
auth = earthaccess.login()
if not auth.authenticated:
# Ask for credentials and persist them in a .netrc file.
auth.login(strategy="interactive", persist=True)
print(earthaccess.__version__)
0.14.0
Search for data granules¶
We search for TEMPO Nitrogen Dioxide (NO2) data for a week-long period (note: times are in UTC), between January 11th and 18th, 2024.
results = earthaccess.search_data(
# TEMPO NOā Level-3 product
short_name="TEMPO_NO2_L3",
# Version 3 of the data product
version="V03",
# Time period: One week in January 2024 (times are in UTC)
temporal=("2024-01-11 12:00", "2024-01-18 12:00"),
)
print(f"Number of granules found: {len(results)}")
Number of granules found: 81
Opening Virtual Multifile Datasets¶
Understanding Virtual Datasets¶
Virtual datasets allow us to work with multiple files as if they were a single dataset without downloading all the data to local storage. This is achieved through:
- Kerchunk: Creates lightweight reference files that point to data chunks in cloud storage
- Virtualizarr: Combines multiple reference files into a single virtual dataset
- Lazy Loading: Data is only accessed when needed for computations
For TEMPO data, we need to handle the hierarchical netCDF4 structure by opening each group separately, then merging them.
First we set the argument options to be used by earthaccess.open_virtual_mfdataset
.
load
argument considerations:
load=True
works. Withinearthaccess.open_virtual_mfdataset
, a temporary virtual reference file (a "virtual dataset") is created and then immediately loaded with kerchunk. This is because the function assumes the user is making this request for the first time and the combined manifest file needs to be generated first. In the future, however,earthaccess.open_virtual_mfdataset
may provide a way to save the combined manifest file, at which point you could then avoid repeating these steps, and proceed directly to loading with kerchunk/virtualizarr.load=False
results inKeyError: "no index found for coordinate 'longitude'"
because it createsManifestArray
s without indexes (see the earthaccess documentation here (link))
open_options = {
"access": "direct", # Direct access to cloud data (faster in AWS)
"load": True, # Load metadata immediately (required for indexing)
"concat_dim": "time", # Concatenate files along the time dimension
"data_vars": "minimal", # Only load data variables that include the concat_dim
"coords": "minimal", # Only load coordinate variables that include the concat_dim
"compat": "override", # Avoid coordinate conflicts by picking the first
"combine_attrs": "override", # Avoid attribute conflicts by picking the first
"parallel": True, # Enable parallel processing with Dask
}
Because TEMPO data are processed and archived in a netCDF4 format using a group hierarchy, we open each group ā i.e., 'root', 'product', and 'geolocation' ā and then afterwards merge them together.
%%time
result_root = earthaccess.open_virtual_mfdataset(granules=results, **open_options)
result_product = earthaccess.open_virtual_mfdataset(
granules=results, group="product", **open_options
)
result_geolocation = earthaccess.open_virtual_mfdataset(
granules=results, group="geolocation", **open_options
)
CPU times: user 1.28 ms, sys: 5.05 ms, total: 6.34 ms Wall time: 6.32 ms
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[5], line 1 ----> 1 get_ipython().run_cell_magic('time', '', 'result_root = earthaccess.open_virtual_mfdataset(granules=results, **open_options)\nresult_product = earthaccess.open_virtual_mfdataset(\n granules=results, group="product", **open_options\n)\nresult_geolocation = earthaccess.open_virtual_mfdataset(\n granules=results, group="geolocation", **open_options\n)\n') File ~/checkouts/readthedocs.org/user_builds/earthaccess/envs/976/lib/python3.11/site-packages/IPython/core/interactiveshell.py:2565, in InteractiveShell.run_cell_magic(self, magic_name, line, cell) 2563 with self.builtin_trap: 2564 args = (magic_arg_s, cell) -> 2565 result = fn(*args, **kwargs) 2567 # The code below prevents the output from being displayed 2568 # when using magics with decorator @output_can_be_silenced 2569 # when the last Python token in the expression is a ';'. 2570 if getattr(fn, magic.MAGIC_OUTPUT_CAN_BE_SILENCED, False): File ~/checkouts/readthedocs.org/user_builds/earthaccess/envs/976/lib/python3.11/site-packages/IPython/core/magics/execution.py:1470, in ExecutionMagics.time(self, line, cell, local_ns) 1468 if interrupt_occured: 1469 if exit_on_interrupt and captured_exception: -> 1470 raise captured_exception 1471 return 1472 return out File ~/checkouts/readthedocs.org/user_builds/earthaccess/envs/976/lib/python3.11/site-packages/IPython/core/magics/execution.py:1434, in ExecutionMagics.time(self, line, cell, local_ns) 1432 st = clock2() 1433 try: -> 1434 exec(code, glob, local_ns) 1435 out = None 1436 # multi-line %%time case File <timed exec>:1 File ~/checkouts/readthedocs.org/user_builds/earthaccess/envs/976/lib/python3.11/site-packages/earthaccess/dmrpp_zarr.py:136, in open_virtual_mfdataset(granules, group, access, preprocess, parallel, **xr_combine_nested_kwargs) 130 with warnings.catch_warnings(): 131 warnings.filterwarnings( 132 "ignore", 133 message="Numcodecs codecs*", 134 category=UserWarning, 135 ) --> 136 vmfdataset = vz.open_virtual_mfdataset( 137 urls=granule_dmrpp_urls, 138 registry=obstore_registry, 139 parser=DMRPPParser(group=group), 140 preprocess=preprocess, 141 parallel=parallel, 142 **xr_combine_nested_kwargs, 143 ) 145 return vmfdataset File ~/checkouts/readthedocs.org/user_builds/earthaccess/envs/976/lib/python3.11/site-packages/virtualizarr/xarray.py:210, in open_virtual_mfdataset(urls, registry, parser, concat_dim, compat, preprocess, data_vars, coords, combine, parallel, join, attrs_file, combine_attrs, **kwargs) 205 ids, paths1d = ( 206 list(combined_ids_paths.keys()), 207 list(combined_ids_paths.values()), 208 ) 209 elif concat_dim is not None: --> 210 raise ValueError( 211 "When combine='by_coords', passing a value for `concat_dim` has no " 212 "effect. To manually combine along a specific dimension you should " 213 "instead specify combine='nested' along with a value for `concat_dim`.", 214 ) 215 else: 216 paths1d = paths # type: ignore[assignment] ValueError: When combine='by_coords', passing a value for `concat_dim` has no effect. To manually combine along a specific dimension you should instead specify combine='nested' along with a value for `concat_dim`.
Merge root groups with subgroups.
result_merged = xr.merge([result_root, result_product, result_geolocation])
result_merged
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[6], line 1 ----> 1 result_merged = xr.merge([result_root, result_product, result_geolocation]) 2 result_merged NameError: name 'result_root' is not defined
Understanding the Data¶
- vertical_column_troposphere: Total column amount of NOā in the troposphere (units: molecules/cm²)
- main_data_quality_flag: Quality indicator (0 = good quality data)
- Geographic region: We'll focus on the Mid-Atlantic region (Washington DC area)
- Longitude: -78° to -74° W
- Latitude: 35° to 39° N
# Define our region of interest (Mid-Atlantic/Washington DC area)
lon_bounds = (-78, -74) # Western to Eastern longitude
lat_bounds = (35, 39) # Southern to Northern latitude
print(
f"Analyzing region: {lat_bounds[0]}°N to {lat_bounds[1]}°N, {abs(lon_bounds[0])}°W to {abs(lon_bounds[1])}°W"
)
Analyzing region: 35°N to 39°N, 78°W to 74°W
Temporal Mean - a map showing an annual average¶
# Define temporal mean (average over time) calculation
temporal_mean_ds = (
result_merged.sel(
{
"longitude": slice(lon_bounds[0], lon_bounds[1]),
"latitude": slice(lat_bounds[0], lat_bounds[1]),
}
)
.where(result_merged["main_data_quality_flag"] == 0) # Filter for good quality data
.mean(dim="time")
)
print(f"Dataset shape after subsetting: {temporal_mean_ds.dims}")
temporal_mean_ds
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[8], line 3 1 # Define temporal mean (average over time) calculation 2 temporal_mean_ds = ( ----> 3 result_merged.sel( 4 { 5 "longitude": slice(lon_bounds[0], lon_bounds[1]), 6 "latitude": slice(lat_bounds[0], lat_bounds[1]), 7 } 8 ) 9 .where(result_merged["main_data_quality_flag"] == 0) # Filter for good quality data 10 .mean(dim="time") 11 ) 13 print(f"Dataset shape after subsetting: {temporal_mean_ds.dims}") 14 temporal_mean_ds NameError: name 'result_merged' is not defined
%%time
# Compute the temporal mean
mean_vertical_column_trop = temporal_mean_ds["vertical_column_troposphere"].compute()
CPU times: user 5 μs, sys: 1e+03 ns, total: 6 μs Wall time: 7.87 μs
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[9], line 1 ----> 1 get_ipython().run_cell_magic('time', '', '# Compute the temporal mean\nmean_vertical_column_trop = temporal_mean_ds["vertical_column_troposphere"].compute()\n') File ~/checkouts/readthedocs.org/user_builds/earthaccess/envs/976/lib/python3.11/site-packages/IPython/core/interactiveshell.py:2565, in InteractiveShell.run_cell_magic(self, magic_name, line, cell) 2563 with self.builtin_trap: 2564 args = (magic_arg_s, cell) -> 2565 result = fn(*args, **kwargs) 2567 # The code below prevents the output from being displayed 2568 # when using magics with decorator @output_can_be_silenced 2569 # when the last Python token in the expression is a ';'. 2570 if getattr(fn, magic.MAGIC_OUTPUT_CAN_BE_SILENCED, False): File ~/checkouts/readthedocs.org/user_builds/earthaccess/envs/976/lib/python3.11/site-packages/IPython/core/magics/execution.py:1470, in ExecutionMagics.time(self, line, cell, local_ns) 1468 if interrupt_occured: 1469 if exit_on_interrupt and captured_exception: -> 1470 raise captured_exception 1471 return 1472 return out File ~/checkouts/readthedocs.org/user_builds/earthaccess/envs/976/lib/python3.11/site-packages/IPython/core/magics/execution.py:1434, in ExecutionMagics.time(self, line, cell, local_ns) 1432 st = clock2() 1433 try: -> 1434 exec(code, glob, local_ns) 1435 out = None 1436 # multi-line %%time case File <timed exec>:2 NameError: name 'temporal_mean_ds' is not defined
fig, ax = plt.subplots(subplot_kw={"projection": ccrs.PlateCarree()})
mean_vertical_column_trop.squeeze().plot.contourf(ax=ax)
# Add geographic features
ax.coastlines()
ax.gridlines(
draw_labels=True,
dms=True,
x_inline=False,
y_inline=False,
)
plt.show()
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[10], line 3 1 fig, ax = plt.subplots(subplot_kw={"projection": ccrs.PlateCarree()}) ----> 3 mean_vertical_column_trop.squeeze().plot.contourf(ax=ax) 5 # Add geographic features 6 ax.coastlines() NameError: name 'mean_vertical_column_trop' is not defined
Spatial mean - a time series of area averages¶
# Define spatial mean (average over longitude/latitude) calculation
spatial_mean_ds = (
result_merged.sel(
{
"longitude": slice(lon_bounds[0], lon_bounds[1]),
"latitude": slice(lat_bounds[0], lat_bounds[1]),
}
)
.where(result_merged["main_data_quality_flag"] == 0) # Filter for good quality data
.mean(dim=("longitude", "latitude"))
)
spatial_mean_ds
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[11], line 3 1 # Define spatial mean (average over longitude/latitude) calculation 2 spatial_mean_ds = ( ----> 3 result_merged.sel( 4 { 5 "longitude": slice(lon_bounds[0], lon_bounds[1]), 6 "latitude": slice(lat_bounds[0], lat_bounds[1]), 7 } 8 ) 9 .where(result_merged["main_data_quality_flag"] == 0) # Filter for good quality data 10 .mean(dim=("longitude", "latitude")) 11 ) 12 spatial_mean_ds NameError: name 'result_merged' is not defined
%%time
# Compute the spatial mean
spatial_mean_vertical_column_trop = spatial_mean_ds[
"vertical_column_troposphere"
].compute()
CPU times: user 4 μs, sys: 1e+03 ns, total: 5 μs Wall time: 8.58 μs
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[12], line 1 ----> 1 get_ipython().run_cell_magic('time', '', '# Compute the spatial mean\nspatial_mean_vertical_column_trop = spatial_mean_ds[\n "vertical_column_troposphere"\n].compute()\n') File ~/checkouts/readthedocs.org/user_builds/earthaccess/envs/976/lib/python3.11/site-packages/IPython/core/interactiveshell.py:2565, in InteractiveShell.run_cell_magic(self, magic_name, line, cell) 2563 with self.builtin_trap: 2564 args = (magic_arg_s, cell) -> 2565 result = fn(*args, **kwargs) 2567 # The code below prevents the output from being displayed 2568 # when using magics with decorator @output_can_be_silenced 2569 # when the last Python token in the expression is a ';'. 2570 if getattr(fn, magic.MAGIC_OUTPUT_CAN_BE_SILENCED, False): File ~/checkouts/readthedocs.org/user_builds/earthaccess/envs/976/lib/python3.11/site-packages/IPython/core/magics/execution.py:1470, in ExecutionMagics.time(self, line, cell, local_ns) 1468 if interrupt_occured: 1469 if exit_on_interrupt and captured_exception: -> 1470 raise captured_exception 1471 return 1472 return out File ~/checkouts/readthedocs.org/user_builds/earthaccess/envs/976/lib/python3.11/site-packages/IPython/core/magics/execution.py:1434, in ExecutionMagics.time(self, line, cell, local_ns) 1432 st = clock2() 1433 try: -> 1434 exec(code, glob, local_ns) 1435 out = None 1436 # multi-line %%time case File <timed exec>:2 NameError: name 'spatial_mean_ds' is not defined
spatial_mean_vertical_column_trop.plot()
plt.show()
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[13], line 1 ----> 1 spatial_mean_vertical_column_trop.plot() 2 plt.show() NameError: name 'spatial_mean_vertical_column_trop' is not defined
Single scan subset¶
# Select a single scan time for detailed analysis
scan_time_start = np.datetime64("2024-01-11T13:00:00") # 1 PM UTC
scan_time_end = np.datetime64("2024-01-11T14:00:00") # 2 PM UTC
print(f"Analyzing single scan: {scan_time_start} to {scan_time_end} UTC")
print("Note: This corresponds to ~8-9 AM local time on the US East Coast")
subset_ds = result_merged.sel(
{
"longitude": slice(lon_bounds[0], lon_bounds[1]),
"latitude": slice(lat_bounds[0], lat_bounds[1]),
"time": slice(scan_time_start, scan_time_end),
}
).where(result_merged["main_data_quality_flag"] == 0)
subset_ds
Analyzing single scan: 2024-01-11T13:00:00 to 2024-01-11T14:00:00 UTC Note: This corresponds to ~8-9 AM local time on the US East Coast
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[14], line 8 5 print(f"Analyzing single scan: {scan_time_start} to {scan_time_end} UTC") 6 print("Note: This corresponds to ~8-9 AM local time on the US East Coast") ----> 8 subset_ds = result_merged.sel( 9 { 10 "longitude": slice(lon_bounds[0], lon_bounds[1]), 11 "latitude": slice(lat_bounds[0], lat_bounds[1]), 12 "time": slice(scan_time_start, scan_time_end), 13 } 14 ).where(result_merged["main_data_quality_flag"] == 0) 15 subset_ds NameError: name 'result_merged' is not defined
%%time
# Compute the single scan's values
subset_vertical_column_trop = subset_ds["vertical_column_troposphere"].compute()
CPU times: user 5 μs, sys: 1e+03 ns, total: 6 μs Wall time: 9.3 μs
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[15], line 1 ----> 1 get_ipython().run_cell_magic('time', '', '# Compute the single scan\'s values\nsubset_vertical_column_trop = subset_ds["vertical_column_troposphere"].compute()\n') File ~/checkouts/readthedocs.org/user_builds/earthaccess/envs/976/lib/python3.11/site-packages/IPython/core/interactiveshell.py:2565, in InteractiveShell.run_cell_magic(self, magic_name, line, cell) 2563 with self.builtin_trap: 2564 args = (magic_arg_s, cell) -> 2565 result = fn(*args, **kwargs) 2567 # The code below prevents the output from being displayed 2568 # when using magics with decorator @output_can_be_silenced 2569 # when the last Python token in the expression is a ';'. 2570 if getattr(fn, magic.MAGIC_OUTPUT_CAN_BE_SILENCED, False): File ~/checkouts/readthedocs.org/user_builds/earthaccess/envs/976/lib/python3.11/site-packages/IPython/core/magics/execution.py:1470, in ExecutionMagics.time(self, line, cell, local_ns) 1468 if interrupt_occured: 1469 if exit_on_interrupt and captured_exception: -> 1470 raise captured_exception 1471 return 1472 return out File ~/checkouts/readthedocs.org/user_builds/earthaccess/envs/976/lib/python3.11/site-packages/IPython/core/magics/execution.py:1434, in ExecutionMagics.time(self, line, cell, local_ns) 1432 st = clock2() 1433 try: -> 1434 exec(code, glob, local_ns) 1435 out = None 1436 # multi-line %%time case File <timed exec>:2 NameError: name 'subset_ds' is not defined
fig, ax = plt.subplots(subplot_kw={"projection": ccrs.PlateCarree()})
subset_vertical_column_trop.squeeze().plot.contourf(ax=ax)
# Add geographic features
ax.coastlines()
ax.gridlines(
draw_labels=True,
dms=True,
x_inline=False,
y_inline=False,
)
plt.show()
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[16], line 3 1 fig, ax = plt.subplots(subplot_kw={"projection": ccrs.PlateCarree()}) ----> 3 subset_vertical_column_trop.squeeze().plot.contourf(ax=ax) 5 # Add geographic features 6 ax.coastlines() NameError: name 'subset_vertical_column_trop' is not defined