Skip to content

aerosol table#74

Open
larsbuntemeyer wants to merge 6 commits intomainfrom
aerosol
Open

aerosol table#74
larsbuntemeyer wants to merge 6 commits intomainfrom
aerosol

Conversation

@larsbuntemeyer
Copy link
Contributor

@larsbuntemeyer larsbuntemeyer commented Jul 29, 2025

see #34

The initial aerosol table created using:

import pandas as pd

def sheet_url(url, sheet_name):
    """create google spreadsheet url based on sheet name"""
    sheet_name = sheet_name.replace(" ", "%20")
    return url.format(sheet_id=sheet_id, sheet_name=sheet_name)


def retrieve_google_sheet(url, sheet_name, skiprows=4):
    """retrieve single sheet of data request"""
    return pd.read_csv(sheet_url(url, sheet_name), skiprows=skiprows, dtype=str)

def handle_inconsistencies(df):
    """handle some random inconsistencies"""
    df.loc[df["priority"] == "TIER 2", "priority"] = "TIER2"
    df.loc[df["priority"] == "TIER 1", "priority"] = "TIER1"
    return df

def freq_list(row):
    """create list of frequencies from boolean entries ('x')"""
    if row["mon"] == "fx":
        return ["fx"]
    return [f for f in freqs if row[f] == "x"]

def update_cell_methods(df):
    # special fx cases
    df.loc[df.frequency == "fx", "cell_methods"] = "area: mean"

    # flux units, see https://github.com/WCRP-CORDEX/cordex-cmip6-data-request/issues/23
    df.loc[df.units == "W m-2", "cell_methods"] = "area: time: mean"

    return df

def handle_special_cell_methods(df):
    for var, v in df.cell_methods.items():
        for f, cm in v.items():
            df.loc[(df.out_name == var) & (df.frequency == f), "cell_methods"] = cm
    return df

def clean_df(df, drop=True):
    """tidy up dataframe"""
    # remove unnamed columns
    df = df.loc[:, ~df.columns.str.contains("Unnamed")]

    df["standard_name"] = df["standard_name"].fillna("")

    # lower case column names and renaming to cmip6 formats
    df.columns = df.columns.str.lower()
    df.rename(
        columns={"output variable name": "out_name", "comments": "comment"},
        inplace=True,
    )

    # frequency columns to tidy data
    df["frequency"] = df.apply(lambda row: freq_list(row), axis=1)
    df = df.explode("frequency", ignore_index=True)

    df = handle_inconsistencies(df)  # set correct frequency name for point values

    subdaily_pt = (df["frequency"].isin(["1hr", "3hr", "6hr"])) & (df["ag"] == "i")
    # set frequency, we don't do that anymore,
    # see https://github.com/WCRP-CORDEX/cordex-cmip6-data-request/issues/24
    # df.loc[subdaily_pt, "frequency"] = df[subdaily_pt].frequency + "Pt"

    # set cell methods depending on frequency
    df["cell_methods"] = "area: time: mean"
    df.loc[subdaily_pt, "cell_methods"] = "area: mean time: point"

    # update some more cell_methods
    df = update_cell_methods(df)
    # remove trailing formatters
    df.replace(r"\n", " ", regex=True, inplace=True)
    strip_cols = ["standard_name", "long_name"]
    for col in strip_cols:
        df[col] = df[col].str.strip()
    if drop is True:
        df.drop(columns=freqs, inplace=True)
        #df.drop(columns=["ag"], inplace=True)
        df = df.dropna(subset=["out_name", "frequency"], how="all")

    # handle min max cell_methods
    df.loc[df.out_name.str.contains("min"), "cell_methods"] = "area: mean time: minimum"
    df.loc[df.out_name.str.contains("max"), "cell_methods"] = "area: mean time: maximum"

    # handle special cases
    #df = handle_special_cell_methods(df)

    # set these to lowercase
    lowercase = ["CAPE", "LI", "CIN", "CAPEmax", "LImax", "CINmax"]
    lc = df.out_name.isin(lowercase)

    df.loc[lc, "out_name"] = df[lc].out_name.str.lower()

    # set positive values
    up = ["outgoing", "upward", "upwelling"]
    down = ["incoming", "downward", "downwelling", "sinking"]
    ups = df.loc[df.standard_name.str.contains("|".join(up), case=False)]
    downs = df.loc[df.standard_name.str.contains("|".join(down), case=False)]
    df.loc[ups.index, "positive"] = "up"
    df.loc[downs.index, "positive"] = "down"

    return df

freqs = ["mon", "day", "6hr", "3hr", "1hr"]

sheet_names = ["Aersol CORE", "Aerosol Tier 1", "Aerosol Tier 2"]
#url = "https://docs.google.com/spreadsheets/d/1_KLWJuVdxryyq3DsB5NIJwoneuVqSUVN/edit?pli=1&gid=1672965248#gid=1672965248"

sheet_id = "1_KLWJuVdxryyq3DsB5NIJwoneuVqSUVN"
url = (
    "https://docs.google.com/spreadsheets/d/{sheet_id}/gviz/"
    "tq?tqx=out:csv&sheet={sheet_name}"
)

def retrieve_data_request():
    data = []
    for sheet_name in sheet_names:
        df = retrieve_google_sheet(url, sheet_name, skiprows=0).rename(columns={"Output frequency mon": "mon"})
        df.columns.values[1] = "units"
        #df = clean_df(df)
        data.append(df)
    return data

df = pd.concat(retrieve_data_request(), ignore_index=True)
df = clean_df(df)
df.to_csv("aerosol.csv", index=False)

@larsbuntemeyer
Copy link
Contributor Author

larsbuntemeyer commented Jul 29, 2025

@pierrenabat i added a table in this PR basically containing all requested aerosol variables and some meta data derived from the information provided. I kept the "ag" column for now to check cell methods.

The default for cell methods is (all frequencies aver averaged values)

  • "area: time: mean"

in case of "i" in the aggregation column, for subdaily frequencies it's

  • "area: mean time: point"

However, i'm unsure how to handle the "c" (cumulative). Should the subdaily cell method be something like "area: mean time: sum"? I could't find anythin im CMIP6 to hang on, e.g., no cumulative subdaily frequncies.

@larsbuntemeyer
Copy link
Contributor Author

@jesusff for aerosols, i now see in the comments a lot of pressure levels requested for aerosol variables, e.g.,

  • List of levels: 200, 250, 300, 350, 400, 450, 500, 550, 600, 650, 700, 750, 800, 850, 900, 925, 950, 975, 1000 hPa (all in 1 file if possible)

The default data request splits pressure levels in individual datasets with scalar coordinates. Should we stick with one approach or have both? I'm unsure...

@pierrenabat
Copy link

@larsbuntemeyer thanks for creating the table.
The variables with ag="cumulative" can have a cell_methods equal to "area: time: mean". These variables are equivalent to fluxes, such as existing variables like precipitation, evapotranspiration or radiation fluxes.
For the pressure levels, if it is not possible to have the different levels in the same file, you can split them in individual datasets if you prefer.
I will also complete the missing information for the variables concerned.
Thanks !

@larsbuntemeyer
Copy link
Contributor Author

larsbuntemeyer commented Aug 1, 2025

variables with ag="cumulative" can have a cell_methods equal to "area: time: mean"

Alright, i'll update that.
Edit: No update required, see also WCRP-CORDEX/cordex-cmip6-data-request#23

For the pressure levels, if it is not possible to have the different levels in the same file, you can split them in individual datasets if you prefer.

It should be possible, however, not consistent with the default data request. I think we need more opinions on this.

@jesusff
Copy link
Member

jesusff commented Aug 8, 2025

I've added a separate discussion on the model levels issue in #76

For the moment, I'd leave this aerosol request as is now, with 3D variables including the vertical dimension. @pierrenabat, how standard is the set of levels you propose here?

@larsbuntemeyer
Copy link
Contributor Author

Ok, we can keep 3D variables and i will add a coordinate. However, we still have to decide about the invalud standard names, see #34 (comment)

That is about half the variables that have invalid standard names, should we remove them for now?

pierrenabat and others added 3 commits December 4, 2025 10:49
Update aerosol standard_names for CF compliance
Update aerosol data request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants