diff --git a/docs/index.rst b/docs/index.rst index 4ba93df..ee35e28 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -22,7 +22,7 @@ User Guide Installing astrodb-utils pages/loading/index pages/querying_existing_db/index - pages/ingesting/getting_started_ingesting + pages/ingesting/index pages/modifying/index pages/make_new_db/index pages/template_repo/index diff --git a/docs/pages/ingesting/getting_started_ingesting.rst b/docs/pages/ingesting/index.rst similarity index 89% rename from docs/pages/ingesting/getting_started_ingesting.rst rename to docs/pages/ingesting/index.rst index 2b61d48..099c862 100644 --- a/docs/pages/ingesting/getting_started_ingesting.rst +++ b/docs/pages/ingesting/index.rst @@ -6,10 +6,10 @@ Ingesting and Modifying Data :maxdepth: 2 :titlesonly: + ingest_scripts/index ingesting_publications spectra/* - ingest_scripts - + .. note:: @@ -26,4 +26,4 @@ API Documentation Function to ingest alternative name data :py:mod:`astrodb_utils.utils.ingest_instrument` - Function to ingest instrument data \ No newline at end of file + Function to ingest instrument data diff --git a/docs/pages/ingesting/ingest_scripts.rst b/docs/pages/ingesting/ingest_scripts.rst deleted file mode 100644 index b788950..0000000 --- a/docs/pages/ingesting/ingest_scripts.rst +++ /dev/null @@ -1,66 +0,0 @@ -Ingest Scripts -============== -Ingest scripts can be used to add a bunch of data to the database at once. -Often ingests are performed by reading in a file (e.g., csv) that contains -a table of data and then ingesting each row of the table into the database. -Below is an example script for ingesting sources discovered by -Rojas et al. 2012 into the SIMPLE Archive from a .csv file -that has columns named `name`, `ra`, `dec`. - -.. code-block:: python - - from astropy.io import ascii - from simple.schema import REFERENCE_TABLES - from astrodb_utils import load_astrodb, logger, AstroDBError - from astrodb_utils.sources import ingest_source - from astrodb_utils.publications import ingest_publication - - SAVE_DB = False # Set to True to write out the JSON files at the end of the script - RECREATE_DB = True # Set to True to recreate the database from the JSON files - - # Load the database - db = load_astrodb("SIMPLE.sqlite", - recreatedb=RECREATE_DB, - reference_tables=REFERENCE_TABLES, - felis_schema="simple/schema.yaml", - ) - - - def ingest_pubs(db): - # Ingest discovery publication - ingest_publication( - db, - doi="10.1088/0004-637X/748/2/93" - ) - - def ingest_sources(db): - # read the csv data into an astropy table - data_table = ascii.read(file.csv, format="csv") - - n_added = 0 - n_skipped = 0 - - for source in data_table: - try: - ingest_source( - db, - source=data_table['name'], - ra=data_table['ra'], - dec=data_table['dec'], - reference="Roja12", - ) - logger.info(f"Source {source['name']} ingested.") - n_added += 1 - except AstroDBError as e: - logger.warning(f"Error ingesting source {source['name']}: {e}") - n_skipped += 1 - continue - - - ingest_pubs(db) - ingest_sources(db) - - logger.info(f"Added {n_added} sources, skipped {n_skipped} sources.") - - if DB_SAVE: - db.save() diff --git a/docs/pages/ingesting/ingest_scripts/index.rst b/docs/pages/ingesting/ingest_scripts/index.rst new file mode 100644 index 0000000..52b261e --- /dev/null +++ b/docs/pages/ingesting/ingest_scripts/index.rst @@ -0,0 +1,136 @@ +Ingest Scripts +============== + +.. toctree:: + :glob: + :maxdepth: 1 + + writing_scripts + +Ingest scripts can be used to add a bunch of data to the database at once. +Ingest scripts also aid in reproducibilty since they document exactly how +data was added to the database. +They can also be reused later to add similar data. + + +Loading the Database +-------------------- + +.. code-block:: python + + from astrodb_utils import build_db_from_json + + db = build_db_from_json(settings_file = "path/to/database.toml") + +First, we need to load our database using the +:py:func:`astrodb_utils.loaders.build_db_from_json` function. +This function takes in a settings file (in TOML format) that contains +information about our database, including its name. +The ``build_db_from_json`` function will perform a full rebuild of the +database from the JSON data files, +essentially reconstructing it from scratch. + + +Setting Up Your Data +-------------------- + +Often ingests are performed by reading in a file (e.g., csv) that contains a +table of data and then ingesting each row of the table into the database. +Therefore, it is important to read in your data into a format that is easy +to work with, such as an `Astropy Table `_ +or pandas DataFrame. + +Here is an example of reading in a csv file using Astropy's ascii module: + +.. code-block:: python + + from astropy.io import ascii + + L6T6_link = ( + "scripts/ingests/zjzhang/L6_to_T6_benchmarks08062025.csv" + ) + + L6T6_table = ascii.read( + L6T6_link, + format="csv", + data_start=1, + header_start=0, + guess=False, + fast_reader=False, + delimiter=",", + ) + +First, we define a variable that points to the location of our data file, +in which we then use to read in our data file as an Astropy Table. +Here, we specify that our file is in csv format and provide additional +parameters to ensure the file is read correctly. +For example, data_start and header_start specify which rows contain the data +and the header, respectively, while delimiter indicates that the file is +comma-separated. +The resulting ``L6T6_table`` variable is now an Astropy Table object that +contains all the data from the csv file, which we can then loop through +and ingest each row into the database. + +There are many ways to read in data files in Python, so feel free to use +other libraries or methods that you are comfortable with, such as pandas. + +Another Example Ingest Script +----------------------------- +Below is an example script for ingesting sources discovered by +Rojas et al. 2012 into the SIMPLE Archive from a .csv file +that has columns named `name`, `ra`, `dec`. + +.. code-block:: python + + from astropy.io import ascii + from astrodb_utils.loaders import build_db_from_json + from astrodb_utils.sources import ingest_source + from astrodb_utils.publications import ingest_publication + + DB_SAVE = False # Set to True once script can run without errors and all sources can be ingested + + # Load the database + db = build_db_from_json(settings_file="path/to/database.toml") + + # Set the logger setting to control how much output is shown + import logging + logger = logging.getLogger("astrodb_utils") + logger.setLevel(logging.INFO) # Set to DEBUG for more verbosity + + def ingest_pubs(db): + # Ingest discovery publication + ingest_publication( + db, + doi="10.1088/0004-637X/748/2/93" + ) + + + def ingest_sources(db): + # read the csv data into an astropy table + data_table = ascii.read(file.csv, format="csv") + + n_added = 0 + n_skipped = 0 + + for source in data_table: + ingest_source( + db, + source=data_table['name'], + ra=data_table['ra'], + dec=data_table['dec'], + reference="Roja12", + raise_error=True, + ) + n_added += 1 + except AstroDBError as e: + logger.warning(f"Error ingesting source {source['name']}: {e}") + n_skipped += 1 + continue + + print(f"Added {n_added} sources, skipped {n_skipped} sources.") + + ingest_pubs(db) + ingest_sources(db) + + if DB_SAVE: + db.save() diff --git a/docs/pages/ingesting/ingest_scripts/writing_scripts.rst b/docs/pages/ingesting/ingest_scripts/writing_scripts.rst new file mode 100644 index 0000000..400f494 --- /dev/null +++ b/docs/pages/ingesting/ingest_scripts/writing_scripts.rst @@ -0,0 +1,122 @@ +Writing Scripts +=============== + +When writing ingest scripts, there are two different ways to go about it: +using existing ingest functions from `astrodb_utils` or using sqlalchemy +commands. + + +Using Existing Ingest functions +------------------------------- +Using existing ingest functions helps streamline the process of writing an +ingest script. +However, only few ingest functions exist, namely for sources, names, and +instruments. +Therefore, if your data fits into one of these categories, it is recommended +to use the existing functions. + +Below is an example of how to use the +:py:func:`astrodb_utils.sources.ingest_source` function to ingest source +data into the database: + +.. code-block:: python + + for source in bones_sheet_table: + + ingest_source( + db, + source=source["NAME"], + reference=reference[1], + ra=source["RA"], + dec=source["DEC"], + raise_error=True, + search_db=True, + comment="Discovery reference from the BONES archive", + ) + +Note that the basic structure for any ingest is looping through each row of +your data table and appropriately ingesting each row into the database with +the relevant parameters. +Each ingest function will have different required and optional parameters, +so be sure to check the API documentation for more details. + + +Using SQLAlchemy Commands +------------------------- +If there is no existing ingest function for your data type, you can use +sqlalchemy commands to directly ingest into the database. + +Below is an example of how to ingest modeled parameters data into the database +using sqlalchemy commands: + +.. code-block:: python + + for row in L6T6_table: + with db.engine.connect() as conn: + conn.execute( + db.ModeledParameters.insert().values( + { + "source": L6T6_table["NAME"], + "model": L6T6_table["MODEL"], + "parameter": L6T6_table["PARAM"], + "value": L6T6_table["VAL"], + "upper_error": L6T6_table["UPP_ERR"], + "lower_error": L6T6_table["LOW_ERR"], + "unit": L6T6_table["UNIT"], + "comments": "Ingested from compilation by Zhang et al. (2020ApJ...891..171Z)", + "reference": L6T6_table["REF"] + } + ) + ) + conn.commit() + +Here, we follow the same format of looping through each row of our data table +and then using insert commands to add each row into the database. + +Since there is no existing ingest function, there are a few things to keep +note of. For example, make sure to change the table name after ``db.`` to the +appropriate table you are ingesting into. + +It is also important to reference the schema to ensure your code matches the +database structure. For example, make sure that the column names inside the +``values()`` method match exactly with the column names in the database schema. +Additionally, the schema, which is available in your code under the utils +folder, will indicate which columns are required versus optional (check nullable +in the column you are referencing), so be sure to include all required columns in +your code to avoid any errors. Finally, make sure to commit the changes to the +database after executing the command with ``conn.commit()``. + +Logging Setup +------------- + +When working with data ingestion scripts or database-building workflows, +it's important to have a reliable way to understand what the script is +doing internally. +Python's built-in logging module provides a structured system for +reporting events, progress updates, and errors during execution. + +.. code-block:: python + + import logging + logger = logging.getLogger("AstroDB") + logger.setLevel(logging.INFO) + +By instantiating a logger for your script, it creates an easier way for you +to track what your script is doing: database loading, ingest errors, warnings, +etc. + +The line ``logger.setLevel(logging.INFO)`` configures the logger to display +only log messages at level INFO or higher. +Python provides multiple logging levels, including: + +* DEBUG:extremely detailed diagnostic output +* INFO: general runtime information +* WARNING: unexpected events that do not stop execution +* ERROR: serious problems that prevent part of the script from running +* CRITICAL: errors severe enough to stop execution entirely + +Database ingestion often involves multiple operations happening quickly, +therefore setting the level prevents you from being flooded with low-level +debugging messages. +This filters out unimportant information, making it easier to read and +facilitates the process of diagnosing ingestion problems or error messages.