diff --git a/ModelOps Training.ipynb b/ModelOps Training.ipynb index fb62f9d..6ff1bef 100644 --- a/ModelOps Training.ipynb +++ b/ModelOps Training.ipynb @@ -2,18 +2,10 @@ "cells": [ { "cell_type": "markdown", - "id": "f6008b6e", "metadata": {}, "source": [ "## Setup\n", "\n", - "\n", - "Ensure you have the following packages and python libraries installed \n", - "\n", - "```code\n", - "pip install teradataml==17.0.0.4 aoa==6.1.0 pandas==1.1.5\n", - "```\n", - "\n", "The remainder of the notebook runs through the following steps\n", "\n", "- Connect to Vantage\n", @@ -21,10 +13,19 @@ "- Import Data\n" ] }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import sys\n", + "!{sys.executable} -m pip install aoa>=7.0.0rc3 pandas>=1.1.5" + ] + }, { "cell_type": "code", "execution_count": 1, - "id": "0528bd6a", "metadata": {}, "outputs": [ { @@ -50,28 +51,35 @@ "host = input(\"Host:\")\n", "username = input(\"Username:\")\n", "password = getpass.getpass(\"Password:\")\n", + "database = input(\"Database (defaults to user):\")\n", + "\n", + "if not database:\n", + " database = username\n", "\n", "\n", - "engine = create_context(host=host, username=username, password=urllib.parse.quote(password), logmech=\"TDNEGO\")" + "engine = create_context(host=host, \n", + " username=username, \n", + " password=urllib.parse.quote(password), \n", + " logmech=\"TDNEGO\",\n", + " database=database)" ] }, { "cell_type": "markdown", - "id": "4eed19e0", "metadata": {}, "source": [ "### Create DDLs\n", "\n", "Create the following tables \n", "\n", - "- aoa_feature_metadata \n", + "- aoa_statistics_metadata \n", "- aoa_byom_models\n", "- pima_patient_predictions\n", "\n", - "`aoa_feature_metadata` is used to store the profiling metadata for the features so that we can consistently compute the data drift and model drift statistics. This table can also be created via the CLI by executing \n", + "`aoa_statistics_metadata` is used to store the profiling metadata for the features so that we can consistently compute the data drift and model drift statistics. This table can also be created via the CLI by executing \n", "\n", "```bash\n", - "aoa feature create-stats-table -m .\n", + "aoa feature create-stats-table -e -m .\n", "```\n", "\n", "`pima_patient_predictions` is used for storing the predictions of the model scoring for the demo use case" @@ -80,7 +88,6 @@ { "cell_type": "code", "execution_count": 2, - "id": "9875d156", "metadata": {}, "outputs": [ { @@ -102,7 +109,7 @@ "# Also note, if a shared datalab is being used, only one user should execute the following DDL/DML commands\n", "database = username\n", "\n", - "create_features_stats_table(f\"{database}.aoa_feature_metadata\")\n", + "create_features_stats_table(f\"{database}.aoa_statistics_metadata\")\n", "\n", "get_context().execute(f\"\"\"\n", "CREATE MULTISET TABLE {database}.aoa_byom_models\n", @@ -131,7 +138,6 @@ }, { "cell_type": "markdown", - "id": "b237d537", "metadata": {}, "source": [ "### Import Data\n", @@ -146,19 +152,30 @@ "\n", "`pima_patient_diagnoses` contains the diabetes diagnostic results for the patients.\n", "\n", - "`aoa_feature_metadata` contains the feature statistics data for the `pima_patient_features` and `pima_patient_diagnoses`\n", + "`aoa_statistics_metadata` contains the feature statistics metadata for the `pima_patient_features` and `pima_patient_diagnoses`\n", "\n", "Note the `pima_patient_feature` can be populated via the CLI by executing \n", "\n", + "Compute the statistics metadata for the continuous variables\n", "```bash\n", - "aoa feature compute-stats -s .PIMA -m . -t continuous -c numtimesprg,plglcconc,bloodp,skinthick,twohourserins,bmi,dipedfunc,age \n", + "aoa feature compute-stats \\\n", + " -s . \\\n", + " -m . \\\n", + " -t continuous -c numtimesprg,plglcconc,bloodp,skinthick,twohourserins,bmi,dipedfunc,age\n", + "```\n", + "\n", + "Compute the statistics metadata for the categorical variables\n", + "```bash\n", + "aoa feature compute-stats \\\n", + " -s . \\\n", + " -m . \\\n", + " -t categorical -c hasdiabetes\n", "```" ] }, { "cell_type": "code", "execution_count": 3, - "id": "07461699", "metadata": {}, "outputs": [], "source": [ @@ -196,16 +213,15 @@ " })\n", "\n", "# we can compute this from the CLI also - but lets import pre-computed for now.\n", - "df = pd.read_csv(\"data/aoa_feature_metadata.csv\")\n", + "df = pd.read_csv(\"data/aoa_statistics_metadata.csv\")\n", "copy_to_sql(df=df, \n", - " table_name=\"aoa_feature_metadata\", \n", + " table_name=\"aoa_statistics_metadata\", \n", " schema_name=database,\n", " if_exists=\"append\")\n" ] }, { "cell_type": "markdown", - "id": "2b0cdd53", "metadata": {}, "source": [ "## ModelOps UI\n", @@ -242,7 +258,7 @@ " - Description: PIMA Diabetes\n", " - Feature Catalog: Vantage\n", " - Database: {your-db}\n", - " - Table: aoa_feature_metadata\n", + " - Table: aoa_statistics_metadata\n", " - Features\n", " - Query: `SELECT * FROM {your-db}.pima_patient_features`\n", " - Entity Key: PatientId\n", @@ -255,7 +271,6 @@ " - Database: {your-db}\n", " - Table: pima_patient_predictions\n", " - Entity Selection: `SELECT * FROM pima_patient_features WHERE patientid MOD 5 = 0`\n", - " - BYOM Target Column: `CAST(CAST(json_report AS JSON).JSONExtractValue('$.predicted_HasDiabetes') AS INT)`\n", " \n", " \n", "- create training dataset\n", @@ -307,84 +322,24 @@ }, { "cell_type": "markdown", - "id": "17a64068", "metadata": {}, "source": [ "#### View Predictions\n", "\n", - "In the next version of ModelOps, you will be able to view the predictions that follow the standard pattern directly via the UI. However, for now, we can view it here. As the same predictions table contains the predictions for all the jobs, we filter by the `airflow_job_id`. You can find this id in the UI under deployment executions." - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "id": "904b2fb9", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
job_idPatientIdHasDiabetesjson_report
\n", - "
" - ], - "text/plain": [ - "Empty DataFrame\n", - "Columns: [job_id, PatientId, HasDiabetes, json_report]\n", - "Index: []" - ] - }, - "execution_count": 4, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "import pandas as pd\n", - "from teradataml import get_connection\n", - "\n", - "pd.options.display.max_colwidth = 250\n", - "\n", - "airflow_job_id = \"5761d5c1-bf57-456b-8076-c3062be0b544-scheduled__2022-07-11T00:00:00+00:00\"\n", + "In the UI, select a deployment from the deployments left hand navigation. Go to the Jobs tab and on the right hand side for each job execution, you can select \"View Predictions\". This will show you a sample of the predictions for that particular job execution.\n", "\n", - "pd.read_sql(f\"SELECT TOP 5 * FROM pima_patient_predictions WHERE job_id='{airflow_job_id}'\", get_connection())" + "Note, your predictions table must have a `job_id` column which matches to the execution job id. If using BYOM, this is done automatically. For you own `scoring.py`, checkout the demo models." ] }, { "cell_type": "markdown", - "id": "d479c9cb", "metadata": {}, "source": [ "## CLI \n", "\n", "\n", "```bash\n", - "pip install aoa==6.1.0\n", + "pip install aoa>=7.0.0rc3\n", "```\n", "\n", "##### Copy CLI Config\n", @@ -437,7 +392,6 @@ { "cell_type": "code", "execution_count": null, - "id": "b63bd4d5", "metadata": {}, "outputs": [], "source": [] @@ -445,7 +399,7 @@ ], "metadata": { "kernelspec": { - "display_name": "Python 3", + "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, @@ -459,7 +413,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.7.10" + "version": "3.9.12" } }, "nbformat": 4, diff --git a/data/aoa_feature_metadata.csv b/data/aoa_feature_metadata.csv deleted file mode 100644 index 23c50eb..0000000 --- a/data/aoa_feature_metadata.csv +++ /dev/null @@ -1,9 +0,0 @@ -column_name,stats,update_ts -twohourserins,"{""edges"": [0.0, 84.6, 169.2, 253.79999999999998, 338.4, 423.0, 507.59999999999997, 592.1999999999999, 676.8, 761.4, 846.0]}",2022-05-04 07:53:04.790 -skinthick,"{""edges"": [0.0, 9.9, 19.8, 29.700000000000003, 39.6, 49.5, 59.400000000000006, 69.3, 79.2, 89.10000000000001, 99.0]}",2022-05-04 07:53:04.790 -age,"{""edges"": [21.0, 27.0, 33.0, 39.0, 45.0, 51.0, 57.0, 63.0, 69.0, 75.0, 81.0]}",2022-05-04 07:53:04.790 -numtimesprg,"{""edges"": [0.0, 1.7, 3.4, 5.1, 6.8, 8.5, 10.2, 11.9, 13.6, 15.299999999999999, 17.0]}",2022-05-04 07:53:04.790 -plglcconc,"{""edges"": [0.0, 19.9, 39.8, 59.699999999999996, 79.6, 99.5, 119.39999999999999, 139.29999999999998, 159.2, 179.1, 199.0]}",2022-05-04 07:53:04.790 -bmi,"{""edges"": [0.0, 6.71, 13.42, 20.13, 26.84, 33.55, 40.26, 46.97, 53.68, 60.39, 67.1]}",2022-05-04 07:53:04.790 -bloodp,"{""edges"": [0.0, 12.2, 24.4, 36.599999999999994, 48.8, 61.0, 73.19999999999999, 85.39999999999999, 97.6, 109.8, 122.0]}",2022-05-04 07:53:04.790 -dipedfunc,"{""edges"": [0.07, 0.31, 0.55, 0.78, 1.01, 1.25, 1.48, 1.72, 1.95, 2.19, 2.42]}",2022-05-04 07:53:04.790 diff --git a/data/aoa_statistics_metadata.csv b/data/aoa_statistics_metadata.csv new file mode 100644 index 0000000..4b9be58 --- /dev/null +++ b/data/aoa_statistics_metadata.csv @@ -0,0 +1,10 @@ +"column_name","column_type","stats","update_ts" +twohourserins,continuous,"{""edges"": [0.0, 84.6, 169.2, 253.79999999999998, 338.4, 423.0, 507.59999999999997, 592.1999999999999, 676.8, 761.4, 846.0]}","2022-11-16 18:09:48.220000" +skinthick,continuous,"{""edges"": [0.0, 9.9, 19.8, 29.700000000000003, 39.6, 49.5, 59.400000000000006, 69.3, 79.2, 89.10000000000001, 99.0]}","2022-11-16 18:09:48.220000" +age,continuous,"{""edges"": [21.0, 27.0, 33.0, 39.0, 45.0, 51.0, 57.0, 63.0, 69.0, 75.0, 81.0]}","2022-11-16 18:09:48.220000" +hasdiabetes,categorical,"{""categories"": [""0"", ""1""]}","2022-11-16 20:01:01.790000" +plglcconc,continuous,"{""edges"": [0.0, 19.9, 39.8, 59.699999999999996, 79.6, 99.5, 119.39999999999999, 139.29999999999998, 159.2, 179.1, 199.0]}","2022-11-16 18:09:48.220000" +bmi,continuous,"{""edges"": [0.0, 6.71, 13.42, 20.13, 26.84, 33.55, 40.26, 46.97, 53.68, 60.39, 67.1]}","2022-11-16 18:09:48.220000" +numtimesprg,continuous,"{""edges"": [0.0, 1.7, 3.4, 5.1, 6.8, 8.5, 10.2, 11.9, 13.6, 15.299999999999999, 17.0]}","2022-11-16 18:09:48.220000" +dipedfunc,continuous,"{""edges"": [0.07, 0.31, 0.55, 0.78, 1.01, 1.25, 1.48, 1.72, 1.95, 2.19, 2.42]}","2022-11-16 18:09:48.220000" +bloodp,continuous,"{""edges"": [0.0, 12.2, 24.4, 36.599999999999994, 48.8, 61.0, 73.19999999999999, 85.39999999999999, 97.6, 109.8, 122.0]}","2022-11-16 18:09:48.220000"