From b18864a780022caa9a6841ff83958919fc9d9eb9 Mon Sep 17 00:00:00 2001 From: Todd Pihl Date: Tue, 30 Sep 2025 16:06:01 -0400 Subject: [PATCH] minor --- .../DataHubAPIDemo-checkpoint.ipynb | 1492 +++++++++++++++++ __pycache__/DH_Queries.cpython-312.pyc | Bin 4674 -> 4659 bytes 2 files changed, 1492 insertions(+) create mode 100644 .ipynb_checkpoints/DataHubAPIDemo-checkpoint.ipynb diff --git a/.ipynb_checkpoints/DataHubAPIDemo-checkpoint.ipynb b/.ipynb_checkpoints/DataHubAPIDemo-checkpoint.ipynb new file mode 100644 index 0000000..e563747 --- /dev/null +++ b/.ipynb_checkpoints/DataHubAPIDemo-checkpoint.ipynb @@ -0,0 +1,1492 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "203a18cb-807b-426a-8dff-14463bef2438", + "metadata": { + "jp-MarkdownHeadingCollapsed": true + }, + "source": [ + "\n", + "# Quick Links\n", + "[Introduction](#introduction)\\\n", + "[Prerequisites](#prerequisites)\\\n", + "[Security Note](#securityNote)\\\n", + "[Pagination Note](#paginationNote)\\\n", + "[Step 1: Understanding the Landscape](#step1)\\\n", + "[Step 2: Creating a new submission or using an existing submission](#step2)\n", + "- [Creating a new submission](#newSubmission)\n", + "- [Working with existing submissions](#existingSubmission)\n", + "\n", + "[Step 3: Uploading Submission templates](#step3)\\\n", + "[Step 4: Running the Validations](#step4)\\\n", + "[Step 5: Submitting, Canceling, or Withdrawing](#step5)\n" + ] + }, + { + "cell_type": "markdown", + "id": "796de190-d963-41ec-8378-a9b6cec3b4b1", + "metadata": {}, + "source": [ + "# Introduction\n", + "This notebook walks through the basics of using the Data Submission Portal API to work on, validate, and submit your data. These APIs are designed to allow users to perform all the actions that can be done via the [Data Submission Portal](https://hub.datacommons.cancer.gov/) from a notebook or script. The intent is to allow submitters to operate directly from their own environments if they so choose rathar than work through the graphical submission interface.\n", + "\n", + "There are a few prerequists that you have to meet before you can use this API:\n", + "\n", + "\n", + "# Prerequisites\n", + "## GraphQL\n", + "The Data Submission Portal API uses [GraphQL](https://graphql.org/) and a good understanding of how to use GraphQL is required. Since GraphQL can be complex, a tutorial is beyond the scope of this document, however the [GraphQL Documentation](https://graphql.org/learn/) can be very useful.\n", + "\n", + "## Login.gov account\n", + "Use of the Data Submission Portal in general requires that a user have an account registered with [Login.gov](https://www.login.gov/) (NIH users can use their NIH account and PIV card). Note that a Login.gov account is distinct from an eRA Commons identity that is frequently used at NIH. They are not the same thing.\n", + "\n", + "## Approved Submission\n", + "You must recieve approval to submit data to CRDC prior to using the Data Submission Portal APIs. If you need approval, please read and follow the [Submissions Request Instructions](https://datacommons.cancer.gov/submit). Instructions for using the graphical data submission process are on the same page.\n", + "\n", + "## An API Token\n", + "If you are an approved submitter with a Login.gov or NIH account, you can generate an API token from the graphical interface. Log into the system, then click on your user name and select the **API Token** menu option. This will bring up a dialog box that allows you to create an API token and copy it to your clipboard. There are two things to note about API tokens\n", + "- The token is tied to your user identity and can be used on any submission that you're approved to work on.\n", + "- You can have only one token at a time. Generating a new token will revoke the previous token.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 58, + "id": "43cfe135-2892-4989-a90b-3560512ccebf", + "metadata": {}, + "outputs": [], + "source": [ + "import requests\n", + "import os\n", + "from sys import platform" + ] + }, + { + "cell_type": "markdown", + "id": "1f1454ae-b351-44ac-82c0-de89600e4094", + "metadata": {}, + "source": [ + "The imports below are just used for display purposes in this notebook, they're not required to interact with the Data Submission Portal API" + ] + }, + { + "cell_type": "code", + "execution_count": 59, + "id": "f39da5f4-1a17-4ee8-bf6b-1925631ef734", + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "from IPython.display import display, Markdown, Latex" + ] + }, + { + "cell_type": "markdown", + "id": "a47e1712-b36a-445f-9822-3946d9e5626f", + "metadata": {}, + "source": [ + "\n", + "# Security Note\n", + "It is ***highly*** recommended that you keep your API token secure and not include it in any code. While there are many ways to do this, for the purposes of this notebook it's been set in an environment variable names \"STAGEAPI\"." + ] + }, + { + "cell_type": "code", + "execution_count": 60, + "id": "01bdddfe-4b2f-4bb8-8762-2bc9411054fe", + "metadata": {}, + "outputs": [], + "source": [ + "def apiQuery(tier, query, variables):\n", + " if tier == 'prod':\n", + " url = 'https://hub.datacommons.cancer.gov/api/graphql'\n", + " token = os.environ['PRODAPI']\n", + " elif tier == 'stage':\n", + " #Note that use of Stage is for example purposes only, actual submissions should use the production URL. If you wish to run tests on Stage, please contact the helpdesk.\n", + " url = 'https://hub-stage.datacommons.cancer.gov/api/graphql'\n", + " token = os.environ['STAGEAPI']\n", + " else:\n", + " return('Please provide either \"stage\" or \"prod\" as tier values')\n", + " headers = {\"Authorization\": f\"Bearer {token}\"}\n", + " try:\n", + " if variables is None:\n", + " result = requests.post(url = url, headers = headers, json={\"query\": query})\n", + " else:\n", + " result = requests.post(url = url, headers = headers, json = {\"query\":query, \"variables\":variables})\n", + " if result.status_code == 200:\n", + " return result.json()\n", + " else:\n", + " print(f\"Error: {result.status_code}\")\n", + " return result.content\n", + " except requests.exceptions.HTTPError as e:\n", + " return(f\"HTTP Error: {e}\")" + ] + }, + { + "cell_type": "markdown", + "id": "c81a661f-9ced-4724-8004-351617bd1c79", + "metadata": {}, + "source": [ + "\n", + "# Pagination Note\n", + "Most of queries that return results are paginated and need to be checked to make sure all results are retrived. The number of available results from a query is found in the **total** field that can be returned if requested and pagination can be done using the **first** and **offest** fields in queries. We won't be highlighting pagination in this notebook, but it can be a critical tool for fully understanding your submissions. If the **first** field is not included in a query, the system defaults to returning the first 10 results.\n", + "\n", + "- **first**: The number of records to be returned. If first is set to -1, the API will return all results.\n", + "- **offset**: The number of records to be skipped when returning results." + ] + }, + { + "cell_type": "markdown", + "id": "e929828f-fef2-4fb7-8ffd-6e3ec7d73d2b", + "metadata": {}, + "source": [ + "\n", + "# Step 1: Understanding the landscape\n", + "Let's assume that this is our first submission using the API, so what we need to do is list the studies that I'm approved to submit to. That's done with the *getMyUser* query. This query can return more information about an account (such as the status) however for this situation, we'll focus on the studies that are available." + ] + }, + { + "cell_type": "code", + "execution_count": 61, + "id": "d6802fa1-7b99-4ce3-a867-b0ad6b83b503", + "metadata": {}, + "outputs": [], + "source": [ + "study_query = \"\"\"\n", + "{\n", + " getMyUser {\n", + " userStatus\n", + " studies {\n", + " _id\n", + " controlledAccess\n", + " createdAt\n", + " dbGaPID\n", + " studyName\n", + " studyAbbreviation\n", + " }\n", + " }\n", + "}\n", + "\"\"\"" + ] + }, + { + "cell_type": "markdown", + "id": "39d8036d-5474-4bbb-b2fc-1d4f91c63fad", + "metadata": {}, + "source": [ + "Note that the actual results returned by this query will vary for each organization. These are example results only and shouldn't be used." + ] + }, + { + "cell_type": "code", + "execution_count": 62, + "id": "122289b5-c6c0-4a34-b865-a1fc82f57595", + "metadata": {}, + "outputs": [], + "source": [ + "study_res = apiQuery('stage', study_query, None)" + ] + }, + { + "cell_type": "code", + "execution_count": 63, + "id": "aab6c9cf-0724-4961-a0de-a5ee8f636030", + "metadata": {}, + "outputs": [ + { + "data": { + "text/markdown": [ + "| | _id | controlledAccess | createdAt | dbGaPID | studyName | studyAbbreviation |\n", + "|---:|:-------------------------------------|:-------------------|:-------------------------|:----------|:------------------|:--------------------|\n", + "| 0 | 0f92fd6d-3a0e-4f0f-b057-182bdc04cc6f | True | 2025-03-05T16:05:00.556Z | phs001234 | UAT Studies | UATS |\n", + "| 1 | 5d0d0213-7358-4ef1-8355-abba01b2cc3a | True | 2025-03-05T17:40:37.755Z | phs0002 | API Example Study | AES |" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "study_df = pd.DataFrame(study_res['data']['getMyUser']['studies'])\n", + "display(Markdown(study_df.to_markdown()))" + ] + }, + { + "cell_type": "markdown", + "id": "4a086991-e887-4d34-a85e-b4c4fdd3bf67", + "metadata": {}, + "source": [ + "And just as a check, let's have a quick look at the user status. If the status is **Active**, the submission can proceed. If the status is **Inactive**, you will have to contact the Help Desk (or your submission contact) to get the status set to **Acttive**" + ] + }, + { + "cell_type": "code", + "execution_count": 64, + "id": "2fd7906f-94fd-4217-b246-c1246d76c1fb", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "User status is: Active\n" + ] + } + ], + "source": [ + "print(f\"User status is: {study_res['data']['getMyUser']['userStatus']}\")" + ] + }, + { + "cell_type": "markdown", + "id": "f08abf99-9543-46ad-abfa-1cca90c88a1b", + "metadata": {}, + "source": [ + "\n", + "# Step 2: Creating a new submission or using an existing submission\n", + "The next step in the process is to either create a new submission or to use one of your existing submissions. It is not necessary to create a new submission every time, if you have an existing submission that you need to continue working on, simply start using that submission. " + ] + }, + { + "cell_type": "markdown", + "id": "397c83f4-c77f-4951-833d-4053a422d08b", + "metadata": {}, + "source": [ + "\n", + "## Step 2, Alternate 1: Creating a new submission\n", + "For the purposes of this demonstration, we'll use the **API Example Study** as the example. \n", + "\n", + "In order to submit data your first step is to create a new submission within the study. **Do not do this if you're continuing with an exsiting submission**\n", + " \n", + "From the data we obtained in the first query, we'll have to parse out the information that's relevant to the **API Example Study**. We'll need these to construct the query that creates the new submission" + ] + }, + { + "cell_type": "code", + "execution_count": 65, + "id": "c140b0b8-598a-4a00-999f-0750f52a9384", + "metadata": {}, + "outputs": [], + "source": [ + "abbrev = 'AES'\n", + "for study in study_res['data']['getMyUser']['studies']:\n", + " if study['studyAbbreviation'] == abbrev:\n", + " dbgap = study['dbGaPID']\n", + " name = study['studyName']\n", + " studyid = study['_id']\n", + "dc = \"CDS\"\n", + "name = \"Stage API Sub Test 3\"\n", + "intention = \"New/Update\"\n", + "datatype = \"Metadata and Data Files\"\n" + ] + }, + { + "cell_type": "markdown", + "id": "b4fbfeb7-5de1-45ab-a692-5e44f83f5719", + "metadata": {}, + "source": [ + "### createSubmissions mutation\n", + "\n", + "Creating submissions requires the use of a mutation that calls createSubmissions. There are multiple required variables that have to be provided in a GraphQL compatible way:\n", + "- **studyID**: This is the assigned Study ID that can be obtained from the **_id** field in the *getMyUser* query\n", + "- **dbGaPID**: Obtained when registering the study at dbGaP. This is required for all controlled access studies\n", + "- **dataCommons**: This is the CRDC Data Commons the submissions will be deposited into\n", + "- **name**: This can be anything that allows you to identify this specific submission\n", + "- **intention**: Can be *New/Update* if you are adding information to the submission or *Delete* if you are removing information from an earlier, completed submission\n", + "- **dataType**: Can be either *Metadata and Data Files* or *Metadata Only*. Which one is selected depends on whether or not data files will be included in the submission\n", + "\n", + " This query will return the **_id** field which will be the newly created submission ID. It will also return a number of other fields that can be checked to make sure the submission was created properly." + ] + }, + { + "cell_type": "code", + "execution_count": 66, + "id": "a68dec46-8ee1-4942-a7a3-8b3298acd530", + "metadata": {}, + "outputs": [], + "source": [ + "create_submission_query = \"\"\"\n", + "mutation CreateNewSubmission(\n", + " $studyID: String!,\n", + " $dataCommons: String!,\n", + " $name: String!,\n", + " $intention:String!,\n", + " $dataType: String!,\n", + "){\n", + " createSubmission(\n", + " studyID: $studyID,\n", + " dataCommons: $dataCommons,\n", + " name: $name,\n", + " intention: $intention,\n", + " dataType: $dataType\n", + " ){\n", + " _id\n", + " studyID\n", + " dbGaPID\n", + " dataCommons\n", + " name\n", + " intention\n", + " dataType\n", + " status\n", + " }\n", + "}\"\"\"" + ] + }, + { + "cell_type": "code", + "execution_count": 67, + "id": "632a8907-6fa0-456d-878b-c01da43acd34", + "metadata": {}, + "outputs": [], + "source": [ + "variables = {\"studyID\":studyid, \"dataCommons\":dc, \"name\":name, \"intention\":intention,\"dataType\":datatype}" + ] + }, + { + "cell_type": "code", + "execution_count": 68, + "id": "5116aa6d-b0ca-43ef-bf1e-080cd35155f9", + "metadata": {}, + "outputs": [], + "source": [ + "create_res = apiQuery('stage',create_submission_query, variables)" + ] + }, + { + "cell_type": "markdown", + "id": "4187daa2-dba6-4508-aee8-ae19ab831154", + "metadata": {}, + "source": [ + "Parse out the submission ID since we'll need it later" + ] + }, + { + "cell_type": "code", + "execution_count": 69, + "id": "0ecd2d52-cf09-4f92-9a6b-f128c5d58cef", + "metadata": {}, + "outputs": [], + "source": [ + "submissionid = create_res['data']['createSubmission'][\"_id\"]\n", + "subname = create_res['data']['createSubmission']['name']" + ] + }, + { + "cell_type": "markdown", + "id": "261fc49a-6a3c-4f17-a7c3-78944ff3d79b", + "metadata": {}, + "source": [ + "#### Side trip\n", + "\n", + "At this point if you go to the graphical interface you should see that a new submission has been created using the name provided in the query" + ] + }, + { + "cell_type": "markdown", + "id": "2320bc97-b1dd-4c20-8073-97f48a6c204c", + "metadata": {}, + "source": [ + "\n", + "### Step 2, Alternate 2: Working with existing submissions\n", + "If you already have submissions in the Data Submission Portal that you've been working with, you can continue to work with them instead of creating a new submission. To continue work on a submission, you will first have to identify the submissions using the *listSubmissions* query.\n", + "\n", + "The listSubmissions query requires that **status** be provided as a parameter. The status can be any combination of:\n", + "- All\n", + "- New\n", + "- In Progress\n", + "- Submitted\n", + "- Released\n", + "- Completed\n", + "- Archived\n", + "- Canceled\n", + "- Rejected\n", + "- Withdrawn\n", + "- Deleted\n", + "\n", + "**All** returns all submission statuses.\n", + "\n", + "Details about what each of these states means can be found in the Submission Documentation. For most submitters, the important states are **New**, **In Progress**, and **Submitted** as those will be the states that allow work to be done on the submission.\n", + "\n", + "This allows for queries to bring back information about a specific state, but for the purposes of the demonstration, we'll use \"All\" to bring back everything. We'll also return some additional information about each submission so we can identify the ones we want to work with.\n", + "\n", + "For long lists, the *listSubmissions* query also allows the list to be pagniated using the **first** and **offset** fields and sorted in ascending or descending order with the **sortDirection** field, and to request a sorting order by field with the **orderBy** field. Please see the API documentation for additional information." + ] + }, + { + "cell_type": "code", + "execution_count": 70, + "id": "98a8e86c-bceb-4d81-b8cb-46822a5c0673", + "metadata": {}, + "outputs": [], + "source": [ + "list_sub_query = \"\"\"\n", + " query ListSubmissions(\n", + " $status:[String],\n", + " $first: Int,\n", + " $offset: Int,\n", + " $orderBy: String,\n", + " $sortDirection: String){\n", + " listSubmissions(\n", + " status: $status,\n", + " first: $first,\n", + " offset: $offset,\n", + " orderBy: $orderBy,\n", + " sortDirection: $sortDirection){\n", + " total\n", + " submissions{\n", + " _id\n", + " name\n", + " submitterName\n", + " dataCommons\n", + " studyAbbreviation\n", + " dbGaPID\n", + " modelVersion\n", + " status\n", + " conciergeName\n", + " createdAt\n", + " updatedAt\n", + " intention\n", + " }\n", + " }\n", + " }\n", + "\"\"\"" + ] + }, + { + "cell_type": "code", + "execution_count": 71, + "id": "a5287198-dd55-4c5a-8a64-85a1698f59ab", + "metadata": {}, + "outputs": [], + "source": [ + "statusvariables = {\"status\":[\"New\", \"Deleted\", \"In Progress\"], \"first\": -1, \"offset\": 0, \"orderBy\": \"updatedAt\", \"sortDirection\": \"desc\"}" + ] + }, + { + "cell_type": "code", + "execution_count": 72, + "id": "e16543c9-4044-4855-813c-4c824a3c60a2", + "metadata": {}, + "outputs": [], + "source": [ + "list_sub_res = apiQuery('stage', list_sub_query, statusvariables)" + ] + }, + { + "cell_type": "code", + "execution_count": 73, + "id": "7dee0bce-0a33-4443-af03-a2f0a7017554", + "metadata": {}, + "outputs": [ + { + "data": { + "text/markdown": [ + "| | _id | name | submitterName | dataCommons | studyAbbreviation | dbGaPID | modelVersion | status | conciergeName | createdAt | updatedAt | intention |\n", + "|---:|:-------------------------------------|:--------------------------|:----------------|:--------------|:------------------------|:----------------|:---------------|:------------|:----------------|:-------------------------|:-------------------------|:------------|\n", + "| 0 | 0623921e-926f-4e2a-b0f7-6f1d442d855d | Stage API Sub Test 3 | Todd Pihl | CDS | AES | phs0002 | 6.0.2 | New | | 2025-03-07T19:33:28.186Z | 2025-03-07T19:33:28.186Z | New/Update |\n", + "| 1 | a4e4f3f6-6d47-4174-a687-b71ba925a558 | Stage API Sub Test 2 | Todd Pihl | CDS | AES | phs0002 | 6.0.2 | In Progress | | 2025-03-06T15:27:02.009Z | 2025-03-06T18:33:56.857Z | New/Update |\n", + "| 2 | c477eeb1-53b9-45f3-873b-9fea9e242267 | Stage API Submission Test | Todd Pihl | CDS | AES | phs0002 | 6.0.2 | New | | 2025-03-05T18:28:19.846Z | 2025-03-05T19:01:54.803Z | New/Update |\n", + "| 3 | 4ce43a10-4669-40ce-949f-3fbbfc9f2513 | Stage API Submission Test | Todd Pihl | CDS | G_controlledStudy_Stage | 34424 | 4.0.4 | Deleted | | 2024-10-18T20:02:36.282Z | 2025-03-05T13:30:02.351Z | New/Update |\n", + "| 4 | 451744e8-3c12-4782-ba6b-0d97ef585929 | Key 3 | Todd Pihl | CDS | HTAN image | phs002371_image | 5.0.4 | In Progress | | 2025-02-05T18:59:16.955Z | 2025-02-07T21:28:15.395Z | New/Update |" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "submissions_df = pd.DataFrame(list_sub_res['data']['listSubmissions']['submissions'])\n", + "display(Markdown(submissions_df.to_markdown()))" + ] + }, + { + "cell_type": "markdown", + "id": "43777d02-2c97-4622-9aa5-89403e2d3a83", + "metadata": {}, + "source": [ + "Since we're working with the **AES** study, we need to work on one of the submissions related to that" + ] + }, + { + "cell_type": "code", + "execution_count": 74, + "id": "e7dcfe5c-5a17-450b-9219-8bc28722ec39", + "metadata": {}, + "outputs": [], + "source": [ + "for submission in list_sub_res['data']['listSubmissions']['submissions']:\n", + " if submission['name'] == 'Stage API Sub Test 3':\n", + " submissionid = submission['_id']" + ] + }, + { + "cell_type": "markdown", + "id": "8dc62457-e4c8-4975-beef-1ff40e5d4b1b", + "metadata": {}, + "source": [ + "\n", + "# Step 3: Uploading Submission templates\n", + "Once the study is created, the next step is to start uploading metadata submission templates and data files. There are two ways of accomplishing this upload:\n", + "1) Using the Upload CLI Tool : This is generally the easiest method and can be used to upload both the metadata templates and the data files. The use of the Uploader CLI Tool [is documented elsewhere](https://github.com/CBIIT/crdc-datahub-cli-uploader/tree/master)\n", + "2) Using the API : If you wish to provide metadata only via a program, the API can be used as will be demonstrated in this notebook. While it is possible to upload data files using the API, it is **strongly** recommended that the Upload CLI Tool is used instead.\n", + "\n", + "Uploading data files using the API will be covered in a separate notebook." + ] + }, + { + "cell_type": "markdown", + "id": "d9a2c287-4c8c-4b92-b173-26d908437777", + "metadata": {}, + "source": [ + "## Collecting information about the metadata files to upload\n", + "Let's set up the list of metadata files we want to upload. This will be a list of **FileInput** objects. A FileInput object consiste of a dictionary with *fileName* and *size* as the keys.\n", + "\n", + "- **fileName**: Just the name of the file, not including the path.\n", + "- **size**: The size of the file in bytes\n", + "\n", + "The last field required for the query is the *type* field is either \"metadata\" or \"data file\" and \"data file\" isn't allowed ouside of the Upload CLI Tool, we'll set it to \"metadata\"" + ] + }, + { + "cell_type": "code", + "execution_count": 75, + "id": "7b46f29b-d503-436d-b8d4-99d0055966fc", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "['Demo_file.tsv', 'Demo_treatment.tsv', 'Demo_sampleFIXED.tsv', 'Demo_participant.tsv', 'Demo_program.tsv', 'Demo_genomic_info.tsv', 'Demo_sample.tsv', 'Demo_study.tsv', 'Demo_diagnosis.tsv', 'Demo_image.tsv']\n" + ] + } + ], + "source": [ + "if platform == 'linux' or platform == 'linux2':\n", + " datadir = '/testdata/'\n", + "elif platform == \"win32\":\n", + " datadir = r\"C:\\Users\\pihltd\\Documents\\datadir\"\n", + "elif platform == \"darwin\":\n", + " datadir = \"/testdata/\"\n", + "filelist = os.listdir(datadir)\n", + "metadatafiles = []\n", + "for file in filelist:\n", + " metadatafiles.append(file)\n", + "print(metadatafiles)" + ] + }, + { + "cell_type": "code", + "execution_count": 76, + "id": "8f23cf46-d7a6-4951-bc1a-a3991f8ea10d", + "metadata": {}, + "outputs": [], + "source": [ + "submissiontype = \"metadata\"" + ] + }, + { + "cell_type": "markdown", + "id": "1a1a4aef-cbd1-4613-9e78-1841188869bb", + "metadata": {}, + "source": [ + "## The createBatch mutation\n", + "Now that we've got credentials and the list of files, we create a \"batch\", which is the term for one or more files uploaded at the same time. We do this by using the createBatch muations as shown below. \n", + "\n", + "One of the critical pieces of information returned is the signed URL that is used to actually trasfer the files to the Data Submission Portal." + ] + }, + { + "cell_type": "code", + "execution_count": 77, + "id": "50e5d7eb-87c2-4793-9700-f7782c277b69", + "metadata": {}, + "outputs": [], + "source": [ + "create_batch_query = \"\"\"\n", + "mutation CreateBatch(\n", + " $submissionID: ID!, \n", + " $type: String, \n", + " $files: [String!]!) {\n", + " createBatch(submissionID: $submissionID, type: $type, files: $files) {\n", + " _id\n", + " submissionID\n", + " bucketName\n", + " filePrefix\n", + " type\n", + " status\n", + " createdAt\n", + " updatedAt\n", + " files {\n", + " fileName\n", + " signedURL\n", + " }\n", + " }\n", + "}\n", + "\"\"\"" + ] + }, + { + "cell_type": "code", + "execution_count": 78, + "id": "3d5977e3-b5a8-4d6f-bee7-a6d226a81f19", + "metadata": {}, + "outputs": [], + "source": [ + "create_batch_variables = {\"submissionID\":submissionid, \"type\":submissiontype, \"files\":metadatafiles}" + ] + }, + { + "cell_type": "code", + "execution_count": 79, + "id": "83bb786e-cb8f-432c-9d45-594f52dceeac", + "metadata": {}, + "outputs": [], + "source": [ + "create_batch_res = apiQuery('stage', create_batch_query, create_batch_variables)" + ] + }, + { + "cell_type": "markdown", + "id": "e27465b4-0281-499d-a473-2e9f3e6b4c39", + "metadata": {}, + "source": [ + "The results from this mutation will have the signed URLs (again, for security reasons it's a good idea to not print them out). We'll use these to upload the files. Make sure that you're using the correct signed URL for each file. We'll also need the batch ID, so that shoudl be parsed out." + ] + }, + { + "cell_type": "code", + "execution_count": 80, + "id": "7cc4a8ac-4d6e-43ce-8b3c-7a26f054e1a3", + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "batchid = create_batch_res['data']['createBatch']['_id']" + ] + }, + { + "cell_type": "code", + "execution_count": 81, + "id": "7725b40a-32fb-4d85-8902-fd2f7d7a9f37", + "metadata": {}, + "outputs": [], + "source": [ + "def awsFileUpload(file, signedurl, datadir):\n", + " #https://docs.aws.amazon.com/AmazonS3/latest/userguide/example_s3_Scenario_PresignedUrl_section.html\n", + " headers = {'Content-Type': 'text/tab-separated-values'}\n", + " try:\n", + " fullFileName = datadir+file\n", + " with open(fullFileName, 'rb') as f:\n", + " filetext = f.read()\n", + " res = requests.put(signedurl, data=filetext, headers=headers)\n", + " if res.status_code == 200:\n", + " return res\n", + " else:\n", + " print(f\"Error: {res.status_code}\")\n", + " return res.content\n", + " except requests.exceptions.HTTPError as e:\n", + " return(f\"HTTP error: {e}\")" + ] + }, + { + "cell_type": "code", + "execution_count": 87, + "id": "bfe57e96-17eb-4bac-95c5-0a103b4dd97e", + "metadata": {}, + "outputs": [], + "source": [ + "def processFilesForUpload(metadatafiles, datadir,batch_creation_results):\n", + " file_upload_result = []\n", + " for entry in metadatafiles:\n", + " for metadatafile in batch_creation_results['data']['createBatch']['files']:\n", + " if entry == metadatafile['fileName']:\n", + " metares = awsFileUpload(metadatafile['fileName'], metadatafile['signedURL'], datadir)\n", + " if metares.status_code == 200:\n", + " succeeded = True\n", + " else:\n", + " succeeded = False\n", + " file_upload_result.append({'fileName':entry, 'succeeded': succeeded, 'errors':[], 'skipped':False})\n", + " return file_upload_result" + ] + }, + { + "cell_type": "markdown", + "id": "6243889d-4a1a-49af-8efd-29bd482c267d", + "metadata": {}, + "source": [ + "As each file is uploaded, an *UploadResult* object has to be constructed. This will get used in the batch update step." + ] + }, + { + "cell_type": "code", + "execution_count": 88, + "id": "267d2d69-e5db-4af5-b2e9-e18611c3a932", + "metadata": {}, + "outputs": [], + "source": [ + "datadir = \"/testdata/\"\n", + "file_upload_result = processFilesForUpload(metadatafiles, datadir, create_batch_res)" + ] + }, + { + "cell_type": "markdown", + "id": "a860953a-2e91-4435-b593-fd90b50dc7bc", + "metadata": {}, + "source": [ + "After files have been uploaded, the next step is to update the batch by calling the *updateBatch* mutation. This mutation uses the *UploadResult* object that we created in the previous step" + ] + }, + { + "cell_type": "code", + "execution_count": 89, + "id": "1ee67d21-c4d6-453e-be84-a84a37cdce58", + "metadata": {}, + "outputs": [], + "source": [ + "update_batch_query = \"\"\"\n", + " mutation UpdateBatch(\n", + " $batchID: ID!\n", + " $files: [UploadResult]\n", + " ){\n", + " updateBatch(batchID:$batchID, files:$files){\n", + " _id\n", + " displayID\n", + " }\n", + " }\n", + "\"\"\"" + ] + }, + { + "cell_type": "code", + "execution_count": 90, + "id": "d830262f-92c4-421b-b8c0-a99f800164fd", + "metadata": {}, + "outputs": [], + "source": [ + "update_variables = {'batchID':batchid, 'files':file_upload_result}" + ] + }, + { + "cell_type": "code", + "execution_count": 91, + "id": "3bebc154-061a-42aa-948e-11a4fc2b38e9", + "metadata": {}, + "outputs": [], + "source": [ + "update_res = apiQuery('stage', update_batch_query, update_variables)" + ] + }, + { + "cell_type": "markdown", + "id": "d2fcc0fe-3ae1-48dd-9849-75eb619d3441", + "metadata": {}, + "source": [ + "#### Side Trip\n", + "If you log into the Data Submission Portal interface, at this point you should see the files that have been uploaded along with any errors that were detected." + ] + }, + { + "cell_type": "markdown", + "id": "e3b8a2a1-b5e4-4f23-9fe1-0c4f72c420e8", + "metadata": {}, + "source": [ + "### Checking the upload\n", + "Before going any further, it's a good idea to make sure that the upload went as expected. The best way to check for upload errors is wtih the *listBatches* query. Since this returns all of the batches in a submission, you'll have to do a little parsing to see if there are any issues with the batch you just sent." + ] + }, + { + "cell_type": "code", + "execution_count": 92, + "id": "13365683-4395-427b-a341-6d24b1aa7a6f", + "metadata": {}, + "outputs": [], + "source": [ + "list_batches_query = \"\"\"\n", + "query ListBatches($submissionID: ID!) {\n", + " listBatches(submissionID: $submissionID) {\n", + " total\n", + " batches {\n", + " _id\n", + " submissionID\n", + " displayID\n", + " type\n", + " fileCount\n", + " files {\n", + " fileName\n", + " }\n", + " status\n", + " errors\n", + " }\n", + " }\n", + "}\n", + "\"\"\"" + ] + }, + { + "cell_type": "code", + "execution_count": 93, + "id": "3db57f7e-5e0a-4d77-848e-ec867303541e", + "metadata": {}, + "outputs": [], + "source": [ + "batches_variables = {'submissionID':submissionid}" + ] + }, + { + "cell_type": "code", + "execution_count": 94, + "id": "d0891979-ebf6-40ae-882d-3adf2a9c753a", + "metadata": {}, + "outputs": [], + "source": [ + "batch_error_res = apiQuery('stage', list_batches_query, batches_variables)" + ] + }, + { + "cell_type": "code", + "execution_count": 95, + "id": "9d3bb947-0f47-40ba-b5b3-d3f3528f0a6e", + "metadata": {}, + "outputs": [ + { + "data": { + "text/markdown": [ + "| | _id | submissionID | displayID | type | fileCount | files | status | errors |\n", + "|---:|:-------------------------------------|:-------------------------------------|------------:|:---------|------------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:---------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n", + "| 0 | 3da22432-1711-4ae3-ae62-75d5258478f6 | 0623921e-926f-4e2a-b0f7-6f1d442d855d | 1 | metadata | 10 | [{'fileName': 'Demo_file.tsv'}, {'fileName': 'Demo_treatment.tsv'}, {'fileName': 'Demo_sampleFIXED.tsv'}, {'fileName': 'Demo_participant.tsv'}, {'fileName': 'Demo_program.tsv'}, {'fileName': 'Demo_genomic_info.tsv'}, {'fileName': 'Demo_sample.tsv'}, {'fileName': 'Demo_study.tsv'}, {'fileName': 'Demo_diagnosis.tsv'}, {'fileName': 'Demo_image.tsv'}] | Failed | ['“Demo_treatment.tsv:2”: Key property “treatment_id” value is required.', '“Demo_participant.tsv”: Property \"sex\" is required.', '“Demo_sample.tsv: 38”: conflict data detected: “sample_type”: \"RNA\".', '“Demo_sample.tsv: 74”: conflict data detected: “sample_type”: \"DNA\".', '“Demo_image.tsv:2”: Key property “study_link_id” value is required.'] |" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "batch_df = pd.DataFrame(batch_error_res['data']['listBatches']['batches'])\n", + "display(Markdown(batch_df.to_markdown()))" + ] + }, + { + "cell_type": "markdown", + "id": "880d32d2-1e45-4403-bd55-942fe7a51225", + "metadata": {}, + "source": [ + "Clearly there were some issues that will have to be corrected before the submission can proceed.\n", + "\n", + "### A Note on metadata uploads\n", + "When a metadata upload fails, all of the files in the upload are failed, regardless of which files have errors. While in this demonstration, all of the metadata files are being upload as a group (and therefore all have to be re-uploaded), the Submission Portal does allow metadata files to be be submitted, and evaluated, individually. When submitted individually, only files that fail the initial upload validation need to be corrected, any files that have already passed will remain in the system." + ] + }, + { + "cell_type": "markdown", + "id": "0e1fd473-0f5e-4540-afd2-f2bc04f94c70", + "metadata": {}, + "source": [ + "For this demo, there is a second set of files that have the errors fixed and are in a different directory." + ] + }, + { + "cell_type": "code", + "execution_count": 96, + "id": "26477b99-c10e-4dd4-b0a1-6a3457444620", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "['Demo_file.tsv', 'Demo_participant.tsv', 'Demo_program.tsv', 'Demo_genomic_info.tsv', 'Demo_sample.tsv', 'Demo_study.tsv', 'Demo_diagnosis.tsv']\n" + ] + } + ], + "source": [ + "if platform == 'linux' or platform == 'linux2':\n", + " datadir = '/fixedtestdata/'\n", + "elif platform == \"win32\":\n", + " datadir = r\"C:\\Users\\pihltd\\Documents\\datadir\"\n", + "elif platform == \"darwin\":\n", + " datadir = \"/fixedtestdata/\"\n", + "filelist = os.listdir(datadir)\n", + "new_metadatafiles = []\n", + "for file in filelist:\n", + " new_metadatafiles.append(file)\n", + "print(new_metadatafiles)" + ] + }, + { + "cell_type": "markdown", + "id": "31d950ad-8a03-4ca8-9e59-cd7b5416379c", + "metadata": {}, + "source": [ + "With that in place, we'll go through the same steps to add the files:" + ] + }, + { + "cell_type": "markdown", + "id": "6dbfa812-4cf7-4e17-a80f-a27d68ed199e", + "metadata": {}, + "source": [ + "1. Create a new batch and grab the batch ID" + ] + }, + { + "cell_type": "code", + "execution_count": 97, + "id": "32e05f87-d574-48f8-bf78-d6bd87edc56c", + "metadata": {}, + "outputs": [], + "source": [ + "create_batch_variables = {\"submissionID\":submissionid, \"type\":submissiontype, \"files\":new_metadatafiles}\n", + "create_batch_res = apiQuery('stage', create_batch_query, create_batch_variables)\n", + "batchid = create_batch_res['data']['createBatch']['_id']" + ] + }, + { + "cell_type": "markdown", + "id": "265cd1b5-0ecc-424a-8d39-447a4430406b", + "metadata": {}, + "source": [ + "2. Upload the files using the pre-signed URLs" + ] + }, + { + "cell_type": "code", + "execution_count": 98, + "id": "500d2528-f48c-4d02-8b0c-2195a5ab8867", + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "file_upload_result = processFilesForUpload(new_metadatafiles, datadir, create_batch_res)" + ] + }, + { + "cell_type": "markdown", + "id": "06efe76c-2548-49aa-83ab-5d5bdabf2c2d", + "metadata": {}, + "source": [ + "3. Update the batch" + ] + }, + { + "cell_type": "code", + "execution_count": 99, + "id": "2064b476-6243-405d-8e56-c108de3710d7", + "metadata": {}, + "outputs": [], + "source": [ + "update_variables = {'batchID':batchid, 'files':file_upload_result}\n", + "update_res = apiQuery('stage', update_batch_query, update_variables)" + ] + }, + { + "cell_type": "markdown", + "id": "0850d755-7964-4260-9372-da242b392e73", + "metadata": {}, + "source": [ + "And lastly, check the batch for errors" + ] + }, + { + "cell_type": "code", + "execution_count": 101, + "id": "7eff2291-109f-4c26-9805-7cfef36cda08", + "metadata": {}, + "outputs": [ + { + "data": { + "text/markdown": [ + "| | _id | submissionID | displayID | type | fileCount | files | status | errors |\n", + "|---:|:-------------------------------------|:-------------------------------------|------------:|:---------|------------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:---------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n", + "| 0 | 242a9da4-8a26-420a-88bd-732e85aac752 | 0623921e-926f-4e2a-b0f7-6f1d442d855d | 2 | metadata | 7 | [{'fileName': 'Demo_file.tsv'}, {'fileName': 'Demo_participant.tsv'}, {'fileName': 'Demo_program.tsv'}, {'fileName': 'Demo_genomic_info.tsv'}, {'fileName': 'Demo_sample.tsv'}, {'fileName': 'Demo_study.tsv'}, {'fileName': 'Demo_diagnosis.tsv'}] | Uploaded | [] |\n", + "| 1 | 3da22432-1711-4ae3-ae62-75d5258478f6 | 0623921e-926f-4e2a-b0f7-6f1d442d855d | 1 | metadata | 10 | [{'fileName': 'Demo_file.tsv'}, {'fileName': 'Demo_treatment.tsv'}, {'fileName': 'Demo_sampleFIXED.tsv'}, {'fileName': 'Demo_participant.tsv'}, {'fileName': 'Demo_program.tsv'}, {'fileName': 'Demo_genomic_info.tsv'}, {'fileName': 'Demo_sample.tsv'}, {'fileName': 'Demo_study.tsv'}, {'fileName': 'Demo_diagnosis.tsv'}, {'fileName': 'Demo_image.tsv'}] | Failed | ['“Demo_treatment.tsv:2”: Key property “treatment_id” value is required.', '“Demo_participant.tsv”: Property \"sex\" is required.', '“Demo_sample.tsv: 38”: conflict data detected: “sample_type”: \"RNA\".', '“Demo_sample.tsv: 74”: conflict data detected: “sample_type”: \"DNA\".', '“Demo_image.tsv:2”: Key property “study_link_id” value is required.'] |" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "batch_error_res = apiQuery('stage', list_batches_query, batches_variables)\n", + "batch_df = pd.DataFrame(batch_error_res['data']['listBatches']['batches'])\n", + "display(Markdown(batch_df.to_markdown()))" + ] + }, + { + "cell_type": "markdown", + "id": "f4b8a405-a7f9-4311-b5c7-5a68e89d46c6", + "metadata": {}, + "source": [ + "The status is now **Uploaded** and no errors are reported, so all seven files are now successfully added to the submission.\n", + "\n", + "#### Side Trip\n", + "\n", + "If you log into the Submission Portal, you should see that all files have uploaded and passed." + ] + }, + { + "cell_type": "markdown", + "id": "d793dc93-8aa6-4a78-82da-d3ee9ee011d6", + "metadata": {}, + "source": [ + "\n", + "# Step 4: Running the Validations\n", + "Once you have either metadata templates or data files successfully uploaded to the Submission Portal, you can start running validations. Validations can be run at any time, you don't have to complete all uploads before running validations. However, if you do run validations on incomplete submissions, you will see errors relating to the missing information.\n", + "\n", + "It's important to remember that validations are run against everything in the submission, not just against a specific file, or subset of files." + ] + }, + { + "cell_type": "markdown", + "id": "9e376a4b-8dbd-4ada-8c2d-71d7ab311fd5", + "metadata": {}, + "source": [ + "Validations are triggered by running the *validateSubmission* mutation which requires the submission ID and the types of validation to run., and the scope of the validation.\n", + "#### Types\n", + "- **metadata** - run the validations for the uploaded metadata files\n", + "- **data file** - run the validations for the uploaded data files\n", + "- Note that both values can be used in a single validation run\n", + "\n", + "#### Scope\n", + "- **New** - Run validations only against newly uploaded files. Any files that have previously been validated will be ignored.\n", + "- **All** - Run validations against all the files, both new and previously uploaded." + ] + }, + { + "cell_type": "code", + "execution_count": 102, + "id": "2a4d180e-c83f-4864-8a27-079e09643a23", + "metadata": {}, + "outputs": [], + "source": [ + "run_validation_query = \"\"\"\n", + " mutation ValidateSubmission(\n", + " $id: ID!\n", + " $types: [String]\n", + " $scope: String\n", + "){\n", + " validateSubmission(_id: $id, types: $types, scope: $scope){\n", + " success\n", + " message\n", + " }\n", + "}\n", + "\"\"\"" + ] + }, + { + "cell_type": "code", + "execution_count": 103, + "id": "9742f109-f386-4cfc-b211-31a6049bc674", + "metadata": {}, + "outputs": [], + "source": [ + "validation_variables = {\"id\":submissionid, \"types\":\"metadata\", \"scope\":\"All\"}" + ] + }, + { + "cell_type": "code", + "execution_count": 104, + "id": "cccfee52-da30-4564-be97-82979fda3195", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "True\n" + ] + } + ], + "source": [ + "validation_res = apiQuery('stage', run_validation_query, validation_variables)\n", + "print(validation_res['data']['validateSubmission']['success'])" + ] + }, + { + "cell_type": "markdown", + "id": "1e645280-f0d0-4290-982b-b1c591f22ae1", + "metadata": {}, + "source": [ + "The **success** value simply indicates that the validation process has successfully launched, it *does not* indicate that the validation results are successful. \n", + "\n", + "To check the validation results, there are two queries that can be run:\n", + "\n", + "- **aggregatedSubmissionQCResults**: This query returns a summary of the errors that have been found. Running this first is good practice as systemic issues can produce hundreds or thousands of lines of errors, and this report summarizes those into a more easily understood format.\n", + "- **submissionQCResults**: This query returns detailed results on each of the errors found during validation. Note that the results from this query can be numerous and are be a good use case for pagination. In this example we'll return only the first 10 results, but checking the returned **total** field will be necessary to understand if all results have been returned.\n", + "\n", + "For the purposes of this example, we'll just use two: **_id** which is the submission ID and will pull back all results for the entire submission, and **severity** which can be set to one of the following:\n", + "\n", + "- **All** - Return all errors regardless of severity\n", + "- **Error** - Return only Error level errors. These will block submission of the study.\n", + "- **Warnings** - Return only Warning level errors. Warnings will not block submission, however they should be corrected if possible." + ] + }, + { + "cell_type": "code", + "execution_count": 105, + "id": "1e9a2ca4-e939-48aa-b1dc-988754739c62", + "metadata": {}, + "outputs": [], + "source": [ + "summaryQuery = \"\"\"\n", + " query SummaryQueryQCResults(\n", + " $submissionID: ID!,\n", + " $severity: String,\n", + " $first: Int,\n", + " $offset: Int,\n", + " $orderBy: String,\n", + " $sortDirection: String\n", + " ){\n", + " aggregatedSubmissionQCResults(\n", + " submissionID: $submissionID,\n", + " severity: $severity,\n", + " first: $first,\n", + " offset: $offset,\n", + " orderBy: $orderBy\n", + " sortDirection: $sortDirection\n", + " ){\n", + " total\n", + " results{\n", + " title\n", + " severity\n", + " count\n", + " code\n", + " }\n", + " }\n", + " }\n", + "\n", + "\"\"\"" + ] + }, + { + "cell_type": "code", + "execution_count": 106, + "id": "9d78b640-c070-46d1-94c9-16f0e031cae4", + "metadata": {}, + "outputs": [], + "source": [ + "summary_variables = {\"submissionID\":submissionid, \"severity\":\"All\", \"first\":-1, \"offset\":0, \"sortDirection\": \"desc\", \"orderBy\": \"displayID\"}" + ] + }, + { + "cell_type": "code", + "execution_count": 110, + "id": "669ac53d-50cb-496a-9411-38e57cf4b323", + "metadata": {}, + "outputs": [], + "source": [ + "summary_res = apiQuery('stage', summaryQuery, summary_variables)" + ] + }, + { + "cell_type": "code", + "execution_count": 111, + "id": "a22e3a73-e012-49be-904d-0428053ea321", + "metadata": {}, + "outputs": [ + { + "data": { + "text/markdown": [ + "| | title | severity | count | code |\n", + "|---:|:----------------------------------|:-----------|--------:|:-------|\n", + "| 0 | Related node not found | Error | 2 | M014 |\n", + "| 1 | Value not permitted | Error | 497 | M010 |\n", + "| 2 | Invalid Property | Warning | 152 | M017 |\n", + "| 3 | Missing required property | Error | 3 | M003 |\n", + "| 4 | Relationship not specified | Error | 152 | M013 |\n", + "| 5 | Many-to-one relationship conflict | Error | 4 | M025 |\n", + "| 6 | Updating existing data | Warning | 279 | M018 |" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "summary_df = pd.DataFrame(summary_res['data']['aggregatedSubmissionQCResults']['results'])\n", + "display(Markdown(summary_df.to_markdown()))" + ] + }, + { + "cell_type": "markdown", + "id": "e3f0c04e-52ab-4498-b3d3-2bda2de22203", + "metadata": {}, + "source": [ + "#### Side Trip\n", + "As with other results, these can also be viewed in the Data Submission portal graphical interface. The Submission Portal also allows download of a .csv file if that is preferred." + ] + }, + { + "cell_type": "markdown", + "id": "c977cc8f-80ec-4a67-9d56-9a975f0fa9c7", + "metadata": {}, + "source": [ + "To get a detailed breakdown of each entry, the **submissionQCResults** query should be used. This query has a number of different options that can be used to fine-tune the results that are returned so please refer to the documentation for more options. \n", + "This allows large numbers of errors to be handled in a more digestible manner.\n", + "\n", + "In the example below, we'll use the *M003* code to limit the errors returned to just those classified as *Missing required property*." + ] + }, + { + "cell_type": "code", + "execution_count": 112, + "id": "8f0fc780-f8c5-428c-be13-fe6e79b8cb3f", + "metadata": {}, + "outputs": [], + "source": [ + "detailedQCQuery = \"\"\"\n", + " query DetailedQueryQCResults(\n", + " $id: ID!,\n", + " $severities: String,\n", + " $first: Int,\n", + " $offset: Int,\n", + " $orderBy: String,\n", + " $sortDirection: String\n", + " $issueCode: String\n", + " ){\n", + " submissionQCResults(\n", + " _id:$id,\n", + " severities: $severities,\n", + " first: $first,\n", + " offset: $offset,\n", + " orderBy: $orderBy,\n", + " sortDirection: $sortDirection,\n", + " issueCode: $issueCode\n", + " ){\n", + " total\n", + " results{\n", + " submissionID\n", + " type\n", + " validationType\n", + " batchID\n", + " displayID\n", + " submittedID\n", + " severity\n", + " uploadedDate\n", + " validatedDate\n", + " errors{\n", + " title\n", + " description\n", + " }\n", + " warnings{\n", + " title\n", + " description\n", + " }\n", + " }\n", + " }\n", + " }\n", + "\"\"\"" + ] + }, + { + "cell_type": "code", + "execution_count": 113, + "id": "37a59630-37b1-42cf-bcc2-dcad98b85ece", + "metadata": {}, + "outputs": [], + "source": [ + "detail_variables = {\"id\": submissionid, \"severities\":\"All\", \"first\": -1, \"offset\": 0, \"orderBy\":\"displayID\", \"sortDirection\":\"desc\", \"issueCode\":\"M003\"}" + ] + }, + { + "cell_type": "code", + "execution_count": 114, + "id": "c6f3c4a2-6450-44d6-818d-5b254e81d0d6", + "metadata": {}, + "outputs": [], + "source": [ + "detail_res = apiQuery('stage', detailedQCQuery, detail_variables)" + ] + }, + { + "cell_type": "code", + "execution_count": 115, + "id": "5903b781-eeae-457e-bb9f-5ac3c8b38b9a", + "metadata": {}, + "outputs": [ + { + "data": { + "text/markdown": [ + "| | submissionID | type | validationType | batchID | displayID | submittedID | severity | uploadedDate | validatedDate | errors | warnings |\n", + "|---:|:-------------------------------------|:-------|:-----------------|:-------------------------------------|------------:|:--------------|:-----------|:-------------------------|:-------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-----------|\n", + "| 0 | 0623921e-926f-4e2a-b0f7-6f1d442d855d | study | metadata | 242a9da4-8a26-420a-88bd-732e85aac752 | 2 | phs0002 | Error | 2025-03-07T19:37:39.381Z | 2025-03-07T19:39:51.818Z | [{'title': 'Missing required property', 'description': '[Demo_study.tsv: line 2] Required property \"file_types_and_format\" is empty.'}, {'title': 'Missing required property', 'description': '[Demo_study.tsv: line 2] Required property \"study_access\" is empty.'}, {'title': 'Missing required property', 'description': '[Demo_study.tsv: line 2] Required property \"study_version\" is empty.'}, {'title': 'Value not permitted', 'description': '[Demo_study.tsv: line 2] \"Genomic\" is not a permissible value for property \"study_data_types\". It is recommended to use \"Genomic Structural Variation\", as it is semantically equivalent to \"Genomic\"'}] | [] |" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "detail_df = pd.DataFrame(detail_res['data']['submissionQCResults']['results'])\n", + "display(Markdown(detail_df.to_markdown()))" + ] + }, + { + "cell_type": "markdown", + "id": "3ac579c9-f4de-44df-8dd7-976495fc50d1", + "metadata": {}, + "source": [ + "Since the actual errors and warnings are buried in lists, we'll parse them out to make them more visible" + ] + }, + { + "cell_type": "code", + "execution_count": 116, + "id": "1240f105-e3c4-4e83-a235-747a1c644f25", + "metadata": {}, + "outputs": [ + { + "data": { + "text/markdown": [ + "| | title | description |\n", + "|---:|:--------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n", + "| 0 | Missing required property | [Demo_study.tsv: line 2] Required property \"file_types_and_format\" is empty. |\n", + "| 1 | Missing required property | [Demo_study.tsv: line 2] Required property \"study_access\" is empty. |\n", + "| 2 | Missing required property | [Demo_study.tsv: line 2] Required property \"study_version\" is empty. |\n", + "| 3 | Value not permitted | [Demo_study.tsv: line 2] \"Genomic\" is not a permissible value for property \"study_data_types\". It is recommended to use \"Genomic Structural Variation\", as it is semantically equivalent to \"Genomic\" |" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "columns = ['title', 'description']\n", + "error_df = pd.DataFrame(columns=columns)\n", + "for result in detail_res['data']['submissionQCResults']['results']:\n", + " for error in result['errors']:\n", + " error_df.loc[len(error_df)] = error\n", + "display(Markdown(error_df.to_markdown()))" + ] + }, + { + "cell_type": "markdown", + "id": "8600c9a7-e073-4e64-badd-53b2341442e3", + "metadata": {}, + "source": [ + "\n", + "# Step 5: Submitting, Canceling, or Withdrawing\n", + "The last step of this process techincally is the submission to CRDC, however the same query is used to cancel a submission, or to withdraw a submission. Let's quickly go over what each of those means:\n", + "\n", + "- **Submit** : Once all of the validation errors have been corrected and the validation results are either completely clean or only have warnings, the study is ready to be submitted. Sending a submit request will hand over control of the files and data to the CRDC Data Team for final checks. Note that once you submit a submission, no further edits are allowed.\n", + " \n", + "- **Cancel** : If you want to abandon a submission *that has not been submitted to CRDC yet*, sending a cancellation request will lock the submission and withdraw it from the system. **Further work is not allowed on cancelled submissions so be sure that you want to cancel before you issue this query.**\n", + " \n", + "- **Withdraw** : Withdraw is similar to cancel only it is used on submissions that have already been submitted to CRDC. So if you find that a study was submitted before everythign was complete, or if other errors are found that necessitate stopping the submission process, sending a **Withdraw** query will prevent the release of the submitted data to the data commons and return the submission to it's previous, unsubmitted, state.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 110, + "id": "f1aff9fb-342e-4474-9407-98d413863ffe", + "metadata": {}, + "outputs": [], + "source": [ + "submission_query = \"\"\"\n", + "mutation Submit(\n", + " $id: ID!\n", + " $action: String!\n", + " $comment: String\n", + "){\n", + " submissionAction(submissionID: $id, action: $action, comment: $comment){\n", + " name\n", + " submitterID\n", + " submitterName\n", + " dataCommons\n", + " modelVersion\n", + " studyAbbreviation\n", + " dbGaPID\n", + " status\n", + " }\n", + "}\n", + "\"\"\"" + ] + }, + { + "cell_type": "code", + "execution_count": 111, + "id": "62e52e92-09f4-4cab-b479-790bcf00aaf2", + "metadata": {}, + "outputs": [], + "source": [ + "submission_variables = {\"id\":submissionid, \"action\": \"Submit\", \"comment\":\"Example submission\"}" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a322e4ba-6750-4fcc-974b-7a1ea2d57607", + "metadata": {}, + "outputs": [], + "source": [ + "submission_res = apiQuery('stage', submission_query, submission_variables)" + ] + }, + { + "cell_type": "markdown", + "id": "64beff91-859a-4653-8511-c09f2e8a90e5", + "metadata": {}, + "source": [ + "# Conclusions\n", + "\n", + "At this point, we've walked through the basics of creating a submission, uploading, validating, and submitting (or not) data using the API system. There are more queries and mutations that are available to provide additional information and capabilties for integrating with your systems and we suggest reading the API documentation for further details. And while this example is in Python, any language that can use GraphQL queries is suitable for interaction with this API.\n", + "\n", + "If you have any questions about using this API, please contact the [CRDC Helpdesk](mailto:NCICRDC%40mail.nih.gov?subject=Data%20Submission%20API%20Question)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "da8a6c08-46f4-4d10-81f2-c4317f2323ee", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.3" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/__pycache__/DH_Queries.cpython-312.pyc b/__pycache__/DH_Queries.cpython-312.pyc index 89093ceddf03e92368274bec00843577b7d92075..229247b547f0b7fc10cc6f50188a1560683a84fa 100644 GIT binary patch delta 62 zcmX@4vRQ@iG%qg~0}woUe>)>ta3f#4ppu?`Mt*LpenDnNj(&P(Nk(aszDr_BqDN_x QV}PeiYHq&%&F$jrzoNr`dEPcF?(%_}L6 fNzW|FC{2oSNi0e9C{1z<@N`Md&5xP9U9biKu|XPy