diff --git a/pages/architecture.md b/pages/architecture.md index 5f025253e..a6f214c33 100644 --- a/pages/architecture.md +++ b/pages/architecture.md @@ -362,7 +362,7 @@ performance from actual user sessions in near real time. Environments are workspaces where one or multiple teams can work. They are the door between our users in data.all and AWS, that is why we say that we "link" environments because we link each environment to **ONE** AWS account, in **ONE** specific region. -Under each environment we create other data.all resources, such as datasets, pipelines and notebooks. +Under each environment we create other data.all resources, such as datasets and notebooks. For the deployment of CloudFormation stacks we call upon a CDK trust policy between the Deployment account and the Environment account. As for the SDK calls, from the deployment account we assume a certain IAM role in the environment accounts, the pivotRole. @@ -396,7 +396,7 @@ manually update the pivot role template. IAM policies cannot be scoped down to n ### CDK bootstrap We need to bootstrap the environment account to provision resources the AWS CDK needs to perform the deployment of -environments, datasets, pipelines and other data.all resources. +environments, datasets, and other data.all resources. Run the following command with AWS credentials of the environment account: ```bash @@ -578,8 +578,9 @@ and underlying access to the data in AWS. ### Pipelines -A Pipeline has a pipeline-Team with UI permissions on the Pipeline -and underlying access to the data in AWS. +> [!WARNING] +> As of v2.7.0 - data.all Pipelines is no longer a supported resource due to the restriction of the CodeCommit service to only existing AWS Customers. To learn more about CodeCommit's availability, please refer to the documentation [here](https://aws.amazon.com/blogs/devops/how-to-migrate-your-aws-codecommit-repository-to-another-git-provider/). + ### Dashboards diff --git a/pages/code.md b/pages/code.md index 05d9ad183..abb6a7e3d 100644 --- a/pages/code.md +++ b/pages/code.md @@ -171,21 +171,12 @@ The data.all `base.api` package contains the `gql` sub-package to support GraphQ #### cdkproxy This package contains the code associated with the deployment of CDK stacks that correspond to data.all resources. -`cdkproxy` is a package that exposes a REST API to run registered cloudformation stacks using AWS CDK. It is deployed as a docker container running on AWS ECS. +`cdkproxy` is a package that runs registered cloudformation stacks using AWS CDK. It is bundled as a docker image and run as a AWS ECS task which is triggered on infastrcutre as code (IaC) operations on data.all (e.g. CRUD of data.all resources). -When a data.all resource is created, the API sends an HTTP request -to the docker service and the code runs the appropriate stack using `cdk` cli. +When an API request is made to create a data.all resource, such as a new dataset, the data.all backend sends a new message to an SQS Queue to asynchronously be read off the queue and start a new cdkproxy ECS task. +The code uses a `cdk` cli wrapper to register infrastructure and manage cdk commands, and runs the appropriate stack using `cdk` cli to deploy the IaC of the respective data.all resource. -These stacks are deployed with the `cdk` cli wrapper -The API itself consists of 4 actions/paths: - -- GET / : checks if the server is running -- POST /stack/{stackid} : creates or updates the stack -- DELETE /stack/{stackid} : deletes the stack -- GET /stack/{stackid] : returns stack status - -The webserver is running on docker, using Python's [FASTAPI](https://fastapi.tiangolo.com/) -web framework and running using [uvicorn](https://www.uvicorn.org/) ASGI server. +For local data.all deployments, a webserver runs on docker using Python's [FASTAPI](https://fastapi.tiangolo.com/) web framework and [uvicorn](https://www.uvicorn.org/) ASGI server. Subsequnetly, data.all sends POST API Requests to the `cdkproxy` web server to start the data.all infrastructure task. ### core/ Core contains those functionalities that are indispensable to run data.all. Customization of the core should be limited @@ -855,7 +846,7 @@ when `Datasets` or `Dashboards` modules are disabled. "active": true }, "datapipelines": { - "active": true + "active": false }, "datasets": { "active": false diff --git a/pages/deploy/deploy_aws.md b/pages/deploy/deploy_aws.md index c4b0c4137..c3255ab44 100644 --- a/pages/deploy/deploy_aws.md +++ b/pages/deploy/deploy_aws.md @@ -9,7 +9,7 @@ permalink: /deploy-aws/ - [Pre-requisites](#pre-reqs) - [1. Clone data.all code](#clone) - [2. Setup Python virtualenv](#env) -- [3. Mirror the code to a CodeCommit or CodeStar Connections repository](#code) +- [3. Mirror the code to a CodeCommit or CodeConnections repository ](#code) - [4. Bootstrap tooling account](#boot) - [5. Bootstrap deployment account(s)](#boot2) - [6. Configure the deployment options in the cdk.json file](#cdkjson) @@ -79,27 +79,13 @@ source venv/bin/activate pip install -r ./deploy/requirements.txt pip install git-remote-codecommit ``` -## 3. Mirror the code to a CodeCommit or CodeStar Connections repository -### Using CodeCommit: -Assuming AWS tooling account Administrator credentials, create an AWS CodeCommit repository, mirror the data.all code -and push your changes: -Run the following to check your credentials: -```bash -aws sts get-caller-identity -``` -```bash -aws codecommit create-repository --repository-name dataall -git remote rm origin -git remote add origin codecommit::://dataall -git init -git add . -git commit -m "First commit" -git push --set-upstream origin main -``` -### Using CodeStar Connection to GitHub, GitHub Enterprise, GitLab or Bitbucket: -If you choose to use a GitHub, GitLab or Bitbucket repository, it's important to note that you need to set up an AWS CodeStar connection to your repository for seamless integration. +## 3. Mirror the code to a CodeCommit or CodeConnections repository + +### Using CodeConnections to GitHub, GitHub Enterprise, GitLab or Bitbucket: +If you choose to use a GitHub, GitLab or Bitbucket repository, it's important to note that you need to set up an AWS CodeConnections connection to your repository for seamless integration. + This connection allows AWS CodePipeline to interact securely with GitHub, GitHub Enterprise, GitLab or Bitbucket. -Before mirroring the data.all code and pushing any changes, make sure to set up the CodeStar connection by following +Before mirroring the data.all code and pushing any changes, make sure to set up the CodeConnections by following the steps detailed in the [documentation](https://docs.aws.amazon.com/dtconsole/latest/userguide/connections-create.html): 1. Log in to the AWS Management Console. 2. Navigate to the AWS Developer tools > Settings > Connections. @@ -117,6 +103,29 @@ git add . git commit -m "First commit" git push --set-upstream origin main ``` + +### Using CodeCommit: + +> [!WARNING] +> CodeCommit is no longer an active service for new AWS Customers. It is recommended to use the above method of CodeConnections when deploying a new instance of data.all. To learn more about CodeCommit's availability, please refer to the documentation [here](https://aws.amazon.com/blogs/devops/how-to-migrate-your-aws-codecommit-repository-to-another-git-provider/). + +Assuming AWS tooling account Administrator credentials, create an AWS CodeCommit repository, mirror the data.all code +and push your changes: +Run the following to check your credentials: +```bash +aws sts get-caller-identity +``` +```bash +aws codecommit create-repository --repository-name dataall +git remote rm origin +git remote add origin codecommit::://dataall +git init +git add . +git commit -m "First commit" +git push --set-upstream origin main +``` + + ## 4. Bootstrap tooling account The **Tooling** account is where the code repository, and the CI/CD pipeline are deployed. It needs to be bootstrapped with CDK in 2 regions, your selected region and us-east-1. @@ -170,7 +179,8 @@ of our repository. Open it, you should be seen something like: "resource_prefix": "string_PREFIX_FOR_ALL_RESOURCES_CREATED_BY_THIS_APP|DEFAULT=dataall", "repository_source": "string_VERSION_CONTROL_SERVICE|(codecommit, codestar_connection) DEFAULT=codecommit", "repo_string": "string_REPOSITORY_IN_GITHUB_OWNER/REPOSITORY|DEFAULT=awslabs/aws-dataall, REQUIRED if repository_source=codestar_connection", - "repo_connection_arn": "string_CODESTAR_SOURCE_CONNECTION_ARN_FOR_GITHUB_arn:aws:codestar-connections:region:account-id:connection/connection-id|DEFAULT=None, REQUIRED if repository_source=codestar_connection", + "repo_connection_arn": "string_CODESTAR_SOURCE_CONNECTION_ARN_FOR_GITHUB_arn:aws:codeconnections:region:account-id:connection/connection-id|DEFAULT=None, REQUIRED if repository_source=codestar_connection", + "log_retention_duration": "string_LOG_RETENTION_DURATION|DEFAULT=TWO_YEARS", "DeploymentEnvironments": [ { "envname": "string_ENVIRONMENT_NAME|REQUIRED", @@ -194,6 +204,7 @@ of our repository. Open it, you should be seen something like: "enable_cw_canaries": "boolean_SET_CLOUDWATCH_CANARIES_FOR_FRONTEND_TESTING|DEFAULT=false", "shared_dashboards_sessions": "string_TYPE_SESSION_SHARED_DASHBOARDS|(reader, anonymous) DEFAULT=anonymous", "enable_pivot_role_auto_create": "boolean_ENABLE_PIVOT_ROLE_AUTO_CREATE_IN_ENVIRONMENT|DEFAULT=false", + "allowed_origins": "string_TYPE_DOMAIN_ORIGIN|DEFAULT=*", "enable_update_dataall_stacks_in_cicd_pipeline": "boolean_ENABLE_UPDATE_DATAALL_STACKS_IN_CICD_PIPELINE|DEFAULT=false", "enable_opensearch_serverless": "boolean_USE_OPENSEARCH_SERVERLESS|DEFAULT=false", "cognito_user_session_timeout_inmins": "integer_COGNITO_USER_SESSION_TIMEOUT_INMINS|DEFAULT=43200", @@ -234,7 +245,9 @@ and find 2 examples of cdk.json files. | resource_prefix | Optional | The prefix used for AWS created resources. It must be in lower case without any special character. (default: dataall) | | source | Optional | The version control source for the repository. It can take 2 values 'codecommit' or 'codestar_connection'. (default: 'codecommit') | | repo_string | Optional | The repository path as string. Required if source='codestar_connection' (default: 'awslabs/aws-dataall') | -| repo_connection_arn | Optional | The arn of the CodeStar connection connecting with the source repository. Required if source='codestar_connection'(default: None) | +| repo_connection_arn | Optional | The arn of the CodeConnection connecting with the source repository. Required if source='codestar_connection'(default: None) | +| log_retention_duration | Optional | The CloudWatch log retention days for all data.all compute log groups (e.g. Lambda and ECS Tasks), VPC flow logs, and API Activity logs - this parameter is specified as a string value of one of the AWS CDK enum RetentionDays members (default: `TWO_YEARS`) | + | **Deployment environments Parameters** | **Optional/Required** | **Definition** | | ---------------------------- | --------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | envname | REQUIRED | The name of the deployment environment (e.g dev, qa, prod,...). It must be in lower case without any special character. | @@ -259,6 +272,7 @@ and find 2 examples of cdk.json files. | cognito_user_session_timeout_inmins | Optional | The number of minutes to set the refresh token validity time for user session's in Cognito before a user must re-login to the data.all UI (default: 43200 - i.e. 30 days) | | reauth_config | Optional | A dictionary containing a list of API operations that require a user to re-authenticate before proceedind (`reauth_apis`) and a time to live (`ttl`) for how long a user's re-auth session is valid to perform re-auth APIs before having to re-authenticate again | | custom_auth | Optional | A dictionary containing set of parameters to setup external IDP ( Authentication and Authorization) in data.all. Custom Auth Configuration : `provider`, `url`, `redirect_url`, `client_id`, `response_types`, `scopes`, `claims_mapping` (Nested dictionary containing configuration : `user_id`, `email`). All the configurations are required if setting data.all with an external OIDC supported IDP | +| allowed_origins | Optional | A string origin to be specified as the `Access-Control-Allow-Origin` response header when returning responses from bakend (default: `'*'`) | **Example 1**: Basic deployment: this is an example of a minimum configured cdk.json file. @@ -301,6 +315,7 @@ deploy to 2 deployments accounts. "git_release": true, "quality_gate": false, "resource_prefix": "da", + "log_retention_duration": "SIX_YEARS", "DeploymentEnvironments": [ { "envname": "dev", @@ -338,6 +353,7 @@ deploy to 2 deployments accounts. "enable_update_dataall_stacks_in_cicd_pipeline": true, "enable_opensearch_serverless": true, "cognito_user_session_timeout_inmins": 240, + "allowed_origins": "https://example.com", "reauth_config": { "reauth_apis": ["CreateDataset", "ImportDataset", "deleteDataset"], "ttl": 10 @@ -449,13 +465,22 @@ the different configuration options. { "modules": { "mlstudio": { - "active": true + "active": true, + "features": { + "show_stack_logs": "enabled|disabled|admin-only" + } }, "notebooks": { - "active": true + "active": true, + "features": { + "show_stack_logs": "enabled|disabled|admin-only" + } }, "datapipelines": { - "active": true + "active": true, + "features": { + "show_stack_logs": "enabled|disabled|admin-only" + } }, "omics": { "active": false @@ -483,7 +508,7 @@ the different configuration options. "Secret" : true } } - }, + }, "s3_datasets": { "active": true, "features": { @@ -491,14 +516,27 @@ the different configuration options. "file_actions": true, "aws_actions": true, "preview_data": true, - "glue_crawler": true + "glue_crawler": true, + "show_stack_logs": "enabled|disabled|admin-only" } - }, + }, "s3_datasets_shares": { "active": true - }, + }, + "shares_base": { + "active": true, + "features": { + "show_share_logs": "enabled|disabled|admin-only" + } + }, "worksheets": { - "active": true + "active": true, + "features": { + "nlq": { + "active": true, + "max_count_per_day": 25 + } + } }, "dashboards": { "active": true @@ -511,8 +549,10 @@ the different configuration options. "features": { "env_aws_actions": true, "cdk_pivot_role_multiple_environments_same_account": false, - "enable_quicksight_monitoring": false - } + "enable_quicksight_monitoring": false, + "show_stack_logs": "enabled|disabled|admin-only" + }, + "log_query_period_days": 1 } } ``` @@ -537,14 +577,12 @@ functionality. If you want to know more about each module, check the [UserGuide] | s3_datasets_shares | datasets_base, notifications | Sub-module that allows sharing of Datasets through Lake Formation and S3 | | datasets_base | None | Shared code related to Datasets (not exposed on `config.json`). | | worksheets | datasets | Athena query editor integrated in data.all UI | -| datapipelines | feed | CICD pipelines that deploy [AWS DDK](https://awslabs.github.io/aws-ddk/) applications | -| omics | None | adds the capability to view and instantiate HealthOmics Ready2Run workflows as runs that can output and save omic data as data.all Datasets. | -| mlstudio | None | SageMaker Studio users that can open a session directly from data.all UI | -| notebooks | None | SageMaker Notebooks created and accessible from data.all UI | -| dashboards | catalog, vote, feed | Start a Quicksight session or import and share a Quicksight Dashboard. | -| notifications | None | Construct to notify users on dataset sharing updates in data.all | -| maintenance | None | Admin control to start/stop data.all maintenance mode to restrict user actions in data.all and allow a stable window for deploying new updates | -| metadata | None ( Registers data.all entities if they are enabled ) | Allows users to attach additional metadata to their data.all entities (datasets, environment, etc ). Enables users to create a metadata form & enforce attaching it to various entities | +| omics | None | adds the capability to view and instantiate HealthOmics Ready2Run workflows as runs that can output and save omic data as data.all Datasets.| +| mlstudio | None | SageMaker Studio users that can open a session directly from data.all UI | +| notebooks | None | SageMaker Notebooks created and accessible from data.all UI | +| dashboards | catalog, vote, feed | Start a Quicksight session or import and share a Quicksight Dashboard. | +| notifications | None | Construct to notify users on dataset sharing updates in data.all | +| maintenance | None | Admin control to start/stop data.all maintenance mode to restrict user actions in data.all and allow a stable window for deploying new updates | | metadata | None ( Registers data.all entities if they are enabled ) | Allows users to attach additional metadata to their data.all entities (datasets, environment, etc ). Enables users to create a metadata form & enforce attaching it to various entities | ### Disable module features @@ -558,21 +596,25 @@ In the example config.json, the feature that enables file upload from data.all U "features": { "file_uploads": false, } - }, + } ``` -| **Feature** | **Module** | **Description** | -|---------------------|------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| file_uploads | s3_datasets | Upload files in a Dataset in the Upload tab | -| file_actions | s3_datasets | Create, Read, Update, Delete on Dataset Folders | -| aws_actions | s3_datasets | Get AWS Credentials and assume Dataset IAM role from data.all's UI | -| preview_data | s3_datasets | Enable previews of dataset tables for users in data.all UI | -| glue_crawler | s3_datasets | Allow running Glue Crawler to catalog new data for data.all datasets directly from the UI | -| share_notifications | s3_datasets | Allow additional notifications (on top of data.all's built in UI notifications) to be sent to data.all users when a dataset sharing operation occurs (currently only type `email` notifications is supported and requires `custom_domain` hosted zone parameters be specified in `cdk.json`) | -| confidentiality_dropdown | s3_datasets | Disable / Enable use of confidentiality levels for a dataset. Please note - when this drop down is set to false each dataset is treated as if it is Official or Secret | -| topics_dropdown | s3_datasets | Disable / Enable use of topics for a dataset | -|auto_approval_for_confidentiality_level | s3_datasets | Specify if auto-approval for share requests should be enabled for each confidentiality level in data.all | +| **Feature** | **Module** | **Description** | +|-----------------------------------------|-------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| file_uploads | s3_datasets | Upload files in a Dataset in the Upload tab | +| file_actions | s3_datasets | Create, Read, Update, Delete on Dataset Folders | +| aws_actions | s3_datasets | Get AWS Credentials and assume Dataset IAM role from data.all's UI | +| preview_data | s3_datasets | Enable previews of dataset tables for users in data.all UI | +| glue_crawler | s3_datasets | Allow running Glue Crawler to catalog new data for data.all datasets directly from the UI | +| share_notifications | s3_datasets | Allow additional notifications (on top of data.all's built in UI notifications) to be sent to data.all users when a dataset sharing operation occurs (currently only type `email` notifications is supported and requires `custom_domain` hosted zone parameters be specified in `cdk.json`) | +| confidentiality_dropdown | s3_datasets | Disable / Enable use of confidentiality levels for a dataset. Please note - when this drop down is set to false each dataset is treated as if it is Official or Secret | +| topics_dropdown | s3_datasets | Disable / Enable use of topics for a dataset | +| auto_approval_for_confidentiality_level | s3_datasets | Specify if auto-approval for share requests should be enabled for each confidentiality level in data.all | +| show_stack_logs | s3_datasets | Enable / Disable showing stack logs to users or only allow admins to view stack logs. When "Enabled", users who have access to stack can view the logs. With "admin-only", only Data.all admins can view the stack logs and when "Disabled", no user can see the stack logs. Please check out the config.json in step 7 for more details | +| show_share_logs | shares_base | Enable / Disable showing share logs to users or only allow admins to view share logs. When "Enabled", users who have access to those shares can view the logs. With "admin-only", only Data.all admins can view the share logs and when "Disabled", no user can see the share logs. Please check out the config.json in step 7 for more details | +| nlq.active | worksheets | Disable / Enable natural language querying powered by genAI in worksheets (experimental feature - default: False) | + ### Customizing Module Features @@ -595,6 +637,7 @@ In addition to disabling / enabling, some module features allow for additional c | **Customization** | **Module** | **Description** | |--------------------------------|------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | custom_confidentiality_mapping | s3_datasets | Provides custom confidentiality mapping json which maps your custom confidentiality levels to existing data.all confidentiality
For e.g. ```custom_confidentiality_mapping : { "Public" : "Unclassified", "Private" : "Official", "Confidential" : "Secret", "Very Highly Confidential" : "Secret"}```
This will display confidentiality levels - Public, Private, Confidential & Very Highly Confidential - in the confidentiality drop down and maps it existing confidentiality levels in data.all - Unclassified, Official and Secret | +| nlq.max_count_per_day | worksheets | Set a limit of number of invocations allowed per user per day for the genAI NLQ worksheets feature (default: 10) | ### Disable and customize core features @@ -607,17 +650,32 @@ a particular feature in the core is to add it to the core section of the `config "env_aws_actions": true, "cdk_pivot_role_multiple_environments_same_account": false, "enable_quicksight_monitoring": false - } + }, + "log_query_period_days": 1 } ``` This is the list of core features that can currently be customized. Take it as an example if you need to disable or modify the bahavior any other core feature. -| **Feature** | **Module** | **Description** | -|-----------------------|----------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| env_aws_actions | environments | If set to True, users can get AWS Credentials and assume Environment Group IAM roles from data.all's UI | -| cdk_pivot_role_multiple_environments_same_account | environments | If set to True, the CDK-created pivot role as part of the environment stack will be region specific (`dataallPivotRole-cdk-`). This feature allows users to create multiple data.all environments in the same account but multiple regions. | -| enable_quicksight_monitoring | environments | If set to **true**, RDS security groups and VPC NACL rules are modified to allow connection of the RDS metadata database with Quicksight in the infrastructure account (default: false) | +| **Feature** | **Module** | **Description** | +|-----------------------|--------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| env_aws_actions | environments | If set to True, users can get AWS Credentials and assume Environment Group IAM roles from data.all's UI | +| cdk_pivot_role_multiple_environments_same_account | environments | If set to True, the CDK-created pivot role as part of the environment stack will be region specific (`dataallPivotRole-cdk-`). This feature allows users to create multiple data.all environments in the same account but multiple regions. | +| enable_quicksight_monitoring | environments | If set to **true**, RDS security groups and VPC NACL rules are modified to allow connection of the RDS metadata database with Quicksight in the infrastructure account (default: false) | +| log_query_period_days | global | Specify the time frame for querying the log history. This log history is used for Stacks view and shared log views. | +| show_stack_logs | environments | Enable / Disable showing stack logs to users or only allow admins to view stack logs. When "Enabled", users who have access to stack can view the logs. With "admin-only", only Data.all admins can view the stack logs and when "Disabled", no user can see the stack logs. Please check out the config.json in step 7 for more details | + +### (Optional) Additional Set Up For Worksheets GenAI Natural Language Query (NLQ) Features + +To use these features, your. data.all admin team must enabled access to the Claude 3.5 Sonnet Model hosted in Amazon Bedrock in the Deployment Account of data.all. To do so, the data.all admin team can follow the steps: + +1. Navigate to Amazon Bedrock Console and Select Model Access in left navigation pane +2. Chose Modify Model Access and select the Claude 3.5 Sonnet model + 1. Be sure to review the End User License Agreement (EULA) for terms and conditions of using the model before requesting access to it +3. Select Next, Review any additional terms documents, and when ready Submit the request +4. If the request is successful the Access status changes to Access granted or Available to request (this may take a few minutes) + +For more information about enabling foundation model access in Amazon Bedrock, please refer to [AWS Documentation](https://docs.aws.amazon.com/bedrock/latest/userguide/model-access-modify.html). ## 8. Run CDK synth and check cdk.context.json diff --git a/pages/userguide/img/worksheets/ws_analyze_txt_doc.png b/pages/userguide/img/worksheets/ws_analyze_txt_doc.png new file mode 100644 index 000000000..ba000b7bb Binary files /dev/null and b/pages/userguide/img/worksheets/ws_analyze_txt_doc.png differ diff --git a/pages/userguide/img/worksheets/ws_text_to_sql.png b/pages/userguide/img/worksheets/ws_text_to_sql.png new file mode 100644 index 000000000..26a12c211 Binary files /dev/null and b/pages/userguide/img/worksheets/ws_text_to_sql.png differ diff --git a/pages/userguide/worksheets.md b/pages/userguide/worksheets.md index 0f0bef047..c756a9d72 100644 --- a/pages/userguide/worksheets.md +++ b/pages/userguide/worksheets.md @@ -54,7 +54,7 @@ coming from Athena will pop-up automatically. ![worksheets](img/worksheets/ws_joins.png#zoom#shadow) -If you want to save the current query for later or for other users, click on the **save** icon (between the edit and the +If you want to save the current query for later or for other users, click on the **save** icon (next to the edit and the delete buttons). > **✅ More than just SELECT** @@ -65,4 +65,59 @@ delete buttons). > for more information on AWS Athena SQL syntax. +## 🆕 **Experimental Features: GenAI Powered Worksheets** + +As part of data.all >= v2.7 we introduced support for generative AI powered worksheet features. These features include both: + +1. Natural Language Querying (NLQ) of Structured Data +2. Text Document Analysis of Unstructured Data + +These features are optionally enabled/disabled via feature flags specified in data.all's configuration. + +More details on how to use each of these features are below. + +### Natural Language Querying (NLQ) of Structured Data + +data.all offers a NLQ feature to significantly reduce the barrier to entry for non-technical business users who need to quickly and easily query data to make informed decisions. + +Given a prompt and a selection of tables, data.all NLQ feature will generate the corresponding SQL statement that data.all users can execute against the data they have access to in data.all's Worksheets module. + +To start generating SQL, data.all users can select the TextToSQL Tab in the Worksheets View: + +![worksheets_nlq](img/worksheets/ws_text_to_sql.png#zoom#shadow) + +Users select the Worksheet environment, database and one or more tables where the data of interest is stored. Then they introduce a prompt describing the operation they want to perform. For example, they could type something like "Give me the top 3 clients in the last 10 months". Once they send the request to generate the query, data.all will invoke Claude 3.5 Sonnet model using Amazon Bedrock to generate a response. + +To enrich the context of the genAI request, data.all fetches the Glue metadata of the tables and database and passes it to the LLM. Access to Glue is limited to the tables the user has access to, in other words, we control that only accessible glue tables are fetched. + +In addition, there are built in guardrails to avoid mutating SQL statements (i.e. WRITE, UPSERT, DELETE, etc.). + +data.all Admins can additionally limit the number of invocations run against these LLMs by specifying a `max_count_per_day` feature flag in data.all's configuration (please reference data.all's Deployment Guide for more information). + + +### Text Document Analysis of Unstructured Data + +For unstructured text documents, data.all offers a feature to start analyzing your data using natural language. + +Given a prompt and a selected text docuemnt in a S3 Dataset, data.all's Document Analyzer feature will generate a response displayed in the data.all Worksheet Editor. + +> **⚠️ Limitations of Document Analysis** +> +> Currently data.all's Worksheet Document Analyzer is limited only to `.txt` and `.pdf` file extensions. Additionally, the feature is limited only to +> text documents which are explicitly owned by one of the user's teams (documents that are given access via data.all shares are not yet supported). + + +To start analyzing your text documents, data.all users can select the Document Analyzer Tab in the Worksheets View: + +![worksheets_unstructured](img/worksheets/ws_analyze_txt_doc.png#zoom#shadow) + +Users select the Worksheet environment, S3 dataset bucket and S3 object key (.txt or .pdf file) where the data of interest is stored. Then they introduce a prompt describing the information they want from the text document. For example, they could type something like "Give me the most prevalent 3 themes across this document". Once they send the request, data.all will invoke Claude 3.5 Sonnet model using Amazon Bedrock to generate a response. + +data.all fetches the content of the S3 Object and passes it to the LLM along with the user prompt. Access to S3 is limited to the buckets the user owns. + +There are built in guardrails to reduce hallucinations by ensuring the selected S3 Object contains information pertaining to the user's prompt. + +data.all Admins can additionally limit the number of invocations run against these LLMs by specifying a `max_count_per_day` feature flag in data.all's configuration (please reference data.all's Deployment Guide for more information). + +