Skip to content

Commit 87bd5e6

Browse files
erindrutreysp
andauthored
Docs: Scheduler facade docs (#3558)
Co-authored-by: Trey Spiller <1831878+treysp@users.noreply.github.com>
1 parent ed63bab commit 87bd5e6

File tree

13 files changed

+233
-1
lines changed

13 files changed

+233
-1
lines changed
Lines changed: 226 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,226 @@
1+
# Airflow
2+
3+
Tobiko Cloud's Airflow integration allows you to combine Airflow system monitoring with the powerful debugging tools in Tobiko Cloud.
4+
5+
## Setup
6+
7+
Your SQLMesh project must be configured and connected to Tobiko Cloud before using the Airflow integration.
8+
9+
Learn more about connecting to Tobiko Cloud in the [Getting Started page](../../tcloud_getting_started.md).
10+
11+
### Install libraries
12+
13+
After connecting your project to Tobiko Cloud, you're ready to set up the Airflow integration.
14+
15+
Start by installing the `tobiko-cloud-scheduler-facade` library in your Airflow runtime environment.
16+
17+
Make sure to include the `[airflow]` extra in the installation command:
18+
19+
``` bash
20+
$ pip install tobiko-cloud-scheduler-facade[airflow]
21+
```
22+
23+
### Connect Airflow to Tobiko Cloud
24+
25+
Next, add an Airflow [connection](https://airflow.apache.org/docs/apache-airflow/stable/howto/connection.html#creating-a-connection-with-the-ui) containing your Tobiko Cloud credentials.
26+
27+
Specify these fields when adding the connection:
28+
29+
- **Connection ID**: connection name of your choice
30+
- May not contain spaces, single quotes `'`, or double quotes `"`
31+
- **Connection Type**: always HTTP
32+
- **Host**: URL for your Tobiko Cloud project
33+
- **Password**: your Tobiko Cloud API token
34+
35+
The host URL and password values will be provided to you during your Tobiko Cloud onboarding.
36+
37+
It is convenient to specify the connection in the Airflow UI, as in this example with the name `tobiko_cloud`:
38+
39+
![Add a connection in the Airflow UI](./airflow/add_connection.png)
40+
41+
If the connection is successful, it will appear in the connection list:
42+
43+
![List of connections in the Airflow UI](./airflow/connection_list.png)
44+
45+
!!! info "Remember the connection name!"
46+
47+
Name the connection whatever you like, but remember that name because it's used for the `conn_id` parameter below.
48+
49+
## Create a DAG
50+
51+
You are now ready to create an Airflow DAG that connects to Tobiko Cloud.
52+
53+
This example code demonstrates the creation process, which requires:
54+
55+
- Importing the `SQLMeshEnterpriseAirflow` operator
56+
- Creating a `SQLMeshEnterpriseAirflow` instance with your Airflow connection id (the name from [above](#connect-airflow-to-tobiko-cloud!))
57+
- Creating the DAG object with the `create_cadence_dag()` method
58+
59+
60+
```python linenums="1"
61+
# folder: dags/
62+
# file name: tobiko_cloud_airflow_integration.py
63+
64+
# Import SQLMeshEnterpriseAirflow operator
65+
from tobikodata.scheduler_facades.airflow import SQLMeshEnterpriseAirflow
66+
67+
# Create SQLMeshEnterpriseAirflow instance with connection ID
68+
tobiko_cloud = SQLMeshEnterpriseAirflow(conn_id="tobiko_cloud")
69+
70+
# Create DAG for `prod` environment from SQLMeshEnterpriseAirflow instance
71+
first_task, last_task, dag = tobiko_cloud.create_cadence_dag(environment="prod")
72+
```
73+
74+
This is all that's needed to integrate with Tobiko Cloud!
75+
76+
## Monitor Tobiko Cloud actions
77+
78+
Once your DAG is loaded by Airflow, it will be populated with the SQLMesh models for the specified `environment` and will automatically trigger when the next Cloud Scheduler run happens.
79+
80+
You will see an entry in the DAG list:
81+
82+
![Airflow UI list of DAGs](./airflow/dag_list.png)
83+
84+
You can browse the DAG just like any other - each node is a SQLMesh model:
85+
86+
![Airflow UI DAG view](./airflow/dag_view.png)
87+
88+
## How it works
89+
90+
Tobiko Cloud uses a custom approach to Airflow integration - this section describes how it works.
91+
92+
The Airflow DAG task mirrors the progress of the Tobiko Cloud scheduler run. Each local task reflects the outcome of its corresponding remote task.
93+
94+
This allows you to observe at a glance how your data pipeline is progressing, displayed alongside your other pipelines in Airflow. No need to navigate to Tobiko Cloud!
95+
96+
### Why a custom approach?
97+
98+
Tobiko Cloud's scheduler performs multiple optimizations to ensure that your pipelines run correctly and efficiently. Those optimizations are only possible within our SQLMesh-aware scheduler.
99+
100+
Our approach allows you to benefit from those optimizations while retaining the flexibility to attach extra tasks or logic to the DAG in your broader pipeline orchestration context.
101+
102+
Because `run`s are still triggered by the Tobiko Cloud scheduler and tasks in the local DAG just reflect their remote equivalent in Tobiko Cloud, we call our custom approach a *facade*.
103+
104+
## Debugging
105+
106+
Each task in the local DAG writes logs that include a link to its corresponding remote task in Tobiko Cloud.
107+
108+
In the Airflow UI, find these logs in the task's Logs tab:
109+
110+
![Airflow UI task logs](./airflow/task_logs.png)
111+
112+
Clicking the link opens the remote task in the Tobiko Cloud [Debugger View](../debugger_view.md), which provides information and tools to aid debugging:
113+
114+
![Tobiko Cloud UI debugger view](./airflow/cloud_debugger.png)
115+
116+
## Extending the DAG
117+
118+
You may extend the local DAG with arguments to the `create_cadence_dag()` method.
119+
120+
This section describes how to extend your local DAG and demonstrates some simple extensions.
121+
122+
### Base DAG structure
123+
124+
The local DAG represents your SQLMesh project's models and their activity in Tobiko Cloud. This section describes how the DAG is structured.
125+
126+
The DAG is composed of SQLMesh models, but there must be a boundary around those models to separate them from your broader Airflow pipeline. The boundary consists of two tasks that serve as entry and exit nodes for the entire Tobiko Cloud run.
127+
128+
The first and last tasks in the DAG are the boundary tasks. The tasks are the same in every local DAG instance:
129+
- First task: `Sensor` task that synchronizes with Tobiko Cloud
130+
- Last task: `DummyOperator` task that ensures all models without downstream dependencies have completed before declaring the DAG completed
131+
132+
![Airflow DAG boundary tasks](./airflow/boundary_tasks.png)
133+
134+
### Using `create_cadence_dag()`
135+
136+
The local DAG is extended at the time of creation via arguments to the `create_cadence_dag()` method.
137+
138+
Each DAG corresponds to a specific SQLMesh project environment (`prod` by default). Specify another environment by passing its name to `create_cadence_dag()`'s `environment` argument.
139+
140+
The `create_cadence_dag()` method returns a tuple of references:
141+
142+
- `first_task` - a reference to the first task in the DAG (always the `Sensor` boundary task)
143+
- `last_task` - a reference to the last task in the DAG (always the `DummyOperator` boundary task)
144+
- `dag` - a reference to the Airflow `DAG` object
145+
146+
Use these references to manipulate the DAG and attach extra behavior.
147+
148+
### Examples
149+
150+
#### Slack notification when run begins
151+
152+
Attach a task to the `first_task` to send a Slack notification when a `run` completes:
153+
154+
```python
155+
# Create DAG
156+
first_task, last_task, dag = tobiko_cloud.create_cadence_dag(environment="prod")
157+
158+
# Attach Slack operator to first_task
159+
first_task >> SlackAPIPostOperator(task_id="notify_slack", channel="#notifications", ...)
160+
```
161+
162+
Airflow DAG view:
163+
164+
![Airflow DAG with Slack notification task](./airflow/add_task_at_start.png)
165+
166+
#### Send email and trigger DAG when run completes
167+
168+
Attach tasks to the `last_task` to send an email and trigger another DAG on `run` completion:
169+
170+
```python
171+
# Create DAG
172+
first_task, last_task, dag = tobiko_cloud.create_cadence_dag(environment="prod")
173+
174+
# Attach Email operator to last_task
175+
last_task >> EmailOperator(task_id="notify_admin", to="admin@example.com", subject="SQLMesh run complete")
176+
177+
# Attach DAG trigger operator to last_task
178+
last_task >> TriggerDagRunOperator(task_id="trigger_job", trigger_dag_id="some_downstream_job")
179+
```
180+
181+
Airflow DAG view:
182+
183+
![Airflow DAG with email and DAG trigger tasks on run completion](./airflow/add_task_at_end.png)
184+
185+
#### Trigger DAG when specific model completes
186+
187+
Trigger another DAG after a specific model has completed, without waiting for the entire run to complete:
188+
189+
```python
190+
# Create DAG
191+
first_task, last_task, dag = tobiko_cloud.create_cadence_dag(environment="prod")
192+
193+
# Get `sushi.customers` model task
194+
customers_task = dag.get_task("sushi.customers")
195+
196+
# Attach DAG trigger operator to `sushi.customers` model task
197+
customers_task >> TriggerDagRunOperator(task_id="customers_updated", trigger_dag_id="some_other_pipeline", ...)
198+
```
199+
200+
Airflow DAG view:
201+
202+
![Airflow DAG with DAG trigger task on a specific model's completin](./airflow/add_task_after_specific_model.png)
203+
204+
!!! info "Model task names"
205+
206+
Each model's Airflow `task_id` is the SQLMesh fully qualified model name. View a task's `task_id` by hovering over its node in the Airflow DAG view.
207+
208+
Each model's display name in the Airflow DAG view is just the *table* portion of the fully qualified model name. For example, a SQLMesh model named `foo.model_a` will be labeled `model_a` in the Airflow DAG view.
209+
210+
## Configuration
211+
212+
### `SQLMeshEnterpriseAirflow` parameters
213+
214+
| Option | Description | Type | Required |
215+
|-----------|--------------------------------------------------------------------------|:----:|:--------:|
216+
| `conn_id` | The Airflow connection ID containing the Tobiko Cloud connection details | str | Y |
217+
218+
### `create_cadence_dag()` parameters
219+
220+
| Option | Description | Type | Required |
221+
|----------------------|----------------------------------------------------------------------------------------|:----:|:--------:|
222+
| `environment` | Which SQLMesh environment to target. Default: `prod` | str | N |
223+
| `dag_kwargs` | A dict of arguments to pass to the Airflow DAG object when it is created. | dict | N |
224+
| `common_task_kwargs` | A dict of kwargs to pass to all task operators in the DAG | dict | N |
225+
| `sensor_task_kwargs` | A dict of kwargs to pass to just the sensor task operators in the DAG | dict | N |
226+
| `report_task_kwargs` | A dict of kwargs to pass to just the model / progress report task operators in the DAG | dict | N |
25 KB
Loading
29.9 KB
Loading
26.3 KB
Loading
28.3 KB
Loading
115 KB
Loading
130 KB
Loading
14.9 KB
Loading
29.8 KB
Loading
162 KB
Loading

0 commit comments

Comments
 (0)