-
Notifications
You must be signed in to change notification settings - Fork 115
feat: add a new ternary contour plot operator #4193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
ELin2025
wants to merge
7
commits into
apache:main
Choose a base branch
from
ELin2025:ternary-contour
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…thon based operators (apache#4189) ### What changes were proposed in this PR? This new PR introduces a PythonTemplateBuilder mechanism to create Texera’s Python native operators. It refactors how Python code is created using a new template concept, addressing prior issues with string formatting. Previously, creating Python-based operators is via raw string formatting, which is fragile: user text can contain `{}`, `%`, quotes, or newlines that break formatting. This PR makes codegen deterministic and safer by treating interpolated values as data segments. #### Design **Diagram 1** (compile-time `pyb` expansion and validation) This diagram describes the Scala compile-time flow when a developer writes a `pyb"..."` template: the `pyb` macro receives the literal parts and argument trees, verifies that literal segments are safe, classifies each interpolated argument (plain text vs. encodable vs. nested builder), and applies boundary validation to ensure encodable content cannot “break out” of its intended Python context. Each argument is evaluated once, runtime guards are injected when a nested builder is spliced in, and the pieces are concatenated into a `PythonTemplateBuilder`, which compacts adjacent text chunks and renders an `encode()` output where encodable values become decode-at-runtime segments before the generated Python is embedded into the operator payload. ```mermaid sequenceDiagram participant Dev as Scala code participant SC as StringContext participant M as pyb macro participant EI as EncodableInspector participant BV as BoundaryValidator participant PTB as PythonTemplateBuilder Dev->>SC: pyb"t0 $a0 t1 $a1 t2" SC->>M: parts + arg trees M->>M: verify literal parts M->>EI: classify args loop each direct encodable arg M->>BV: validateCompileTime(left,right,prefixLine) BV-->>M: ok / abort end M->>M: eval each arg once into __pyb_argN loop each nested builder arg M->>BV: runtimeChecksForNestedBuilder(ctx,__pyb_argN) BV-->>M: injected guard if unsafe end M->>PTB: concat parts + __pyb_argN PTB-->>Dev: returns PythonTemplateBuilder PTB->>PTB: compact adjacent Text chunks PTB->>PTB: render Encode (encodable -> decode(base64)) PTB-->>Dev: encode() returns python source string Dev->>Dev: embed generated python into operator payload ``` **Diagram 2** (end-to-end runtime flow: UI → descriptor → worker decoding with cache) This diagram illustrates the end-to-end pipeline from UI input to execution: the UI submits parameters (including user-controlled strings) to the Scala descriptor, where `pyb` expansion and `PythonTemplateBuilder` assembly produce a deterministic Python source string in “encode mode.” The encoded Python is embedded into the workflow plan payload, dispatched by the workflow service to the Python worker, and executed by the operator; during execution, the operator uses `PythonTemplateDecoder` to recover user text by decoding each encoded segment. An LRU cache (size 256) backs the decoder so repeated encoded strings decode once and subsequently reuse cached UTF-8 strings, reducing overhead while preserving strict decoding semantics. ```mermaid sequenceDiagram autonumber participant UI as UI Web participant DESC as Descriptor (Scala) participant MAC as pyb macro (compile time) participant PTB as PythonTemplateBuilder participant PLAN as Plan payload participant SVC as Workflow service participant WK as Python worker participant OP as Python Operator participant DEC as PythonTemplateDecoder participant CACHE as lru_cache 256 note over DESC,PTB: PyB related (Scala compile time codegen) UI->>DESC: submit params + code strings DESC->>MAC: pyb interpolation expands MAC-->>DESC: expanded builder + validation logic DESC->>PTB: assemble chunks (Text + Value) PTB-->>DESC: rendered python source (encode mode) note over DESC,WK: Plan + dispatch DESC->>PLAN: embed python source into payload PLAN->>SVC: submit workflow plan SVC->>WK: dispatch operator payload note over WK,DEC: Python runtime (worker executes generated source) WK->>OP: start operator with python source loop each encoded segment OP->>DEC: decode(base64) DEC->>CACHE: lookup(base64) alt cache hit CACHE-->>DEC: cached str else cache miss CACHE-->>DEC: miss DEC->>DEC: base64 decode + utf8 strict DEC->>CACHE: store(base64,str) end DEC-->>OP: recovered user text end OP-->>WK: execution continued ``` **Diagram 3** (test harness: generate code, reject raw-invalid, `py_compile`) This diagram shows the automated verification path for Python native operators: ScalaTest uses ClassGraph to discover every `PythonOperatorDescriptor`, instantiates each descriptor, inject invalid raw strings into class fields marked with `Json` properties and calls `generatePythonCode()` to produce the final Python source string. The test asserts that no “RawInvalid” marker appears in the generated output (indicating unsafe raw text did not leak), writes the source to a temporary `source.py`, and runs `python -m py_compile` to ensure the code is syntactically valid and compilable. Any raw-invalid leakage, compile error, or timeout causes the test to fail, enforcing consistent template-based code generation across operators. ```mermaid sequenceDiagram autonumber participant TS as ScalaTest participant CG as ClassGraph scanner participant DESC as PythonOperatorDescriptor participant GEN as generatePythonCode participant SPEC as PythonCodeRawInvalidTextSpec participant PY as python -m py_compile participant FS as temp file (source.py) TS->>CG: scan descriptors in packages CG-->>TS: list of PythonOperatorDescriptor classes loop each descriptor class TS->>DESC: instantiate descriptor TS->>GEN: call generatePythonCode(descriptor) GEN-->>TS: python source string TS->>SPEC: assert RawInvalid marker not present alt marker leaked SPEC-->>TS: FAIL (invalid raw text leaked) else marker clean SPEC-->>TS: OK TS->>FS: write source to temp file TS->>PY: py_compile(temp file) alt compile error or timeout PY-->>TS: FAIL (compile/timeout) else compile ok PY-->>TS: PASS end end end ``` #### As a developer, how to use `pyb` to create your python-based operators 1. **Use `EncodableString` for any UI/user-controlled text** Before (raw `String`) ```scala @JsonSchemaTitle("Ground Truth Attribute Column") @AutofillAttributeName var groundTruthAttribute: String = "" @JsonSchemaTitle("Selected Features") @AutofillAttributeNameList var selectedFeatures: List[String] = _ ``` After (`EncodableString`) ```scala import org.apache.texera.amber.pybuilder.PyStringTypes.EncodableString @JsonSchemaTitle("Ground Truth Attribute Column") @AutofillAttributeName var groundTruthAttribute: EncodableString = "" @JsonSchemaTitle("Selected Features") @AutofillAttributeNameList var selectedFeatures: List[EncodableString] = _ ``` --- 2. **Write Python using `pyb"""..."""` and interpolate values with `$param`** Before (string interpolation with manual quoting) ```scala val code = s""" |y_train = self.dataset[\"$groundTruthAttribute\"] |""".stripMargin ``` After (template + data: no manual quoting) ```scala import org.apache.texera.amber.pybuilder.PythonTemplateBuilder.PythonTemplateBuilderStringContext val code = pyb""" |y_train = self.dataset[$groundTruthAttribute] |""".encode //Automatic stripMargin applied inside the builder ``` --- 3. **For optional arguments, represent them as small `pyb` fragments, then put them in the code template** Before (manual string concatenation + quote juggling) ```scala val colorArg = if (color.nonEmpty) s", color='$color'" else "" val patternArg = if (pattern.nonEmpty) s", pattern_shape='$pattern'" else "" val fig = s"fig = px.timeline(table, x_start='start', x_end='finish', y='task'$colorArg$patternArg)" ``` After (optional fragments are builders too) ```scala val colorArg = if (color.nonEmpty) pyb", color=$color" else pyb""" val patternArg = if (pattern.nonEmpty) pyb", pattern_shape=$pattern" else pyb""" val fig = pyb"""fig = px.timeline(table, x_start=$start, x_end=$finish, y=$task$colorArg$patternArg)""" ``` --- 4. **Return `.encode` from `generatePythonCode()`** Before (returns raw string) ```scala override def generatePythonCode(): String = { val finalCode = s""" |from pytexera import * |y_train = self.dataset[\"$groundTruthAttribute\"] |""".stripMargin finalCode } ``` After (returns encoded output from builder) ```scala override def generatePythonCode(): String = { val finalCode = pyb""" |from pytexera import * |y_train = self.dataset[$groundTruthAttribute] |""" finalCode.encode } ``` --- 5. **Try to avoid the use of `s"..."`, `.format`, or `%` formatting for Python codegen** Before (`s` / `String.format` / `.format` patterns) ```scala // s"..." return s"""table[\"${ele.attribute}\"].values.shape[0]""" // String.format / "{}" placeholders workflowParam = workflowParam + String.format("%s = {},", ele.parameter.getName) portParam = portParam + String.format("%s(table['%s'].values[i]),", ele.parameter.getType, ele.attribute) ``` After (`pyb` templates end-to-end) ```scala return pyb"""table[${ele.attribute}].values.shape[0]""" workflowParam = pyb"$workflowParam${ele.parameter.getName} = {}," portParam = pyb"$portParam${ele.parameter.getType}(table[${ele.attribute}].values[i])," ``` --- 6. **Develop the unit tests in the new way** Before (expects quoted literals like `'start'`) ```scala assert( opDesc.createPlotlyFigure().plain.contains( "fig = px.timeline(table, x_start='start', x_end='finish', y='task' , color='color' )" ) ) ``` After (expects template output using variables, no embedded quotes) ```scala assert( opDesc.createPlotlyFigure().plain.contains( "fig = px.timeline(table, x_start=start, x_end=finish, y=task , color=color )" ) ) ``` ### Any related issues, documentation, discussions? No ### How was this PR tested? The PR includes a comprehensive set of tests to ensure the new functionality works and that it doesn’t break existing workflows: Unit Tests for PythonTemplateBuilder: New unit tests were added to verify that PythonTemplateBuilder correctly classifies and encodes segments. For example, tests likely feed in code strings with various edge cases (braces, percentage signs, quotes, etc.) and assert that the builder produces the expected spec output. Unit Tests for PythonCodeRawInvalidTextSpec: 2 new unit test to instantiate each Python Native Operator, and call `generatePythonCode` method and checks the python code compiles and the string format is consistent. ## Was this PR authored or co-authored using generative AI tooling? Reviewed by ChatGPT 5.2
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this PR?
This change relates to the addition of a ternary contour plot operator, which visualizes how a scalar value varies as a function of three normalized components that sum to a constant (typically 1 or 100%).
In a ternary contour plot:
This visualization is useful for identifying regions where the output is optimized or insensitive to changes in the component proportions, as well as for understanding trade-offs between the three variables.
The operator takes in 4 inputs. The first three variables are the components, and the fourth variable is the output that corresponds to the proportion of the the three components.
Any related issues, documentation, discussions?
Needs python library scikit-image
Can be installed using: pip install scikit-image
How was this PR tested?
Tested with existing test cases
Was this PR authored or co-authored using generative AI tooling?
No