Skip to content

🚀 Feat: template conversational data augmentation + e2e tests + CodeRabbit#57

Merged
nicofretti merged 31 commits intodevelopfrom
feat/e2e-tests
Jan 24, 2026
Merged

🚀 Feat: template conversational data augmentation + e2e tests + CodeRabbit#57
nicofretti merged 31 commits intodevelopfrom
feat/e2e-tests

Conversation

@nicofretti
Copy link
Owner

@nicofretti nicofretti commented Jan 20, 2026

Description

  • Add data augmentation pipeline
  • Add e2e tests
  • Add skills of Claude
  • Add CodeRabbit

Related Issue

Checklist

  • Code follows project style guidelines
  • Comments explain "why" not "what"
  • Documentation updated (if needed)
  • No debug code or console statements
  • make format passes
  • make pre-merge passes
  • PR update from develop branch
  • Copilot review run and addressed

Summary by CodeRabbit

  • New Features

    • Data Augmentation template and three new pipeline blocks for sampling, semantic field generation, and duplicate detection.
    • Block editor: per-field JSON vs template mode and model/embedding dropdowns.
    • Generator UI: auto-switch between Markdown and JSON input modes based on pipeline.
  • Documentation

    • Guides for data augmentation, block implementation, PR review workflows, and webapp E2E testing added.
  • Tests

    • Extensive unit, integration, and end-to-end test suites and fixtures added.
  • Bug Fixes

    • Safer template rendering and pipeline output now filter input metadata.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Jan 20, 2026

Walkthrough

Adds a data-augmentation feature set (StructureSampler, SemanticInfiller, DuplicateRemover), JSON-or-template config support across blocks, frontend editor improvements (json-or-template toggle and model dropdowns), API/pipeline output changes, template utilities, E2E Playwright test infra + tests, various docs/skills, CI/build updates, and helper scripts for tests and PR comment inspection.

Changes

Cohort / File(s) Summary
New Blocks & Templates
lib/blocks/builtin/structure_sampler.py, lib/blocks/builtin/semantic_infiller.py, lib/blocks/builtin/duplicate_remover.py, lib/templates/data_augmentation.yaml, lib/templates/seeds/seed_data_augmentation.json
Three new production blocks implementing skeleton sampling, LLM-based semantic infill with diversity checks, and embedding-based duplicate detection; plus a composed data_augmentation template and seed fixture.
Block Utilities & Config Schema
lib/blocks/commons/template_utils.py, lib/blocks/config.py, lib/template_renderer.py, lib/workflow.py
New template/LLM-response utilities, safe tojson behavior, per-field format hints (_config_formats) surfaced to schema, and pipeline output filtering to strip input metadata before persisting.
JSON-or-Template Support (Blocks)
lib/blocks/builtin/json_validator.py, lib/blocks/builtin/validator.py, lib/blocks/builtin/field_mapper.py, lib/blocks/builtin/ragas_metrics.py, lib/blocks/builtin/structured_generator.py
Refactor to accept JSON-or-Jinja templates for several config fields, add runtime rendering/validation, BlockExecutionError on invalid JSON, and _config_formats descriptors.
Frontend: Editor & Preview
frontend/src/components/pipeline-editor/BlockConfigPanel.tsx, frontend/src/components/pipeline-editor/BlockNode.tsx, frontend/src/pages/Generator.tsx, app.py
BlockConfigPanel: per-field json-or-template toggle, Monaco editors, LLM/embedding model dropdowns. BlockNode: expanded preview logic and return-type signature change. Generator: replaces isMultiplierPipeline with needsMarkdown driven by first_block_type (exposed via app.py change).
Templates, Docs & Skills
docs/template_data_augmentation.md, docs/how_to_create_blocks.md, llm/state-backend.md, llm/state-project.md, llm/state-frontend.md, .claude/skills/*
New template guide, updated block docs/state files, and multiple .claude skills (code-review, debugging, implementing blocks, webapp-testing, address-pr-review) with examples.
E2E & Integration Tests + Helpers
tests/e2e/*, tests/integration/*, tests/blocks/*, tests/*, tests/integration/conftest.py, tests/blocks/commons/test_template_utils.py
Adds Playwright-based E2E test suites, server orchestration helpers, integration tests for data-augmentation pipeline, and extensive unit tests for new blocks and template utilities.
Test Orchestration Scripts
scripts/with_server.py, tests/e2e/run_all_tests.sh, scripts/inspect_db_configs.py, scripts/fetch_comments.py
Helpers to run servers for tests, run E2E suites, inspect DB model configs, and fetch/format PR comments via gh CLI.
Build, CI & Config
pyproject.toml, Makefile, .coderabbit.yaml, .github/pull_request_template.md, .gitignore
Added scikit-learn and ragas bump, dependency-groups, pytest ignores for e2e, new Makefile e2e targets, CodeRabbit review config, PR template update, and .gitignore adjustments to preserve .claude/skills.
Frontend Templates & Seeds
lib/templates/seeds/*, lib/templates/data_augmentation.yaml
Seed fixtures and template wiring for data augmentation e2e/integration tests.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant BlockConfigPanel as BlockConfigPanel.tsx
    participant API as /api/llm-models\n/api/embedding-models
    participant Monaco as Monaco Editor

    Client->>BlockConfigPanel: Select block node
    BlockConfigPanel->>BlockConfigPanel: Reset jsonMode state
    BlockConfigPanel->>API: GET models
    API-->>BlockConfigPanel: model lists
    BlockConfigPanel->>BlockConfigPanel: Store model names

    Client->>BlockConfigPanel: Toggle json-or-template
    alt JSON Mode
        BlockConfigPanel->>Monaco: Init language=json
        Client->>Monaco: Edit JSON
        Monaco-->>BlockConfigPanel: JSON content
        BlockConfigPanel->>BlockConfigPanel: Validate JSON on save
    else Template Mode
        BlockConfigPanel->>Monaco: Init language=jinja2
        Client->>Monaco: Edit template
        Monaco-->>BlockConfigPanel: Template content
    end

    Client->>BlockConfigPanel: Select model / Save
    BlockConfigPanel->>BlockConfigPanel: Preserve custom model values
    BlockConfigPanel->>API: Persist config
Loading
sequenceDiagram
    participant Pipeline
    participant StructureSampler as StructureSampler
    participant SemanticInfiller as SemanticInfiller
    participant DuplicateRemover as DuplicateRemover
    participant LLM
    participant Embedding as Embedding Model

    Pipeline->>StructureSampler: execute(seed_data)
    StructureSampler->>StructureSampler: Analyze distributions\n(topological dependencies)
    StructureSampler-->>Pipeline: {skeletons, _seed_samples}

    Pipeline->>SemanticInfiller: execute(skeletons)
    SemanticInfiller->>LLM: Generate fields_to_generate
    LLM-->>SemanticInfiller: JSON output
    alt Diversity enabled
        SemanticInfiller->>Embedding: Embed existing samples
        SemanticInfiller->>Embedding: Embed candidate sample
        Embedding-->>SemanticInfiller: embeddings
        SemanticInfiller->>SemanticInfiller: Cosine similarity check
        alt Too similar
            SemanticInfiller->>LLM: Retry with negative exemplars/raised temp
        end
    end
    SemanticInfiller-->>Pipeline: {samples, _usage}

    Pipeline->>DuplicateRemover: execute(samples)
    DuplicateRemover->>Embedding: Embed seed samples (cached per trace)
    DuplicateRemover->>Embedding: Embed generated samples
    Embedding-->>DuplicateRemover: embeddings
    DuplicateRemover->>DuplicateRemover: Compute similarities\nflag is_duplicate
    DuplicateRemover-->>Pipeline: {generated_samples, similarity metrics}
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2
❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 46.58% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive PR description is incomplete: missing detailed explanation of implementation, no coverage of how data augmentation works, and several checklist items remain unchecked without justification. Expand description with technical details on the augmentation approach, explain how coherence and diversity are maintained, and address or explain unchecked checklist items.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed PR title clearly describes the main features: template-based data augmentation, e2e tests, Claude skills, and CodeRabbit integration—matching the changeset contents.
Linked Issues check ✅ Passed PR addresses both linked issues: #52 (CodeRabbit integration via .coderabbit.yaml) and #38 (data augmentation pipeline via StructureSampler, SemanticInfiller, DuplicateRemover blocks).
Out of Scope Changes check ✅ Passed All major changes align with stated objectives: data augmentation blocks, e2e tests, Claude skills documentation, CodeRabbit config, and supporting infrastructure (templates, fixtures, utilities).

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Comment @coderabbitai help to get the list of available commands and usage tips.

@nicofretti
Copy link
Owner Author

@coderabbitai help

@coderabbitai
Copy link

coderabbitai bot commented Jan 20, 2026

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

CodeRabbit commands

These commands are invoked using PR/Issue comments.

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai evaluate custom pre-merge check --instructions <custom-checks-instructions> --name <custom-checks-title> [--mode <error|warning>] to validate the custom pre-merge checks instructions. Defaults to error when --mode is omitted.
    • @coderabbitai ignore pre-merge checks to override pre-merge checks and get an approval on PR.
    • @coderabbitai run pre-merge checks to run pre-merge checks on the pull request.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve to resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai generate configuration to create a PR that adds the current resolved configuration as .coderabbit.yaml (or show it if already present).
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, support, documentation and community

  • Visit our status page to check the current availability of CodeRabbit.
  • Create a ticket on our support page for assistance with any issues or questions.
  • Visit our documentation site for detailed information on how to use CodeRabbit.
  • Join our Discord community to connect with other users and get help from the community.
  • Follow us on X/Twitter for updates and announcements.

@nicofretti
Copy link
Owner Author

@coderabbitai review

@coderabbitai
Copy link

coderabbitai bot commented Jan 20, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@nicofretti nicofretti changed the title Feat/e2e tests 🚀 Feat: template conversational data augmentation + e2e tests + CodeRabbit Jan 20, 2026
@nicofretti nicofretti self-assigned this Jan 20, 2026
@nicofretti nicofretti marked this pull request as ready for review January 20, 2026 22:41
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (5)
frontend/src/pages/Generator.tsx (1)

355-363: Static description text doesn't reflect markdown mode.

The description still says "Upload a JSON seed file" but the component now supports markdown files too. Consider making this dynamic based on needsMarkdown.

Proposed fix
         <Text sx={{ color: "fg.default" }}>
-          Upload a JSON seed file with input data. Each seed will be executed through your pipeline
-          multiple times based on repetitions.
+          {needsMarkdown
+            ? "Upload a Markdown file with your content. The file will be processed through your pipeline."
+            : "Upload a JSON seed file with input data. Each seed will be executed through your pipeline multiple times based on repetitions."}
         </Text>
Makefile (1)

1-1: Add new targets to .PHONY declaration.

The new test-e2e and test-e2e-ui targets should be declared as phony to prevent conflicts with potential files of the same name.

Proposed fix
-.PHONY: check-deps install dev dev-ui dev-backend run-dev build-ui run mock-llm clean lint format lint-frontend format-frontend format-all lint-all typecheck typecheck-frontend typecheck-all test test-integration pre-merge setup
+.PHONY: check-deps install dev dev-ui dev-backend run-dev build-ui run mock-llm clean lint format lint-frontend format-frontend format-all lint-all typecheck typecheck-frontend typecheck-all test test-integration test-e2e test-e2e-ui pre-merge setup
llm/state-project.md (1)

55-60: Directory structure omits tests/e2e.

Line 55-60 lists test directories but the PR adds e2e tests; please include tests/e2e/ (or mention it in the testing section). As per coding guidelines, docs should reflect the actual codebase.

lib/blocks/config.py (1)

41-67: Document _config_formats metadata pattern in block configuration docs.

The _config_formats attribute is extracted in schema generation (line 49) and applied to JSON schema format hints (line 35-36), but is not documented in llm/state-backend.md or llm/state-project.md. Update BaseBlock interface sections to include this pattern alongside _config_enums, _field_references, and _config_descriptions.

lib/blocks/builtin/ragas_metrics.py (1)

295-310: Update RAGAS dependency constraint to match implemented API version.

Code uses RAGAS 0.4.x API (ascore(**kwargs) returning MetricResult object with .value attribute, parameter names like reference), but pyproject.toml allows ragas>=0.2.0. RAGAS 0.2.x has incompatible API (returns float directly, uses ground_truths instead of reference, different import paths). Update the constraint to ragas>=0.4.0 to prevent runtime failures.

🤖 Fix all issues with AI agents
In @.coderabbit.yaml:
- Around line 13-19: The auto_review block uses the invalid key 'branches' which
the schema ignores; replace the 'branches' field with the schema-expected
'base_branches' (keeping values/regex patterns) so auto_incremental_review
targets the intended branches—update the 'auto_review' mapping in
.coderabbit.yaml by renaming 'branches' to 'base_branches' while preserving
other keys like 'enabled' and 'auto_incremental_review'.

In `@frontend/src/components/pipeline-editor/BlockConfigPanel.tsx`:
- Around line 254-288: The model dropdowns in BlockConfigPanel.tsx (the "model"
and "embedding_model" branches) can drop saved custom model names that aren't
returned by llmModels/embeddingModels; update the rendering so if value is
non-empty and not present in the respective model array you add a Select.Option
for that exact value (e.g., a disabled or clearly labeled "Custom: {value}"
option or a normal option) before mapping the API models, so the current value
remains selectable and handleChange will preserve it on save; keep the existing
empty/default option and the onChange handler (handleChange) unchanged.

In `@lib/blocks/builtin/structured_generator.py`:
- Around line 113-120: The except block catching json.JSONDecodeError should
re-raise BlockExecutionError using exception chaining to preserve the original
traceback; update the raise in the except json.JSONDecodeError as e handler so
it raises BlockExecutionError(...) from e, keeping the same message and detail
payload (including self.json_schema_template and schema_rendered).

In `@pyproject.toml`:
- Around line 97-100: Consolidate dev dependencies by removing the duplicate
[project.optional-dependencies].dev or [dependency-groups].dev section and
migrating all dev packages into a single dev group using the preferred style
(use [dependency-groups].dev if adopting PEP 735); locate entries under
[project.optional-dependencies].dev and [dependency-groups].dev, combine their
package strings into the chosen section (ensuring versions like
"playwright>=1.57.0" are preserved) and delete the other section so only one dev
dependency block remains.

In `@scripts/with_server.py`:
- Around line 35-41: The subprocess.Popen invocation that starts the
long-running server (the proc created from cmd with preexec_fn=os.setsid)
currently pipes stdout/stderr which can deadlock when buffers fill; change the
Popen call to not use subprocess.PIPE for stdout and stderr (redirect them to
subprocess.DEVNULL or leave as inherited/None) so output is not buffered into a
fixed-size pipe while still keeping the process group cleanup (os.setsid /
os.killpg) logic intact.
🟡 Minor comments (21)
.gitignore-22-23 (1)

22-23: Unignore contents of .claude/skills/ explicitly.
Without the /** rule, files under the directory may remain ignored.

Suggested fix
 .claude/*
 !.claude/skills/
+!.claude/skills/**
.github/pull_request_template.md-26-26 (1)

26-26: Clarify checklist wording.
Current phrasing is awkward; consider tightening it for clarity.

Suggested wording
-- [ ] Ask CodeRabbit review addressed (comment `@coderabbitai review`)
+- [ ] CodeRabbit review requested and addressed (comment `@coderabbitai review`)
tests/e2e/README.md-270-270 (1)

270-270: Capitalize "GitHub" properly.

"github" should be "GitHub" (official spelling).

Proposed fix
-example github actions workflow:
+example GitHub Actions workflow:
tests/e2e/README.md-27-35 (1)

27-35: Add language specifier to fenced code block.

The code block is missing a language identifier. Add bash for syntax highlighting consistency.

Proposed fix
 make sure both backend and frontend can start:

-```
+```bash
 # test backend (port 8000)
 uv run uvicorn app:app --reload --host 0.0.0.0 --port 8000
frontend/src/components/pipeline-editor/BlockConfigPanel.tsx-393-465 (1)

393-465: Document the new json-or-template editor behavior.
This introduces a new editing mode + model selector UX. Please update llm/state-frontend.md accordingly. As per coding guidelines, document the new UI patterns.

scripts/inspect_db_configs.py-16-51 (1)

16-51: Ensure DB connections close on error paths.
If any query raises, the current cleanup is skipped. Wrap the body in try/finally to always await storage.close().

💡 Suggested fix
 async def main():
     storage = Storage("data/qa_records.db")
-    await storage.init_db()
-
-    # get LLM models
-    print("=== LLM Models ===")
-    llm_models = []
-
-    async def get_llm_models(db):
-        db.row_factory = aiosqlite.Row
-        cursor = await db.execute("SELECT * FROM llm_models")
-        return await cursor.fetchall()
-
-    llm_rows = await storage._execute_with_connection(get_llm_models)
-    for row in llm_rows:
-        model_dict = {key: row[key] for key in row.keys()}
-        print(model_dict)
-        llm_models.append(model_dict)
-
-    print("\n=== Embedding Models ===")
-    embedding_models = []
-
-    async def get_embedding_models(db):
-        db.row_factory = aiosqlite.Row
-        cursor = await db.execute("SELECT * FROM embedding_models")
-        return await cursor.fetchall()
-
-    emb_rows = await storage._execute_with_connection(get_embedding_models)
-    for row in emb_rows:
-        model_dict = {key: row[key] for key in row.keys()}
-        print(model_dict)
-        embedding_models.append(model_dict)
-
-    await storage.close()
-
-    return llm_models, embedding_models
+    try:
+        await storage.init_db()
+
+        # get LLM models
+        print("=== LLM Models ===")
+        llm_models = []
+
+        async def get_llm_models(db):
+            db.row_factory = aiosqlite.Row
+            cursor = await db.execute("SELECT * FROM llm_models")
+            return await cursor.fetchall()
+
+        llm_rows = await storage._execute_with_connection(get_llm_models)
+        for row in llm_rows:
+            model_dict = {key: row[key] for key in row.keys()}
+            print(model_dict)
+            llm_models.append(model_dict)
+
+        print("\n=== Embedding Models ===")
+        embedding_models = []
+
+        async def get_embedding_models(db):
+            db.row_factory = aiosqlite.Row
+            cursor = await db.execute("SELECT * FROM embedding_models")
+            return await cursor.fetchall()
+
+        emb_rows = await storage._execute_with_connection(get_embedding_models)
+        for row in emb_rows:
+            model_dict = {key: row[key] for key in row.keys()}
+            print(model_dict)
+            embedding_models.append(model_dict)
+
+        return llm_models, embedding_models
+    finally:
+        await storage.close()
tests/e2e/test_helpers.py-57-65 (1)

57-65: Log errors when pipeline count fails.
Silent failures make flaky E2E diagnostics harder. The cleanup_database() function in the same file already follows this pattern—catch exceptions and print diagnostics instead of silently passing.

 def get_pipeline_count():
     """get number of pipelines in database"""
     try:
         response = httpx.get("http://localhost:8000/api/pipelines", timeout=10.0)
         if response.status_code == 200:
             return len(response.json())
-    except Exception:
-        pass
+    except Exception as e:
+        print(f"get_pipeline_count warning: {e}")
     return -1
tests/e2e/test_helpers.py-17-37 (1)

17-37: Surface HTTP errors from cleanup operations instead of silently ignoring them.
Without checking response status codes on API calls, errors go undetected and tests can run against dirty database state. Currently, line 23 ignores the delete response entirely, and lines 27–32 skip pipeline cleanup if the list API doesn't return 200 but don't raise an error. Raise an exception on non-2xx responses so the test suite fails loudly if cleanup fails.

💡 Suggested fix
 def cleanup_database():
     """delete all pipelines, jobs, and records from the database"""
     base_url = "http://localhost:8000"
 
     try:
         # delete all records
-        httpx.delete(f"{base_url}/api/records", timeout=10.0)
+        resp = httpx.delete(f"{base_url}/api/records", timeout=10.0)
+        if resp.status_code >= 400:
+            raise RuntimeError(f"failed to delete records: {resp.status_code}")
 
         # get all pipelines
         response = httpx.get(f"{base_url}/api/pipelines", timeout=10.0)
-        if response.status_code == 200:
-            pipelines = response.json()
+        if response.status_code >= 400:
+            raise RuntimeError(f"failed to list pipelines: {response.status_code}")
+        pipelines = response.json()
 
-            # delete each pipeline
-            for pipeline in pipelines:
-                httpx.delete(f"{base_url}/api/pipelines/{pipeline['id']}", timeout=10.0)
+        # delete each pipeline
+        for pipeline in pipelines:
+            resp = httpx.delete(f"{base_url}/api/pipelines/{pipeline['id']}", timeout=10.0)
+            if resp.status_code >= 400:
+                raise RuntimeError(f"failed to delete pipeline {pipeline['id']}: {resp.status_code}")
 
         time.sleep(0.5)  # wait for cleanup to complete
 
     except Exception as e:
-        print(f"cleanup warning: {e}")
+        raise RuntimeError(f"cleanup failed: {e}") from e
tests/e2e/test_review_e2e.py-23-25 (1)

23-25: Locator result unused - no actual verification.

The .or_() chain returns a locator but the result is never used or asserted. This test doesn't actually verify the page loaded correctly.

Suggested fix
 # verify we're on review page
-# look for job selector or records section
-page.get_by_text("Select Job", exact=False).or_(page.get_by_text("Records", exact=False))
+job_or_records = page.get_by_text("Select Job", exact=False).or_(page.get_by_text("Records", exact=False))
+assert job_or_records.count() > 0, "Review page should show job selector or records section"
tests/e2e/test_review_e2e.py-180-184 (1)

180-184: Another unused locator - verification incomplete.

Same issue: locator chain returns result but no assertion verifies the trace content is visible.

Suggested fix
 # verify trace content is visible
-# look for block type, execution time, or trace data
-page.get_by_text("block_type", exact=False).or_(
-    page.get_by_text("execution_time", exact=False)
-)
+trace_content = page.get_by_text("block_type", exact=False).or_(
+    page.get_by_text("execution_time", exact=False)
+)
+assert trace_content.count() > 0, "Trace should show block_type or execution_time"
tests/blocks/test_field_mapper.py-16-28 (1)

16-28: Align test names with required pattern.

Names like Line 17 test_init_with_mappings don’t follow test_<method>_<scenario>_<expected>. Please align across this file. As per coding guidelines, keep test naming consistent.

tests/blocks/commons/test_template_utils.py-15-46 (1)

15-46: Test names don't follow required pattern.

Line 16 (test_render_and_parse_list) and similar names don’t follow the required test_<method>_<scenario>_<expected> convention. Please align across this file (e.g., test_render_and_parse_json_list_success). As per coding guidelines, please keep test naming consistent.

tests/e2e/test_generator_e2e.py-176-178 (1)

176-178: Locators created but never asserted.

Lines 176-178 and 201-205 create locators without any assertions. These are effectively no-ops.

-            # verify job progress appears
-            # look for progress indicators
-            page.get_by_text("Progress", exact=False).or_(
-                page.get_by_text("Generated", exact=False)
-            )
+            # verify job progress appears
+            progress_indicator = page.get_by_text("Progress", exact=False).or_(
+                page.get_by_text("Generated", exact=False)
+            )
+            expect(progress_indicator).to_be_visible(timeout=5000)

Also applies to: 201-205

tests/integration/test_data_augmentation_pipeline.py-148-169 (1)

148-169: Missing @pytest.mark.integration marker.

First test has @pytest.mark.integration but this test doesn't. For consistent test filtering in CI, add the marker.

+@pytest.mark.integration
 `@pytest.mark.asyncio`
 async def test_data_augmentation_pipeline_missing_fields_error():
tests/e2e/test_generator_e2e.py-49-70 (1)

49-70: Test passes silently when no selectors found.

If len(selectors) > 0 is false, the test completes without any assertions. This hides failures when the UI structure changes.

Add assertion or skip
         selectors = page.locator('select, [role="combobox"]').all()

-        if len(selectors) > 0:
+        assert len(selectors) > 0, "No pipeline selector found on page"
+        if len(selectors) > 0:

Or use pytest.skip() if the element is optional.

tests/integration/test_data_augmentation.py-81-83 (1)

81-83: Remove double JSON encoding in save_pipeline call.

save_pipeline expects a dict[str, Any] and internally calls json.dumps(definition). Passing json.dumps(pipeline_def) causes the string to be encoded again, storing escaped JSON in the database.

Pass dict directly
         # save pipeline to database
-        pipeline_id = await storage.save_pipeline("test_augmentation", json.dumps(pipeline_def))
+        pipeline_id = await storage.save_pipeline("test_augmentation", pipeline_def)
         assert pipeline_id > 0
tests/e2e/test_pipelines_e2e.py-38-69 (1)

38-69: Test lacks assertions - only prints debug info.

test_view_templates doesn't assert anything meaningful. The test will always pass regardless of whether templates are displayed.

🔧 Add assertion
         # if templates exist, there should be use template buttons
         # otherwise page should at least load without error
         print(f"found {use_template_buttons.count()} template buttons")
+        
+        # At minimum, verify page loaded without error
+        expect(page.get_by_text("Pipelines", exact=True)).to_be_visible()

         browser.close()
tests/e2e/test_pipelines_e2e.py-114-172 (1)

114-172: Overly broad locator for delete buttons may match unintended elements.

Line 142 page.locator("button:has(svg)") matches any button with an SVG icon, not specifically delete buttons. This could lead to clicking the wrong button.

🔧 Tighten the selector
         delete_buttons = (
             page.get_by_role("button")
             .filter(has_text="Delete")
             .or_(page.locator('button[aria-label*="Delete"]'))
-            .or_(page.locator("button:has(svg)"))
+            .or_(page.locator('button[aria-label*="delete" i]'))
         )
tests/e2e/test_pipelines_e2e.py-71-112 (1)

71-112: Test silently passes when no templates exist.

If create_buttons.count() == 0, the test prints a message and passes without testing anything. Consider using pytest.skip() or making template availability a precondition.

🔧 Suggested fix
         if create_buttons.count() > 0:
             # ... existing logic ...
         else:
-            print("no template buttons found, skipping test")
+            pytest.skip("no template buttons found - templates may not be loaded")

         browser.close()
lib/blocks/builtin/semantic_infiller.py-228-235 (1)

228-235: Add length check or strict=True to prevent silent truncation.

If vec1 and vec2 have different lengths, zip silently truncates, producing incorrect similarity scores. Embedding dimensions should match, but defensive coding is warranted.

Proposed fix
 def _cosine_similarity(self, vec1: list[float], vec2: list[float]) -> float:
     """compute cosine similarity between two vectors"""
+    if len(vec1) != len(vec2):
+        logger.warning(f"Vector length mismatch: {len(vec1)} vs {len(vec2)}")
+        return 0.0
     dot = sum(a * b for a, b in zip(vec1, vec2))
     norm1 = sum(a * a for a in vec1) ** 0.5
     norm2 = sum(b * b for b in vec2) ** 0.5
lib/blocks/builtin/structure_sampler.py-370-373 (1)

370-373: The instance attribute assignment pattern is safe in the current sequential execution model, but represents poor design practice that creates technical debt.

The pipeline executes blocks sequentially with await block.execute(context) in a for loop, which guarantees no concurrent access to the same block instance. However, mutating instance state during execute() violates idempotency and creates fragility if the execution model ever changes to support concurrent block execution.

Pass categorical_fields, numeric_fields, and dependencies as method parameters or store them in the execution context instead of instance attributes. This makes the method stateless and maintains single responsibility.

🧹 Nitpick comments (39)
.claude/skills/implementing-datagenflow-blocks/SKILL.md (1)

144-151: Minor: format shadows Python builtin.

The example uses format as a parameter name, which shadows the built-in format() function. While functional, consider noting this or using output_format in the example to avoid confusion.

tests/e2e/README.md (1)

281-281: Update to actions/checkout@v6 (current latest version).

actions/checkout@v3 is outdated. Version 6 is the current recommended release.

tests/e2e/test_helpers.py (1)

40-54: Validate URL scheme in server wait helper.
Restrict to http/https to avoid accidental file: or other schemes.

💡 Suggested fix
 def wait_for_server(url: str = "http://localhost:8000/health", timeout: int = 30):
     """wait for server to be ready"""
     import urllib.error
     import urllib.request
+    from urllib.parse import urlparse
+
+    parsed = urlparse(url)
+    if parsed.scheme not in ("http", "https"):
+        raise ValueError("wait_for_server only supports http/https URLs")
frontend/src/pages/Generator.tsx (1)

43-44: Consider removing or documenting the relationship between isMultiplierPipeline and needsMarkdown.

Both states are set from the same API response but serve slightly different purposes. If isMultiplierPipeline is no longer used elsewhere, consider removing it to reduce state complexity.

#!/bin/bash
# Check if isMultiplierPipeline is used elsewhere in the codebase
rg -n "isMultiplierPipeline" frontend/src/ --type ts --type tsx
lib/blocks/builtin/structured_generator.py (1)

37-39: Add ClassVar annotation for mutable class attribute.

Per Ruff RUF012, mutable class attributes should be annotated with typing.ClassVar to make the intent explicit.

Proposed fix
+from typing import Any, ClassVar
-from typing import Any

-    _config_formats = {
+    _config_formats: ClassVar[dict[str, str]] = {
         "json_schema": "json-or-template",
     }
tests/e2e/run_all_tests.sh (2)

39-44: Consider adding Playwright browser installation check.

The script checks if playwright module is installed but doesn't verify the browser (chromium) is installed. Missing browser will cause cryptic failures.

Proposed enhancement
 # check if playwright is installed
 if ! uv run python -c "import playwright" 2>/dev/null; then
     echo "❌ Playwright not installed"
     echo "Install with: uv pip install playwright && uv run playwright install chromium"
     exit 1
 fi
+
+# check if chromium browser is installed
+if ! uv run python -c "from playwright.sync_api import sync_playwright; p = sync_playwright().start(); p.chromium.launch(headless=True).close(); p.stop()" 2>/dev/null; then
+    echo "❌ Chromium browser not installed"
+    echo "Install with: uv run playwright install chromium"
+    exit 1
+fi

53-76: Test suites restart servers for each suite.

Each test suite spawns fresh backend/frontend instances. This is safe but slow. If tests are independent, consider running all suites under a single server session.

tests/integration/conftest.py (2)

24-34: Hardcoded endpoint IP reduces portability.

The endpoint http://172.20.160.1:11434 is hardcoded and will fail in different environments. Consider using an environment variable with a fallback:

Suggested fix
+import os
+
+OLLAMA_ENDPOINT = os.getenv("OLLAMA_ENDPOINT", "http://localhost:11434")
+
 # add default LLM model (ollama gemma3:1b)
 llm_config = LLMModelConfig(
     name="default",
     provider=LLMProvider.OLLAMA,
-    endpoint="http://172.20.160.1:11434/v1/chat/completions",
+    endpoint=f"{OLLAMA_ENDPOINT}/v1/chat/completions",
     api_key="",
     model_name="gemma3:1b",
 )
 ...
 embedding_config = EmbeddingModelConfig(
     name="ollama-nomic",
     provider=LLMProvider.OLLAMA,
-    endpoint="http://172.20.160.1:11434/v1/embeddings",
+    endpoint=f"{OLLAMA_ENDPOINT}/v1/embeddings",
     api_key="",
     model_name="nomic-embed-text",
     dimensions=768,
 )

43-51: SQLite WAL files may persist after cleanup.

SQLite can create .db-wal and .db-shm files that won't be cleaned up. Also, move os import to the top of the file.

Suggested fix
 # cleanup
 try:
     await storage.close()
 finally:
-    # ensure cleanup happens even if close() fails
-    import os
-
-    if os.path.exists("data/test_e2e_records.db"):
-        os.remove("data/test_e2e_records.db")
+    db_path = "data/test_e2e_records.db"
+    for suffix in ("", "-wal", "-shm"):
+        path = db_path + suffix
+        if os.path.exists(path):
+            os.remove(path)
.claude/skills/debugging-pipelines/SKILL.md (1)

90-97: Line number references will become stale.

Hardcoded line ranges like lib/workflow.py:85-224 and lib/workflow.py:305-449 will drift as code evolves. Consider referencing function/method names instead, or adding a note that line numbers are approximate.

Example: "Normal pipeline execution flow (see execute() method)" instead of specific line numbers.

tests/e2e/test_review_e2e.py (2)

41-41: Prefer Playwright wait methods over time.sleep().

time.sleep() is brittle and slows tests unnecessarily. Use Playwright's built-in waiting:

# Instead of: time.sleep(2)
page.wait_for_selector('select, [role="combobox"]', timeout=5000)

This pattern repeats throughout the file (lines 49, 56, 72, 80, 107, 115, 128, 140, 156, 164, 178, 200, 208, 220, 231, 247, 255).


282-313: Main block bypasses pytest.

Running tests via if __name__ == "__main__" bypasses pytest's test collection, fixtures, and reporting. This sequential execution also masks failures (all tests print "✓ passed" regardless of exceptions in previous tests).

Consider removing or using pytest's CLI invocation instead:

pytest tests/e2e/test_review_e2e.py -v
docs/template_data_augmentation.md (1)

106-108: Blank line between blockquotes creates parsing issues.

Markdown linters flag this (MD028). Combine into a single blockquote or use different formatting.

Suggested fix
 > **Note:** `price` appears in both `numeric_fields` and `fields_to_generate`. This provides range hints to guide the LLM while letting it generate contextually appropriate prices.
-
-> **Tip:** Use 4-10 diverse samples for best results. More samples = better distribution learning.
+>
+> **Tip:** Use 4-10 diverse samples for best results. More samples = better distribution learning.
lib/blocks/builtin/validator.py (1)

1-8: Deduplicate forbidden_words parsing via template_utils.

Lines 43-69 re-implement JSON render/parse + string-list validation that already exists in lib.blocks.commons.template_utils. Reusing the helper centralizes error handling and keeps block logic slimmer.

♻️ Suggested refactor
-from lib.template_renderer import render_template
+from lib.blocks.commons.template_utils import render_and_parse_json, validate_string_list
@@
-        if self.forbidden_words_template:
-            words_rendered = render_template(
-                self.forbidden_words_template, context.accumulated_state
-            )
-            try:
-                words_list = json.loads(words_rendered)
-                if not isinstance(words_list, list):
-                    raise BlockExecutionError(
-                        "forbidden_words must be a JSON array",
-                        detail={"rendered_value": words_rendered},
-                    )
-                if not all(isinstance(w, str) for w in words_list):
-                    raise BlockExecutionError(
-                        "All items in forbidden_words must be strings",
-                        detail={"forbidden_words": words_list},
-                    )
-                forbidden_words = words_list
-            except json.JSONDecodeError as e:
-                raise BlockExecutionError(
-                    f"forbidden_words must be valid JSON: {str(e)}",
-                    detail={
-                        "template": self.forbidden_words_template,
-                        "rendered": words_rendered,
-                    },
-                )
+        if self.forbidden_words_template:
+            forbidden_words = render_and_parse_json(
+                self.forbidden_words_template,
+                context.accumulated_state,
+                "forbidden_words",
+                list,
+            )
+            validate_string_list(forbidden_words, "forbidden_words")

Also applies to: 43-69

tests/blocks/test_field_mapper.py (1)

44-66: Make xfail strict to avoid silent XPASS.

For the known failures, consider strict=True so a future fix fails the test and forces cleanup.

🔧 Suggested adjustment
-    `@pytest.mark.xfail`(
+    `@pytest.mark.xfail`(
+        strict=True,
         reason="tojson produces pretty-printed JSON with newlines causing parse error"
     )
@@
-    `@pytest.mark.xfail`(
+    `@pytest.mark.xfail`(
+        strict=True,
         reason="tojson produces pretty-printed JSON with newlines causing parse error"
     )
@@
-    `@pytest.mark.xfail`(reason="StrictUndefined raises error instead of returning empty string")
+    `@pytest.mark.xfail`(
+        strict=True,
+        reason="StrictUndefined raises error instead of returning empty string"
+    )
lib/workflow.py (1)

380-393: Potential issue: filtering applied twice for saved results.

filtered_result is computed at line 381, then _save_seed_result is called at line 384 which internally calls _filter_output_data again on context.accumulated_state. This is redundant but not harmful since both produce the same result.

Consider passing filtered_result directly or removing the redundant filter call inside _save_seed_result to avoid confusion:

Option: Pass filtered data directly to save
             if storage and pipeline_id > 0 and job_id > 0:
                 await self._save_seed_result(
                     initial_data,
-                    context.accumulated_state,
-                    initial_data_keys,
+                    filtered_result,
                     context.trace,
                     pipeline_id,
                     job_id,

This would require updating _save_seed_result to accept already-filtered data.

lib/blocks/builtin/field_mapper.py (2)

29-31: ClassVar annotation would silence linter, but optional.

The static analysis flags mutable class attributes. Adding typing.ClassVar would be more explicit, but this is a common pattern in the codebase for block metadata.


65-72: Add exception chaining for better tracebacks.

When re-raising a different exception from a caught exception, use from e to preserve the original traceback. This aids debugging JSON parse failures.

Fix exception chaining
         except json.JSONDecodeError as e:
             raise BlockExecutionError(
-                f"mappings must be valid JSON: {str(e)}",
+                f"mappings must be valid JSON: {e!s}",
                 detail={
                     "template": self.mappings_template,
                     "rendered": mappings_rendered,
                 },
-            )
+            ) from e
tests/e2e/test_generator_e2e.py (3)

44-52: Prefer explicit waits over time.sleep.

Hardcoded sleeps make tests flaky and slow. Playwright provides wait_for_selector and expect with auto-waiting.

Replace sleep with explicit wait
         # find pipeline selector (dropdown or select)
         selectors = page.locator('select, [role="combobox"]').all()

         if len(selectors) > 0:
             # click first selector
             selectors[0].click()
-            time.sleep(0.5)
+            page.wait_for_timeout(500)  # or better: wait for dropdown to appear

239-273: Main block bypasses pytest test discovery.

Running tests via if __name__ == "__main__" means pytest won't collect them properly for CI. The print("✓ passed") after each call doesn't verify the test actually passed - exceptions would halt execution.

Consider using pytest markers and fixtures for proper test orchestration instead of manual sequencing.

As per coding guidelines, tests should follow naming convention test_<method>_<scenario>_<expected> and use proper fixtures.


90-117: Seed file cleanup in finally block for reliability.

If the test fails before os.remove(seed_path), the file remains. Use a fixture or try/finally.

Ensure cleanup on failure
+        try:
             # go to generator page
             page.goto("http://localhost:5173")
             # ... rest of test ...
-
-        # cleanup
-        os.remove(seed_path)
-
-        browser.close()
+        finally:
+            # cleanup
+            if os.path.exists(seed_path):
+                os.remove(seed_path)
+            browser.close()
tests/integration/test_data_augmentation_pipeline.py (2)

28-32: Monkey patching works but consider pytest fixture.

Direct monkey patching with try/finally restoration is functional. A pytest fixture with monkeypatch would be cleaner and auto-restore.

Alternative using monkeypatch fixture
`@pytest.fixture`
def patched_llm_config(e2e_storage, monkeypatch):
    manager = LLMConfigManager(e2e_storage)
    import app
    monkeypatch.setattr(app, "llm_config_manager", manager)
    return manager

165-169: Error assertion could be more specific.

Checking for "undefined" or "target_count" is loose. Consider a more precise error match.

     error_msg = str(exc_info.value)
-    assert "undefined" in error_msg.lower() or "target_count" in error_msg
+    assert "undefined" in error_msg.lower() or "target_count" in error_msg.lower(), (
+        f"Expected error about undefined/target_count, got: {error_msg}"
+    )
tests/blocks/test_structure_sampler.py (1)

43-119: Tests rely on internal state manipulation - acceptable for unit testing private methods.

Setting block.categorical_fields, block.dependencies, and block.numeric_fields directly couples tests to implementation details. This is fine for testing computation methods in isolation, but consider adding integration-style tests that go through execute() for these scenarios to ensure the full flow works.

tests/blocks/test_duplicate_remover.py (1)

180-214: Verify the cache test actually tests caching behavior.

The test creates two contexts with the same trace_id and checks mock_embedding.call_count == 3. However, both contexts are created with make_context() which sets trace_id="test-trace", and line 209 redundantly sets it again. The assertion verifies embedding is called 3 times (1 seed + 2 batches), which confirms seed embeddings aren't recomputed.

Consider adding a comment clarifying the expected call pattern.

lib/blocks/builtin/duplicate_remover.py (1)

43-57: Instance-level cache without eviction could cause memory growth.

_embeddings_cache stores embeddings keyed by trace_id but is never cleared. For long-running processes with many pipeline executions, this could accumulate significant memory.

Consider adding a cache size limit or using weakref for automatic cleanup when trace contexts are garbage collected.

💡 Suggested improvement
from collections import OrderedDict

# In __init__:
self._embeddings_cache: OrderedDict[str, list[list[float]]] = OrderedDict()
self._max_cache_size = 100  # configurable

# In _get_seed_embeddings, after caching:
if len(self._embeddings_cache) > self._max_cache_size:
    self._embeddings_cache.popitem(last=False)  # evict oldest
tests/e2e/test_pipelines_e2e.py (2)

13-36: Replace time.sleep(1) with Playwright's built-in waiting.

Hard-coded sleeps are flaky. Use page.wait_for_selector() or expect(...).to_be_visible() instead.

♻️ Suggested fix
         pipelines_link.click()
         page.wait_for_load_state("networkidle")
-        time.sleep(1)
+        # Wait for content to be visible instead of arbitrary sleep
+        page.wait_for_selector("text=Pipelines", state="visible")

174-216: Use context manager or try/finally for browser cleanup.

If an assertion fails, browser.close() won't be called, potentially leaving browser processes orphaned.

♻️ Use fixture or context manager pattern
def test_pipeline_editor_opens():
    """test that pipeline editor modal opens"""
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=get_headless_mode())
        try:
            page = browser.new_page()
            # ... test logic ...
        finally:
            browser.close()

Or better, extract browser setup to a pytest fixture that handles cleanup automatically.

lib/blocks/builtin/json_validator.py (1)

53-81: Chain the exception for better debugging.

When re-raising BlockExecutionError from JSONDecodeError, use from e to preserve the original traceback.

♻️ Fix exception chaining
             except json.JSONDecodeError as e:
                 raise BlockExecutionError(
-                    f"required_fields must be valid JSON: {str(e)}",
+                    f"required_fields must be valid JSON: {e}",
                     detail={
                         "template": self.required_fields_template,
                         "rendered": fields_rendered,
                     },
-                )
+                ) from e
lib/blocks/commons/template_utils.py (3)

26-35: Add exception chaining to preserve traceback.

Missing from e loses the original JSON parse error context, making debugging harder.

Proposed fix
     try:
         parsed = json.loads(rendered)
     except json.JSONDecodeError as e:
         raise BlockExecutionError(
-            f"{field_name} must be valid JSON: {str(e)}",
+            f"{field_name} must be valid JSON: {e!s}",
             detail={
                 "template": template,
                 "rendered": rendered,
             },
-        )
+        ) from e

85-91: Greedy regex may over-match in edge cases.

The regex \{.*\} with re.DOTALL matches from the first { to the last } in the content. If the LLM returns JSON followed by text containing braces, this could grab extra content.

Consider non-greedy matching, though this may break legitimate nested JSON:

Alternative approach (trade-off)
-    json_match = re.search(r"\{.*\}", content, re.DOTALL)
+    # Non-greedy - may fail on deeply nested JSON but avoids over-matching
+    json_match = re.search(r"\{[^{}]*(?:\{[^{}]*\}[^{}]*)*\}", content, re.DOTALL)

The current approach is reasonable as a last-resort fallback since strategies 1 and 2 handle most cases.


107-122: Hardcoded metadata fields may drift out of sync.

This set is defined locally but metadata fields are used across multiple blocks. Consider defining METADATA_FIELDS as a module-level constant or in a shared location to ensure consistency.

Proposed improvement
+# module-level constant for reuse
+METADATA_FIELDS = frozenset({
+    "samples",
+    "target_count",
+    "categorical_fields",
+    "numeric_fields",
+    "dependencies",
+    "comparison_fields",
+    "similarity_threshold",
+    "fields_to_generate",
+})
+
+
 def clean_metadata_fields(state: dict[str, Any]) -> dict[str, Any]:
     """
     remove pipeline metadata fields from state
     returns new dict without mutation
     """
-    metadata_fields = {
-        "samples",
-        "target_count",
-        "categorical_fields",
-        "numeric_fields",
-        "dependencies",
-        "comparison_fields",
-        "similarity_threshold",
-        "fields_to_generate",
-    }
-    return {key: value for key, value in state.items() if key not in metadata_fields}
+    return {key: value for key, value in state.items() if key not in METADATA_FIELDS}
lib/blocks/builtin/ragas_metrics.py (2)

74-78: Consider using normalize_template_param for consistency.

Other blocks use normalize_template_param(metrics, list) from template_utils.py. This inline implementation duplicates that logic.

Proposed refactor
+from lib.blocks.commons.template_utils import normalize_template_param
+
 # in __init__:
-        if isinstance(metrics, list):
-            self.metrics_template = json.dumps(metrics)
-        else:
-            self.metrics_template = metrics
+        self.metrics_template = normalize_template_param(metrics, list)

101-108: Add exception chaining and consider using shared utilities.

  1. Missing from e loses traceback context.
  2. This parsing logic duplicates render_and_parse_json + validate_string_list from template_utils.py.
Proposed fix
+from lib.blocks.commons.template_utils import render_and_parse_json, validate_string_list

 # In execute(), replace lines 86-108 with:
-        metrics_rendered = render_template(self.metrics_template, context.accumulated_state)
-        try:
-            metrics_list = json.loads(metrics_rendered)
-            if not isinstance(metrics_list, list):
-                raise BlockExecutionError(
-                    "metrics must be a JSON array",
-                    detail={"rendered_value": metrics_rendered},
-                )
-            if not all(isinstance(m, str) for m in metrics_list):
-                raise BlockExecutionError(
-                    "All items in metrics must be strings",
-                    detail={"metrics": metrics_list},
-                )
-            metrics = metrics_list
-        except json.JSONDecodeError as e:
-            raise BlockExecutionError(
-                f"metrics must be valid JSON: {str(e)}",
-                detail={
-                    "template": self.metrics_template,
-                    "rendered": metrics_rendered,
-                },
-            )
+        metrics = render_and_parse_json(
+            self.metrics_template,
+            context.accumulated_state,
+            "metrics",
+            expected_type=list,
+        )
+        validate_string_list(metrics, "metrics")
lib/blocks/builtin/structure_sampler.py (1)

319-326: Add exception chaining for better debugging.

Missing from e loses the original ValueError context.

Proposed fix
         except ValueError as e:
             raise BlockExecutionError(
-                f"target_count must be a valid integer: {str(e)}",
+                f"target_count must be a valid integer: {e!s}",
                 detail={
                     "template": self.target_count_template,
                     "rendered": target_count_rendered,
                 },
-            )
+            ) from e
lib/blocks/builtin/semantic_infiller.py (4)

172-182: Add exception chaining for LLM errors.

Missing from e loses the original exception context.

Proposed fix
         try:
             response = await litellm.acompletion(**llm_params)
         except Exception as e:
             raise BlockExecutionError(
-                f"LLM call failed: {str(e)}",
+                f"LLM call failed: {e!s}",
                 detail={
                     "skeleton": skeleton,
                     "prompt_preview": prompt[:200],
                     "error": str(e),
                 },
-            )
+            ) from e

265-268: Unused loop variable sim.

The similarity score is unpacked but never used. Prefix with _ to indicate intentionally unused.

Proposed fix
-        for sim, sample in similar_samples:
+        for _sim, sample in similar_samples:
             fields_str = json.dumps({f: sample.get(f, "") for f in fields_to_generate})
             negative_lines.append(f"  - {fields_str}")

330-336: Add exception chaining for LLM retry errors.

Same issue as in _process_skeleton - missing from e.

Proposed fix
             try:
                 response = await litellm.acompletion(**llm_params)
             except Exception as e:
                 raise BlockExecutionError(
-                    f"LLM call failed: {str(e)}",
+                    f"LLM call failed: {e!s}",
                     detail={"skeleton": skeleton, "attempt": attempt, "error": str(e)},
-                )
+                ) from e

387-388: Misleading comment - this code path is reachable.

This return executes when all retries complete but similarity remains above threshold. The comment suggests it's unreachable, but it's actually the fallback for exhausted retries.

Proposed fix
-        # should not reach here, but return last result
+        # fallback: return last result after exhausting retries
         return result, embedding

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

🤖 Fix all issues with AI agents
In `@frontend/src/components/pipeline-editor/BlockConfigPanel.tsx`:
- Around line 436-440: The Editor is using defaultLanguage which only applies on
mount, so toggling isJsonMode doesn't update highlighting; replace the
defaultLanguage prop with the dynamic language prop (e.g., language={isJsonMode
? "json" : "python"}) on the <Editor> (and remove any stale usage of
defaultLanguage or unnecessary key remount hacks), ensuring the Editor's
language prop is driven by isJsonMode so syntax highlighting updates when
toggled.

In `@llm/state-backend.md`:
- Around line 441-443: The docs for DuplicateRemover list an output named
similarity_score which is inconsistent with tests; update the DuplicateRemover
output specification to match the actual outputs used in tests by replacing
similarity_score with similarity_to_seeds and similarity_to_generated (or
explicitly list both fields) and ensure the outputs line reads: "outputs: *,
is_duplicate, similarity_to_seeds, similarity_to_generated" so the doc matches
the DuplicateRemover behavior and tests.

In `@llm/state-project.md`:
- Around line 117-121: The documentation for the validators lists
DuplicateRemover output as "similarity_score" but the implementation/tests
expect "similarity_to_seeds" and "similarity_to_generated"; update the docs in
the validators section (DuplicateRemover) to list the actual output fields
"is_duplicate", "similarity_to_seeds", and "similarity_to_generated" (or
alternatively change the DuplicateRemover implementation/tests to emit a single
"similarity_score" if you prefer that public contract) so the public contract
matches the tests and implementation; reference the DuplicateRemover symbol and
ensure the validators list reflects the exact output field names used by tests.

In `@tests/e2e/README.md`:
- Around line 264-266: Replace the direct instruction to delete the default DB
file ("remove test database: `rm data/qa_records.db`") with a safer alternative:
either point to a dedicated test DB path or add a clear warning that this
command only applies to throwaway/test-only databases (and not
production/default data); update the README example text around the "remove test
database" bullet so it references a test-specific file name (e.g.,
`data/qa_records_test.db`) or prepends a cautionary note explaining to back up
or confirm the file is test-only before running removal.
- Around line 111-123: The README.md's directory-tree fenced code block is
missing a language tag which breaks lint rules; update the opening fence to
include a language (e.g., change the triple backticks that start the tree block
to ```text) so the block becomes a valid "text" fenced code block, ensuring the
closing ``` remains unchanged; locate the fenced block in README.md that begins
with the directory listing (the tree block shown) and add the language tag to
the opening fence.

In `@tests/e2e/test_generator_e2e.py`:
- Around line 100-104: The test is calling selectors[0].select_option(index=1)
without ensuring a real pipeline option exists, which can throw if only a
placeholder option is present; after you build selectors =
page.locator("select").all(), guard by asserting selectors is non-empty and then
check options_count = selectors[0].locator("option").count() (or equivalent) and
only call selectors[0].select_option(index=1) if options_count > 1, otherwise
fail fast with a clear assertion/error message; apply the same guard to the
other occurrence where select_option(index=1) is used.

In `@tests/e2e/test_pipelines_e2e.py`:
- Around line 223-255: The main-block that calls wait_for_server() and
cleanup_database() is not run under pytest, causing flaky CI; remove the entire
if __name__ == "__main__" block and add an autouse, module-scoped pytest fixture
named _e2e_setup_teardown that calls wait_for_server() (skip tests if it returns
False), runs cleanup_database() before yielding, and runs cleanup_database()
again after yield to teardown; reference symbols to update: wait_for_server,
cleanup_database, and add pytest.fixture(scope="module", autouse=True) on
_e2e_setup_teardown.
- Around line 54-70: The test currently collects use_template_buttons (variable
use_template_buttons) but only prints the count and asserts the page loaded;
update test_view_templates so it validates template rendering: call
use_template_buttons.count() and if it is zero call pytest.skip with a clear
message (or alternatively assert that count() > 0), otherwise assert that at
least one button is visible (e.g.,
expect(use_template_buttons.first()).to_be_visible()) to fail on regressions;
ensure you import pytest if using pytest.skip and keep the existing page load
expectation (expect(page.get_by_text("Pipelines", exact=True)).to_be_visible()).
- Around line 150-176: The tests test_delete_pipeline() and
test_pipeline_editor_opens() should explicitly skip when required UI buttons are
missing instead of silently passing: import pytest if not already, check
delete_buttons.count() and the edit/open button query (the same
page.get_by_role("button") filter used in test_pipeline_editor_opens) and call
pytest.skip("reason") when count == 0 before performing any clicks; likewise,
after opening the confirmation dialog check confirm_buttons.count() and call
pytest.skip("confirmation dialog not present") instead of proceeding silently,
and ensure these skips are placed immediately after the corresponding .count()
checks (refer to delete_buttons, confirm_buttons, and the
page.get_by_role("button").filter(...) calls).
- Around line 33-34: Update each test in tests/e2e/test_pipelines_e2e.py that
calls page.screenshot so it uses pytest's tmp_path fixture instead of hardcoded
"/tmp" paths: add tmp_path as a parameter to the test functions and replace
occurrences like page.screenshot(path="/tmp/pipelines_page.png", ...) with
page.screenshot(path=(tmp_path / "pipelines_page.png"), ...). Do this for all
page.screenshot calls in the file (there are five instances) so Playwright
receives a pathlib.Path and each test gets an isolated temp directory.

In `@tests/e2e/test_review_e2e.py`:
- Line 8: Import TimeoutError from playwright.sync_api and update the export
test so the entire page.expect_download() context manager is wrapped in a
try/except that catches TimeoutError (not a blind Exception). Specifically, in
tests/e2e/test_review_e2e.py add "from playwright.sync_api import TimeoutError"
and wrap the with page.expect_download(...) as download: block (and the code
that enters that context) in a try/except TimeoutError to handle the case where
expect_download times out when no records exist, replacing the generic Exception
handler with a TimeoutError-specific handler.
🧹 Nitpick comments (9)
scripts/with_server.py (2)

46-63: Unused proc in health check loop.

The proc variable from the tuple unpacking isn't used - only port is needed for the health check URL construction.

Suggested fix
-        for proc, port in self.processes:
+        for _proc, port in self.processes:

148-148: Add strict=True to zip() for argument validation.

Since the code already validates len(args.servers) != len(args.ports) at line 136, adding strict=True provides runtime safety if that check is ever bypassed.

Suggested fix
-    servers = list(zip(args.servers, args.ports))
+    servers = list(zip(args.servers, args.ports, strict=True))
tests/e2e/test_helpers.py (1)

45-59: Move imports to module level.

The urllib.error and urllib.request imports are inside the function. While functional, module-level imports are idiomatic Python and avoid repeated import overhead on each call.

Suggested fix

Add imports at module level (lines 6-9):

 import os
 import time
+import urllib.error
+import urllib.request

 import httpx

Then remove lines 47-48:

 def wait_for_server(url: str = "http://localhost:8000/health", timeout: int = 30):
     """wait for server to be ready"""
-    import urllib.error
-    import urllib.request
-
     start_time = time.time()
lib/blocks/builtin/semantic_infiller.py (5)

28-29: Add ClassVar annotation for mutable class attributes.

Consistency with StructuredGenerator which uses ClassVar for _config_formats. Mutable class attributes without ClassVar annotation can cause unexpected sharing between instances.

Suggested fix
+from typing import Any, ClassVar, cast
-from typing import Any, cast
 
 # ... then in the class:
-    inputs = ["skeletons"]
-    outputs = ["samples"]
+    inputs: ClassVar[list[str]] = ["skeletons"]
+    outputs: ClassVar[list[str]] = ["samples"]

And for _config_descriptions and _config_formats:

-    _config_descriptions = {
+    _config_descriptions: ClassVar[dict[str, str]] = {
         ...
     }
 
-    _config_formats = {
+    _config_formats: ClassVar[dict[str, str]] = {

172-182: Use exception chaining with raise ... from e.

Preserves the original traceback for debugging. Also use {e!s} conversion flag for cleaner f-string.

Suggested fix
         try:
             response = await litellm.acompletion(**llm_params)
         except Exception as e:
             raise BlockExecutionError(
-                f"LLM call failed: {str(e)}",
+                f"LLM call failed: {e!s}",
                 detail={
                     "skeleton": skeleton,
                     "prompt_preview": prompt[:200],
                     "error": str(e),
                 },
-            )
+            ) from e

332-338: Use exception chaining with raise ... from e.

Same pattern as above - preserves traceback for LLM failures during diversity retry loop.

Suggested fix
             try:
                 response = await litellm.acompletion(**llm_params)
             except Exception as e:
                 raise BlockExecutionError(
-                    f"LLM call failed: {str(e)}",
+                    f"LLM call failed: {e!s}",
                     detail={"skeleton": skeleton, "attempt": attempt, "error": str(e)},
-                )
+                ) from e

247-252: Add strict=True to zip().

Ensures embeddings and samples lists have matching lengths at runtime.

Suggested fix
-        for emb, sample in zip(embeddings, samples):
+        for emb, sample in zip(embeddings, samples, strict=True):

267-270: Rename unused loop variable sim to _sim.

The similarity score sim isn't used in the loop body - only the sample is extracted for the negative example.

Suggested fix
-        for sim, sample in similar_samples:
+        for _sim, sample in similar_samples:
tests/e2e/test_pipelines_e2e.py (1)

21-28: Prefer configurable base URL and event-driven waits over fixed sleeps.

Hard-coded http://localhost:5173 plus time.sleep(...) makes these e2e tests flaky across environments and slower in CI. Use a configurable base URL and wait for a stable UI element instead.

♻️ Proposed fix (example in this test)
+import os
 ...
-def test_pipelines_page_loads():
+BASE_URL = os.getenv("E2E_BASE_URL", "http://localhost:5173")
+
+def test_pipelines_page_loads():
     """verify pipelines page loads successfully"""
     with sync_playwright() as p:
         browser = p.chromium.launch(headless=get_headless_mode())
         page = browser.new_page()
 ...
-        page.goto("http://localhost:5173")
+        page.goto(BASE_URL)
         page.wait_for_load_state("networkidle")
 ...
-        time.sleep(1)
+        expect(page.get_by_text("Pipelines", exact=True)).to_be_visible()

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Fix all issues with AI agents
In `@frontend/src/components/pipeline-editor/BlockConfigPanel.tsx`:
- Around line 405-429: Update the frontend state docs (llm/state-frontend.md) to
document the new JSON-mode toggle and model selection state: describe the
jsonMode object keyed by config keys (accessed as jsonMode[key] with default
true), how setJsonMode updates it (setJsonMode(prev => ({ ...prev, [key]: value
}))), the UI toggle label/behavior in BlockConfigPanel (JSON mode vs Jinja2
template) and how jsonValue is derived (string value or JSON.stringify(value,
null, 2)); also note where model selection state lives and how it should be
persisted/updated in the same frontend state conventions.
- Around line 155-162: The code currently skips all fields with
fieldSchema.format === "json-or-template", allowing invalid JSON to be saved;
update the logic in BlockConfigPanel so that when jsonMode[key] is true you
attempt to parse/validate processedData[key] (but do not mutate processedData)
and surface an error if parsing fails, while still returning early for
non-JSON-mode json-or-template fields; also ensure jsonMode is included in the
relevant dependency array (e.g., the useEffect or validator hook that iterates
over Object.entries(schema)) so validation re-runs when JSON mode toggles.
- Around line 35-36: The jsonMode state (initialized via useState in
BlockConfigPanel.tsx) is keyed only by field name and can leak between nodes;
add an effect that watches node.id (or the node object used in this component)
and calls setJsonMode({}) to reset jsonMode whenever the node changes so toggles
don’t bleed across nodes; locate the jsonMode/setJsonMode usage in
BlockConfigPanel and add a useEffect referencing node.id to clear the map on
change.

In `@llm/state-backend.md`:
- Around line 420-431: The doc entries for StructureSampler and SemanticInfiller
need explicit outputs and missing parameters: update the StructureSampler entry
(symbol: StructureSampler) to use outputs ["skeletons", "_seed_samples"] instead
of "*" and enumerate its parameters (target_count, categorical_fields,
numeric_fields, dependencies, seed) as already present; update the
SemanticInfiller entry (symbol: SemanticInfiller) to use outputs ["samples"] and
add the five optional parameters system_prompt, embedding_model,
diversity_threshold, negative_examples_count, and max_diversity_retries to its
parameter list along with the existing fields_to_generate, model, temperature,
max_tokens; ensure both entries include full parameter lists and correct
categories (seeders or generators) and remove the "*" shorthand.

In `@tests/e2e/README.md`:
- Line 316: Replace the broken placeholder link target used for "[webapp-testing
skill]" that currently points to
"/.claude/plugins/cache/anthropic-agent-skills/document-skills/.../webapp-testing/"
with a valid relative or absolute documentation URL (or remove the link if no
public location exists); locate the markdown entry for "[webapp-testing skill]"
in the README and update the link target to the correct path to the
webapp-testing skill documentation or to an appropriate repository/README
anchor.
🧹 Nitpick comments (8)
tests/e2e/test_helpers.py (2)

47-61: Consider using httpx consistently instead of urllib.

cleanup_database() and get_pipeline_count() use httpx, but wait_for_server() uses urllib. Using httpx throughout would be more consistent.

♻️ Optional: Use httpx for consistency
 def wait_for_server(url: str = "http://localhost:8000/health", timeout: int = 30):
     """wait for server to be ready"""
-    import urllib.error
-    import urllib.request
-
     start_time = time.time()
     while time.time() - start_time < timeout:
         try:
-            with urllib.request.urlopen(url, timeout=2) as response:
-                if response.status == 200:
-                    return True
-        except (urllib.error.URLError, TimeoutError):
+            response = httpx.get(url, timeout=2.0)
+            if response.status_code == 200:
+                return True
+        except (httpx.RequestError, httpx.TimeoutException):
             time.sleep(1)
 
     return False

64-72: Catching bare Exception obscures errors.

The broad except Exception in get_pipeline_count silently swallows all errors. Consider catching specific exceptions or at least logging the exception type.

♻️ Suggested fix
 def get_pipeline_count():
     """get number of pipelines in database"""
     try:
         response = httpx.get("http://localhost:8000/api/pipelines", timeout=10.0)
         if response.status_code == 200:
             return len(response.json())
-    except Exception as e:
-        print(f"get_pipeline_count warning: {e}")
+        return -1
+    except httpx.RequestError as e:
+        print(f"get_pipeline_count warning: request failed - {e}")
     return -1
tests/e2e/test_review_e2e.py (2)

17-37: Use tmp_path fixture for screenshot paths (consistency).

test_pipelines_e2e.py uses pytest's tmp_path fixture for screenshots, but this file uses hardcoded /tmp paths. This causes S108 warnings and potential path collisions in parallel runs.

♻️ Pattern to apply to all test functions
-def test_review_page_loads():
+def test_review_page_loads(tmp_path):
     """verify review page loads successfully"""
     with sync_playwright() as p:
         ...
-        page.screenshot(path="/tmp/review_page.png", full_page=True)
+        page.screenshot(path=str(tmp_path / "review_page.png"), full_page=True)

Apply this pattern to all 7 test functions in this file.


1-15: Add autouse fixture for server readiness and cleanup.

Unlike test_pipelines_e2e.py, this module lacks a setup/teardown fixture. This causes inconsistent behavior between pytest runs and direct execution.

♻️ Add fixture after imports
`@pytest.fixture`(scope="module", autouse=True)
def _e2e_setup_teardown():
    """setup and teardown for e2e tests"""
    from test_helpers import wait_for_server
    if not wait_for_server():
        pytest.skip("server not ready for e2e tests")
    yield

Also add import pytest to imports.

tests/e2e/test_generator_e2e.py (3)

51-53: Redundant conditional after assertion.

Line 51 asserts len(selectors) > 0, so the if len(selectors) > 0 check on line 53 is always true.

♻️ Remove redundant check
         selectors = page.locator('select, [role="combobox"]').all()
         assert len(selectors) > 0, "No pipeline selector found on page"
 
-        if len(selectors) > 0:
-            # click first selector
-            selectors[0].click()
+        # click first selector
+        selectors[0].click()

18-36: Use tmp_path fixture and add autouse setup fixture.

Same consistency issues as test_review_e2e.py:

  1. Hardcoded /tmp paths for screenshots
  2. Missing autouse fixture for server readiness

Follow the pattern from test_pipelines_e2e.py for consistency across the e2e test suite.


94-96: Seed file creation in /tmp should use tmp_path fixture.

Seed files created at /tmp/test_seed.json can collide in parallel runs. Use tmp_path fixture for test isolation.

♻️ Example fix
-def test_upload_seed_file():
+def test_upload_seed_file(tmp_path):
     """test uploading a seed JSON file"""
     ...
-        seed_path = "/tmp/test_seed.json"
+        seed_path = tmp_path / "test_seed.json"
         with open(seed_path, "w") as f:
             json.dump(seed_data, f)

Apply to both test_upload_seed_file and test_start_generation_job.

pyproject.toml (1)

16-17: Move test dependencies to dev group.

pytest>=8.4.2 and pytest-asyncio>=0.24.0 are in the main project dependencies but should be in [dependency-groups].dev. End users don't need test frameworks installed.

♻️ Suggested fix
 [project]
 dependencies = [
   "fastapi>=0.115.0",
   "uvicorn>=0.32.0",
   "pydantic>=2.9.0",
   "aiosqlite>=0.20.0",
   "httpx>=0.27.0",
   "python-dotenv>=1.0.0",
   "python-multipart>=0.0.20",
   "loguru>=0.7.0",
   "tqdm>=4.66.0",
-  "pytest>=8.4.2",
-  "pytest-asyncio>=0.24.0",
   "jinja2>=3.1.6",
   ...
 ]

 [dependency-groups]
 dev = [
     "ruff>=0.7.0",
     "mypy>=1.13.0",
     "playwright>=1.57.0",
+    "pytest>=8.4.2",
+    "pytest-asyncio>=0.24.0",
+    "pytest-timeout>=2.4.0",
 ]

Note: pytest-timeout>=2.4.0 (line 27) should also move to dev.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

🤖 Fix all issues with AI agents
In @.claude/skills/webapp-testing/examples/console_logging.py:
- Line 1: Ensure the output directory '/mnt/user-data/outputs' exists before
writing logs: add an import for os if missing and call
os.makedirs('/mnt/user-data/outputs', exist_ok=True) just before the code that
opens/writes the log file (the open(...) / write(...) calls around lines 30-32 /
any usage of output_path or output_file); this guarantees no FileNotFoundError
when the script writes logs.

In @.claude/skills/webapp-testing/examples/element_discovery.py:
- Around line 36-38: Replace the fixed "/tmp/page_discovery.png" usage in
page.screenshot and the subsequent print with a unique temporary file: import
and use the tempfile module (e.g., tempfile.NamedTemporaryFile(delete=False,
suffix=".png") or tempfile.mkstemp) to create a one-off path, close the handle
if necessary, pass that temp file's name to page.screenshot, and print that
generated filename; update any surrounding logic in element_discovery.py that
references the old hard-coded path.
- Around line 5-40: Wrap the browser/page interaction in a try/finally so
browser.close() always runs even on errors: after obtaining the Playwright
instance (sync_playwright()) and launching the browser (p.chromium.launch), put
everything that uses page (page.goto, page.wait_for_load_state, page.locator,
page.screenshot, etc.) inside a try block and call browser.close() in the
finally block; guard the finally close with a safety check (e.g., if browser is
not None or "browser" in locals()) to avoid referencing an uninitialized
variable. Ensure you keep the outer sync_playwright() context manager but move
cleanup of the browser into the finally to avoid leaking the browser process in
CI.

In @.claude/skills/webapp-testing/examples/static_html_automation.py:
- Around line 18-30: The screenshots use a hardcoded path
"/mnt/user-data/outputs" which may not exist; update the code around
page.screenshot (both calls) to use a repo-relative or configurable output
directory (e.g., "./outputs" or an env var) and ensure the directory exists
before saving (create it with os.makedirs or
pathlib.Path(...).mkdir(parents=True, exist_ok=True)); keep the same filenames
("static_page.png" and "after_submit.png") but build their full paths from the
validated/created output dir so writes cannot fail.
- Around line 7-9: The file URL is built in a way that breaks on Windows and
doesn't percent-encode paths: replace the os.path.abspath +
f"file://{html_file_path}" construction by using
pathlib.Path(html_file_path).resolve().as_uri() to produce a correct,
cross-platform file URL; update the variables html_file_path and file_url
accordingly and add the pathlib import if missing.

In @.claude/skills/webapp-testing/scripts/with_server.py:
- Around line 87-89: The subprocess.Popen call that creates the test server
currently sets stdout=subprocess.PIPE and stderr=subprocess.PIPE which will
deadlock if those buffers fill; update the subprocess.Popen invocation that
launches server["cmd"] (the process variable) to not pipe these streams — either
remove the stdout/stderr args so the child inherits the parent streams, or set
stdout=subprocess.DEVNULL and stderr=subprocess.DEVNULL if output should be
discarded — and ensure any subsequent code that expects to read output from
process is adjusted accordingly (look for the subprocess.Popen(...) call that
assigns to process in with_server.py).

In @.claude/skills/webapp-testing/SKILL.md:
- Line 18: The fenced code block in .claude/skills/webapp-testing/SKILL.md is
missing a language identifier; update the opening backticks to include a
language (e.g., change ``` to ```text) so markdownlint passes and readability
improves, leaving the block contents unchanged.
- Line 14: Fix the typo "abslutely" to "absolutely" in the guidance sentence
that starts with "Always run scripts with `--help` first" (replace the
misspelled word in the same sentence in SKILL.md); ensure the corrected sentence
reads "...find that a customized solution is absolutely necessary." and keep
surrounding formatting/backticks unchanged.

In `@llm/state-backend.md`:
- Around line 420-431: Update the documented output field names for the
StructureSampler and SemanticInfiller blocks to match their implementations:
replace the dynamic "*" outputs with explicit arrays — for StructureSampler list
["skeletons", "_seed_samples"] (note the leading underscore on "_seed_samples")
and for SemanticInfiller list ["samples"]; ensure the entries appear under the
corresponding StructureSampler and SemanticInfiller headings in the doc so the
documented outputs match the actual block declarations.

In `@tests/e2e/README.md`:
- Around line 127-133: Update the "pipelines tests" README paragraph to
explicitly warn that the tests will delete pipelines, jobs, and records and
therefore must be run against a dedicated test database or isolated environment
(e.g., separate DB instance, schema, or test-only config), and mention that you
should set test-specific credentials/connection (CI or local env) before running
the pipelines tests to avoid impacting production data.
- Around line 264-266: Replace the blanket "killall uvicorn" / "killall node"
instructions with targeted cleanup: instruct tests to capture server PIDs when
starting (e.g., write PID files or export variables) and then use "kill <PID>"
or "kill -TERM $(cat /path/to/pidfile)" to stop only those processes, or use
"pkill -f '<unique_start_command_or_pattern>'" scoped to the test-runner
pattern; keep the "rm data/test_qa_records.db" note but add a caution to only
remove the DB if using a dedicated test path and ensure the README warns that
indiscriminate killall may terminate unrelated developer processes.
🧹 Nitpick comments (3)
.claude/skills/webapp-testing/examples/static_html_automation.py (1)

21-27: Consider waiting for selectors before interactions.

Direct clicks/fills can be flaky if elements render after load. A small wait (or locator().wait_for()) makes the example more reliable.

♻️ Suggested tweak
-    page.click("text=Click Me")
-    page.fill("#name", "John Doe")
-    page.fill("#email", "john@example.com")
+    page.locator("text=Click Me").wait_for()
+    page.click("text=Click Me")
+    page.locator("#name").wait_for()
+    page.fill("#name", "John Doe")
+    page.fill("#email", "john@example.com")
.claude/skills/webapp-testing/examples/console_logging.py (1)

9-28: Ensure the browser closes even when a step fails.

If page.goto() or page.click() throws, the browser stays open. Wrap the interaction in try/finally so cleanup always runs.

♻️ Suggested change
 with sync_playwright() as p:
     browser = p.chromium.launch(headless=True)
     page = browser.new_page(viewport={"width": 1920, "height": 1080})
 
-    # Set up console log capture
-    def handle_console_message(msg):
-        console_logs.append(f"[{msg.type}] {msg.text}")
-        print(f"Console: [{msg.type}] {msg.text}")
-
-    page.on("console", handle_console_message)
-
-    # Navigate to page
-    page.goto(url)
-    page.wait_for_load_state("networkidle")
-
-    # Interact with the page (triggers console logs)
-    page.click("text=Dashboard")
-    page.wait_for_timeout(1000)
-
-    browser.close()
+    try:
+        # Set up console log capture
+        def handle_console_message(msg):
+            console_logs.append(f"[{msg.type}] {msg.text}")
+            print(f"Console: [{msg.type}] {msg.text}")
+
+        page.on("console", handle_console_message)
+
+        # Navigate to page
+        page.goto(url)
+        page.wait_for_load_state("networkidle")
+
+        # Interact with the page (triggers console logs)
+        page.click("text=Dashboard")
+        page.wait_for_timeout(1000)
+    finally:
+        browser.close()
.claude/skills/webapp-testing/scripts/with_server.py (1)

17-21: Terminate server process groups to avoid orphans.

With shell=True, terminate() may only kill the shell while the actual server keeps running. Consider starting a new session and killing the process group.

♻️ Suggested change (POSIX)
-import argparse
+import argparse
+import os
+import signal
@@
-            process = subprocess.Popen(server["cmd"], shell=True)
+            process = subprocess.Popen(
+                server["cmd"],
+                shell=True,
+                preexec_fn=os.setsid,
+            )
@@
-            try:
-                process.terminate()
-                process.wait(timeout=5)
-            except subprocess.TimeoutExpired:
-                process.kill()
-                process.wait()
+            try:
+                os.killpg(process.pid, signal.SIGTERM)
+                process.wait(timeout=5)
+            except subprocess.TimeoutExpired:
+                os.killpg(process.pid, signal.SIGKILL)
+                process.wait()

Also applies to: 87-89, 112-117

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In @.claude/skills/address-pr-review/scripts/fetch_comments.py:
- Around line 111-123: Validate CLI args at program start: ensure pr_number is a
numeric value before calling fetch_comments and, when mode == "--id", verify a
third argument exists and is numeric (set target_id only after this validation);
if validation fails print __doc__ and exit(1). Update the top-level block around
pr_number, mode, and target_id to perform these checks so missing or non-numeric
inputs cannot fall through to unintended API calls (references: variables
pr_number, mode, target_id and function fetch_comments).
- Around line 51-80: parse_comment currently only populates desc when
TITLE_PATTERN matches and ignores <details> at position 0, causing comments
without bold titles to lose their bodies; update parse_comment so that if
title_match is missing you set after_title = body (use the full body as the
source), and change the <details> handling to consider details_pos >= 0 (or ==
0) when slicing so content before a details block is kept; keep the existing
length cutoff (len(after_title) < 500) and the markdown-cleaning steps (removing
<!-- --> and collapsing extra newlines) after computing desc; refer to
parse_comment, title_match, TITLE_PATTERN, after_title, details_pos, and desc
when making the change.

In `@llm/state-backend.md`:
- Around line 420-461: The document's block category labels are inconsistent
with the BaseBlock category list; update either the BaseBlock category comment
or the block entries so they match. Specifically, edit the BaseBlock category
list to include "multipliers", "observability", and "utilities" (or rename
MarkdownMultiplierBlock, LangfuseBlock, FieldMapper categories to ones already
in BaseBlock) and ensure each block entry (StructureSampler, TextGenerator,
StructuredGenerator, SemanticInfiller, MarkdownMultiplierBlock, ValidatorBlock,
JSONValidatorBlock, DuplicateRemover, DiversityScore, CoherenceScore,
RougeScore, LangfuseBlock, FieldMapper, RagasMetrics) uses a category label that
exactly matches the canonical list; keep naming consistent (singular vs plural)
across BaseBlock and all block definitions.

In `@tests/e2e/README.md`:
- Around line 266-268: Update the pkill pattern used in the cleanup notes:
replace the hardcoded "uvicorn.*5000" reference with the actual documented port
"uvicorn.*8000" or make it a placeholder like "uvicorn.*<PORT>" and instruct
users to substitute their configured port so the pkill/kill command targets the
correct uvicorn process.
♻️ Duplicate comments (1)
.claude/skills/webapp-testing/examples/element_discovery.py (1)

37-38: Hardcoded /tmp path still present.

This was already flagged in a previous review. The static analysis (Ruff S108) also flags this as a potential security issue due to predictable temp file paths. Consider using tempfile.NamedTemporaryFile for unique paths.

🧹 Nitpick comments (1)
.claude/skills/webapp-testing/scripts/with_server.py (1)

76-77: Add strict=True to zip() for Python 3.10+.

The manual length validation at lines 71-73 makes this safe, but strict=True is idiomatic Python 3.10+ and self-documenting.

♻️ Suggested change
-    for cmd, port in zip(args.servers, args.ports):
+    for cmd, port in zip(args.servers, args.ports, strict=True):

@nicofretti nicofretti changed the base branch from main to develop January 22, 2026 22:21
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@docs/how_to_create_blocks.md`:
- Line 42: The description for the MarkdownMultiplierBlock uses a lowercase
"markdown" — update the text for the symbol MarkdownMultiplierBlock so the
phrase reads "Split Markdown documents into chunks for batch processing"
(capitalize "Markdown" only in the human-readable description).

In `@docs/template_data_augmentation.md`:
- Around line 106-108: Remove the stray empty line between the two blockquote
lines so the "**Note:** `price` appears in both `numeric_fields` and
`fields_to_generate`..." and the "**Tip:** Use 4-10 diverse samples..." lines
form a single contiguous blockquote (i.e., no blank line between the lines
starting with >), or alternatively explicitly prefix the blank line with > to
make it a valid blockquote paragraph; update the blockquote content accordingly.
♻️ Duplicate comments (4)
.claude/skills/address-pr-review/scripts/fetch_comments.py (2)

67-75: Comment bodies dropped when no bold title - not addressed.

The previous review flagged that desc remains empty when title_match is None, and details_pos > 0 ignores <details> at position 0. Human review comments (often without bold titles) will have empty descriptions.


113-131: CLI arg validation - not addressed.

Previous review flagged: pr_number should be validated as numeric, and --id mode needs proper validation. Additionally, int(sys.argv[3]) on line 125 will raise an unhandled ValueError if the argument isn't numeric.

frontend/src/components/pipeline-editor/BlockConfigPanel.tsx (2)

266-312: LGTM!

Both dropdowns correctly preserve custom model values not returned by the API, preventing silent data loss on save.


417-489: Document the new JSON-mode toggle in frontend state docs.

Per coding guidelines, new UI flows should be reflected in llm/state-frontend.md. The jsonMode state and JSON/Jinja2 toggle behavior should be documented.

🧹 Nitpick comments (12)
.claude/skills/address-pr-review/scripts/fetch_comments.py (3)

23-31: Silent exit on failure obscures the root cause.

When gh repo view fails, the script exits without any error message, making it hard to diagnose issues (e.g., not in a repo, gh not authenticated).

Suggested improvement
     if result.returncode != 0:
+        print(f"Failed to get repo: {result.stderr.strip()}", file=sys.stderr)
         sys.exit(1)

34-43: Same silent failure pattern.

Add stderr output on API failure for debuggability.

Suggested improvement
     if result.returncode != 0:
+        print(f"Failed to fetch comments: {result.stderr.strip()}", file=sys.stderr)
         sys.exit(1)

147-151: Unrecognized mode silently defaults to --unresolved.

Line 147's condition mode not in ["--all", "--id", "--summary"] causes any typo (e.g., --unresloved) to silently behave as --unresolved. Consider warning or erroring on unknown modes.

Suggested improvement
-    if mode == "--unresolved" or mode not in ["--all", "--id", "--summary"]:
+    if mode not in ["--unresolved", "--all", "--id", "--summary"]:
+        print(f"Unknown mode '{mode}', use --unresolved, --all, --id, or --summary", file=sys.stderr)
+        sys.exit(1)
+    if mode == "--unresolved":
         top_level = [c for c in top_level if not is_resolved(c)]
         print(f"showing {len(top_level)} unresolved comments")
frontend/src/pages/Generator.tsx (3)

50-50: Remove unused validated state.

The validated value is discarded ([, setValidated]) and never read. If validation gating is needed, use validationResult instead. Otherwise, remove this state entirely to reduce noise.

-  const [, setValidated] = useState(false);

Then remove the setValidated(...) calls at lines 78, 137, 248, and 275.


125-127: Magic string for block type detection.

The hardcoded "MarkdownMultiplierBlock" string is fragile. If the backend renames this block type, the check fails silently. Consider having the API return a format hint (e.g., expected_input_format: "markdown" | "json") or extracting this to a constant.

+const MARKDOWN_BLOCK_TYPE = "MarkdownMultiplierBlock";
+
 const fetchPipelineDetails = async () => {
   // ...
   const firstBlockType = data.first_block_type || "";
-  const needsMd = firstBlockType === "MarkdownMultiplierBlock";
+  const needsMd = firstBlockType === MARKDOWN_BLOCK_TYPE;

732-732: Minor redundancy in verify button condition.

The !needsMarkdown && file.name.endsWith(".json") check is defensive but redundant—if needsMarkdown is false and a file passed validation, it's already JSON. Safe to keep for robustness, or simplify to just !needsMarkdown.

lib/blocks/builtin/duplicate_remover.py (1)

56-57: Potential memory leak in embeddings cache.

The _embeddings_cache dictionary grows unbounded as new trace_id values are added but never evicted. In long-running processes with many pipeline executions, this could lead to memory exhaustion.

Consider adding cache eviction (e.g., LRU cache, max size limit, or explicit cleanup after pipeline completion).

♻️ Suggested approach using functools.lru_cache or bounded dict
# Option 1: Add a max cache size
MAX_CACHE_SIZE = 100

def _get_seed_embeddings(self, ...):
    # Evict oldest entries if cache too large
    if len(self._embeddings_cache) >= MAX_CACHE_SIZE:
        oldest_key = next(iter(self._embeddings_cache))
        del self._embeddings_cache[oldest_key]
    # ... rest of method
lib/blocks/builtin/semantic_infiller.py (5)

174-182: Use raise ... from err to preserve exception chain.

When re-raising exceptions, use raise ... from err to maintain the exception chain for better debugging.

📝 Suggested fix
         try:
             response = await litellm.acompletion(**llm_params)
         except Exception as e:
             raise BlockExecutionError(
-                f"LLM call failed: {str(e)}",
+                f"LLM call failed: {e!s}",
                 detail={
                     "skeleton": skeleton,
                     "prompt_preview": prompt[:200],
                     "error": str(e),
                 },
-            )
+            ) from e

247-247: Add strict=True to zip for safety.

Adding strict=True ensures embeddings and samples lists have matching lengths, catching potential bugs early.

📝 Suggested fix
-        for emb, sample in zip(embeddings, samples):
+        for emb, sample in zip(embeddings, samples, strict=True):

268-270: Rename unused loop variable sim to _sim.

The similarity value isn't used in the loop body - only the sample is needed. Prefix with underscore to indicate intentional non-use.

📝 Suggested fix
-        for sim, sample in similar_samples:
+        for _sim, sample in similar_samples:
             fields_str = json.dumps({f: sample.get(f, "") for f in fields_to_generate})
             negative_lines.append(f"  - {fields_str}")

322-322: Temperature could exceed valid range on retries.

With max_diversity_retries=5 and base temperature=0.9, the adjusted temperature becomes 1.4. LLMs typically accept 0.0-2.0, but values >1.0 may produce unexpected results with some providers.

Consider capping the temperature:

📝 Suggested fix
-                temperature=self.temperature + (attempt * 0.1),  # increase temp on retry
+                temperature=min(self.temperature + (attempt * 0.1), 1.0),  # increase temp on retry, cap at 1.0

334-338: Use raise ... from e here as well.

Same issue as line 175 - preserve exception chain.

📝 Suggested fix
             except Exception as e:
                 raise BlockExecutionError(
-                    f"LLM call failed: {str(e)}",
+                    f"LLM call failed: {e!s}",
                     detail={"skeleton": skeleton, "attempt": attempt, "error": str(e)},
-                )
+                ) from e

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In @.claude/skills/address-pr-review/scripts/fetch_comments.py:
- Around line 34-43: The gh API call in fetch_comments currently uses ["gh",
"api", f"repos/{repo}/pulls/{pr_number}/comments", "--paginate"] which can emit
multiple JSON documents that json.loads cannot parse; update the subprocess.run
invocation in fetch_comments to include the "--slurp" flag (e.g., add "--slurp"
to the argument list) so gh will wrap all pages into a single JSON array and
json.loads(result.stdout) will succeed.

In `@tests/e2e/test_generator_e2e.py`:
- Around line 196-219: The test_job_progress_monitoring currently asserts any
presence of progress or upload UI via the progress_or_upload locator, which is
too permissive; either rename the test to reflect it's only checking UI
visibility (e.g., test_generator_page_shows_expected_ui) or modify it to start a
generation job and then assert real progress updates: interact with the page to
initiate a job (e.g., upload/select input and click the start button), then wait
for the progress bar locator ('[role="progressbar"]' / ".progress, .Progress")
or percentage text to appear and assert that its value or text changes over time
(e.g., percentage increases or status transitions), and keep using the
progress_or_upload locator only after the job start to verify actual progress
updates in test_job_progress_monitoring.
🧹 Nitpick comments (3)
tests/e2e/test_generator_e2e.py (2)

167-185: Consider using Playwright's auto-waiting instead of fixed sleep.

The 3-second time.sleep(3) at line 177 is a code smell. Playwright's expect with timeout or wait_for_selector would be more reliable and faster.

♻️ Suggested improvement
         if generate_buttons.count() > 0:
             generate_buttons.first.click()

-            # wait for job to start
-            time.sleep(3)
-            page.wait_for_load_state("networkidle")
-
-            # verify job progress appears
-            # look for progress indicators
-            progress_indicator = page.get_by_text("Progress", exact=False).or_(
+            # wait for progress indicator to appear
+            progress_indicator = page.get_by_text("Progress", exact=False).or_(
                 page.get_by_text("Generated", exact=False)
             )
+            expect(progress_indicator.first).to_be_visible(timeout=10000)
-            assert progress_indicator.count() > 0, "Progress indicator should be visible"

247-281: Main block runs tests sequentially without pytest.

This works for manual execution but tests won't benefit from pytest's fixtures, parallel execution, or reporting when run this way. Consider adding pytest markers for proper test discovery.

♻️ Suggested improvement

Add a conftest.py with shared fixtures and use pytest markers:

# In conftest.py or at module level
import pytest

`@pytest.fixture`(scope="module", autouse=True)
def setup_and_cleanup():
    """Setup pipeline and cleanup after all tests in module."""
    wait_for_server()
    cleanup_database()
    setup_test_pipeline()
    yield
    cleanup_database()

Then run with: pytest tests/e2e/test_generator_e2e.py -v

.claude/skills/address-pr-review/scripts/fetch_comments.py (1)

23-31: Silent failure on gh repo view error.

When gh fails, the script exits without telling the user why. A brief stderr message helps debugging.

Proposed fix
     if result.returncode != 0:
+        print(f"failed to get repo: {result.stderr.strip()}", file=sys.stderr)
         sys.exit(1)

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@tests/e2e/test_generator_e2e.py`:
- Around line 160-175: The test currently skips the flow when the file input or
Generate/Start button is missing; add explicit assertions so the test fails when
required UI elements are absent: assert that file_inputs
(page.locator('input[type="file"]').all()) has length > 0 before calling
set_input_files(seed_path) or, if file upload is optional, document that and
only assert presence of the generate/start control; similarly assert
generate_buttons.count() > 0 (or call pytest.fail with a clear message) before
invoking generate_buttons.first.click(); reference the variables file_inputs and
generate_buttons and the methods set_input_files, count, and first.click when
making the checks.
- Around line 110-116: The test currently attempts to upload without verifying
the file input exists; before calling file_inputs[0].set_input_files(seed_path)
assert that the locator 'input[type="file"]' has at least one element (e.g.,
assert len(file_inputs) > 0 or use Playwright's
expect(file_inputs[0]).to_be_visible()) so the test fails clearly when the file
input is missing; keep the existing variables file_inputs, seed_path and the
page.locator call and only proceed to set_input_files and sleep after the
existence assertion.
- Around line 239-272: Remove the guarded "__main__" block and replace its
setup/teardown logic with a module-scoped autouse pytest fixture that calls
wait_for_server(), cleanup_database(), and setup_test_pipeline() before yielding
and then calls cleanup_database() after yield; keep the existing test_*
functions (test_generator_page_loads, test_select_pipeline,
test_upload_seed_file, test_start_generation_job,
test_generator_shows_upload_ui) as normal tests so pytest will run them,
matching the pattern used in test_pipelines_e2e.py.
♻️ Duplicate comments (1)
tests/e2e/test_generator_e2e.py (1)

58-67: Fail fast when no real pipeline option exists.
Right now the test can pass even if only a placeholder option exists (or no custom dropdown items). Make it assert the availability of a selectable option.

🔧 Suggested fix
             if selectors[0].evaluate("el => el.tagName") == "SELECT":
                 options = selectors[0].locator("option").all()
-                if len(options) > 1:  # skip "select pipeline" placeholder
-                    selectors[0].select_option(index=1)
+                assert len(options) > 1, "No pipelines available; create one before running e2e tests"
+                selectors[0].select_option(index=1)
             else:
                 # for custom dropdowns, click first item
                 items = page.locator('[role="option"]').all()
-                if len(items) > 0:
-                    items[0].click()
+                assert len(items) > 0, "No pipeline options found in custom dropdown"
+                items[0].click()
🧹 Nitpick comments (2)
.claude/skills/address-pr-review/scripts/fetch_comments.py (2)

23-31: Silent failure makes debugging harder.

fetch_comments() prints stderr on failure, but get_repo() exits silently. If gh isn't authenticated or there's no repo context, users get no feedback.

♻️ Suggested fix
 def get_repo() -> str:
     result = subprocess.run(
         ["gh", "repo", "view", "--json", "owner,name", "-q", '.owner.login + "/" + .name'],
         capture_output=True,
         text=True,
     )
     if result.returncode != 0:
+        print(f"failed to get repo: {result.stderr.strip()}", file=sys.stderr)
         sys.exit(1)
     return result.stdout.strip()

162-166: Unknown modes silently treated as --unresolved.

A typo like --unreslved would be silently accepted. Consider warning when an unrecognized mode is used.

♻️ Optional: warn on unknown modes
-    if mode == "--unresolved" or mode not in ["--all", "--id", "--summary"]:
+    known_modes = ["--unresolved", "--all", "--id", "--summary"]
+    if mode not in known_modes:
+        print(f"unknown mode '{mode}', defaulting to --unresolved", file=sys.stderr)
+    if mode == "--unresolved" or mode not in known_modes:
         top_level = [c for c in top_level if not is_resolved(c)]
         print(f"showing {len(top_level)} unresolved comments")

@nicofretti nicofretti merged commit aadd197 into develop Jan 24, 2026
2 checks passed
@nicofretti nicofretti deleted the feat/e2e-tests branch January 24, 2026 13:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

🚀 Feat: template conversational data augmentation 🚀 Feat: include CodeRabbit as reviewer for the repository

1 participant