Skip to content

🚀 Feat: Add DataGenFlow skills#64

Merged
nicofretti merged 3 commits intodevelopfrom
63-feat-add-datagenflow-skills
Feb 1, 2026
Merged

🚀 Feat: Add DataGenFlow skills#64
nicofretti merged 3 commits intodevelopfrom
63-feat-add-datagenflow-skills

Conversation

@nicofretti
Copy link
Owner

@nicofretti nicofretti commented Feb 1, 2026

Related Issue

Checklist

  • Code follows project style guidelines
  • Comments explain "why" not "what"
  • Documentation updated (if needed)
  • No debug code or console statements
  • make format passes
  • make pre-merge passes
  • PR update from develop branch
  • Copilot review run and addressed

Summary by CodeRabbit

  • New Features

    • Added data augmentation pipeline template for generating synthetic records while preserving distributions
    • Enhanced block configuration with dynamic JSON-or-template field support
    • New built-in blocks: StructureSampler, SemanticInfiller, DuplicateRemover, FieldMapper, and RagasMetrics
  • Documentation

    • Comprehensive guides for code skills, model configuration, block implementation, debugging, and testing
  • Tests

    • Expanded end-to-end testing framework for web application validation
    • New integration tests for data augmentation workflows

✏️ Tip: You can customize this high-level summary in your review settings.

@nicofretti nicofretti linked an issue Feb 1, 2026 that may be closed by this pull request
@nicofretti nicofretti changed the base branch from main to develop February 1, 2026 20:31
@coderabbitai
Copy link

coderabbitai bot commented Feb 1, 2026

Caution

Review failed

Failed to post review comments

Walkthrough

This PR introduces a comprehensive suite of DataGenFlow features: three new data augmentation blocks (StructureSampler, SemanticInfiller, DuplicateRemover), a data augmentation pipeline template, end-to-end testing infrastructure via Playwright with server orchestration, Claude Code skills documentation, frontend JSON-templating support, and extensive test coverage.

Changes

Cohort / File(s) Summary
Claude Code Skills Documentation
.claude/skills/address-pr-review/SKILL.md, .claude/skills/code-review/SKILL.md, .claude/skills/configuring-models/SKILL.md, .claude/skills/creating-pipeline-templates/SKILL.md, .claude/skills/debugging-pipelines/SKILL.md, .claude/skills/implementing-datagenflow-blocks/SKILL.md, .claude/skills/testing-pipeline-templates/SKILL.md, .claude/skills/webapp-testing/SKILL.md, .claude/skills/writing-e2e-tests/SKILL.md
New comprehensive skill guides for PR review workflows, code review processes, model configuration, pipeline creation, debugging, block implementation patterns, testing, webapp automation, and e2e tests.
Webapp Testing Scripts & Examples
.claude/skills/address-pr-review/scripts/fetch_comments.py, .claude/skills/webapp-testing/scripts/with_server.py, .claude/skills/webapp-testing/examples/*
New Python utilities for fetching PR comments, managing server lifecycle during tests, and capturing browser console logs and performing element discovery via Playwright.
Data Augmentation Blocks
lib/blocks/builtin/structure_sampler.py, lib/blocks/builtin/semantic_infiller.py, lib/blocks/builtin/duplicate_remover.py
Three new blocks for synthetic data generation: StructureSampler (statistical skeleton generation), SemanticInfiller (LLM-based field completion with diversity checks), DuplicateRemover (embedding-based duplicate filtering).
Block Configuration & Template Utilities
lib/blocks/builtin/field_mapper.py, lib/blocks/builtin/json_validator.py, lib/blocks/builtin/ragas_metrics.py, lib/blocks/builtin/structured_generator.py, lib/blocks/builtin/validator.py, lib/blocks/commons/template_utils.py, lib/blocks/config.py
Enhanced blocks with JSON-or-template support for dynamic field configuration; new template utilities for rendering, JSON parsing, and error handling; config schema updates to support per-field format hints.
Frontend UI Components
frontend/src/components/pipeline-editor/BlockConfigPanel.tsx, frontend/src/components/pipeline-editor/BlockNode.tsx, frontend/src/pages/Generator.tsx
Added JSON-or-template editor with Monaco, LLM/embedding model dropdowns, and state handling; extended block preview logic for new block types; replaced pipeline type detection from first_block_is_multiplier to needsMarkdown flag.
Backend Core & Templates
lib/workflow.py, lib/template_renderer.py, app.py, lib/templates/data_augmentation.yaml, lib/templates/seeds/seed_data_augmentation.json
Added output filtering logic to exclude input metadata from results; improved tojson filter with undefined variable error messages; exposed first_block_type in pipeline API; introduced data augmentation template with configuration seeds.
E2E Testing Infrastructure
scripts/with_server.py, tests/e2e/run_all_tests.sh, tests/e2e/test_*.py, tests/e2e/fixtures/*, tests/e2e/README.md
Comprehensive e2e testing suite with Playwright: ServerManager for multi-server orchestration, three test modules (pipelines, generator, review), helper utilities for database cleanup and server readiness, and test fixtures.
Unit & Integration Tests
tests/blocks/test_*.py, tests/blocks/commons/test_template_utils.py, tests/integration/conftest.py, tests/integration/test_data_augmentation*.py, tests/test_template_renderer.py, tests/test_templates.py
Comprehensive test coverage for new blocks, template utilities, data augmentation pipeline, and template rendering; includes fixtures for e2e storage setup with Ollama integration.
Project Configuration & Documentation
.coderabbit.yaml, .github/pull_request_template.md, .gitignore, Makefile, pyproject.toml, docs/claude_code_skills.md, docs/how_to_create_blocks.md, docs/template_data_augmentation.md, llm/state-*.md, scripts/inspect_db_configs.py
Updated CodeRabbit review settings, PR template to reference CodeRabbit, gitignore rules for .claude skills, Makefile e2e targets, dependencies (scikit-learn, ragas upgrade), and extensive documentation for new blocks, data augmentation template, and Claude Code skills integration.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant StructureSampler
    participant SemanticInfiller
    participant DuplicateRemover
    participant LLM as LLM Service
    participant EmbeddingService

    User->>StructureSampler: Execute with seed samples
    StructureSampler->>StructureSampler: Analyze distributions & dependencies
    StructureSampler-->>User: Return skeletons + hints

    User->>SemanticInfiller: Execute with skeletons
    SemanticInfiller->>SemanticInfiller: Build generation prompts from hints
    SemanticInfiller->>LLM: Request field completion
    LLM-->>SemanticInfiller: Generated fields
    SemanticInfiller->>EmbeddingService: Get embeddings for diversity check
    EmbeddingService-->>SemanticInfiller: Embeddings
    SemanticInfiller->>SemanticInfiller: Check similarity & retry if needed
    SemanticInfiller-->>User: Return filled samples

    User->>DuplicateRemover: Execute with samples
    DuplicateRemover->>EmbeddingService: Embed seed & generated samples
    EmbeddingService-->>DuplicateRemover: Embeddings
    DuplicateRemover->>DuplicateRemover: Compute cosine similarity
    DuplicateRemover-->>User: Return samples with duplicate flags
Loading
sequenceDiagram
    participant TestRunner as Test Runner
    participant ServerManager
    participant Backend as Backend Server
    participant Frontend as Frontend Server
    participant Browser as Playwright Browser
    participant TestScript as Test Script

    TestRunner->>ServerManager: Start servers (backend, frontend)
    ServerManager->>Backend: Launch uvicorn process
    ServerManager->>Frontend: Launch yarn dev process
    ServerManager->>ServerManager: Poll /health endpoints
    Backend-->>ServerManager: Ready
    Frontend-->>ServerManager: Ready
    ServerManager-->>TestRunner: All servers ready

    TestRunner->>TestScript: Execute test suite
    TestScript->>Browser: Launch headless/visible
    Browser->>Frontend: Navigate to http://localhost:5173
    Frontend->>Backend: API requests (pipelines, generation)
    Backend-->>Frontend: Response data
    Browser->>Browser: Interact with UI & assertions
    TestScript-->>TestRunner: Test results

    TestRunner->>ServerManager: Cleanup (terminate processes)
    ServerManager-->>TestRunner: Success
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

The PR spans heterogeneous changes: three substantial block implementations with complex logic (embeddings, LLM integration, similarity calculations), frontend state refactoring with new UI modes, extensive test infrastructure (e2e + unit), template utilities, configuration schema updates, and documentation. While individual cohorts follow consistent patterns, the variety across backend, frontend, tests, and config requires separate reasoning per area.

Possibly related PRs

🚥 Pre-merge checks | ✅ 3 | ❌ 2
❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is incomplete; only the Related Issue and Checklist sections are present. The Description section explaining what the PR does is entirely missing. Add a Description section explaining the skills added and the key changes (e.g., new blocks, UI updates, e2e testing).
Docstring Coverage ⚠️ Warning Docstring coverage is 46.58% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The PR title clearly indicates the main feature: adding DataGenFlow skills for agents.
Linked Issues check ✅ Passed The PR successfully implements the objectives from issue #63: it adds multiple skills for DataGenFlow (address-pr-review, code-review, configuring-models, creating-pipeline-templates, debugging-pipelines, implementing-datagenflow-blocks, testing-pipeline-templates, writing-e2e-tests, webapp-testing), introduces three new blocks (StructureSampler, SemanticInfiller, DuplicateRemover) for data augmentation, and adds e2e testing infrastructure.
Out of Scope Changes check ✅ Passed Minor out-of-scope changes detected: PR template update (CodeRabbit wording), .gitignore restructuring, Makefile e2e targets, and pyproject.toml dependency updates are not mentioned in issue #63 objectives but are supportive and reasonable.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch 63-feat-add-datagenflow-skills

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@nicofretti nicofretti self-assigned this Feb 1, 2026
@nicofretti nicofretti merged commit d5e45f9 into develop Feb 1, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

🚀 Feat: Add DataGenFlow skills

1 participant