🚀 Feat: Add DataGenFlow skills by nicofretti · Pull Request #64 · nicofretti/DataGenFlow

nicofretti · 2026-02-01T20:30:48Z

Related Issue

closes 🚀 Feat: Add DataGenFlow skills #63

Checklist

Code follows project style guidelines
Comments explain "why" not "what"
Documentation updated (if needed)
No debug code or console statements
make format passes
make pre-merge passes
PR update from develop branch
Copilot review run and addressed

Summary by CodeRabbit

New Features
- Added data augmentation pipeline template for generating synthetic records while preserving distributions
- Enhanced block configuration with dynamic JSON-or-template field support
- New built-in blocks: StructureSampler, SemanticInfiller, DuplicateRemover, FieldMapper, and RagasMetrics
Documentation
- Comprehensive guides for code skills, model configuration, block implementation, debugging, and testing
Tests
- Expanded end-to-end testing framework for web application validation
- New integration tests for data augmentation workflows

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2026-02-01T20:31:08Z

Caution

Review failed

Failed to post review comments

Walkthrough

This PR introduces a comprehensive suite of DataGenFlow features: three new data augmentation blocks (StructureSampler, SemanticInfiller, DuplicateRemover), a data augmentation pipeline template, end-to-end testing infrastructure via Playwright with server orchestration, Claude Code skills documentation, frontend JSON-templating support, and extensive test coverage.

Changes

Cohort / File(s)	Summary
Claude Code Skills Documentation `.claude/skills/address-pr-review/SKILL.md`, `.claude/skills/code-review/SKILL.md`, `.claude/skills/configuring-models/SKILL.md`, `.claude/skills/creating-pipeline-templates/SKILL.md`, `.claude/skills/debugging-pipelines/SKILL.md`, `.claude/skills/implementing-datagenflow-blocks/SKILL.md`, `.claude/skills/testing-pipeline-templates/SKILL.md`, `.claude/skills/webapp-testing/SKILL.md`, `.claude/skills/writing-e2e-tests/SKILL.md`	New comprehensive skill guides for PR review workflows, code review processes, model configuration, pipeline creation, debugging, block implementation patterns, testing, webapp automation, and e2e tests.
Webapp Testing Scripts & Examples `.claude/skills/address-pr-review/scripts/fetch_comments.py`, `.claude/skills/webapp-testing/scripts/with_server.py`, `.claude/skills/webapp-testing/examples/*`	New Python utilities for fetching PR comments, managing server lifecycle during tests, and capturing browser console logs and performing element discovery via Playwright.
Data Augmentation Blocks `lib/blocks/builtin/structure_sampler.py`, `lib/blocks/builtin/semantic_infiller.py`, `lib/blocks/builtin/duplicate_remover.py`	Three new blocks for synthetic data generation: StructureSampler (statistical skeleton generation), SemanticInfiller (LLM-based field completion with diversity checks), DuplicateRemover (embedding-based duplicate filtering).
Block Configuration & Template Utilities `lib/blocks/builtin/field_mapper.py`, `lib/blocks/builtin/json_validator.py`, `lib/blocks/builtin/ragas_metrics.py`, `lib/blocks/builtin/structured_generator.py`, `lib/blocks/builtin/validator.py`, `lib/blocks/commons/template_utils.py`, `lib/blocks/config.py`	Enhanced blocks with JSON-or-template support for dynamic field configuration; new template utilities for rendering, JSON parsing, and error handling; config schema updates to support per-field format hints.
Frontend UI Components `frontend/src/components/pipeline-editor/BlockConfigPanel.tsx`, `frontend/src/components/pipeline-editor/BlockNode.tsx`, `frontend/src/pages/Generator.tsx`	Added JSON-or-template editor with Monaco, LLM/embedding model dropdowns, and state handling; extended block preview logic for new block types; replaced pipeline type detection from `first_block_is_multiplier` to `needsMarkdown` flag.
Backend Core & Templates `lib/workflow.py`, `lib/template_renderer.py`, `app.py`, `lib/templates/data_augmentation.yaml`, `lib/templates/seeds/seed_data_augmentation.json`	Added output filtering logic to exclude input metadata from results; improved tojson filter with undefined variable error messages; exposed `first_block_type` in pipeline API; introduced data augmentation template with configuration seeds.
E2E Testing Infrastructure `scripts/with_server.py`, `tests/e2e/run_all_tests.sh`, `tests/e2e/test_.py`, `tests/e2e/fixtures/`, `tests/e2e/README.md`	Comprehensive e2e testing suite with Playwright: ServerManager for multi-server orchestration, three test modules (pipelines, generator, review), helper utilities for database cleanup and server readiness, and test fixtures.
Unit & Integration Tests `tests/blocks/test_.py`, `tests/blocks/commons/test_template_utils.py`, `tests/integration/conftest.py`, `tests/integration/test_data_augmentation.py`, `tests/test_template_renderer.py`, `tests/test_templates.py`	Comprehensive test coverage for new blocks, template utilities, data augmentation pipeline, and template rendering; includes fixtures for e2e storage setup with Ollama integration.
Project Configuration & Documentation `.coderabbit.yaml`, `.github/pull_request_template.md`, `.gitignore`, `Makefile`, `pyproject.toml`, `docs/claude_code_skills.md`, `docs/how_to_create_blocks.md`, `docs/template_data_augmentation.md`, `llm/state-*.md`, `scripts/inspect_db_configs.py`	Updated CodeRabbit review settings, PR template to reference CodeRabbit, gitignore rules for .claude skills, Makefile e2e targets, dependencies (scikit-learn, ragas upgrade), and extensive documentation for new blocks, data augmentation template, and Claude Code skills integration.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant StructureSampler
    participant SemanticInfiller
    participant DuplicateRemover
    participant LLM as LLM Service
    participant EmbeddingService

    User->>StructureSampler: Execute with seed samples
    StructureSampler->>StructureSampler: Analyze distributions & dependencies
    StructureSampler-->>User: Return skeletons + hints

    User->>SemanticInfiller: Execute with skeletons
    SemanticInfiller->>SemanticInfiller: Build generation prompts from hints
    SemanticInfiller->>LLM: Request field completion
    LLM-->>SemanticInfiller: Generated fields
    SemanticInfiller->>EmbeddingService: Get embeddings for diversity check
    EmbeddingService-->>SemanticInfiller: Embeddings
    SemanticInfiller->>SemanticInfiller: Check similarity & retry if needed
    SemanticInfiller-->>User: Return filled samples

    User->>DuplicateRemover: Execute with samples
    DuplicateRemover->>EmbeddingService: Embed seed & generated samples
    EmbeddingService-->>DuplicateRemover: Embeddings
    DuplicateRemover->>DuplicateRemover: Compute cosine similarity
    DuplicateRemover-->>User: Return samples with duplicate flags

sequenceDiagram
    participant TestRunner as Test Runner
    participant ServerManager
    participant Backend as Backend Server
    participant Frontend as Frontend Server
    participant Browser as Playwright Browser
    participant TestScript as Test Script

    TestRunner->>ServerManager: Start servers (backend, frontend)
    ServerManager->>Backend: Launch uvicorn process
    ServerManager->>Frontend: Launch yarn dev process
    ServerManager->>ServerManager: Poll /health endpoints
    Backend-->>ServerManager: Ready
    Frontend-->>ServerManager: Ready
    ServerManager-->>TestRunner: All servers ready

    TestRunner->>TestScript: Execute test suite
    TestScript->>Browser: Launch headless/visible
    Browser->>Frontend: Navigate to http://localhost:5173
    Frontend->>Backend: API requests (pipelines, generation)
    Backend-->>Frontend: Response data
    Browser->>Browser: Interact with UI & assertions
    TestScript-->>TestRunner: Test results

    TestRunner->>ServerManager: Cleanup (terminate processes)
    ServerManager-->>TestRunner: Success

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

The PR spans heterogeneous changes: three substantial block implementations with complex logic (embeddings, LLM integration, similarity calculations), frontend state refactoring with new UI modes, extensive test infrastructure (e2e + unit), template utilities, configuration schema updates, and documentation. While individual cohorts follow consistent patterns, the variety across backend, frontend, tests, and config requires separate reasoning per area.

Possibly related PRs

🚀 Feat: template conversational data augmentation + e2e tests + CodeRabbit #57: Directly related—both PRs implement the same data augmentation blocks (StructureSampler, SemanticInfiller, DuplicateRemover), template utilities, e2e testing infrastructure, frontend BlockConfigPanel enhancements, and Claude Code skills.

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description is incomplete; only the Related Issue and Checklist sections are present. The Description section explaining what the PR does is entirely missing.	Add a Description section explaining the skills added and the key changes (e.g., new blocks, UI updates, e2e testing).
Docstring Coverage	⚠️ Warning	Docstring coverage is 46.58% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title clearly indicates the main feature: adding DataGenFlow skills for agents.
Linked Issues check	✅ Passed	The PR successfully implements the objectives from issue `#63`: it adds multiple skills for DataGenFlow (address-pr-review, code-review, configuring-models, creating-pipeline-templates, debugging-pipelines, implementing-datagenflow-blocks, testing-pipeline-templates, writing-e2e-tests, webapp-testing), introduces three new blocks (StructureSampler, SemanticInfiller, DuplicateRemover) for data augmentation, and adds e2e testing infrastructure.
Out of Scope Changes check	✅ Passed	Minor out-of-scope changes detected: PR template update (CodeRabbit wording), .gitignore restructuring, Makefile e2e targets, and pyproject.toml dependency updates are not mentioned in issue `#63` objectives but are supportive and reasonable.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch 63-feat-add-datagenflow-skills

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

nicofretti added 3 commits February 1, 2026 21:12

add: base skills

78fd10f

add: docs

6f05290

fix: signature

0119dcb

nicofretti linked an issue Feb 1, 2026 that may be closed by this pull request

🚀 Feat: Add DataGenFlow skills #63

Closed

nicofretti changed the base branch from main to develop February 1, 2026 20:31

nicofretti self-assigned this Feb 1, 2026

nicofretti merged commit d5e45f9 into develop Feb 1, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🚀 Feat: Add DataGenFlow skills#64

🚀 Feat: Add DataGenFlow skills#64
nicofretti merged 3 commits intodevelopfrom
63-feat-add-datagenflow-skills

nicofretti commented Feb 1, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Feb 1, 2026 •

edited

Loading

Review failed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nicofretti commented Feb 1, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Related Issue

Checklist

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

nicofretti commented Feb 1, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 1, 2026 •

edited

Loading