🚀 Feat: template conversational data augmentation by nicofretti · Pull Request #53 · nicofretti/DataGenFlow

nicofretti · 2026-01-10T20:21:29Z

Description

This PR introduces a data augmentation pipeline feature with three new Python blocks (StructureSampler, SemanticInfiller, DuplicateRemover) for generating and validating synthetic records. It includes frontend model selection UI, comprehensive test coverage, documentation, and CodeRabbit configuration.

Related Issue

Checklist

Code follows project style guidelines
Comments explain "why" not "what"
Documentation updated (if needed)
No debug code or console statements
make format passes
make pre-merge passes
PR update from develop branch
Copilot review run and addressed

Summary by CodeRabbit

Release Notes

New Features
- Added data augmentation pipeline with semantic field generation and duplicate detection capabilities.
- Introduced dynamic model selection dropdowns in the pipeline editor for configuring LLM and embedding models.
- Enhanced block preview display with improved formatting for arrays and objects.
Documentation
- Added comprehensive guide for implementing and extending DataGenFlow blocks.
- Added data augmentation template documentation with examples and customization options.
Chores
- Added scikit-learn dependency.
- Updated Docker and code review configurations.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2026-01-10T20:21:34Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

This PR introduces a data augmentation pipeline feature with three new Python blocks (StructureSampler, SemanticInfiller, DuplicateRemover) for generating and validating synthetic records. It includes frontend model selection UI, comprehensive test coverage, documentation, and CodeRabbit configuration.

Changes

Cohort / File(s)	Summary
Infrastructure & Configuration `docker/docker-compose.yml`, `.coderabbit.yaml`, `pyproject.toml`, `.gitignore`	Build context/Dockerfile paths updated; healthcheck URL and start_period added; CodeRabbit review config added with language-specific rules and tone guidelines; scikit-learn dependency added; .claude/skills/ now tracked.
Documentation `.claude/skills/implementing-datagenflow-blocks/SKILL.md`, `docs/template_data_augmentation.md`, `llm/state-backend.md`, `llm/state-project.md`	New comprehensive skill guide for block implementation with templates and patterns; new data augmentation template documentation covering pipeline stages, customization, and examples; updated reference docs with new blocks and features.
Backend: Data Augmentation Blocks `lib/blocks/builtin/structure_sampler.py`, `lib/blocks/builtin/semantic_infiller.py`, `lib/blocks/builtin/duplicate_remover.py`	StructureSampler: learns distributions from seeds, generates skeletons with dependency-aware categorical sampling. SemanticInfiller: LLM-based field completion with template rendering, JSON parsing, and locked-field restoration. DuplicateRemover: embedding-based similarity detection with per-trace caching and graceful error handling.
Templates `lib/templates/data_augmentation.yaml`, `lib/templates/seeds/seed_data_augmentation.json`	YAML pipeline definition for three-block augmentation flow; seed data example with 6 sample records, categorical/numeric field configs, and dependencies.
Frontend: Model Integration `frontend/src/components/pipeline-editor/BlockConfigPanel.tsx`, `frontend/src/components/pipeline-editor/BlockNode.tsx`	BlockConfigPanel: fetches and populates llmModels/embeddingModels via new API endpoints, renders dropdowns for "model" and "embedding_model" fields. BlockNode: extended block type categorization (sampler, infiller, remover, multiplier, etc.), enhanced value formatting (arrays as [N items], objects as {N keys}), increased truncation to 25 chars.
Unit Tests `tests/blocks/test_structure_sampler.py`, `tests/blocks/test_semantic_infiller.py`, `tests/blocks/test_duplicate_remover.py`	Comprehensive coverage for each block: initialization, distribution calculations, prompt construction, JSON parsing, embedding caching, error handling, schema validation, edge cases (missing data, circular dependencies, embedding failures).
Integration Tests `tests/integration/test_data_augmentation.py`	End-to-end pipeline test with mocked LLM; validates result structure, trace integrity, field constraints, deterministic seeding, and graceful handling of missing embeddings.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Pipeline
    participant StructureSampler as Structure<br/>Sampler
    participant SemanticInfiller as Semantic<br/>Infiller
    participant LLM as LLM API
    participant DuplicateRemover as Duplicate<br/>Remover
    participant EmbedModel as Embedding<br/>Model

    User->>Pipeline: Trigger pipeline with seed
    Pipeline->>StructureSampler: Read seed samples
    StructureSampler->>StructureSampler: Analyze distributions & dependencies
    StructureSampler->>Pipeline: Return skeleton records (N items)
    
    loop For each skeleton
        Pipeline->>SemanticInfiller: Process skeleton
        SemanticInfiller->>SemanticInfiller: Build constrained prompt
        SemanticInfiller->>LLM: Request field generation
        LLM-->>SemanticInfiller: Generated JSON with fields
        SemanticInfiller->>SemanticInfiller: Parse & restore locked fields
        SemanticInfiller->>Pipeline: Return enriched record
        
        Pipeline->>DuplicateRemover: Check against reference dataset
        DuplicateRemover->>EmbedModel: Embed current record text
        DuplicateRemover->>DuplicateRemover: Compute similarity (cached refs)
        DuplicateRemover->>Pipeline: Return record + is_duplicate flag
    end
    
    Pipeline->>User: Return augmented dataset

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.78% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	❓ Inconclusive	PR description is vague and incomplete. It provides a general overview but lacks specific implementation details, context, or clarity about the changes made.	Expand the description to explain: (1) What each block does and why it's needed, (2) How the data augmentation pipeline works, (3) Key architectural decisions, (4) Frontend changes purpose, (5) How to use the new feature. Include specific examples if possible.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	Title is partially related to the changeset. It mentions 'template conversational data augmentation' but the PR primarily implements three new blocks (StructureSampler, SemanticInfiller, DuplicateRemover), a data augmentation template, documentation, and frontend changes—broader than just conversational data.
Linked Issues check	✅ Passed	The changeset addresses issue #38 by implementing a data augmentation pipeline with three blocks maintaining coherence and diversity. Issue #52 is addressed by adding .coderabbit.yaml configuration enabling CodeRabbit reviews.
Out of Scope Changes check	✅ Passed	Changes are well-scoped to the linked issues. Docker/gitignore updates support the infrastructure. Frontend changes enhance model selection. All modifications serve the core data augmentation and CodeRabbit integration objectives.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

nicofretti · 2026-01-10T20:54:14Z

@coderabbitai review

coderabbitai · 2026-01-10T20:54:18Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai · 2026-01-10T20:55:21Z

Note

Docstrings generation - SUCCESS
Generated docstrings for this pull request at #54

@nicofretti

Docstrings generation was requested by @nicofretti. * #53 (comment) The following files were modified: * `frontend/src/components/pipeline-editor/BlockConfigPanel.tsx` * `frontend/src/components/pipeline-editor/BlockNode.tsx` * `lib/blocks/builtin/duplicate_remover.py` * `lib/blocks/builtin/semantic_infiller.py` * `lib/blocks/builtin/structure_sampler.py` * `tests/blocks/test_duplicate_remover.py` * `tests/blocks/test_semantic_infiller.py` * `tests/blocks/test_structure_sampler.py` * `tests/integration/test_data_augmentation.py`

coderabbitai

Actionable comments posted: 12

🤖 Fix all issues with AI agents

In @.claude/skills/implementing-datagenflow-blocks/SKILL.md:
- Around line 417-419: The SKILL example for the StructureSampler class has the
wrong category string; update the class attribute on StructureSampler (the name
= "Structure Sampler" class) to use category = "seeders" instead of "generators"
so it matches the implementation in structure_sampler.py and other references.
- Around line 414-427: The example for the multiplier block uses the wrong
execute signature: change StructureSampler.execute to accept a
BlockExecutionContext (import it) instead of initial_data: dict[str, Any];
update any references inside the method to use the context (e.g., context.input
or context.state) and keep the return type as list[dict[str, Any]] to match
BaseMultiplierBlock’s actual implementation and the real structure_sampler.py.

In @docs/template_data_augmentation.md:
- Around line 361-399: Replace the bold "Step X" lines that trigger MD036 with
proper Markdown headings: change lines like "**Step 1: Prepare samples (6
examples)**", "**Step 2: Create pipeline from template**", "**Step 3: Start
generation**", "**Step 4: Monitor progress**", and "**Step 5: Review and
export**" into heading syntax (for example "### Step 1: Prepare samples (6
examples)" etc.), ensuring consistent heading level across all steps and leaving
the surrounding code blocks and prose unchanged.
- Around line 41-55: The fenced diagram block in
docs/template_data_augmentation.md triggers markdownlint MD040 because it lacks
a language tag; update the opening fence for the ASCII diagram to include a
language (e.g., change the opening "```" to "```text") so the block becomes a
language-specified fenced code block and markdownlint MD040 is satisfied; locate
the ASCII diagram block (the box/arrow diagram under the Structure → Semantic →
Duplicate heading) and replace its opening fence accordingly.

In @lib/blocks/builtin/duplicate_remover.py:
- Around line 17-24: The "_config_descriptions" entry for "embedding_model" in
duplicate_remover.py is longer than 100 chars; shorten or split that description
so no line exceeds 100 chars — e.g., rephrase the description to be under 100
characters or break it into concatenated shorter strings for the
"embedding_model" value in the _config_descriptions dict (keep the same key name
and meaning).

In @lib/blocks/builtin/semantic_infiller.py:
- Around line 27-35: The block currently mutates the instance attribute
self.fields_to_generate inside execute(), making it non-reentrant; instead, keep
the original config immutable by reading self.fields_to_generate (and any
related attrs) and working on a local variable (e.g., fields_to_generate_local)
for parsing, template rendering, validation and any conversions; update only
local state and return results without assigning back to
self.fields_to_generate, and apply the same change to the code regions
referenced around execute() (lines ~139-252) to ensure thread-safety and avoid
side effects.

In @llm/state-project.md:
- Around line 31-33: Update llm/state-project.md to reflect the correct number
and list of builtin blocks: change the count from 12 to 14 wherever it appears
and replace the partial list with the full set of block names to match
llm/state-backend.md — include StructureSampler, TextGenerator,
StructuredGenerator, SemanticInfiller, MarkdownMultiplierBlock, ValidatorBlock,
JSONValidatorBlock, DuplicateRemover, DiversityScore, CoherenceScore,
RougeScore, LangfuseBlock, FieldMapper, and RagasMetrics; ensure the header
comment near builtin/ and the block listing sections all consistently show "14
blocks" and enumerate these 14 block implementations.

In @tests/blocks/test_semantic_infiller.py:
- Around line 92-124: Tests instantiate SemanticInfiller with fields_to_generate
as Python lists, but the real config expects a JSON string; update each test to
pass a JSON string (e.g., fields_to_generate='["bio"]' or json.dumps(["bio"]))
when creating SemanticInfiller in test_parse_valid_json,
test_parse_json_with_markdown, test_parse_json_embedded_in_text, and
test_parse_invalid_json_raises_error so the block receives the same shape as
production and _parse_json_safely is exercised correctly.

In @tests/integration/test_data_augmentation.py:
- Around line 81-90: Ruff errors F841 (assigned-but-unused) and F541 (f-string
missing placeholders) are caused by unused locals and incorrectly formatted
f-strings in this test; fix by either using or discarding the assigned variables
and by removing the stray f-prefix from literal strings or adding proper
{placeholders}. Specifically, for the assignments pipeline_id, pipeline =
Pipeline(...), and initial_data in the shown block (and the other occurrences at
164-176, 205-207, 268-271) either (a) use the values in an assertion or
subsequent call (e.g., assert pipeline_id is not None or assert pipeline is
instance of Pipeline) or (b) rename them to _pipeline_id/_pipeline or prefix
with an underscore to indicate intentional unused values; for any f"..." strings
flagged F541, remove the leading f if there is no interpolation or replace with
f"...{var}..." including the correct variable names if interpolation was
intended.

🧹 Nitpick comments (10)

lib/blocks/builtin/duplicate_remover.py (2)
117-130: Missing _usage tracking for embedding API calls.

Per coding guidelines, LLM/embedding calls should track and return usage metrics. The embedding responses from litellm include usage data that should be captured.
Track embedding usage
                 response = await litellm.aembedding(**embedding_params)

                 self._embeddings_cache[trace_id] = [
                     item["embedding"] for item in response.data
                 ]
+
+                # track reference embedding usage
+                if hasattr(response, 'usage') and response.usage:
+                    logger.info(
+                        f"Reference embeddings usage: {response.usage.total_tokens} tokens"
+                    )
36-37: Instance-level cache may persist across pipeline jobs.

The _embeddings_cache dict is instance-level. If the same block instance is reused across different pipeline jobs, old trace_ids will accumulate. Consider clearing stale entries or using a bounded cache.
lib/blocks/builtin/structure_sampler.py (3)
101-103: Add strict=True to zip for safety.

If parent_fields and parent_key have mismatched lengths, silent truncation occurs. Same issue at line 215.
Proposed fix
-                parent_str = ",".join(f"{p}={v}" for p, v in zip(parent_fields, parent_key))
+                parent_str = ",".join(
+                    f"{p}={v}" for p, v in zip(parent_fields, parent_key, strict=True)
+                )
198-205: Replace global random with instance-level RNG.

random.choices uses global state. After fixing __init__, update this to use self._rng.choices.
Proposed fix
     def _sample_from_distribution(self, probs: dict[str, float]) -> Any:
         """weighted random choice from probability distribution"""
         if not probs:
             return None

         values = list(probs.keys())
         weights = list(probs.values())
-        return random.choices(values, weights=weights, k=1)[0]
+        return self._rng.choices(values, weights=weights, k=1)[0]
133-140: Replace global random.sample with instance-level RNG.

Same issue - use self._rng.sample after the __init__ fix.
Proposed fix
     def _select_exemplars(
         self, samples: list[dict[str, Any]], max_count: int | None = None
     ) -> list[dict]:
         """randomly select exemplar samples for reference"""
         if max_count is None:
             max_count = self.MAX_EXEMPLARS
         num_exemplars = min(max_count, len(samples))
-        return random.sample(samples, num_exemplars)
+        return self._rng.sample(samples, num_exemplars)
tests/blocks/test_duplicate_remover.py (1)
9-21: Simplify make_context helper - redundant copy logic.

Lines 11-12 copy state if initial_state is provided, but then it's overwritten anyway by update(). The conditional copy is unnecessary.
Simplified version
 def make_context(state: dict, initial_state: dict | None = None) -> BlockExecutionContext:
     """helper to create test context"""
-    if initial_state:
-        state = {**state}  # don't mutate
     context = BlockExecutionContext(
         trace_id="test-trace",
         pipeline_id=1,
-        accumulated_state=state,
+        accumulated_state={**state},  # always copy to avoid mutation
     )
     if initial_state:
-        # add initial state items to accumulated_state
         context.accumulated_state.update(initial_state)
     return context
frontend/src/components/pipeline-editor/BlockConfigPanel.tsx (1)

38-39: Consider caching + lifecycle safety for fetched model lists.

State additions are fine, but the current approach will refetch on every mount; if this panel mounts/unmounts frequently, consider lifting/caching at a higher level (or at least guarding setState on unmount).

tests/integration/test_data_augmentation.py (1)

164-176: Consider removing print(...) noise from tests unless debugging is intentional.

Also applies to: 230-231, 286-287

tests/blocks/test_semantic_infiller.py (1)

126-290: Consider pytest fixtures for repeated LLM/mock wiring to keep tests focused.

lib/blocks/builtin/semantic_infiller.py (1)

104-137: _parse_json_safely: consider non-greedy fallback regex to reduce over-capture.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 67dc447 and 5aa20e7.

📒 Files selected for processing (19)

.claude/skills/implementing-datagenflow-blocks/SKILL.md
.coderabbit.yaml
.gitignore
docker/docker-compose.yml
docs/template_data_augmentation.md
frontend/src/components/pipeline-editor/BlockConfigPanel.tsx
frontend/src/components/pipeline-editor/BlockNode.tsx
lib/blocks/builtin/duplicate_remover.py
lib/blocks/builtin/semantic_infiller.py
lib/blocks/builtin/structure_sampler.py
lib/templates/data_augmentation.yaml
lib/templates/seeds/seed_data_augmentation.json
llm/state-backend.md
llm/state-project.md
pyproject.toml
tests/blocks/test_duplicate_remover.py
tests/blocks/test_semantic_infiller.py
tests/blocks/test_structure_sampler.py
tests/integration/test_data_augmentation.py

🧰 Additional context used

📓 Path-based instructions (6)

**/*.{yaml,yml,json,toml}