Skip to content

🚀 Feat: template conversational data augmentation#53

Closed
nicofretti wants to merge 13 commits intodevelopfrom
38-feat-template-conversational-data-augmentation
Closed

🚀 Feat: template conversational data augmentation#53
nicofretti wants to merge 13 commits intodevelopfrom
38-feat-template-conversational-data-augmentation

Conversation

@nicofretti
Copy link
Owner

@nicofretti nicofretti commented Jan 10, 2026

Description

This PR introduces a data augmentation pipeline feature with three new Python blocks (StructureSampler, SemanticInfiller, DuplicateRemover) for generating and validating synthetic records. It includes frontend model selection UI, comprehensive test coverage, documentation, and CodeRabbit configuration.

Related Issue

Checklist

  • Code follows project style guidelines
  • Comments explain "why" not "what"
  • Documentation updated (if needed)
  • No debug code or console statements
  • make format passes
  • make pre-merge passes
  • PR update from develop branch
  • Copilot review run and addressed

Summary by CodeRabbit

Release Notes

  • New Features

    • Added data augmentation pipeline with semantic field generation and duplicate detection capabilities.
    • Introduced dynamic model selection dropdowns in the pipeline editor for configuring LLM and embedding models.
    • Enhanced block preview display with improved formatting for arrays and objects.
  • Documentation

    • Added comprehensive guide for implementing and extending DataGenFlow blocks.
    • Added data augmentation template documentation with examples and customization options.
  • Chores

    • Added scikit-learn dependency.
    • Updated Docker and code review configurations.

✏️ Tip: You can customize this high-level summary in your review settings.

@nicofretti nicofretti linked an issue Jan 10, 2026 that may be closed by this pull request
@coderabbitai
Copy link

coderabbitai bot commented Jan 10, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

This PR introduces a data augmentation pipeline feature with three new Python blocks (StructureSampler, SemanticInfiller, DuplicateRemover) for generating and validating synthetic records. It includes frontend model selection UI, comprehensive test coverage, documentation, and CodeRabbit configuration.

Changes

Cohort / File(s) Summary
Infrastructure & Configuration
docker/docker-compose.yml, .coderabbit.yaml, pyproject.toml, .gitignore
Build context/Dockerfile paths updated; healthcheck URL and start_period added; CodeRabbit review config added with language-specific rules and tone guidelines; scikit-learn dependency added; .claude/skills/ now tracked.
Documentation
.claude/skills/implementing-datagenflow-blocks/SKILL.md, docs/template_data_augmentation.md, llm/state-backend.md, llm/state-project.md
New comprehensive skill guide for block implementation with templates and patterns; new data augmentation template documentation covering pipeline stages, customization, and examples; updated reference docs with new blocks and features.
Backend: Data Augmentation Blocks
lib/blocks/builtin/structure_sampler.py, lib/blocks/builtin/semantic_infiller.py, lib/blocks/builtin/duplicate_remover.py
StructureSampler: learns distributions from seeds, generates skeletons with dependency-aware categorical sampling. SemanticInfiller: LLM-based field completion with template rendering, JSON parsing, and locked-field restoration. DuplicateRemover: embedding-based similarity detection with per-trace caching and graceful error handling.
Templates
lib/templates/data_augmentation.yaml, lib/templates/seeds/seed_data_augmentation.json
YAML pipeline definition for three-block augmentation flow; seed data example with 6 sample records, categorical/numeric field configs, and dependencies.
Frontend: Model Integration
frontend/src/components/pipeline-editor/BlockConfigPanel.tsx, frontend/src/components/pipeline-editor/BlockNode.tsx
BlockConfigPanel: fetches and populates llmModels/embeddingModels via new API endpoints, renders dropdowns for "model" and "embedding_model" fields. BlockNode: extended block type categorization (sampler, infiller, remover, multiplier, etc.), enhanced value formatting (arrays as [N items], objects as {N keys}), increased truncation to 25 chars.
Unit Tests
tests/blocks/test_structure_sampler.py, tests/blocks/test_semantic_infiller.py, tests/blocks/test_duplicate_remover.py
Comprehensive coverage for each block: initialization, distribution calculations, prompt construction, JSON parsing, embedding caching, error handling, schema validation, edge cases (missing data, circular dependencies, embedding failures).
Integration Tests
tests/integration/test_data_augmentation.py
End-to-end pipeline test with mocked LLM; validates result structure, trace integrity, field constraints, deterministic seeding, and graceful handling of missing embeddings.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Pipeline
    participant StructureSampler as Structure<br/>Sampler
    participant SemanticInfiller as Semantic<br/>Infiller
    participant LLM as LLM API
    participant DuplicateRemover as Duplicate<br/>Remover
    participant EmbedModel as Embedding<br/>Model

    User->>Pipeline: Trigger pipeline with seed
    Pipeline->>StructureSampler: Read seed samples
    StructureSampler->>StructureSampler: Analyze distributions & dependencies
    StructureSampler->>Pipeline: Return skeleton records (N items)
    
    loop For each skeleton
        Pipeline->>SemanticInfiller: Process skeleton
        SemanticInfiller->>SemanticInfiller: Build constrained prompt
        SemanticInfiller->>LLM: Request field generation
        LLM-->>SemanticInfiller: Generated JSON with fields
        SemanticInfiller->>SemanticInfiller: Parse & restore locked fields
        SemanticInfiller->>Pipeline: Return enriched record
        
        Pipeline->>DuplicateRemover: Check against reference dataset
        DuplicateRemover->>EmbedModel: Embed current record text
        DuplicateRemover->>DuplicateRemover: Compute similarity (cached refs)
        DuplicateRemover->>Pipeline: Return record + is_duplicate flag
    end
    
    Pipeline->>User: Return augmented dataset
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2
❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.78% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive PR description is vague and incomplete. It provides a general overview but lacks specific implementation details, context, or clarity about the changes made. Expand the description to explain: (1) What each block does and why it's needed, (2) How the data augmentation pipeline works, (3) Key architectural decisions, (4) Frontend changes purpose, (5) How to use the new feature. Include specific examples if possible.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed Title is partially related to the changeset. It mentions 'template conversational data augmentation' but the PR primarily implements three new blocks (StructureSampler, SemanticInfiller, DuplicateRemover), a data augmentation template, documentation, and frontend changes—broader than just conversational data.
Linked Issues check ✅ Passed The changeset addresses issue #38 by implementing a data augmentation pipeline with three blocks maintaining coherence and diversity. Issue #52 is addressed by adding .coderabbit.yaml configuration enabling CodeRabbit reviews.
Out of Scope Changes check ✅ Passed Changes are well-scoped to the linked issues. Docker/gitignore updates support the infrastructure. Frontend changes enhance model selection. All modifications serve the core data augmentation and CodeRabbit integration objectives.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@nicofretti nicofretti changed the base branch from main to develop January 10, 2026 20:21
@nicofretti nicofretti self-assigned this Jan 10, 2026
@nicofretti nicofretti added the enhancement New feature or request label Jan 10, 2026
@nicofretti nicofretti marked this pull request as ready for review January 10, 2026 20:49
@nicofretti nicofretti marked this pull request as draft January 10, 2026 20:51
@nicofretti nicofretti marked this pull request as ready for review January 10, 2026 20:51
@nicofretti
Copy link
Owner Author

@coderabbitai review

@coderabbitai
Copy link

coderabbitai bot commented Jan 10, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai
Copy link

coderabbitai bot commented Jan 10, 2026

Note

Docstrings generation - SUCCESS
Generated docstrings for this pull request at #54

coderabbitai bot added a commit that referenced this pull request Jan 10, 2026
Docstrings generation was requested by @nicofretti.

* #53 (comment)

The following files were modified:

* `frontend/src/components/pipeline-editor/BlockConfigPanel.tsx`
* `frontend/src/components/pipeline-editor/BlockNode.tsx`
* `lib/blocks/builtin/duplicate_remover.py`
* `lib/blocks/builtin/semantic_infiller.py`
* `lib/blocks/builtin/structure_sampler.py`
* `tests/blocks/test_duplicate_remover.py`
* `tests/blocks/test_semantic_infiller.py`
* `tests/blocks/test_structure_sampler.py`
* `tests/integration/test_data_augmentation.py`
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

🤖 Fix all issues with AI agents
In @.claude/skills/implementing-datagenflow-blocks/SKILL.md:
- Around line 417-419: The SKILL example for the StructureSampler class has the
wrong category string; update the class attribute on StructureSampler (the name
= "Structure Sampler" class) to use category = "seeders" instead of "generators"
so it matches the implementation in structure_sampler.py and other references.
- Around line 414-427: The example for the multiplier block uses the wrong
execute signature: change StructureSampler.execute to accept a
BlockExecutionContext (import it) instead of initial_data: dict[str, Any];
update any references inside the method to use the context (e.g., context.input
or context.state) and keep the return type as list[dict[str, Any]] to match
BaseMultiplierBlock’s actual implementation and the real structure_sampler.py.

In @docs/template_data_augmentation.md:
- Around line 361-399: Replace the bold "Step X" lines that trigger MD036 with
proper Markdown headings: change lines like "**Step 1: Prepare samples (6
examples)**", "**Step 2: Create pipeline from template**", "**Step 3: Start
generation**", "**Step 4: Monitor progress**", and "**Step 5: Review and
export**" into heading syntax (for example "### Step 1: Prepare samples (6
examples)" etc.), ensuring consistent heading level across all steps and leaving
the surrounding code blocks and prose unchanged.
- Around line 41-55: The fenced diagram block in
docs/template_data_augmentation.md triggers markdownlint MD040 because it lacks
a language tag; update the opening fence for the ASCII diagram to include a
language (e.g., change the opening "```" to "```text") so the block becomes a
language-specified fenced code block and markdownlint MD040 is satisfied; locate
the ASCII diagram block (the box/arrow diagram under the Structure → Semantic →
Duplicate heading) and replace its opening fence accordingly.

In @lib/blocks/builtin/duplicate_remover.py:
- Around line 17-24: The "_config_descriptions" entry for "embedding_model" in
duplicate_remover.py is longer than 100 chars; shorten or split that description
so no line exceeds 100 chars — e.g., rephrase the description to be under 100
characters or break it into concatenated shorter strings for the
"embedding_model" value in the _config_descriptions dict (keep the same key name
and meaning).

In @lib/blocks/builtin/semantic_infiller.py:
- Around line 27-35: The block currently mutates the instance attribute
self.fields_to_generate inside execute(), making it non-reentrant; instead, keep
the original config immutable by reading self.fields_to_generate (and any
related attrs) and working on a local variable (e.g., fields_to_generate_local)
for parsing, template rendering, validation and any conversions; update only
local state and return results without assigning back to
self.fields_to_generate, and apply the same change to the code regions
referenced around execute() (lines ~139-252) to ensure thread-safety and avoid
side effects.

In @llm/state-project.md:
- Around line 31-33: Update llm/state-project.md to reflect the correct number
and list of builtin blocks: change the count from 12 to 14 wherever it appears
and replace the partial list with the full set of block names to match
llm/state-backend.md — include StructureSampler, TextGenerator,
StructuredGenerator, SemanticInfiller, MarkdownMultiplierBlock, ValidatorBlock,
JSONValidatorBlock, DuplicateRemover, DiversityScore, CoherenceScore,
RougeScore, LangfuseBlock, FieldMapper, and RagasMetrics; ensure the header
comment near builtin/ and the block listing sections all consistently show "14
blocks" and enumerate these 14 block implementations.

In @tests/blocks/test_semantic_infiller.py:
- Around line 92-124: Tests instantiate SemanticInfiller with fields_to_generate
as Python lists, but the real config expects a JSON string; update each test to
pass a JSON string (e.g., fields_to_generate='["bio"]' or json.dumps(["bio"]))
when creating SemanticInfiller in test_parse_valid_json,
test_parse_json_with_markdown, test_parse_json_embedded_in_text, and
test_parse_invalid_json_raises_error so the block receives the same shape as
production and _parse_json_safely is exercised correctly.

In @tests/integration/test_data_augmentation.py:
- Around line 81-90: Ruff errors F841 (assigned-but-unused) and F541 (f-string
missing placeholders) are caused by unused locals and incorrectly formatted
f-strings in this test; fix by either using or discarding the assigned variables
and by removing the stray f-prefix from literal strings or adding proper
{placeholders}. Specifically, for the assignments pipeline_id, pipeline =
Pipeline(...), and initial_data in the shown block (and the other occurrences at
164-176, 205-207, 268-271) either (a) use the values in an assertion or
subsequent call (e.g., assert pipeline_id is not None or assert pipeline is
instance of Pipeline) or (b) rename them to _pipeline_id/_pipeline or prefix
with an underscore to indicate intentional unused values; for any f"..." strings
flagged F541, remove the leading f if there is no interpolation or replace with
f"...{var}..." including the correct variable names if interpolation was
intended.
🧹 Nitpick comments (10)
lib/blocks/builtin/duplicate_remover.py (2)

117-130: Missing _usage tracking for embedding API calls.

Per coding guidelines, LLM/embedding calls should track and return usage metrics. The embedding responses from litellm include usage data that should be captured.

Track embedding usage
                 response = await litellm.aembedding(**embedding_params)

                 self._embeddings_cache[trace_id] = [
                     item["embedding"] for item in response.data
                 ]
+
+                # track reference embedding usage
+                if hasattr(response, 'usage') and response.usage:
+                    logger.info(
+                        f"Reference embeddings usage: {response.usage.total_tokens} tokens"
+                    )

36-37: Instance-level cache may persist across pipeline jobs.

The _embeddings_cache dict is instance-level. If the same block instance is reused across different pipeline jobs, old trace_ids will accumulate. Consider clearing stale entries or using a bounded cache.

lib/blocks/builtin/structure_sampler.py (3)

101-103: Add strict=True to zip for safety.

If parent_fields and parent_key have mismatched lengths, silent truncation occurs. Same issue at line 215.

Proposed fix
-                parent_str = ",".join(f"{p}={v}" for p, v in zip(parent_fields, parent_key))
+                parent_str = ",".join(
+                    f"{p}={v}" for p, v in zip(parent_fields, parent_key, strict=True)
+                )

198-205: Replace global random with instance-level RNG.

random.choices uses global state. After fixing __init__, update this to use self._rng.choices.

Proposed fix
     def _sample_from_distribution(self, probs: dict[str, float]) -> Any:
         """weighted random choice from probability distribution"""
         if not probs:
             return None

         values = list(probs.keys())
         weights = list(probs.values())
-        return random.choices(values, weights=weights, k=1)[0]
+        return self._rng.choices(values, weights=weights, k=1)[0]

133-140: Replace global random.sample with instance-level RNG.

Same issue - use self._rng.sample after the __init__ fix.

Proposed fix
     def _select_exemplars(
         self, samples: list[dict[str, Any]], max_count: int | None = None
     ) -> list[dict]:
         """randomly select exemplar samples for reference"""
         if max_count is None:
             max_count = self.MAX_EXEMPLARS
         num_exemplars = min(max_count, len(samples))
-        return random.sample(samples, num_exemplars)
+        return self._rng.sample(samples, num_exemplars)
tests/blocks/test_duplicate_remover.py (1)

9-21: Simplify make_context helper - redundant copy logic.

Lines 11-12 copy state if initial_state is provided, but then it's overwritten anyway by update(). The conditional copy is unnecessary.

Simplified version
 def make_context(state: dict, initial_state: dict | None = None) -> BlockExecutionContext:
     """helper to create test context"""
-    if initial_state:
-        state = {**state}  # don't mutate
     context = BlockExecutionContext(
         trace_id="test-trace",
         pipeline_id=1,
-        accumulated_state=state,
+        accumulated_state={**state},  # always copy to avoid mutation
     )
     if initial_state:
-        # add initial state items to accumulated_state
         context.accumulated_state.update(initial_state)
     return context
frontend/src/components/pipeline-editor/BlockConfigPanel.tsx (1)

38-39: Consider caching + lifecycle safety for fetched model lists.

State additions are fine, but the current approach will refetch on every mount; if this panel mounts/unmounts frequently, consider lifting/caching at a higher level (or at least guarding setState on unmount).

tests/integration/test_data_augmentation.py (1)

164-176: Consider removing print(...) noise from tests unless debugging is intentional.

Also applies to: 230-231, 286-287

tests/blocks/test_semantic_infiller.py (1)

126-290: Consider pytest fixtures for repeated LLM/mock wiring to keep tests focused.

lib/blocks/builtin/semantic_infiller.py (1)

104-137: _parse_json_safely: consider non-greedy fallback regex to reduce over-capture.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 67dc447 and 5aa20e7.

📒 Files selected for processing (19)
  • .claude/skills/implementing-datagenflow-blocks/SKILL.md
  • .coderabbit.yaml
  • .gitignore
  • docker/docker-compose.yml
  • docs/template_data_augmentation.md
  • frontend/src/components/pipeline-editor/BlockConfigPanel.tsx
  • frontend/src/components/pipeline-editor/BlockNode.tsx
  • lib/blocks/builtin/duplicate_remover.py
  • lib/blocks/builtin/semantic_infiller.py
  • lib/blocks/builtin/structure_sampler.py
  • lib/templates/data_augmentation.yaml
  • lib/templates/seeds/seed_data_augmentation.json
  • llm/state-backend.md
  • llm/state-project.md
  • pyproject.toml
  • tests/blocks/test_duplicate_remover.py
  • tests/blocks/test_semantic_infiller.py
  • tests/blocks/test_structure_sampler.py
  • tests/integration/test_data_augmentation.py
🧰 Additional context used
📓 Path-based instructions (6)
**/*.{yaml,yml,json,toml}

⚙️ CodeRabbit configuration file

**/*.{yaml,yml,json,toml}: Review configuration changes:

  • No secrets committed
  • Valid syntax
  • Changes documented if needed
  • Backwards compatible or migration documented

Files:

  • lib/templates/data_augmentation.yaml
  • lib/templates/seeds/seed_data_augmentation.json
  • pyproject.toml
  • docker/docker-compose.yml
frontend/**/*.{ts,tsx,js,jsx}

⚙️ CodeRabbit configuration file

frontend/**/*.{ts,tsx,js,jsx}: Apply frontend code review checklist from llm/rules-frontend.md:
Identify which llm/*.md files need updates:

  • New pages/components → update llm/state-frontend.md
  • Changed UI flow → update llm/state-frontend.md
  • New patterns → update llm/state-frontend.md
    Identify if the docs needs updates.
    Golden rule: keep components focused and maintainable.

Files:

  • frontend/src/components/pipeline-editor/BlockConfigPanel.tsx
  • frontend/src/components/pipeline-editor/BlockNode.tsx
llm/**/*.md

⚙️ CodeRabbit configuration file

llm/**/*.md: Review documentation updates:

  • Changes reflect actual code (not aspirational designs)
  • Updates are gradual and incremental (not complete rewrites)
  • Technical and concise
  • Explain what changed and why
  • Note any breaking changes

Files:

  • llm/state-project.md
  • llm/state-backend.md
**/*.py

⚙️ CodeRabbit configuration file

**/*.py: Apply backend code review checklist from llm/rules-backend.md:
Identify which llm/*.md files need updates:

  • New API endpoints → update llm/state-backend.md
  • New blocks → update llm/state-backend.md and llm/state-project.md
  • Changed patterns → update relevant llm/state-*.md
    Identify if the docs needs updates.
    Golden rule: if code cannot be explained in one sentence, it's too complex.

Files:

  • tests/integration/test_data_augmentation.py
  • lib/blocks/builtin/semantic_infiller.py
  • tests/blocks/test_duplicate_remover.py
  • tests/blocks/test_semantic_infiller.py
  • tests/blocks/test_structure_sampler.py
  • lib/blocks/builtin/duplicate_remover.py
  • lib/blocks/builtin/structure_sampler.py
tests/**/*.py

⚙️ CodeRabbit configuration file

tests/**/*.py: Review test quality:

  • One behavior per test
  • Test names: test_
  • Error cases tested (not just happy path)
  • Proper use of fixtures
  • Mocks used appropriately
  • Tests are focused and maintainable

Files:

  • tests/integration/test_data_augmentation.py
  • tests/blocks/test_duplicate_remover.py
  • tests/blocks/test_semantic_infiller.py
  • tests/blocks/test_structure_sampler.py
lib/blocks/**/*.py

⚙️ CodeRabbit configuration file

lib/blocks/**/*.py: Apply block implementation checklist from .claude/skills/implementing-datagenflow-blocks/SKILL.md:
Identify which llm/*.md files need updates:

  • New blocks → update llm/state-backend.md and llm/state-project.md
  • Changed block behavior → update relevant llm/state-*.md
    Identify if the docs needs updates.
    Golden rule: blocks should be single-responsibility and reusable.

Files:

  • lib/blocks/builtin/semantic_infiller.py
  • lib/blocks/builtin/duplicate_remover.py
  • lib/blocks/builtin/structure_sampler.py
🧬 Code graph analysis (2)
lib/blocks/builtin/duplicate_remover.py (5)
lib/blocks/base.py (1)
  • BaseBlock (8-43)
lib/entities/block_execution_context.py (3)
  • BlockExecutionContext (8-51)
  • copy (44-51)
  • get_state (36-38)
lib/blocks/builtin/semantic_infiller.py (1)
  • execute (139-252)
lib/blocks/builtin/structure_sampler.py (1)
  • execute (284-304)
lib/llm_config.py (1)
  • _prepare_embedding_call (233-272)
lib/blocks/builtin/structure_sampler.py (3)
lib/blocks/base.py (1)
  • BaseMultiplierBlock (46-55)
lib/entities/block_execution_context.py (2)
  • BlockExecutionContext (8-51)
  • get_state (36-38)
lib/errors.py (1)
  • ValidationError (29-32)
🪛 GitHub Actions: Pre-Merge Checks
tests/integration/test_data_augmentation.py

[error] 82-82: F841 Local variable pipeline_id is assigned to but never used.


[error] 169-169: F541 [*] f-string without any placeholders.


[error] 205-205: F841 Local variable pipeline_id is assigned to but never used.


[error] 268-268: F841 Local variable pipeline_id is assigned to but never used.

lib/blocks/builtin/semantic_infiller.py

[error] 29-29: E501 Line too long (109 > 100).


[error] 166-166: E501 Line too long (108 > 100).


[error] 230-230: E501 Line too long (102 > 100).

lib/blocks/builtin/duplicate_remover.py

[error] 23-23: E501 Line too long (115 > 100).


[error] 148-148: E501 Line too long (105 > 100).

🪛 LanguageTool
llm/state-project.md

[grammar] ~101-~101: Ensure spelling is correct
Context: ...init_ signature pass ``` ### builtin blocks (12 total) seeders: - Struc...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🪛 markdownlint-cli2 (0.18.1)
docs/template_data_augmentation.md

41-41: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


363-363: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


375-375: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


382-382: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


389-389: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


395-395: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

🪛 Ruff (0.14.10)
tests/integration/test_data_augmentation.py

82-82: Local variable pipeline_id is assigned to but never used

Remove assignment to unused variable pipeline_id

(F841)


169-169: f-string without any placeholders

Remove extraneous f prefix

(F541)


205-205: Local variable pipeline_id is assigned to but never used

Remove assignment to unused variable pipeline_id

(F841)


268-268: Local variable pipeline_id is assigned to but never used

Remove assignment to unused variable pipeline_id

(F841)

lib/blocks/builtin/semantic_infiller.py

21-21: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)


22-22: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)


27-35: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)


131-137: Avoid specifying long messages outside the exception class

(TRY003)


154-157: Avoid specifying long messages outside the exception class

(TRY003)


159-162: Avoid specifying long messages outside the exception class

(TRY003)


164-167: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


164-167: Avoid specifying long messages outside the exception class

(TRY003)


165-165: Use explicit conversion flag

Replace with conversion flag

(RUF010)


208-208: Do not catch blind exception: Exception

(BLE001)


209-216: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


209-216: Avoid specifying long messages outside the exception class

(TRY003)


210-210: Use explicit conversion flag

Replace with conversion flag

(RUF010)


223-223: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

lib/blocks/builtin/duplicate_remover.py

17-17: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)


18-18: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)


20-24: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)


151-151: Do not catch blind exception: Exception

(BLE001)

lib/blocks/builtin/structure_sampler.py

17-17: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)


18-18: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)


24-30: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)


36-36: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)


37-37: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)


52-58: Avoid specifying long messages outside the exception class

(TRY003)


101-101: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)


181-184: Avoid specifying long messages outside the exception class

(TRY003)


205-205: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)


215-215: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

🔇 Additional comments (24)
docker/docker-compose.yml (2)

3-5: LGTM: Build context updated for project structure.

The build context change to parent directory with explicit dockerfile path aligns with organizing Docker files in a dedicated docker/ subdirectory.


17-22: No issues found with the /health endpoint.

The /health endpoint is properly implemented in app.py (line 108) and returns {"status": "healthy"}. The healthcheck configuration is valid and will function correctly. The 40s start period is reasonable for service initialization.

.gitignore (1)

22-23: LGTM: Refined .claude directory tracking.

The pattern change correctly allows tracking .claude/skills/ while ignoring other .claude/ contents. This aligns with the new skills documentation structure.

.coderabbit.yaml (4)

1-11: LGTM: Core review settings configured appropriately.

The configuration enables request changes workflow and disables poems, which is suitable for a technical project.


13-16: LGTM: Auto-review enabled with incremental updates.

Standard configuration for automatic code reviews on PR creation and subsequent pushes.


81-94: LGTM: Chat and knowledge base settings configured.

Auto-reply enabled and learnings scope set to "auto" for appropriate privacy handling. Tone instructions define clear priority levels for review feedback.


18-79: All referenced documentation files exist and are properly linked.

The path instructions in .coderabbit.yaml correctly reference three documentation files that all exist in the repository:

  • llm/rules-backend.md
  • llm/rules-frontend.md
  • .claude/skills/implementing-datagenflow-blocks/SKILL.md

CodeRabbit will be able to enforce these rules as configured.

lib/templates/seeds/seed_data_augmentation.json (2)

1-13: LGTM: Well-structured seed samples.

The 6 seed samples provide good coverage across plan tiers (Free, Pro, Enterprise) with varied roles, storage values, and realistic bios. This should enable effective augmentation to reach the target count of 20.


14-21: LGTM: Metadata configuration is well-defined.

The augmentation parameters are appropriate:

  • Target count (20) provides 3.3x expansion from 6 seeds
  • Field classifications are correct (categorical, numeric)
  • Dependency role: ["plan"] is logical
  • Using bio for duplicate comparison is sensible
frontend/src/components/pipeline-editor/BlockNode.tsx (3)

64-87: LGTM: Preview support for new block types.

The implementation cleanly extends preview field selection for the new data augmentation blocks (sampler, infiller, remover) and other block types. Priority keys are appropriate for each category.

Based on coding guidelines, verify if llm/state-frontend.md needs updates for these new block type patterns.


92-99: LGTM: Enhanced validator and score preview fields.

Adding field_name and metric to priority keys improves preview informativeness for validators and score blocks.


108-118: LGTM: Improved preview value formatting.

The special formatting for arrays and objects ([N items], {N keys}) provides cleaner previews. The truncation increase to 25 characters is reasonable.

lib/blocks/builtin/structure_sampler.py (1)

284-304: LGTM - execute method follows the expected pattern.

Reads samples from context, validates, analyzes, and generates skeletons. Clean implementation with appropriate logging.

pyproject.toml (1)

22-22: LGTM - dependency addition for embedding similarity.

scikit-learn>=1.3.0 is required for cosine_similarity in DuplicateRemover. The version constraint allows compatible updates.

tests/blocks/test_duplicate_remover.py (2)

117-234: Good embedding mock tests with cache verification.

Tests properly verify:

  • Duplicate detection above/below threshold
  • Per-trace_id cache behavior with correct call counts
  • Different embedding vectors for similarity computation

236-272: Good error handling coverage.

Tests verify graceful degradation when embedding model is missing or fails, returning is_duplicate=False without raising exceptions.

lib/templates/data_augmentation.yaml (1)

1-23: LGTM - well-structured data augmentation pipeline template.

The three-block sequence (StructureSampler → SemanticInfiller → DuplicateRemover) correctly implements the data augmentation workflow. Template variables are properly referenced using Jinja2 syntax.

Consider making seed configurable via template variable instead of hardcoded 42 if reproducibility should be user-controlled.

tests/blocks/test_structure_sampler.py (3)

43-108: Good distribution calculation tests.

Tests verify probability computation, conditional probabilities, and numeric statistics with appropriate tolerance checks for floating-point comparisons.


189-227: Good edge case coverage.

Tests properly verify:

  • Empty samples raises ValidationError
  • Missing samples raises ValidationError
  • Circular dependencies are detected and raise ValidationError

156-159: Test verifies conditional dependency constraint.

Good assertion that all Free plans have Viewer role (100% conditional probability in sample data).

.claude/skills/implementing-datagenflow-blocks/SKILL.md (1)

1-611: Well-structured skill documentation.

Comprehensive coverage of block implementation patterns including UI integration, LLM calls, state management, caching, and testing. The common mistakes table and implementation checklist are particularly useful.

llm/state-backend.md (1)

12-19: State-backend.md is accurate—contains 14 builtin blocks with correct names and detailed descriptions (lines 417-460). The inconsistency exists elsewhere: state-project.md incorrectly claims 12 blocks (outdated). No changes needed in this file.

Likely an incorrect or invalid review comment.

frontend/src/components/pipeline-editor/BlockConfigPanel.tsx (1)

237-271: The model and embedding_model selection semantics are correctly implemented. The frontend sends null for "Use default model", and the backend properly interprets None as a fallback trigger to use the default model (via get_llm_model(None) and get_embedding_model(None)). Both blocks are properly typed with str | None, templates use null, and tests confirm this pattern works throughout the codebase. No changes needed.

docs/template_data_augmentation.md (1)

405-412: Verify "Related Documentation" link targets resolve in your docs site/router.

Ensure all referenced pages (templates, how_to_use, how_to_create_blocks) and anchors (#structuresampler, #semanticinfiller, #duplicateremover) exist as link targets, and that relative paths are correctly formatted.

@nicofretti nicofretti closed this Jan 20, 2026
@nicofretti nicofretti deleted the 38-feat-template-conversational-data-augmentation branch January 24, 2026 13:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

🚀 Feat: template conversational data augmentation 🚀 Feat: include CodeRabbit as reviewer for the repository

2 participants