🚀 Feat: template conversational data augmentation#53
🚀 Feat: template conversational data augmentation#53nicofretti wants to merge 13 commits intodevelopfrom
Conversation
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the WalkthroughThis PR introduces a data augmentation pipeline feature with three new Python blocks (StructureSampler, SemanticInfiller, DuplicateRemover) for generating and validating synthetic records. It includes frontend model selection UI, comprehensive test coverage, documentation, and CodeRabbit configuration. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Pipeline
participant StructureSampler as Structure<br/>Sampler
participant SemanticInfiller as Semantic<br/>Infiller
participant LLM as LLM API
participant DuplicateRemover as Duplicate<br/>Remover
participant EmbedModel as Embedding<br/>Model
User->>Pipeline: Trigger pipeline with seed
Pipeline->>StructureSampler: Read seed samples
StructureSampler->>StructureSampler: Analyze distributions & dependencies
StructureSampler->>Pipeline: Return skeleton records (N items)
loop For each skeleton
Pipeline->>SemanticInfiller: Process skeleton
SemanticInfiller->>SemanticInfiller: Build constrained prompt
SemanticInfiller->>LLM: Request field generation
LLM-->>SemanticInfiller: Generated JSON with fields
SemanticInfiller->>SemanticInfiller: Parse & restore locked fields
SemanticInfiller->>Pipeline: Return enriched record
Pipeline->>DuplicateRemover: Check against reference dataset
DuplicateRemover->>EmbedModel: Embed current record text
DuplicateRemover->>DuplicateRemover: Compute similarity (cached refs)
DuplicateRemover->>Pipeline: Return record + is_duplicate flag
end
Pipeline->>User: Return augmented dataset
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
|
Note Docstrings generation - SUCCESS |
Docstrings generation was requested by @nicofretti. * #53 (comment) The following files were modified: * `frontend/src/components/pipeline-editor/BlockConfigPanel.tsx` * `frontend/src/components/pipeline-editor/BlockNode.tsx` * `lib/blocks/builtin/duplicate_remover.py` * `lib/blocks/builtin/semantic_infiller.py` * `lib/blocks/builtin/structure_sampler.py` * `tests/blocks/test_duplicate_remover.py` * `tests/blocks/test_semantic_infiller.py` * `tests/blocks/test_structure_sampler.py` * `tests/integration/test_data_augmentation.py`
There was a problem hiding this comment.
Actionable comments posted: 12
🤖 Fix all issues with AI agents
In @.claude/skills/implementing-datagenflow-blocks/SKILL.md:
- Around line 417-419: The SKILL example for the StructureSampler class has the
wrong category string; update the class attribute on StructureSampler (the name
= "Structure Sampler" class) to use category = "seeders" instead of "generators"
so it matches the implementation in structure_sampler.py and other references.
- Around line 414-427: The example for the multiplier block uses the wrong
execute signature: change StructureSampler.execute to accept a
BlockExecutionContext (import it) instead of initial_data: dict[str, Any];
update any references inside the method to use the context (e.g., context.input
or context.state) and keep the return type as list[dict[str, Any]] to match
BaseMultiplierBlock’s actual implementation and the real structure_sampler.py.
In @docs/template_data_augmentation.md:
- Around line 361-399: Replace the bold "Step X" lines that trigger MD036 with
proper Markdown headings: change lines like "**Step 1: Prepare samples (6
examples)**", "**Step 2: Create pipeline from template**", "**Step 3: Start
generation**", "**Step 4: Monitor progress**", and "**Step 5: Review and
export**" into heading syntax (for example "### Step 1: Prepare samples (6
examples)" etc.), ensuring consistent heading level across all steps and leaving
the surrounding code blocks and prose unchanged.
- Around line 41-55: The fenced diagram block in
docs/template_data_augmentation.md triggers markdownlint MD040 because it lacks
a language tag; update the opening fence for the ASCII diagram to include a
language (e.g., change the opening "```" to "```text") so the block becomes a
language-specified fenced code block and markdownlint MD040 is satisfied; locate
the ASCII diagram block (the box/arrow diagram under the Structure → Semantic →
Duplicate heading) and replace its opening fence accordingly.
In @lib/blocks/builtin/duplicate_remover.py:
- Around line 17-24: The "_config_descriptions" entry for "embedding_model" in
duplicate_remover.py is longer than 100 chars; shorten or split that description
so no line exceeds 100 chars — e.g., rephrase the description to be under 100
characters or break it into concatenated shorter strings for the
"embedding_model" value in the _config_descriptions dict (keep the same key name
and meaning).
In @lib/blocks/builtin/semantic_infiller.py:
- Around line 27-35: The block currently mutates the instance attribute
self.fields_to_generate inside execute(), making it non-reentrant; instead, keep
the original config immutable by reading self.fields_to_generate (and any
related attrs) and working on a local variable (e.g., fields_to_generate_local)
for parsing, template rendering, validation and any conversions; update only
local state and return results without assigning back to
self.fields_to_generate, and apply the same change to the code regions
referenced around execute() (lines ~139-252) to ensure thread-safety and avoid
side effects.
In @llm/state-project.md:
- Around line 31-33: Update llm/state-project.md to reflect the correct number
and list of builtin blocks: change the count from 12 to 14 wherever it appears
and replace the partial list with the full set of block names to match
llm/state-backend.md — include StructureSampler, TextGenerator,
StructuredGenerator, SemanticInfiller, MarkdownMultiplierBlock, ValidatorBlock,
JSONValidatorBlock, DuplicateRemover, DiversityScore, CoherenceScore,
RougeScore, LangfuseBlock, FieldMapper, and RagasMetrics; ensure the header
comment near builtin/ and the block listing sections all consistently show "14
blocks" and enumerate these 14 block implementations.
In @tests/blocks/test_semantic_infiller.py:
- Around line 92-124: Tests instantiate SemanticInfiller with fields_to_generate
as Python lists, but the real config expects a JSON string; update each test to
pass a JSON string (e.g., fields_to_generate='["bio"]' or json.dumps(["bio"]))
when creating SemanticInfiller in test_parse_valid_json,
test_parse_json_with_markdown, test_parse_json_embedded_in_text, and
test_parse_invalid_json_raises_error so the block receives the same shape as
production and _parse_json_safely is exercised correctly.
In @tests/integration/test_data_augmentation.py:
- Around line 81-90: Ruff errors F841 (assigned-but-unused) and F541 (f-string
missing placeholders) are caused by unused locals and incorrectly formatted
f-strings in this test; fix by either using or discarding the assigned variables
and by removing the stray f-prefix from literal strings or adding proper
{placeholders}. Specifically, for the assignments pipeline_id, pipeline =
Pipeline(...), and initial_data in the shown block (and the other occurrences at
164-176, 205-207, 268-271) either (a) use the values in an assertion or
subsequent call (e.g., assert pipeline_id is not None or assert pipeline is
instance of Pipeline) or (b) rename them to _pipeline_id/_pipeline or prefix
with an underscore to indicate intentional unused values; for any f"..." strings
flagged F541, remove the leading f if there is no interpolation or replace with
f"...{var}..." including the correct variable names if interpolation was
intended.
🧹 Nitpick comments (10)
lib/blocks/builtin/duplicate_remover.py (2)
117-130: Missing_usagetracking for embedding API calls.Per coding guidelines, LLM/embedding calls should track and return usage metrics. The embedding responses from litellm include usage data that should be captured.
Track embedding usage
response = await litellm.aembedding(**embedding_params) self._embeddings_cache[trace_id] = [ item["embedding"] for item in response.data ] + + # track reference embedding usage + if hasattr(response, 'usage') and response.usage: + logger.info( + f"Reference embeddings usage: {response.usage.total_tokens} tokens" + )
36-37: Instance-level cache may persist across pipeline jobs.The
_embeddings_cachedict is instance-level. If the same block instance is reused across different pipeline jobs, old trace_ids will accumulate. Consider clearing stale entries or using a bounded cache.lib/blocks/builtin/structure_sampler.py (3)
101-103: Addstrict=Trueto zip for safety.If
parent_fieldsandparent_keyhave mismatched lengths, silent truncation occurs. Same issue at line 215.Proposed fix
- parent_str = ",".join(f"{p}={v}" for p, v in zip(parent_fields, parent_key)) + parent_str = ",".join( + f"{p}={v}" for p, v in zip(parent_fields, parent_key, strict=True) + )
198-205: Replace global random with instance-level RNG.
random.choicesuses global state. After fixing__init__, update this to useself._rng.choices.Proposed fix
def _sample_from_distribution(self, probs: dict[str, float]) -> Any: """weighted random choice from probability distribution""" if not probs: return None values = list(probs.keys()) weights = list(probs.values()) - return random.choices(values, weights=weights, k=1)[0] + return self._rng.choices(values, weights=weights, k=1)[0]
133-140: Replace global random.sample with instance-level RNG.Same issue - use
self._rng.sampleafter the__init__fix.Proposed fix
def _select_exemplars( self, samples: list[dict[str, Any]], max_count: int | None = None ) -> list[dict]: """randomly select exemplar samples for reference""" if max_count is None: max_count = self.MAX_EXEMPLARS num_exemplars = min(max_count, len(samples)) - return random.sample(samples, num_exemplars) + return self._rng.sample(samples, num_exemplars)tests/blocks/test_duplicate_remover.py (1)
9-21: Simplifymake_contexthelper - redundant copy logic.Lines 11-12 copy state if
initial_stateis provided, but then it's overwritten anyway byupdate(). The conditional copy is unnecessary.Simplified version
def make_context(state: dict, initial_state: dict | None = None) -> BlockExecutionContext: """helper to create test context""" - if initial_state: - state = {**state} # don't mutate context = BlockExecutionContext( trace_id="test-trace", pipeline_id=1, - accumulated_state=state, + accumulated_state={**state}, # always copy to avoid mutation ) if initial_state: - # add initial state items to accumulated_state context.accumulated_state.update(initial_state) return contextfrontend/src/components/pipeline-editor/BlockConfigPanel.tsx (1)
38-39: Consider caching + lifecycle safety for fetched model lists.State additions are fine, but the current approach will refetch on every mount; if this panel mounts/unmounts frequently, consider lifting/caching at a higher level (or at least guarding setState on unmount).
tests/integration/test_data_augmentation.py (1)
164-176: Consider removingprint(...)noise from tests unless debugging is intentional.Also applies to: 230-231, 286-287
tests/blocks/test_semantic_infiller.py (1)
126-290: Consider pytest fixtures for repeated LLM/mock wiring to keep tests focused.lib/blocks/builtin/semantic_infiller.py (1)
104-137:_parse_json_safely: consider non-greedy fallback regex to reduce over-capture.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (19)
.claude/skills/implementing-datagenflow-blocks/SKILL.md.coderabbit.yaml.gitignoredocker/docker-compose.ymldocs/template_data_augmentation.mdfrontend/src/components/pipeline-editor/BlockConfigPanel.tsxfrontend/src/components/pipeline-editor/BlockNode.tsxlib/blocks/builtin/duplicate_remover.pylib/blocks/builtin/semantic_infiller.pylib/blocks/builtin/structure_sampler.pylib/templates/data_augmentation.yamllib/templates/seeds/seed_data_augmentation.jsonllm/state-backend.mdllm/state-project.mdpyproject.tomltests/blocks/test_duplicate_remover.pytests/blocks/test_semantic_infiller.pytests/blocks/test_structure_sampler.pytests/integration/test_data_augmentation.py
🧰 Additional context used
📓 Path-based instructions (6)
**/*.{yaml,yml,json,toml}
⚙️ CodeRabbit configuration file
**/*.{yaml,yml,json,toml}: Review configuration changes:
- No secrets committed
- Valid syntax
- Changes documented if needed
- Backwards compatible or migration documented
Files:
lib/templates/data_augmentation.yamllib/templates/seeds/seed_data_augmentation.jsonpyproject.tomldocker/docker-compose.yml
frontend/**/*.{ts,tsx,js,jsx}
⚙️ CodeRabbit configuration file
frontend/**/*.{ts,tsx,js,jsx}: Apply frontend code review checklist from llm/rules-frontend.md:
Identify which llm/*.md files need updates:
- New pages/components → update llm/state-frontend.md
- Changed UI flow → update llm/state-frontend.md
- New patterns → update llm/state-frontend.md
Identify if the docs needs updates.
Golden rule: keep components focused and maintainable.
Files:
frontend/src/components/pipeline-editor/BlockConfigPanel.tsxfrontend/src/components/pipeline-editor/BlockNode.tsx
llm/**/*.md
⚙️ CodeRabbit configuration file
llm/**/*.md: Review documentation updates:
- Changes reflect actual code (not aspirational designs)
- Updates are gradual and incremental (not complete rewrites)
- Technical and concise
- Explain what changed and why
- Note any breaking changes
Files:
llm/state-project.mdllm/state-backend.md
**/*.py
⚙️ CodeRabbit configuration file
**/*.py: Apply backend code review checklist from llm/rules-backend.md:
Identify which llm/*.md files need updates:
- New API endpoints → update llm/state-backend.md
- New blocks → update llm/state-backend.md and llm/state-project.md
- Changed patterns → update relevant llm/state-*.md
Identify if the docs needs updates.
Golden rule: if code cannot be explained in one sentence, it's too complex.
Files:
tests/integration/test_data_augmentation.pylib/blocks/builtin/semantic_infiller.pytests/blocks/test_duplicate_remover.pytests/blocks/test_semantic_infiller.pytests/blocks/test_structure_sampler.pylib/blocks/builtin/duplicate_remover.pylib/blocks/builtin/structure_sampler.py
tests/**/*.py
⚙️ CodeRabbit configuration file
tests/**/*.py: Review test quality:
- One behavior per test
- Test names: test_
- Error cases tested (not just happy path)
- Proper use of fixtures
- Mocks used appropriately
- Tests are focused and maintainable
Files:
tests/integration/test_data_augmentation.pytests/blocks/test_duplicate_remover.pytests/blocks/test_semantic_infiller.pytests/blocks/test_structure_sampler.py
lib/blocks/**/*.py
⚙️ CodeRabbit configuration file
lib/blocks/**/*.py: Apply block implementation checklist from .claude/skills/implementing-datagenflow-blocks/SKILL.md:
Identify which llm/*.md files need updates:
- New blocks → update llm/state-backend.md and llm/state-project.md
- Changed block behavior → update relevant llm/state-*.md
Identify if the docs needs updates.
Golden rule: blocks should be single-responsibility and reusable.
Files:
lib/blocks/builtin/semantic_infiller.pylib/blocks/builtin/duplicate_remover.pylib/blocks/builtin/structure_sampler.py
🧬 Code graph analysis (2)
lib/blocks/builtin/duplicate_remover.py (5)
lib/blocks/base.py (1)
BaseBlock(8-43)lib/entities/block_execution_context.py (3)
BlockExecutionContext(8-51)copy(44-51)get_state(36-38)lib/blocks/builtin/semantic_infiller.py (1)
execute(139-252)lib/blocks/builtin/structure_sampler.py (1)
execute(284-304)lib/llm_config.py (1)
_prepare_embedding_call(233-272)
lib/blocks/builtin/structure_sampler.py (3)
lib/blocks/base.py (1)
BaseMultiplierBlock(46-55)lib/entities/block_execution_context.py (2)
BlockExecutionContext(8-51)get_state(36-38)lib/errors.py (1)
ValidationError(29-32)
🪛 GitHub Actions: Pre-Merge Checks
tests/integration/test_data_augmentation.py
[error] 82-82: F841 Local variable pipeline_id is assigned to but never used.
[error] 169-169: F541 [*] f-string without any placeholders.
[error] 205-205: F841 Local variable pipeline_id is assigned to but never used.
[error] 268-268: F841 Local variable pipeline_id is assigned to but never used.
lib/blocks/builtin/semantic_infiller.py
[error] 29-29: E501 Line too long (109 > 100).
[error] 166-166: E501 Line too long (108 > 100).
[error] 230-230: E501 Line too long (102 > 100).
lib/blocks/builtin/duplicate_remover.py
[error] 23-23: E501 Line too long (115 > 100).
[error] 148-148: E501 Line too long (105 > 100).
🪛 LanguageTool
llm/state-project.md
[grammar] ~101-~101: Ensure spelling is correct
Context: ...init_ signature pass ``` ### builtin blocks (12 total) seeders: - Struc...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
🪛 markdownlint-cli2 (0.18.1)
docs/template_data_augmentation.md
41-41: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
363-363: Emphasis used instead of a heading
(MD036, no-emphasis-as-heading)
375-375: Emphasis used instead of a heading
(MD036, no-emphasis-as-heading)
382-382: Emphasis used instead of a heading
(MD036, no-emphasis-as-heading)
389-389: Emphasis used instead of a heading
(MD036, no-emphasis-as-heading)
395-395: Emphasis used instead of a heading
(MD036, no-emphasis-as-heading)
🪛 Ruff (0.14.10)
tests/integration/test_data_augmentation.py
82-82: Local variable pipeline_id is assigned to but never used
Remove assignment to unused variable pipeline_id
(F841)
169-169: f-string without any placeholders
Remove extraneous f prefix
(F541)
205-205: Local variable pipeline_id is assigned to but never used
Remove assignment to unused variable pipeline_id
(F841)
268-268: Local variable pipeline_id is assigned to but never used
Remove assignment to unused variable pipeline_id
(F841)
lib/blocks/builtin/semantic_infiller.py
21-21: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
22-22: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
27-35: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
131-137: Avoid specifying long messages outside the exception class
(TRY003)
154-157: Avoid specifying long messages outside the exception class
(TRY003)
159-162: Avoid specifying long messages outside the exception class
(TRY003)
164-167: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
164-167: Avoid specifying long messages outside the exception class
(TRY003)
165-165: Use explicit conversion flag
Replace with conversion flag
(RUF010)
208-208: Do not catch blind exception: Exception
(BLE001)
209-216: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
209-216: Avoid specifying long messages outside the exception class
(TRY003)
210-210: Use explicit conversion flag
Replace with conversion flag
(RUF010)
223-223: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
lib/blocks/builtin/duplicate_remover.py
17-17: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
18-18: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
20-24: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
151-151: Do not catch blind exception: Exception
(BLE001)
lib/blocks/builtin/structure_sampler.py
17-17: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
18-18: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
24-30: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
36-36: Do not use mutable data structures for argument defaults
Replace with None; initialize within function
(B006)
37-37: Do not use mutable data structures for argument defaults
Replace with None; initialize within function
(B006)
52-58: Avoid specifying long messages outside the exception class
(TRY003)
101-101: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
181-184: Avoid specifying long messages outside the exception class
(TRY003)
205-205: Standard pseudo-random generators are not suitable for cryptographic purposes
(S311)
215-215: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
🔇 Additional comments (24)
docker/docker-compose.yml (2)
3-5: LGTM: Build context updated for project structure.The build context change to parent directory with explicit dockerfile path aligns with organizing Docker files in a dedicated
docker/subdirectory.
17-22: No issues found with the/healthendpoint.The
/healthendpoint is properly implemented inapp.py(line 108) and returns{"status": "healthy"}. The healthcheck configuration is valid and will function correctly. The 40s start period is reasonable for service initialization..gitignore (1)
22-23: LGTM: Refined .claude directory tracking.The pattern change correctly allows tracking
.claude/skills/while ignoring other.claude/contents. This aligns with the new skills documentation structure..coderabbit.yaml (4)
1-11: LGTM: Core review settings configured appropriately.The configuration enables request changes workflow and disables poems, which is suitable for a technical project.
13-16: LGTM: Auto-review enabled with incremental updates.Standard configuration for automatic code reviews on PR creation and subsequent pushes.
81-94: LGTM: Chat and knowledge base settings configured.Auto-reply enabled and learnings scope set to "auto" for appropriate privacy handling. Tone instructions define clear priority levels for review feedback.
18-79: All referenced documentation files exist and are properly linked.The path instructions in
.coderabbit.yamlcorrectly reference three documentation files that all exist in the repository:
llm/rules-backend.mdllm/rules-frontend.md.claude/skills/implementing-datagenflow-blocks/SKILL.mdCodeRabbit will be able to enforce these rules as configured.
lib/templates/seeds/seed_data_augmentation.json (2)
1-13: LGTM: Well-structured seed samples.The 6 seed samples provide good coverage across plan tiers (Free, Pro, Enterprise) with varied roles, storage values, and realistic bios. This should enable effective augmentation to reach the target count of 20.
14-21: LGTM: Metadata configuration is well-defined.The augmentation parameters are appropriate:
- Target count (20) provides 3.3x expansion from 6 seeds
- Field classifications are correct (categorical, numeric)
- Dependency
role: ["plan"]is logical- Using
biofor duplicate comparison is sensiblefrontend/src/components/pipeline-editor/BlockNode.tsx (3)
64-87: LGTM: Preview support for new block types.The implementation cleanly extends preview field selection for the new data augmentation blocks (sampler, infiller, remover) and other block types. Priority keys are appropriate for each category.
Based on coding guidelines, verify if
llm/state-frontend.mdneeds updates for these new block type patterns.
92-99: LGTM: Enhanced validator and score preview fields.Adding
field_nameandmetricto priority keys improves preview informativeness for validators and score blocks.
108-118: LGTM: Improved preview value formatting.The special formatting for arrays and objects (
[N items],{N keys}) provides cleaner previews. The truncation increase to 25 characters is reasonable.lib/blocks/builtin/structure_sampler.py (1)
284-304: LGTM - execute method follows the expected pattern.Reads samples from context, validates, analyzes, and generates skeletons. Clean implementation with appropriate logging.
pyproject.toml (1)
22-22: LGTM - dependency addition for embedding similarity.
scikit-learn>=1.3.0is required forcosine_similarityinDuplicateRemover. The version constraint allows compatible updates.tests/blocks/test_duplicate_remover.py (2)
117-234: Good embedding mock tests with cache verification.Tests properly verify:
- Duplicate detection above/below threshold
- Per-trace_id cache behavior with correct call counts
- Different embedding vectors for similarity computation
236-272: Good error handling coverage.Tests verify graceful degradation when embedding model is missing or fails, returning
is_duplicate=Falsewithout raising exceptions.lib/templates/data_augmentation.yaml (1)
1-23: LGTM - well-structured data augmentation pipeline template.The three-block sequence (StructureSampler → SemanticInfiller → DuplicateRemover) correctly implements the data augmentation workflow. Template variables are properly referenced using Jinja2 syntax.
Consider making
seedconfigurable via template variable instead of hardcoded42if reproducibility should be user-controlled.tests/blocks/test_structure_sampler.py (3)
43-108: Good distribution calculation tests.Tests verify probability computation, conditional probabilities, and numeric statistics with appropriate tolerance checks for floating-point comparisons.
189-227: Good edge case coverage.Tests properly verify:
- Empty samples raises
ValidationError- Missing samples raises
ValidationError- Circular dependencies are detected and raise
ValidationError
156-159: Test verifies conditional dependency constraint.Good assertion that all Free plans have Viewer role (100% conditional probability in sample data).
.claude/skills/implementing-datagenflow-blocks/SKILL.md (1)
1-611: Well-structured skill documentation.Comprehensive coverage of block implementation patterns including UI integration, LLM calls, state management, caching, and testing. The common mistakes table and implementation checklist are particularly useful.
llm/state-backend.md (1)
12-19: State-backend.md is accurate—contains 14 builtin blocks with correct names and detailed descriptions (lines 417-460). The inconsistency exists elsewhere: state-project.md incorrectly claims 12 blocks (outdated). No changes needed in this file.Likely an incorrect or invalid review comment.
frontend/src/components/pipeline-editor/BlockConfigPanel.tsx (1)
237-271: The model and embedding_model selection semantics are correctly implemented. The frontend sendsnullfor "Use default model", and the backend properly interpretsNoneas a fallback trigger to use the default model (viaget_llm_model(None)andget_embedding_model(None)). Both blocks are properly typed withstr | None, templates usenull, and tests confirm this pattern works throughout the codebase. No changes needed.docs/template_data_augmentation.md (1)
405-412: Verify "Related Documentation" link targets resolve in your docs site/router.Ensure all referenced pages (
templates,how_to_use,how_to_create_blocks) and anchors (#structuresampler,#semanticinfiller,#duplicateremover) exist as link targets, and that relative paths are correctly formatted.
Description
This PR introduces a data augmentation pipeline feature with three new Python blocks (StructureSampler, SemanticInfiller, DuplicateRemover) for generating and validating synthetic records. It includes frontend model selection UI, comprehensive test coverage, documentation, and CodeRabbit configuration.
Related Issue
Checklist
make formatpassesmake pre-mergepassesSummary by CodeRabbit
Release Notes
New Features
Documentation
Chores
✏️ Tip: You can customize this high-level summary in your review settings.