Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 20 additions & 0 deletions .github/PULL_REQUEST_TEMPLATE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
## Description
<!-- Brief description of changes -->

## Type of Change
- [ ] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
- [ ] Documentation update

## Testing
- [ ] Unit tests pass (`go test ./...`)
- [ ] Race detector clean (`go test -race ./...`)
- [ ] Linter passes (`golangci-lint run`)
- [ ] Manual testing completed

## Checklist
- [ ] Code follows project style guidelines
- [ ] Self-review completed
- [ ] Documentation updated
- [ ] CHANGELOG.md updated
66 changes: 66 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# Changelog

All notable changes to DocLoom CLI will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [0.2.0] - 2025-10-15

### πŸŽ‰ Added
- **Batch Analysis**: New `analyze-batch` command processes multiple files with `[N/Total]` progress
- **Mixed-Input Batch**: Supports `.csv`, `.tsv`, `.xlsx` (analyzed) + `.yaml`, `.md`, `.txt`, `.docx` (added as docs)
- **Project-Level Sample Control**: `--sample-rows-project` flag to override samples in all summaries (set `0` to disable)
- **Memory Safety**: Hard limits prevent OOM (200k tokens, 20 summaries per project)
- **Context Validation**: Blocks oversized prompts for local LLMs with actionable error messages
- **Timeout Configuration**: `--timeout-sec` flag for generation requests (default 180s)
- **TSV Auto-Delimiter**: Automatically sets tab delimiter for `.tsv` files

### πŸ› Fixed
- **CRITICAL**: XLSX parser returning 0 columns due to absolute relationship paths in ZIP archives
- Unbounded memory accumulation with multiple large files (9.3GB β†’ <2GB peak)
- Duplicate document detection (no more silent overwrites)
- Memory leaks in outlier computation
- Context window overflow causing silent truncation in Ollama
- RAG chunker producing oversized chunks exceeding token limits
- Prompt instruction duplication (40% token reduction)
- Dataset summary basename collisions with disambiguation logic
- Invalid `--sheet-name` silently falling back to first sheet

### ⚑ Performance
- Reduced memory usage by 78% for multi-file projects
- Batched embedding prevents API timeout failures (100 chunks/batch)
- 40% reduction in prompt tokens via deduplication
- Immediate memory release after outlier computation

### πŸ’₯ Breaking Changes
- Context overflow now **blocks** execution for Ollama (was warning-only)
- Duplicate files now **error** instead of silently overwriting
- Invalid `--sheet-name` now errors with available sheet list
- Projects enforce maximum 200k token limit (hard cap at 200k)
- Maximum 20 dataset summaries per project (prevents context bloat)

### πŸ“š Documentation
- Added [docs/examples/analyze-batch.md](cci:7://file:///home/jeremiah/Projects/docloom-cli/docs/examples/analyze-batch.md:0:0-0:0) with batch processing examples
- Updated README with mixed-input batch behavior
- Added XLSX parser fix details and regression test
- Updated quickstart with batch analysis tips

### πŸ§ͺ Testing
- Added regression test for XLSX relationship path normalization
- Added integration test for batch analysis with sample suppression
- Memory profiling tests ensure <2GB peak for 10x100k-row files
- Race detector clean across all packages

## [0.1.0] - 2025-10-01

### Added
- Initial release
- Basic project management (`init`, `add`, `list`)
- CSV/TSV/XLSX analysis with schema inference
- OpenRouter, Ollama, and major provider support
- RAG with embedding indexes
- Model catalog management

[0.2.0]: https://github.com/KaramelBytes/docloom-cli/releases/tag/v0.2.0
[0.1.0]: https://github.com/KaramelBytes/docloom-cli/releases/tag/v0.1.0
14 changes: 14 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -161,6 +161,10 @@ docloom analyze <file> [-p <project-name>] [--output <file>] [--delimiter ','|'t
# Analyzes CSV/TSV/XLSX and produces a compact Markdown summary; can attach to a project
# Extras: --group-by <col1,col2> --correlations --corr-per-group --outliers --outlier-threshold 3.5 --sheet-name <name> --sheet-index N

docloom analyze-batch <files...> [-p <project-name>] [--delimiter ...] [--decimal ...] [--thousands ...] [--sample-rows N] [--max-rows N] [--quiet]
# Analyze multiple CSV/TSV/XLSX files with progress [N/Total]. Supports globs. Mirrors flags from 'analyze'.
# When attaching (-p), you can override sample rows for all summaries using --sample-rows-project (0 disables samples).

docloom list --projects | --docs -p <project-name>
# Lists projects or documents

Expand Down Expand Up @@ -188,6 +192,15 @@ docloom models fetch --provider openrouter [--merge] [--output models.json]
- Behavior in projects: When you `add` CSV/TSV/XLSX to a project, the parser stores a summary (not the raw table) to keep prompts concise and token‑efficient.
- Standalone analysis: Use `docloom analyze <file>` to generate a report and optionally save it to a file or attach it to a project with `-p`.

Batch analysis with progress

- Use `docloom analyze-batch "data/*.csv"` (supports globs) to process multiple files with `[N/Total]` progress.
- Supports mixed inputs: `.csv`, `.tsv`, `.xlsx` are analyzed; other formats (`.yaml`, `.md`, `.txt`, `.docx`) are added as regular documents when `-p` is provided.
- When attaching (`-p`), you can override sample rows for all summaries using `--sample-rows-project`. Set it to `0` to disable sample tables in reports.
- When writing summaries into a project (`dataset_summaries/`), filenames are disambiguated:
- If `--sheet-name` is used, the sheet slug is included: `name__sheet-sales.summary.md`
- On collision, a numeric suffix is appended: `name__2.summary.md`

Examples

```bash
Expand Down Expand Up @@ -284,6 +297,7 @@ See `docs/api.md` for request/response details.

- `--print-prompt`: prints the prompt even for real runs.
- `--prompt-limit N`: truncates the built prompt to N tokens before sending.
- `--timeout-sec N`: sets the request timeout (default 180 seconds).
- `--budget-limit USD`: fails early if estimated max cost (prompt + max-tokens) exceeds the budget.
- `--quiet`: suppresses non-essential console output.
- `--json`: emit response as JSON to stdout.
Expand Down
68 changes: 67 additions & 1 deletion cmd/analyze.go
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ var (
anaThousands string
anaOutliers bool
anaOutlierThr float64
anaSampleRowsProject int
)

var analyzeCmd = &cobra.Command{
Expand Down Expand Up @@ -87,6 +88,9 @@ var analyzeCmd = &cobra.Command{
if anaOutlierThr > 0 {
opt.OutlierThreshold = anaOutlierThr
}
if anaProject != "" && anaSampleRowsProject >= 0 {
opt.SampleRows = anaSampleRowsProject
}
// choose analyzer by extension
lower := strings.ToLower(path)
var md string
Expand Down Expand Up @@ -126,6 +130,35 @@ var analyzeCmd = &cobra.Command{
if err != nil {
return err
}

// Count existing dataset summaries
datasetCount := 0
totalDatasetTokens := 0
for _, doc := range p.Documents {
desc := strings.ToLower(doc.Description)
if strings.Contains(desc, "dataset") || strings.Contains(desc, "summary") ||
strings.HasSuffix(doc.Name, ".summary.md") {
datasetCount++
totalDatasetTokens += doc.Tokens
}
}

// Enforce limits
const maxDatasetSummaries = 20
const maxDatasetTokens = 150000

if datasetCount >= maxDatasetSummaries {
return fmt.Errorf("project already has %d dataset summaries (limit: %d).\n"+
" Consider: (1) Removing old summaries, (2) Using --retrieval mode, or (3) Creating a new project",
datasetCount, maxDatasetSummaries)
}

if totalDatasetTokens >= maxDatasetTokens {
fmt.Printf("⚠ WARNING: Project has %d tokens of dataset summaries (recommended max: %d)\n",
totalDatasetTokens, maxDatasetTokens)
fmt.Printf(" Continuing will likely exceed local LLM context windows. Consider using --retrieval mode.\n\n")
}

// Write summary as a doc file in project folder
outDir := filepath.Join(p.RootDir(), "dataset_summaries")
if err := os.MkdirAll(outDir, 0o755); err != nil {
Expand All @@ -134,7 +167,39 @@ var analyzeCmd = &cobra.Command{
base := filepath.Base(path)
// ensure safe base for filename
safe := strings.TrimSuffix(base, filepath.Ext(base))
outFile := filepath.Join(outDir, safe+".summary.md")
// disambiguate with sheet name if provided
sheetBase := safe
if anaSheetName != "" {
s := strings.ToLower(strings.TrimSpace(anaSheetName))
var b strings.Builder
for _, r := range s {
if (r >= 'a' && r <= 'z') || (r >= '0' && r <= '9') {
b.WriteRune(r)
} else if r == ' ' || r == '-' || r == '_' {
b.WriteRune('-')
}
}
ss := strings.Trim(b.String(), "-")
if ss == "" {
ss = "sheet"
}
sheetBase = safe + "__sheet-" + ss
}
outFile := filepath.Join(outDir, sheetBase+".summary.md")
if _, statErr := os.Stat(outFile); statErr == nil {
idx := 2
for {
cand := filepath.Join(outDir, fmt.Sprintf("%s__%d.summary.md", sheetBase, idx))
if _, err := os.Stat(cand); os.IsNotExist(err) {
if !cmd.Flags().Changed("quiet") {
fmt.Printf("⚠ Detected existing summary, writing to %s to avoid overwrite.\n", filepath.Base(cand))
}
outFile = cand
break
}
idx++
}
}
if err := os.WriteFile(outFile, []byte(md), 0o644); err != nil {
return fmt.Errorf("write project summary: %w", err)
}
Expand Down Expand Up @@ -175,4 +240,5 @@ func init() {
analyzeCmd.Flags().Float64Var(&anaOutlierThr, "outlier-threshold", 3.5, "robust |z| threshold for outliers (MAD-based)")
analyzeCmd.Flags().StringVar(&anaSheetName, "sheet-name", "", "XLSX: sheet name to analyze")
analyzeCmd.Flags().IntVar(&anaSheetIndex, "sheet-index", 1, "XLSX: 1-based sheet index (used if --sheet-name not provided)")
analyzeCmd.Flags().IntVar(&anaSampleRowsProject, "sample-rows-project", -1, "when attaching (-p), override sample rows for dataset summaries (0 disables samples)")
}
Loading
Loading