Skip to content

otoTree/human-text

Repository files navigation

Human-Text DSL Compiler

A powerful compiler that converts human-readable text into structured DSL (Domain Specific Language), supporting both controlled scripts and natural language input with LLM enhancement.

Features

  • Dual Input Modes:

    • Controlled scripts with explicit directives (@task, @tool, etc.)
    • Free-form natural language with LLM-powered structuring
  • Multi-format Output: YAML, JSON, and Protocol Buffers

  • Advanced Processing: Lexical analysis, semantic validation, optimization

  • LLM Integration: Support for multiple LLM providers (DashScope, OpenAI)

  • Debug & Analysis: Save intermediate DSL code generated by LLM for debugging and optimization

  • CLI & Library: Both command-line tool and Python library interface

  • Structured Representation: Complex conditionals, tool calls, agent invocations, and flow control

Quick Start

Prerequisites

  • Python 3.12+
  • uv package manager

Installation

# Install uv if you haven't already
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone the repository (replace with actual repository URL)
git clone <your-repository-url>
cd human-text

# Install dependencies and create virtual environment
# 项目已配置国内镜像源,中国大陆用户可享受更快的下载速度
uv sync

# Activate the virtual environment
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install in development mode
uv pip install -e .

国内镜像源配置

项目已内置国内镜像源配置(pyproject.toml 中的 [tool.uv] 部分),支持以下镜像:

如需自定义镜像源,可以修改 pyproject.toml 中的 [tool.uv] 部分,或在用户目录创建 ~/.config/uv/uv.toml

Basic Usage

Python Library

from dsl_compiler import compile, CompilerConfig

# Create configuration
config = CompilerConfig(
    llm_enabled=True,
    output_format="yaml"
)

# Compile a file
result = compile("input.txt", config)
print(result.to_yaml())

# Compile from string
source_code = """
@task data_processing
    Process user data from database
    Validate and clean the data
    Generate comprehensive report

@var user_id = 12345
@tool data_validator
    Tool for validating data integrity
"""

result = compile(source_code, config)

Command Line Interface

# Basic compilation
uv run dslc input.txt -o output.yaml

# Different output formats
uv run dslc input.txt -f json -o output.json

# Disable LLM for faster processing
uv run dslc input.txt --no-llm

# Syntax validation only
uv run dslc validate input.txt

# Show configuration
uv run dslc config --show

# Or use the traditional Python module syntax
uv run python -m dsl_compiler.cli input.txt -o output.yaml

Syntax Guide

Basic Directives

Task Definition

@task task_name
    Task description
    
    Detailed steps and instructions...

Variable Declaration

@var variable_name = value
@var user_id = 12345
@var debug_mode = true
@var config_file = "settings.json"

Tool Definition

@tool tool_name
    Tool description and usage instructions

Agent Invocation

@agent AgentName(param1=value1, param2=value2)

Flow Control

@next target_task

@if condition_expression
    Actions when condition is true
@else
    Actions when condition is false
@endif

Advanced Features

Conditional Statements

@task order_validation
    Validate customer order
    
    @tool check_order
        Order validation tool
    
    @if result.valid == false
        Order is invalid, terminate process
        @next END
    @else
        Proceed with order processing
        @next process_payment
    @endif

Structured Output Example

The above compiles to:

version: "1.0"
tasks:
  - id: order_validation
    title: Order validation
    body:
      - type: text
        content: "Validate customer order"
        line_number: 2
      - type: tool_call
        tool_call:
          name: check_order
          description: "Order validation tool"
        line_number: 4
      - type: conditional
        conditional:
          branches:
            - condition: "result.valid == false"
              actions:
                - type: text
                  content: "Order is invalid, terminate process"
                - type: jump
                  jump:
                    target: END
            - condition: null  # else branch
              actions:
                - type: text
                  content: "Proceed with order processing"
                - type: jump
                  jump:
                    target: process_payment
        line_number: 6

Configuration

Environment Variables

Copy dsl_compiler/env.example to .env and configure:

# Output format
DSL_OUTPUT_FORMAT=yaml

# LLM configuration
DSL_LLM_ENABLED=true
DSL_LLM_PROVIDER=dashscope
DSL_LLM_API_KEY=your_api_key_here
DSL_LLM_MODEL=qwen-turbo
DSL_LLM_SAVE_INTERMEDIATE=false
DSL_LLM_INTERMEDIATE_DIR=

# Performance settings
DSL_MAX_FILE_SIZE=10485760
DSL_PARSE_TIMEOUT=60

# Debug settings
DSL_DEBUG=false
DSL_LOG_LEVEL=INFO

Configuration Options

Option Default Description
output_format yaml Output format (yaml/json/proto)
llm_enabled true Enable LLM enhancement
llm_provider dashscope LLM provider
llm_save_intermediate false Save intermediate DSL code
llm_intermediate_dir null Directory for intermediate files
strict_mode true Strict validation mode
compact_mode false Compact output format
max_file_size 10MB Maximum file size
parse_timeout 60s Parse timeout

LLM Integration

The compiler supports multiple LLM providers for natural language processing:

DashScope (Alibaba Cloud)

export DSL_LLM_PROVIDER=dashscope
export DSL_LLM_API_KEY=your_dashscope_key
export DSL_LLM_MODEL=qwen-turbo

OpenAI

export DSL_LLM_PROVIDER=openai
export DSL_LLM_API_KEY=your_openai_key
export DSL_LLM_MODEL=gpt-3.5-turbo
###Save intermediate results

To debug and analyze the LLM conversion process, you can save the intermediate DSL code generated by LLM:

```bash
#Enable intermediate result saving
export DSL_LLM_SAVE_INTERMEDIATE=true

#Specify the save directory (optional, default to the llm_intermediate sub directory under the source file directory)
export DSL_LLM_INTERMEDIATE_DIR=./intermediate_results

After activation, each LLM conversion will generate a timestamp '. dsl' file, which includes: -Original DSL code -Generate time and source information -LLM provider and model information used

Example generated file name: level_2_cedium_natural_1lm_generated_20250714_162839.dsl

####Configuration Example

from dsl_compiler import CompilerConfig

config = CompilerConfig(
llm_enabled=True,
llm_save_intermediate=True, #Enable intermediate result saving
llm_intermediate_dir="./debug_results", #Specify the save directory
Debug=True # Enable debug mode to view saved information
)

Architecture

The compiler follows a multi-stage pipeline:

Input Text → Preprocessor → Lexer → Parser → Semantic Analyzer
                                              ↓
Output ← Serializer ← Optimizer ← Validator ← LLM Augmentor

Components

  • Preprocessor: BOM removal, line normalization, tab expansion
  • Lexer: Tokenization with indentation tracking
  • Parser: AST construction with directive parsing
  • Semantic Analyzer: Symbol table building, type checking, scope validation
  • LLM Augmentor: Natural language enhancement (optional)
  • Validator: DAG validation, reference checking, conflict detection
  • Optimizer: Dead code elimination, constant folding, text compression
  • Serializer: Multi-format output generation

Output Formats

YAML (Default)

version: "1.0"
tasks:
  - id: "data_processing"
    title: "Data Processing Task"
    body:
      - type: "text"
        content: "Process user data"
        line_number: 2

JSON

{
  "version": "1.0",
  "tasks": [
    {
      "id": "data_processing",
      "title": "Data Processing Task",
      "body": [
        {
          "type": "text",
          "content": "Process user data",
          "line_number": 2
        }
      ]
    }
  ]
}

Protocol Buffers

syntax = "proto3";
package dsl;

message DSLWorkflow {
  string version = 1;
  map<string, string> metadata = 2;
  repeated Task tasks = 3;
}

Development

Project Structure

src/dsl_compiler/
├── __init__.py          # Main interface
├── config.py            # Configuration management
├── compiler.py          # Main compiler logic
├── preprocessor.py      # Text preprocessing
├── lexer.py             # Lexical analyzer
├── parser.py            # Syntax parser
├── semantic_analyzer.py # Semantic analysis
├── llm_augmentor.py     # LLM enhancement
├── validator.py         # Validation engine
├── optimizer.py         # Code optimization
├── serializer.py        # Output serialization
├── cli.py               # Command-line interface
├── models.py            # Data models
├── exceptions.py        # Exception classes
└── requirements.txt     # Dependencies

Running Tests

# Install development dependencies
pip install pytest pytest-asyncio black flake8 mypy

# Run tests
python -m pytest tests/

# Run with coverage
python -m pytest --cov=src/dsl_compiler tests/

Code Quality

# Format code
black src/

# Lint code
flake8 src/

# Type checking
mypy src/

Error Handling

The compiler provides detailed error information:

from dsl_compiler import compile
from dsl_compiler.exceptions import CompilerError, ValidationError

try:
    result = compile("input.txt")
except ValidationError as e:
    print(f"Validation error: {e}")
    print(f"Rule: {e.rule}")
    print(f"Suggestions: {e.suggestions}")
except CompilerError as e:
    print(f"Compilation error: {e}")
    print(f"File: {e.source_file}")
    print(f"Line: {e.line}")

Performance Features

  • Dead Code Elimination: Remove unreachable code blocks
  • Constant Folding: Evaluate constant expressions at compile time
  • Text Compression: Optimize text content while preserving meaning
  • Structure Optimization: Flatten unnecessary nesting
  • Duplicate Removal: Eliminate redundant definitions

Troubleshooting

Common Issues

  1. LLM Call Failures

    • Check API key configuration
    • Verify network connectivity
    • Check LLM service status
  2. Parse Errors

    • Validate directive format
    • Check file encoding (should be UTF-8)
    • Review detailed error messages
  3. Performance Issues

    • Disable LLM with --no-llm flag
    • Reduce file size
    • Adjust timeout settings

Debug Mode

# Enable debug output
python -m dsl_compiler.cli input.txt --debug

# Set environment variable
export DSL_DEBUG=true

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests for new functionality
  5. Run the test suite
  6. Submit a pull request

📚 Documentation

Complete documentation is available in the doc/ directory:

For quick start, see the sections above. For detailed development information, visit the documentation directory.

License

MIT License

Changelog

v0.1.1 (2025-07-14)

✨ New Features

-* * LLM intermediate result saving function * *: Added the function of saving intermediate DSL code generated by LLM for easy debugging and analysis of the conversion process -Add configuration options' llm_save_inttermediate 'and' llm_mediate-dir '` -Automatically generate timestamped '. dsl' files containing complete metadata information -Support configuration through environment variables' DSL_LLM_SAVEINTERMEDIATE 'and' DSL_LLM∝MEDIATe-DIR '

🛠️ Improvements

-* * Enhanced configuration management * : Added LLM intermediate result saving related configurations in config. py - * Improved debugging experience * : In debugging mode, intermediate result save path information will be displayed - * Document Update * *: Updated README and related documents, added instructions for using the intermediate result saving function

📖 Documentation Updates

-Updated environment variable configuration example file env. example ' -Detailed explanation of LLM intermediate result saving function added in the README -Deleted non-existent document references to maintain the accuracy of document links

v0.1.0 (2025-7-14)

🚀 Major Updates

  • Complete LLM Augmentor Refactoring: Transformed from complex JSON structure analysis to direct DSL code output, significantly simplifying the processing pipeline
  • Critical Error Fix: Resolved the "Expecting value: line 1 column 1 (char 0)" error in LLM response parsing
  • Direct Natural Language to DSL Conversion: Implemented complete natural language content detection and conversion workflow

✨ New Features

  • Intelligent Content Detection: Automatically identifies natural language content that requires LLM enhancement
  • Multi-LLM Provider Support: Enhanced integration with DashScope (Alibaba Cloud) and OpenAI APIs
  • Response Cleaning Mechanism: Added Markdown code block cleaning and JSON extraction functionality
  • Usage Examples and Documentation: Added example_llm_usage.py comprehensive usage guide

🛠️ Improvements

  • Enhanced Error Handling: Added response validation, fallback mechanisms, and detailed error information
  • Code Extraction Logic: Implemented algorithm for accurately extracting DSL code from LLM responses
  • Re-parsing Workflow: Generated DSL code is reprocessed through the complete compiler pipeline
  • Configuration Validation: Strengthened LLM configuration validation and error handling

🐛 Bug Fixes

  • Fixed JSON parsing failures causing compilation interruption
  • Resolved null value handling issues in natural language detection logic
  • Fixed parsing errors caused by inconsistent LLM response formats
  • Improved node handling logic in AST structure conversion

📖 Documentation Updates

  • Updated LLM integration usage instructions and configuration examples
  • Added complete configuration guides for DashScope and OpenAI
  • Provided practical examples of natural language to DSL conversion
  • Enhanced troubleshooting and debugging guides

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published