Skip to content

feat: Add Markdown parser and CI validation workflow#101

Draft
edonadei wants to merge 2 commits intosafe-agentic-framework:mainfrom
edonadei:schema-parser
Draft

feat: Add Markdown parser and CI validation workflow#101
edonadei wants to merge 2 commits intosafe-agentic-framework:mainfrom
edonadei:schema-parser

Conversation

@edonadei
Copy link

@edonadei edonadei commented Nov 4, 2025

Summary

This PR operationalizes the JSON schemas established in PR #<NUMBER_OF_FIRST_PR> by introducing a Python-based Markdown parser and an automated GitHub Actions CI workflow.

The core script (scripts/schema_parser.py) reads the YAML frontmatter from all technique and mitigation Markdown files, validates their structure against the corresponding JSON schemas, and (on success) generates complete techniques.json and mitigations.json index files.

The CI workflow (.github/workflows/main.yml) runs this parser on every pull request to main, ensuring that all contributed content is valid before it can be merged.

This is Part 2 of a multi-PR initiative to address issue #48.

Type of Contribution

  • Infrastructure/Tooling
  • New feature

What's Included

  1. scripts/schema_parser.py
    • A new Python script that:
  2. .github/workflows/main.yml
    • A new GitHub Actions workflow that:
      • Triggers on push and pull_request events to the main branch.
      • Sets up Python and installs dependencies from requirements.txt.
      • Runs the schema_parser.py script to validate all content.
  3. requirements.txt
    • Adds Python dependencies: jsonschema (for validation) and PyYAML (for parsing frontmatter).

Key Features

  • CI Validation: All pull requests are now automatically checked for schema compliance.
  • Data Integrity: Enforces the data contract, ensuring all techniques and mitigations have valid metadata.
  • Automated Parsing: Provides the core logic for turning the Markdown-based content into structured JSON data.
  • Error Reporting: The script will fail the CI build and report errors if a file's metadata is invalid, blocking bad merges.

Benefits

This PR provides the critical feedback loop for the data contract. It:

  • Guarantees that all data in the repository is valid and consistent.
  • Automates the validation process, removing the need for manual review of metadata.
  • Lays the foundation for all future tooling (dashboards, APIs, etc.) by providing a reliable way to generate structured JSON from the content.

Multi-PR Roadmap

Checklist

  • DCO sign-off included in commits (git commit -s)
  • CI workflow triggers correctly on pull requests to main.
  • Parser script successfully validates all existing technique and mitigation files.
  • requirements.txt is complete.
  • Script handles file read/write operations and directory creation.

Related Issues

Related: #48 - Create JSON/YAML index of SAFE-MCP techniques

Add comprehensive JSON Schema (Draft 7) definitions to establish the data
structure contract for SAFE-MCP techniques, mitigations, and tactics.

This enables:
- Automated tooling integration
- Programmatic data access
- Validation and consistency checking
- Type-safe development

Schemas added:
- schemas/technique-schema.json (557 lines)
  Covers attack techniques with metadata, impact assessment, detection
  methods, mitigations, and MITRE ATT&CK mappings

- schemas/mitigation-schema.json (399 lines)
  Covers security controls with implementation details, deployment
  considerations, and effectiveness ratings

- schemas/tactic-schema.json (45 lines)
  Covers MITRE ATT&CK-aligned tactics

Key features:
- Required fields enforce core metadata presence
- Enum values provide controlled vocabularies
- Pattern matching validates ID formats (SAFE-T####, SAFE-M-#)
- Extensible design allows future additions

Related: safe-agentic-framework#48

Next PRs will add:
- Parser tooling (markdown → JSON)
- CI automation via GitHub Actions
- TOON format for LLM optimization
- Documentation and integration guides

Signed-off-by: Emrick Donadei <emrick.donadei@gmail.com>
Add comprehensive tooling to parse markdown files into structured JSON format
and automate generation via GitHub Actions.

Parser Features:
- scripts/parse_markdown.py (872 lines)
  Parses all techniques and mitigations from markdown to JSON
  Extracts complete data including metadata, descriptions, attack vectors,
  impact assessments, detection methods, mitigations, and version history

- scripts/validate_schema.py (107 lines)
  Validates generated JSON against schema definitions
  Provides detailed error reporting for data quality issues

CI/CD Automation:
- .github/workflows/generate-schema.yml
  Triggers on markdown changes, script updates, or manual dispatch
  Generates JSON index from markdown sources
  Validates against schemas (non-blocking)
  Uploads artifacts for PRs
  Publishes to GitHub Releases on main branch push
  Posts statistics as PR comments

Key Features:
- Markdown remains source of truth
- Generated JSON excluded from git (recreated by CI)
- Data distributed via GitHub Releases (tag: latest-data)
- Comprehensive parsing with error handling
- Support for Sigma detection rules, test files, and references

Generated data structure:
- 14 tactics parsed from README
- 31 techniques with full metadata
- 35 mitigations with implementation details
- ~1.3 MB JSON output

Updated .gitignore to exclude generated files while preserving
directory structure with data/.gitkeep

Related: safe-agentic-framework#48

Next PR will add:
- TOON format generator for LLM optimization

Signed-off-by: Emrick Donadei <emrick.donadei@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant