feat: Add Markdown parser and CI validation workflow by edonadei · Pull Request #101 · safe-agentic-framework/safe-mcp

edonadei · 2025-11-04T03:55:24Z

Summary

This PR operationalizes the JSON schemas established in PR #<NUMBER_OF_FIRST_PR> by introducing a Python-based Markdown parser and an automated GitHub Actions CI workflow.

The core script (scripts/schema_parser.py) reads the YAML frontmatter from all technique and mitigation Markdown files, validates their structure against the corresponding JSON schemas, and (on success) generates complete techniques.json and mitigations.json index files.

The CI workflow (.github/workflows/main.yml) runs this parser on every pull request to main, ensuring that all contributed content is valid before it can be merged.

This is Part 2 of a multi-PR initiative to address issue #48.

Type of Contribution

Infrastructure/Tooling
New feature

What's Included

scripts/schema_parser.py
- A new Python script that:
  - Loads the JSON schemas (from PR feat(techniques): add SAFE-T1109 debugging tool exploitation (CVE-2025-49596) #1).
  - Validates the schemas themselves against the Draft 7 standard.
  - Parses the YAML frontmatter from all .md files in the techniques/ and mitigations/ directories.
  - Validates the extracted frontmatter against the loaded schemas.
  - Generates output/techniques.json and output/mitigations.json.
.github/workflows/main.yml
- A new GitHub Actions workflow that:
  - Triggers on push and pull_request events to the main branch.
  - Sets up Python and installs dependencies from requirements.txt.
  - Runs the schema_parser.py script to validate all content.
requirements.txt
- Adds Python dependencies: jsonschema (for validation) and PyYAML (for parsing frontmatter).

Key Features

CI Validation: All pull requests are now automatically checked for schema compliance.
Data Integrity: Enforces the data contract, ensuring all techniques and mitigations have valid metadata.
Automated Parsing: Provides the core logic for turning the Markdown-based content into structured JSON data.
Error Reporting: The script will fail the CI build and report errors if a file's metadata is invalid, blocking bad merges.

Benefits

This PR provides the critical feedback loop for the data contract. It:

Guarantees that all data in the repository is valid and consistent.
Automates the validation process, removing the need for manual review of metadata.
Lays the foundation for all future tooling (dashboards, APIs, etc.) by providing a reliable way to generate structured JSON from the content.

Multi-PR Roadmap

feat: add JSON Schema definitions for programmatic access #100 : Schema definitions
feat: Add Markdown parser and CI validation workflow #101 (this) : Markdown parser + CI automation (GitHub Actions)
Not yet: TOON token-efficient format generator for LLM optimization
Not yet: Documentation and README updates

Checklist

DCO sign-off included in commits (git commit -s)
CI workflow triggers correctly on pull requests to main.
Parser script successfully validates all existing technique and mitigation files.
requirements.txt is complete.
Script handles file read/write operations and directory creation.

Related Issues

Related: #48 - Create JSON/YAML index of SAFE-MCP techniques

Add comprehensive JSON Schema (Draft 7) definitions to establish the data structure contract for SAFE-MCP techniques, mitigations, and tactics. This enables: - Automated tooling integration - Programmatic data access - Validation and consistency checking - Type-safe development Schemas added: - schemas/technique-schema.json (557 lines) Covers attack techniques with metadata, impact assessment, detection methods, mitigations, and MITRE ATT&CK mappings - schemas/mitigation-schema.json (399 lines) Covers security controls with implementation details, deployment considerations, and effectiveness ratings - schemas/tactic-schema.json (45 lines) Covers MITRE ATT&CK-aligned tactics Key features: - Required fields enforce core metadata presence - Enum values provide controlled vocabularies - Pattern matching validates ID formats (SAFE-T####, SAFE-M-#) - Extensible design allows future additions Related: safe-agentic-framework#48 Next PRs will add: - Parser tooling (markdown → JSON) - CI automation via GitHub Actions - TOON format for LLM optimization - Documentation and integration guides Signed-off-by: Emrick Donadei <emrick.donadei@gmail.com>

Add comprehensive tooling to parse markdown files into structured JSON format and automate generation via GitHub Actions. Parser Features: - scripts/parse_markdown.py (872 lines) Parses all techniques and mitigations from markdown to JSON Extracts complete data including metadata, descriptions, attack vectors, impact assessments, detection methods, mitigations, and version history - scripts/validate_schema.py (107 lines) Validates generated JSON against schema definitions Provides detailed error reporting for data quality issues CI/CD Automation: - .github/workflows/generate-schema.yml Triggers on markdown changes, script updates, or manual dispatch Generates JSON index from markdown sources Validates against schemas (non-blocking) Uploads artifacts for PRs Publishes to GitHub Releases on main branch push Posts statistics as PR comments Key Features: - Markdown remains source of truth - Generated JSON excluded from git (recreated by CI) - Data distributed via GitHub Releases (tag: latest-data) - Comprehensive parsing with error handling - Support for Sigma detection rules, test files, and references Generated data structure: - 14 tactics parsed from README - 31 techniques with full metadata - 35 mitigations with implementation details - ~1.3 MB JSON output Updated .gitignore to exclude generated files while preserving directory structure with data/.gitkeep Related: safe-agentic-framework#48 Next PR will add: - TOON format generator for LLM optimization Signed-off-by: Emrick Donadei <emrick.donadei@gmail.com>

edonadei added 2 commits November 3, 2025 22:34

edonadei mentioned this pull request Nov 4, 2025

feat: add JSON Schema definitions for programmatic access #100

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add Markdown parser and CI validation workflow#101

feat: Add Markdown parser and CI validation workflow#101
edonadei wants to merge 2 commits intosafe-agentic-framework:mainfrom
edonadei:schema-parser

edonadei commented Nov 4, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

edonadei commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Type of Contribution

What's Included

Key Features

Benefits

Multi-PR Roadmap

Checklist

Related Issues

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

edonadei commented Nov 4, 2025 •

edited

Loading