Version-cv is a research-driven deep learning repository focused on mathematical problem solving, image recognition, and mathematical reasoning in large language models (LLMs). It builds on the foundation of version-tab, which emphasizes mathematical symbolic reasoning, math-based vectorization, and tabular LLM development.
Extending this work into the visual domain, version-cv builds for vision-based tasks and multimodal understanding. It integrates PyFlink for distributed data processing, Apache Atlas for metadata and lineage tracking, Apache Airflow for workflow orchestration, PyArrow for efficient in-memory columnar data interchange, and Mojo for high-performance AI ML/DL development. Together, these technologies enable scalable, reproducible research across structured and unstructured data pipelines.
Due to time constraints during the project's development, many of these tools were not fully leveraged—but they are included as a contribution to the open-source community for continued research, experimentation, and advancement in this space.
Research Publications/References:
- Gervais et al., MathWriting: A Dataset for Handwritten Mathematical Expression Recognition
- Saxton et al., Analyzing Mathematical Reasoning Abilities of Neural Models
See Research & References section below for a broader scope of the research for this project.
Initial benchmarks were considered but not implemented due to time constraint research and builds. They are provided in data/ and docs/ directories.
version-cv/
├── cloud
├── data
├── docs
├── models
├── notebooks
├── sandbox
├── .gitattributes
├── .gitignore
├── CITATION.cff
├── LICENSE
├── README.md
├── install_pixi.sh
├── pixi.lock
└── pixi.toml
This project is built with Pixi to manage environments and Python dependencies.
# Install Pixi if not already installed
curl -sSf https://pixi.sh/install.sh | bash
# Or run the installation script
./install_pixi.sh
# Initialize Pixi (creates pixi.toml and pixi.lock)
pixi init
# Install dependencies
pixi install
# Enter Pixi environment
pixi shell
# Pixi environment information
pixi infoOn machines with low compute these may not run as fast. Run Jupyter notebooks to see how models were designed.
python models/basemodel.py
jupyter labpixi run python models/basemodel.py
pixi run jupyter lab-
Place your images and data in the
data/directory (e.g.,data/handwriting,data/formulas). -
Run training/inference:
python models/basemodel.pyJupyter Lab:
jupyter labRun Jupyter notebooks basemodel.ipynb or mathwriting.ipynb from the notebooks/ folder for exploratory workflows.
- Gervais et al., MathWriting: A Dataset for Handwritten Mathematical Expression Recognition arXiv:2404.10690
- Saxton et al., Analyzing Mathematical Reasoning Abilities of Neural Models arXiv:1904.01557
- OpenAI, Improving Mathematical Reasoning with Process Supervision (2023) Blog Link
- Hendrycks et al., Measuring Mathematical Problem Solving With the MATH Dataset arXiv:2103.03874
Additional implementation notes are in docs/ and data usage info is in data/.
version-cv is built with integrated security and Python dependency management tools Bandit and pip-audit.
Security and reproducibility improvements are important and welcome via PR's
Bandit is a static analysis tool that is utilized to identify common security issues in Python code.
To run manually:
bandit -r models/ notebooks/pip-audit is a tool for scanning Python dependencies and packages in your environment for vulnerabilities.
To run:
pixi run pip-auditOr:
pip-audit