Skip to content

rcghpge/version-cv

Repository files navigation

version-cv

CodeQL Advanced Bandit

Version-cv is a research-driven deep learning repository focused on mathematical problem solving, image recognition, and mathematical reasoning in large language models (LLMs). It builds on the foundation of version-tab, which emphasizes mathematical symbolic reasoning, math-based vectorization, and tabular LLM development.

Extending this work into the visual domain, version-cv builds for vision-based tasks and multimodal understanding. It integrates PyFlink for distributed data processing, Apache Atlas for metadata and lineage tracking, Apache Airflow for workflow orchestration, PyArrow for efficient in-memory columnar data interchange, and Mojo for high-performance AI ML/DL development. Together, these technologies enable scalable, reproducible research across structured and unstructured data pipelines.

Due to time constraints during the project's development, many of these tools were not fully leveraged—but they are included as a contribution to the open-source community for continued research, experimentation, and advancement in this space.

Research Publications/References:

See Research & References section below for a broader scope of the research for this project.


📊 Key Datasets

Initial benchmarks were considered but not implemented due to time constraint research and builds. They are provided in data/ and docs/ directories.


📁 Project Structure

version-cv/
├── cloud
├── data
├── docs
├── models
├── notebooks
├── sandbox
├── .gitattributes
├── .gitignore
├── CITATION.cff
├── LICENSE
├── README.md
├── install_pixi.sh
├── pixi.lock
└── pixi.toml

⚡ Setup

This project is built with Pixi to manage environments and Python dependencies.

# Install Pixi if not already installed
curl -sSf https://pixi.sh/install.sh | bash

# Or run the installation script
./install_pixi.sh

# Initialize Pixi (creates pixi.toml and pixi.lock)
pixi init

# Install dependencies
pixi install

# Enter Pixi environment
pixi shell

# Pixi environment information
pixi info

🚀 Quick Start

On machines with low compute these may not run as fast. Run Jupyter notebooks to see how models were designed.

Option 1: Using Pixi shell

python models/basemodel.py
jupyter lab

Option 2: One-liner

pixi run python models/basemodel.py
pixi run jupyter lab

📊 Running & Viewing Results

  1. Place your images and data in the data/ directory (e.g., data/handwriting, data/formulas).

  2. Run training/inference:

python models/basemodel.py

🧪 Notebooks

Jupyter Lab:

jupyter lab

Run Jupyter notebooks basemodel.ipynb or mathwriting.ipynb from the notebooks/ folder for exploratory workflows.


📃 Research & References

  • Gervais et al., MathWriting: A Dataset for Handwritten Mathematical Expression Recognition arXiv:2404.10690
  • Saxton et al., Analyzing Mathematical Reasoning Abilities of Neural Models arXiv:1904.01557
  • OpenAI, Improving Mathematical Reasoning with Process Supervision (2023) Blog Link
  • Hendrycks et al., Measuring Mathematical Problem Solving With the MATH Dataset arXiv:2103.03874

Additional implementation notes are in docs/ and data usage info is in data/.


🛡️ Security Note

version-cv is built with integrated security and Python dependency management tools Bandit and pip-audit. Security and reproducibility improvements are important and welcome via PR's

Bandit

Bandit is a static analysis tool that is utilized to identify common security issues in Python code.

To run manually:

bandit -r models/ notebooks/

pip-audit

pip-audit is a tool for scanning Python dependencies and packages in your environment for vulnerabilities.

To run:

pixi run pip-audit

Or:

pip-audit

About

Machine Learning and Deep Learning @ UTA

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published