PROVE: Probabilistic Reasoning Over Visual Evidence

Neuro-symbolic visual question answering using agentic evidence collection and probabilistic logic programming.

Installation

Requirements

Python 3.9+
CUDA-compatible GPU (recommended: 24GB+ VRAM)
AWS Bedrock access (Llama 3.3 70B or other supported LLMs)

Setup

# Clone repository
git clone https://github.com/your-repo/PROVE.git
cd PROVE

# Install dependencies
pip install -r requirements.txt

# Configure AWS credentials for Bedrock
export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_key
export AWS_DEFAULT_REGION=us-west-2

Quick Start

Run on a Single Example

# Random NLVR2 test1 example
python src/eval/run_example.py

# Specific NLVR2 example
python src/eval/run_example.py --identifier test1-366-0-0

# GQA or VQAv2 example
python src/eval/run_example.py --dataset gqa
python src/eval/run_example.py --dataset vqav2

# With logging
python src/eval/run_example.py --save-logs

Programmatic Usage

from src import PROVE

model = PROVE(threshold=0.5)

# Paired images (NLVR2)
result = model.predict(
    {"image_a": "img1.jpg", "image_b": "img2.jpg"},
    "Is there a white bird on top of another animal in both images?"
)

# Single image (GQA, VQAv2)
result = model.predict(
    {"image_a": "photo.jpg"},
    "Is the dog sitting on the couch?"
)

print(result.probabilistic.final_answer)   # True or False
print(result.deterministic.final_answer)   # True or False

Architecture

Question + Images → Detection → Agent (Perceive/Verify) → ProbLog → Answer
                        ↓              ↓                      ↓
                   Entities    Probabilistic Evidence    True/False

Key Principle: Pass the question directly to a ReAct agent that collects visual evidence through investigation and verification, then compose results through probabilistic logic programming.

Pipeline (3 Steps)

Step 1: Object Detection

Purpose: Detect entities mentioned in the question

Process:

Entity Extraction: Llama 3.3 70B extracts nouns from the question (e.g., ["bird", "buffalo"])
Open Vocabulary Detection: Florence-2 detects each entity with bounding boxes
Calibration: Anchored sigmoid transforms raw scores to operational probabilities

Output: ObjectDetection(object_id, label, bbox, confidence) per entity

Step 2: Evidence Collection

Purpose: Collect probabilistic evidence through agentic VLM reasoning

Architecture: ReAct agent loop (max 20 iterations)

Agent Actions (Pydantic-validated):

Action	Purpose	Returns
`perceive`	Ask open-ended question about entity	Text answer (context gathering)
`verify_attribute`	Check if entity has specific attribute	Probability (BLIP-ITM + Qwen logits)
`verify_relationship`	Check spatial relationship between entities	Probability (BLIP-ITM + Qwen logits)
`verify_count`	Count objects of a class	Poisson-Binomial distribution
`done`	Evidence collection complete	-

Agent Prompt Structure:

QUESTION: "Is there a white bird on top of another animal in both images?"

DETECTED OBJECTS:
Image A, image_id: image_a
  - object_id: buffalo_a_0, object_class: buffalo
  - object_id: bird_a_1, object_class: bird

Image B, image_id: image_b
  - object_id: cow_b_0, object_class: cow

ACTION HISTORY:
[Turn 1]
Thought: I need to check if the bird in image A is white
Action: verify_attribute(image_id=image_a, entity_id=bird_a_1, attribute=color, value=white)
Result: p=0.787

Evidence Types:

Attributes: Dual-model verification on cropped entity — BLIP-ITM score + Qwen VL logit probability (e.g., "an orange dog")
Relationships: Dual-model verification on union bbox (e.g., "a bird on top of a buffalo")
Counts: Poisson-Binomial distribution from detection confidences

Output: EvidenceCollection(attributes, relationships, counts, action_history)

Step 3: ProbLog Reasoning

Purpose: Execute probabilistic logic to compute answer probability

Process:

Build ProbLog facts from collected evidence
LLM generates rules and query matching the question
Execute ProbLog program
Return probability and convert to True/False

Dual Mode Execution:

Probabilistic: Original probabilities preserved (e.g., 0.874, 0.623)
Deterministic: Thresholded (p < threshold → 0.0, p >= threshold → 1.0)

Example ProbLog Program:

% Facts
0.874::entity(image_a, buffalo_a_0, buffalo).
0.938::entity(image_a, bird_a_1, bird).
0.906::relation(image_a, bird_a_1, buffalo_a_0, on_top_of).
0.787::attribute(image_a, bird_a_1, white).

% Generated rule
white_bird_on_animal(I) :-
    entity(I, B, bird),
    entity(I, A, buffalo),
    relation(I, B, A, on_top_of),
    attribute(I, B, white).

query(white_bird_on_animal(image_a)).
% Result: P=0.5847

Output: ModeResult(probability, final_answer, problog_program)

Unified Execution Mode

PROVE runs both probabilistic and deterministic modes with shared evidence to isolate the effect of perception uncertainty.

How It Works

Shared Evidence Collection: Detection and verification run ONCE with probabilistic confidences
Dual Fact Generation: Same evidence generates two fact sets
Dual ProbLog Execution: Same queries run against both fact sets
Two Answers: Returns both probabilistic and deterministic final answers

Threshold Parameter

model = PROVE(threshold=0.5)  # Default
model = PROVE(threshold=0.7)  # More conservative

The threshold determines how probabilities map to binary values in deterministic mode:

p < threshold → 0.0 (false)
p >= threshold → 1.0 (true)

Models

Model	Purpose	Notes
Florence-2-large	Object detection	Open vocabulary, BF16
LLM (AWS Bedrock)	Entity extraction, agent reasoning, rule generation	Llama 3.3 70B, Maverick, Mistral Large, etc.
BLIP-ITM-large	Attribute & relationship verification	Well-calibrated ITM head
Qwen-2.5-VL-7B	Perception + attribute/relationship verification	Open-ended + logit-based True/False scoring

Data Structures

Evidence Collection:

EvidenceCollection
├── question: str
├── attributes: List[(entity_id, attr_class, value, prob)]
├── relationships: List[(subj_id, obj_id, relation, prob)]
├── counts: Dict[str, Dict[int, float]]
└── action_history: List[{thought, action, result}]

ProbLog Predicates:

entity(image_id, entity_id, category)
attribute(image_id, entity_id, value)
relation(image_id, subject_id, object_id, relation_type)
count(image_id, category, count_value)

Unified Result:

UnifiedResult
├── threshold: float
├── shared: SharedEvidence
├── probabilistic: ModeResult
└── deterministic: ModeResult

Repository Structure

src/
├── prove.py                    # Main PROVE model class
├── __init__.py                 # Package exports
├── core/
│   ├── knowledge_base.py       # KB management
│   ├── model_manager.py        # Singleton model loading
│   ├── types.py                # Data structures
│   ├── probability.py          # Detector confidence calibration
│   └── image_utils.py          # Image loading utilities
├── language/
│   ├── llm_client.py           # LLM client (AWS Bedrock)
│   └── output_models.py        # Pydantic models for agent actions
├── pipeline/
│   ├── detector.py             # Question-based detection
│   ├── unified_agent.py        # ReAct evidence collection agent
│   ├── problog_builder.py      # Evidence to ProbLog facts
│   └── problog_executor.py     # ProbLog execution
├── vision/
│   ├── florence2.py            # Florence-2 wrapper
│   ├── blip_verifier.py        # BLIP-ITM verification
│   ├── qwen_vl.py              # Qwen VL model wrapper
│   └── qwen_verifier.py        # Qwen logit-based True/False verification
└── eval/
    ├── configs.json            # Scoring config presets (v5_perlm, shared_best)
    ├── run_eval.py             # Batch evaluation with dual-model scoring
    ├── run_example.py          # Run on a single example
    ├── run_prove_sweep.py      # Post-hoc scoring config sweep
    ├── analyze_subsets.py      # Subset analysis
    ├── problog_utils.py        # ProbLog helpers (semiring, fact rebuilding)
    └── eval_gaussian_ablation.py  # Gaussian noise ablation

Usage

Basic Usage

from src import PROVE

model = PROVE(threshold=0.5)

result = model.predict(
    {"image_a": "img1.jpg", "image_b": "img2.jpg"},
    "Are there more birds in image A than image B?"
)

print(f"Probabilistic: {result.probabilistic.final_answer}")
print(f"Deterministic: {result.deterministic.final_answer}")

With Logging

result = model.predict_with_details(
    image_paths={"image_a": "img1.jpg", "image_b": "img2.jpg"},
    question="Are there more birds in image A than image B?",
    save_logs=True,
    log_dir="logs"
)

# Access ProbLog programs
print(result.probabilistic.problog_program)
print(result.deterministic.problog_program)

Log Directory Structure:

logs/20250112_143022_abc123/
├── images/
│   ├── image_a.jpg
│   └── image_b.jpg
├── probabilistic.pl
├── deterministic.pl
└── results.json

Example Output

Question: "Is there a white bird on top of another animal in both images?"
Threshold: 0.5

Step 1: Object Detection...
  image_a: 2 objects detected
  image_b: 2 objects detected

Step 2: Evidence Collection...
  [Verify Attribute] bird_a_1.color=white
    → p=0.787
  [Verify Relationship] bird_a_1 on_top_of buffalo_a_0
    → p=0.906
  [Verify Attribute] bird_b_0.color=white
    → p=0.234

Step 3: ProbLog Reasoning (dual mode)...

============================================================
RESULTS SUMMARY
============================================================

Probabilistic Mode:
  Probability: 0.167
  → Final Answer: False

Deterministic Mode (threshold=0.5):
  Probability: 0.000
  → Final Answer: False

Modes AGREE
============================================================

Key Technical Details

Dual-Model Verification

Both BLIP-ITM and Qwen VL scores are collected for every verification action. The scoring config (selected post-hoc) determines which scores are used.

BLIP-ITM: Image-text matching score on cropped region

cropped = crop_with_padding(image, bbox, padding=0.15)
prompt = f"a {attr_value} {object_class}"  # "an orange cat"
probability = softmax(model(cropped, prompt).itm_score)[1]

Qwen VL: Logit-based True/False probability

statement = f"The {object_class} is {attr_value}"  # "The cat is orange"
probability = softmax(logits["True"], logits["False"])[0]  # P(True)

Poisson-Binomial Counting

Computes probability distribution over counts from detection confidences:

Detections: [0.9, 0.8, 0.7]
Distribution: {0: 0.006, 1: 0.092, 2: 0.398, 3: 0.504}

ReAct Agent Loop

Pattern: Think → Act → Observe

Agent sees: question, detected objects, action history
Agent outputs: thought + action (Pydantic-validated)
Execute action and record result
Repeat until done or max iterations

Summary

PROVE transforms visual questions into probabilistic answers through:

Detection: Question-guided object detection
Agentic Evidence: ReAct agent collects verification evidence
Probabilistic Logic: ProbLog composes evidence mathematically

Key Innovation: Neuro-symbolic architecture combining dual-model neural perception (BLIP-ITM + Qwen VL) with symbolic reasoning (ProbLog) via agentic orchestration.

Name		Name	Last commit message	Last commit date
Latest commit History 87 Commits
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PROVE: Probabilistic Reasoning Over Visual Evidence

Installation

Requirements

Setup

Quick Start

Run on a Single Example

Programmatic Usage

Architecture

Pipeline (3 Steps)

Step 1: Object Detection

Step 2: Evidence Collection

Step 3: ProbLog Reasoning

Unified Execution Mode

How It Works

Threshold Parameter

Models

Data Structures

Repository Structure

Usage

Basic Usage

With Logging

Example Output

Key Technical Details

Dual-Model Verification

Poisson-Binomial Counting

ReAct Agent Loop

Summary

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PROVE: Probabilistic Reasoning Over Visual Evidence

Installation

Requirements

Setup

Quick Start

Run on a Single Example

Programmatic Usage

Architecture

Pipeline (3 Steps)

Step 1: Object Detection

Step 2: Evidence Collection

Step 3: ProbLog Reasoning

Unified Execution Mode

How It Works

Threshold Parameter

Models

Data Structures

Repository Structure

Usage

Basic Usage

With Logging

Example Output

Key Technical Details

Dual-Model Verification

Poisson-Binomial Counting

ReAct Agent Loop

Summary

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages