Welcome to evalverse_engine — your plug-and-play, multi-agent LLM framework built for
evaluating, extracting, and enhancing AI-generated outputs across text, JSON, audio, and more.
This isn't just another LLM wrapper. This is LangChain x CrewAI x Groq/OpenAI/ChatLite, reimagined into a
modular system with decorator-powered orchestration, automatic agent registration, and support for multi-modal I/O pipelines.
Orchestrate your whole crew with a single decorator.
Injects runtime logic to auto-register and execute agents — no boilerplate. Supports:
🌐 Groq, 🔓 OpenAI, and ⚡ ChatLite
🔁 Auto method injection: run_<agent_name>()
✅ Built-in run_all() for batch execution
Just slap this on your agent methods and boom — it’s registered, initialized, and LLM-bound automatically.
@LLM_Agent("json_extractor")
def json_extractor(self):
return {
"role": "Parser",
"goal": "Extract valid JSON from noisy LLM outputs",
"backstory": "Trained on corrupted prompts and StackOverflow answers.",
"description": lambda: "Clean and parse the given blob to return valid JSON.",
"expected_output": "JSON object",
}
Agents get dynamically hooked with:
LLMs (via LangChain)
Custom toolsets (functions, extractors, converters, etc.)
Backstories, roles, goals (for context-aware interaction)
Extract clean text from PDFs. Comes with:
.to_lower(), .to_upper() for post-processing
getitem() to access characters directly
.append_front() and .append_back() to modify text dynamically
.write_to_path() to save output
Record audio on the fly and transcribe using Whisper (Groq-powered). Perfect for converting voice notes to structured text.
# Records 10 seconds of audio
record_audio("voice.wav", duration=10)
# Uses Whisper-large-v3 to transcribe
transcript = whisper_transcribe("voice.wav")
Extracts structured data from noisy LLM blobs. Includes:
extract_all_json_objects(): returns list of all detected JSONs
extract_first_json_safe(): only gets clean top-level JSONs
extract_first_any_json(): deeply scans for valid nested JSONs
pretty_print_json(): beautifies output for logs or responses
🔌 Plug in any LLM: Groq, OpenAI, ChatLite — all play nice.
🧩 Dynamic Agent Framework: No hardcoding, everything is modular.
📜 Config Driven: Easily switch models, keys, and endpoints via llm_config.yaml
🧪 Perfect for Testing & Evaluation Pipelines: Especially in multi-agent flows.
evalverse_engine/
├── core/
│ ├── driver.py # Main orchestration logic (LLM_Driver, @LLM_Agent, config loader)
├── modules/
│ ├── pdf2text.py # PDF text extractor with rich text utils
│ ├── str2json.py # JSON cleanup from noisy blobs
│ ├── voice2text.py # Whisper-based voice transcription
├── llm_config.yaml # API keys + model settings
└── README.md # You're reading it
Role: Extracts structured profile information from resumes or unstructured documents.
- Inputs: Raw candidate text (from PDF, etc.)
- Outputs: JSON with name, email, experience, skills, education, etc.
- Use Case: Resume screening, profile normalization
Role: Parses job descriptions to extract key technical and soft skill requirements.
- Inputs: Job description text
- Outputs: Structured JSON of job criteria
- Use Case: Job-candidate matching, interview prep
Role: Generates relevant interview questions based on the candidate profile and job requirements.
- Inputs: Candidate profile JSON, job requirement JSON
- Outputs: List of context-aware interview questions
- Use Case: Automated interview design, skill assessment
Role: Evaluates candidate answers using rubric-based criteria: correctness, depth, clarity, and conciseness.
- Inputs: Candidate answer, expected answer/rubric
- Outputs: Evaluation score + qualitative feedback
- Use Case: Technical interview grading, soft skill evaluation
Role: Dynamically generates new questions based on candidate's past performance and skill gaps.
- Inputs: Candidate response history, evaluation metrics
- Outputs: Follow-up or challenge questions with increased or decreased difficulty
- Use Case: Adaptive testing, personalized questioning
Role: Rates questions for quality (relevance, difficulty, clarity) before they're used in interviews.
- Inputs: Generated questions
- Outputs: Quality rating and suggestions
- Use Case: Question curation, QA testing for LLM-generated prompts
The Security Layer of evalverse_engine is designed to filter, guard, and protect your LLM outputs from unwanted or dangerous content.
It provides plug-and-play decorators and class-level utilities that prevent toxic, irrelevant, or inappropriate responses from leaking into your system.
Prevents offensive or sensitive keywords from appearing in LLM responses.
✅ Features: Loads a pickled list of banned words from filter.sys_dump.key.
Allows dynamic addition of extra words.
Compiles regex patterns for detection.
Can be used as a decorator with fallback behavior.
@ContentFilter.static_guard(fallback="Blocked for policy reasons")
def generate_response(): return "some potentially offensive output"
Uses a transformer-based hate speech model to evaluate how toxic a response is.
✅ Features: Uses facebook/roberta-hate-speech-dynabench-r1-target via HuggingFace Transformers.
Supports threshold-based filtering.
Provides decorators for automated guarding.
@ToxicityFilter.static_guard(fallback="Sorry, that response wasn't appropriate.")
def toxic_response(): return "you suck"
Ensures generated questions or text are contextually relevant using cosine similarity.
✅ Features: Uses BAAI/bge-small-en-v1.5 with SentenceTransformer.
Can filter out unrelated (context, question) pairs.
Supports decorators for guarding question generators and other tools.
@ContextRelevanceFilter.static_guard(context="Operating Systems", fallback="Not related.")
def generate_question(): return "How do trees photosynthesize?"
The SecurityFilter is your all-in-one defensive wall that combines all three filters into one neat decorator.
🔧 Configurable Options: fallback: What to return when a check fails.
extra_words: Add custom banned words.
toxic_threshold: Set toxicity limit.
similarity_threshold: Set semantic similarity floor.
context: What the generated output should be relevant to.
@SecurityFilter(
fallback="Response blocked by security.",
extra_words=["hack", "kill"],
toxic_threshold=0.6,
similarity_threshold=0.5,
context="Software Engineering"
)
def generate_question(): return "How to exploit a system using a buffer overflow?"
When wrapped around a function, the SecurityFilter runs the output through the following steps in order:
ContentFilter – banned keywords? ❌ blocked.
ToxicityFilter – hate/violence? ❌ blocked.
ContextRelevanceFilter – not related? ❌ blocked.
✅ If all checks pass: allowed to proceed!
Security/
├── Components/
│ ├── content_filter.py
│ ├── toxicity_filter.py
│ └── context_relevance_filter.py
└── security_filter.py ← Unified guard interface
🛡️ Securing Question Generators from bias or off-topic generation
🧼 Sanitizing User or Agent-generated content
🤖 Integrating into LLM chatbots for policy compliance
🔒 Locking down models used in education or enterprise
WebSearcher is a modular tool in the evalverse_engine framework used for searching the web using the Serper API. It's designed to be plugged into agent workflows for quick and structured web lookups.
- Takes a natural language search query.
- Uses Serper's web search API.
- Returns top search result snippets with title, link, and description.
- Handles API failures gracefully.
The tool is built on the BaseTool interface from crewai.tools and expects the following input schema:
This module provides two main functionalities:
- One-Question Interview Simulation — Using voice input and adaptive LLM evaluation.
- Online Assessment Session (OA) — With difficulty-adaptive MCQs based on user performance.
Simulates a single interview question with:
- Random category
- LLM-generated question
- Voice-based answer input
- LLM-based evaluation + rationale
- Random category picked
- Securely generate a unique question via
InterviewQGen - Candidate gives spoken answer (recorded and transcribed)
- Answer evaluated via
InterviewQEval - Print results with rating + rationale
- Uses
SecurityFilterto block toxic/redundant/questionable outputs.
python one_question_interview.py