Skip to content

InternScience/IdeaMiner

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

8 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

IdeaMiner

A two-stage LLM pipeline for generating and evaluating novel scientific research questions.

Official SiteΒ  GitHubΒ 

IdeaMiner is a two-stage pipeline for generating and evaluating novel scientific research questions using LLM agents. It covers a broad taxonomy of academic disciplines and produces ranked, deduplicated research questions scored on novelty, feasibility, and significance.


🌐 Try It Online

Visit our official platform to explore AI-generated research ideas across disciplines β€” no setup required.

IdeaMiner – My Library

Browse and save ideas from your personal library. Each card shows the research question along with its key topic tags.


Idea Detail View
Idea Detail View β€” Each ranked question comes with Background, Significance, Methodology, and Rationale, alongside novelty, feasibility, and significance scores.
User Profile
Personalized Profile β€” Set your research domain and experience level so the platform surfaces the most relevant ideas for you.

Interaction Buttons

Quick-action buttons let you skip, dislike, like, copy, or navigate between ideas with a single click.


βš™οΈ How It Works

flowchart TD
    A["πŸ“„ Config File<br>field Β· keywords Β· research_type Β· granularity"]
    A --> B["πŸ€– Step 1 Β· Generator<br>agents/step_1_generator.py"]
    B --> C["πŸ“ 30 Raw Research Questions<br>data/raw_questions/*.json"]
    C --> D["πŸ” Step 2 Β· Evaluator<br>agents/step_2_evaluator.py"]
    D --> E["🧹 Deduplication<br>Embedding-based Cosine Similarity"]
    E --> F["⭐ Group-Based Scoring<br>novelty · feasibility · significance"]
    F --> G["πŸ† Ranked Questions<br>data/evaluated_questions/"]
Loading

Step 1 – Generation (agents/step_1_generator.py): Each config file specifies a scientific field, a set of keywords, a research type, and a granularity level. The generator prompts an LLM to produce 30 diverse and novel research questions.

Step 2 – Evaluation (agents/step_2_evaluator.py): The evaluator first deduplicates questions using embedding-based cosine similarity, then scores the remaining questions across multiple rounds using a group-based approach. Each group is assessed by one or more LLM models that can invoke a web_search tool to ground their evaluations in current literature.


πŸ“‚ Project Structure

IdeaMiner/
β”œβ”€β”€ agents/
β”‚   β”œβ”€β”€ step_1_generator.py   # Question generation agent
β”‚   └── step_2_evaluator.py   # Question evaluation and ranking agent
β”œβ”€β”€ utils/
β”‚   β”œβ”€β”€ langchain_agent.py    # Async LangChain agent with tool support
β”‚   β”œβ”€β”€ langchain_tools.py    # web_search and paper_search tools
β”‚   β”œβ”€β”€ langchain_utils.py    # Custom embeddings with HuggingFace tokenizer support
β”‚   └── tools.py              # Standalone Semantic Scholar search function
β”œβ”€β”€ configs/
β”‚   └── subject.py            # Academic discipline taxonomy and config generator
β”œβ”€β”€ sh/
β”‚   β”œβ”€β”€ 1_gen.sh              # Batch generation script
β”‚   └── 2_eval.sh             # Batch evaluation script
β”œβ”€β”€ assets/                   # Images for README and documentation
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ raw_questions/        # Output of Step 1 (git-ignored)
β”‚   └── evaluated_questions/  # Output of Step 2 (git-ignored)
β”œβ”€β”€ logs/                     # Runtime logs (git-ignored)
β”œβ”€β”€ .env.example              # Environment variable template
β”œβ”€β”€ requirements.txt          # Python dependencies
└── LICENSE                   # MIT License

πŸ“¦ Dependencies

This project uses StructAI as its core utility library, which provides the LLMAgent, load_file, save_file, and other helpers used throughout the codebase.


πŸš€ Setup

1. Install dependencies

pip install -r requirements.txt

2. Configure environment variables

cp .env.example .env
# Edit .env and fill in your API keys

Required variables:

Variable Description
LLM_API_KEY API key for your OpenAI-compatible LLM provider
LLM_BASE_URL Base URL of the API (default: https://api.openai.com/v1)
TAVILY_API_KEYS Comma-separated Tavily search API keys (or use TAVILY_API_KEY)

Optional variables:

Variable Description
SEMANTIC_SCHOLAR_API_KEY Increases the Semantic Scholar API rate limit

3. Generate config files

The configs/subject.py script generates random experiment configs and writes them to configs/:

python configs/subject.py

Or write your own JSON config:

{
    "field": "Life Sciences",
    "keywords": ["Genomics", "CRISPR", "Epigenetics"],
    "research_type": "Experiment",
    "granularity_level": "Microscopic"
}

πŸ’» Usage

Run the full pipeline

# Step 1: Generate questions for all configs
./sh/1_gen.sh

# Step 2: Evaluate and rank the generated questions
./sh/2_eval.sh

Run individual steps

# Generate questions for a single config
python agents/step_1_generator.py --config_path configs/my_config.json

# Evaluate a single raw question file
python agents/step_2_evaluator.py \
    --input_file data/raw_questions/my_config.json \
    --output_dir data/evaluated_questions/my_config/ \
    --field "Life Sciences" \
    --models gpt-4o-mini \
    --comparison_rounds 3 \
    --group_size 5

Key parameters for evaluation

Parameter Default Description
--similarity_threshold 0.85 Cosine similarity threshold for duplicate removal
--filter_batch_size 50 Questions per filtering batch
--comparison_rounds 3 Number of scoring rounds per question
--group_size 5 Questions per scoring group
--models gpt-4o-mini Space-separated list of scorer models
--max_concurrent_tasks 32 Maximum parallel async scoring tasks

πŸ“Š Output Format

After evaluation, each output directory contains:

File Description
filtered_questions.json Questions after deduplication
evaluation_results.json Full results including per-model scores
ranked_questions.json Questions sorted by consensus score (best first)
summary.json Statistics and top-10 questions

Each ranked question includes:

{
    "question": "...",
    "background": "...",
    "average_scores": {
        "novelty": 8.2,
        "feasibility": 7.5,
        "significance": 8.8,
        "total": 8.17
    },
    "rank": 1
}

πŸ“¬ Contact

  • GitHub Issues: Please open an issue for bug reports or feature requests
  • Wechat Mini Program:

WeChat Mini Program


🌟 Star History

If you find this work helpful, please consider to star⭐ this repo. Thanks for your support! 🀩

InternScience/IdeaMiner Stargazers

Star History Chart


πŸ“œ License

MIT License. See LICENSE for details.

πŸ” Back to top

About

Your dedicated research inspiration engine. An agent framework for generating high-quality, structured research ideas.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors