IdeaMiner

A two-stage LLM pipeline for generating and evaluating novel scientific research questions.

IdeaMiner is a two-stage pipeline for generating and evaluating novel scientific research questions using LLM agents. It covers a broad taxonomy of academic disciplines and produces ranked, deduplicated research questions scored on novelty, feasibility, and significance.

🌐 Try It Online

Visit our official platform to explore AI-generated research ideas across disciplines — no setup required.

Browse and save ideas from your personal library. Each card shows the research question along with its key topic tags.

Idea Detail View — Each ranked question comes with Background, Significance, Methodology, and Rationale, alongside novelty, feasibility, and significance scores.

Personalized Profile — Set your research domain and experience level so the platform surfaces the most relevant ideas for you.

Quick-action buttons let you skip, dislike, like, copy, or navigate between ideas with a single click.

⚙️ How It Works

flowchart TD
    A["📄 Config File<br>field · keywords · research_type · granularity"]
    A --> B["🤖 Step 1 · Generator<br>agents/step_1_generator.py"]
    B --> C["📝 30 Raw Research Questions<br>data/raw_questions/*.json"]
    C --> D["🔍 Step 2 · Evaluator<br>agents/step_2_evaluator.py"]
    D --> E["🧹 Deduplication<br>Embedding-based Cosine Similarity"]
    E --> F["⭐ Group-Based Scoring<br>novelty · feasibility · significance"]
    F --> G["🏆 Ranked Questions<br>data/evaluated_questions/"]

Step 1 – Generation (agents/step_1_generator.py): Each config file specifies a scientific field, a set of keywords, a research type, and a granularity level. The generator prompts an LLM to produce 30 diverse and novel research questions.

Step 2 – Evaluation (agents/step_2_evaluator.py): The evaluator first deduplicates questions using embedding-based cosine similarity, then scores the remaining questions across multiple rounds using a group-based approach. Each group is assessed by one or more LLM models that can invoke a web_search tool to ground their evaluations in current literature.

📂 Project Structure

IdeaMiner/
├── agents/
│   ├── step_1_generator.py   # Question generation agent
│   └── step_2_evaluator.py   # Question evaluation and ranking agent
├── utils/
│   ├── langchain_agent.py    # Async LangChain agent with tool support
│   ├── langchain_tools.py    # web_search and paper_search tools
│   ├── langchain_utils.py    # Custom embeddings with HuggingFace tokenizer support
│   └── tools.py              # Standalone Semantic Scholar search function
├── configs/
│   └── subject.py            # Academic discipline taxonomy and config generator
├── sh/
│   ├── 1_gen.sh              # Batch generation script
│   └── 2_eval.sh             # Batch evaluation script
├── assets/                   # Images for README and documentation
├── data/
│   ├── raw_questions/        # Output of Step 1 (git-ignored)
│   └── evaluated_questions/  # Output of Step 2 (git-ignored)
├── logs/                     # Runtime logs (git-ignored)
├── .env.example              # Environment variable template
├── requirements.txt          # Python dependencies
└── LICENSE                   # MIT License

📦 Dependencies

This project uses StructAI as its core utility library, which provides the LLMAgent, load_file, save_file, and other helpers used throughout the codebase.

🚀 Setup

1. Install dependencies

pip install -r requirements.txt

2. Configure environment variables

cp .env.example .env
# Edit .env and fill in your API keys

Required variables:

Variable	Description
`LLM_API_KEY`	API key for your OpenAI-compatible LLM provider
`LLM_BASE_URL`	Base URL of the API (default: `https://api.openai.com/v1`)
`TAVILY_API_KEYS`	Comma-separated Tavily search API keys (or use `TAVILY_API_KEY`)

Optional variables:

Variable	Description
`SEMANTIC_SCHOLAR_API_KEY`	Increases the Semantic Scholar API rate limit

3. Generate config files

The configs/subject.py script generates random experiment configs and writes them to configs/:

python configs/subject.py

Or write your own JSON config:

{
    "field": "Life Sciences",
    "keywords": ["Genomics", "CRISPR", "Epigenetics"],
    "research_type": "Experiment",
    "granularity_level": "Microscopic"
}

💻 Usage

Run the full pipeline

# Step 1: Generate questions for all configs
./sh/1_gen.sh

# Step 2: Evaluate and rank the generated questions
./sh/2_eval.sh

Run individual steps

# Generate questions for a single config
python agents/step_1_generator.py --config_path configs/my_config.json

# Evaluate a single raw question file
python agents/step_2_evaluator.py \
    --input_file data/raw_questions/my_config.json \
    --output_dir data/evaluated_questions/my_config/ \
    --field "Life Sciences" \
    --models gpt-4o-mini \
    --comparison_rounds 3 \
    --group_size 5

Key parameters for evaluation

Parameter	Default	Description
`--similarity_threshold`	`0.85`	Cosine similarity threshold for duplicate removal
`--filter_batch_size`	`50`	Questions per filtering batch
`--comparison_rounds`	`3`	Number of scoring rounds per question
`--group_size`	`5`	Questions per scoring group
`--models`	`gpt-4o-mini`	Space-separated list of scorer models
`--max_concurrent_tasks`	`32`	Maximum parallel async scoring tasks

📊 Output Format

After evaluation, each output directory contains:

File	Description
`filtered_questions.json`	Questions after deduplication
`evaluation_results.json`	Full results including per-model scores
`ranked_questions.json`	Questions sorted by consensus score (best first)
`summary.json`	Statistics and top-10 questions

Each ranked question includes:

{
    "question": "...",
    "background": "...",
    "average_scores": {
        "novelty": 8.2,
        "feasibility": 7.5,
        "significance": 8.8,
        "total": 8.17
    },
    "rank": 1
}

📬 Contact

GitHub Issues: Please open an issue for bug reports or feature requests
Wechat Mini Program:

🌟 Star History

If you find this work helpful, please consider to star⭐ this repo. Thanks for your support! 🤩

📜 License

MIT License. See LICENSE for details.

🔝 Back to top

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IdeaMiner

🌐 Try It Online

⚙️ How It Works

📂 Project Structure

📦 Dependencies

🚀 Setup

1. Install dependencies

2. Configure environment variables

3. Generate config files

💻 Usage

Run the full pipeline

Run individual steps

Key parameters for evaluation

📊 Output Format

📬 Contact

🌟 Star History

📜 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
agents		agents
assets		assets
configs		configs
sh		sh
utils		utils
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

IdeaMiner

🌐 Try It Online

⚙️ How It Works

📂 Project Structure

📦 Dependencies

🚀 Setup

1. Install dependencies

2. Configure environment variables

3. Generate config files

💻 Usage

Run the full pipeline

Run individual steps

Key parameters for evaluation

📊 Output Format

📬 Contact

🌟 Star History

📜 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages