基于 Qwen2.5-1.5B + LoRA 的URL类型分类模型,用于判断URL是列表页还是详情页。
- 判断URL是列表页 (List Page) 还是详情页 (Detail Page)
- 支持GPU和CPU推理
- 使用LoRA微调,模型轻量
- Python 3.10+
- PyTorch 2.5+
- Transformers
- PEFT
- 建议: NVIDIA GPU (RTX 4060+)
# 创建conda环境
conda create -n url-classifier python=3.11
conda activate url-classifier
# 安装依赖
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install transformers peft datasets accelerate从 HuggingFace 下载数据集:
from datasets import load_dataset
import json
import random
random.seed(42)
ds = load_dataset('IowaCat/page_type_inference_dataset')
train_data = ds['train']
# 采样
list_pages = [url for url, label in zip(train_data['url'], train_data['label']) if label == 0]
detail_pages = [url for url, label in zip(train_data['url'], train_data['label']) if label == 1]
sampled_list = random.sample(list_pages, 5000)
sampled_detail = random.sample(detail_pages, 5000)
# 转换为训练格式
data = []
for url in sampled_list:
data.append({
"instruction": "请判断以下URL是列表页还是详情页。",
"input": url,
"output": "列表页 (List Page)"
})
for url in sampled_detail:
data.append({
"instruction": "请判断以下URL是列表页还是详情页。",
"input": url,
"output": "详情页 (Detail Page)"
})
random.shuffle(data)
with open('data/data.json', 'w', encoding='utf-8') as f:
json.dump(data, f, ensure_ascii=False, indent=2)python src/train.py训练配置:
- 模型: Qwen2.5-1.5B
- LoRA rank: 16
- Epochs: 3
- Batch size: 2
- Gradient accumulation: 8
- Learning rate: 2e-4
python src/infer.py "https://example.com/product/12345"python src/infer.py "https://example.com/products/list" --cpuurl-classifier/
├── data/
│ └── data.json # 训练数据
├── src/
│ ├── __init__.py
│ ├── train.py # 训练脚本
│ ├── infer.py # 推理脚本
│ └── utils.py # 工具函数
├── output/ # 模型输出目录 (不提交)
├── .gitignore
├── README.md
└── requirements.txt
| 测试集 | 准确率 |
|---|---|
| 100条验证集 | 99% |
训练完成后,模型保存在 output/checkpoint-300/ 目录。
模型已上传至 HuggingFace,可直接下载使用:
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("windlx/url-classifier-model")- 大小: ~3 GB
- 地址: https://huggingface.co/windlx/url-classifier-model
from peft import PeftModel
base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-1.5B")
model = PeftModel.from_pretrained(base_model, "windlx/url-classifier-lora")- 大小: ~233 MB (LoRA: ~72 MB)
- 地址: https://huggingface.co/windlx/url-classifier-lora
from transformers import AutoModelForCausalLM
from peft import PeftModel
base_model = AutoModelForCausalLM.from_pretrained('Qwen/Qwen2.5-1.5B')
model = PeftModel.from_pretrained(base_model, 'output/checkpoint-300')
model.merge_and_unload()
model.save_pretrained('output/merged-model')