Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
__pycache__/
*/__pycache__/
**/__pycache__/
*.pyc
conversion_tools/logs/
37 changes: 20 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,8 @@ This dataset is a collection of anonymized customer sessions containing products
- Yelp-full: This is a combination dataset including four versions of yelp datasets mentioned above, where the duplicates are dropped and the number of total reviews is 28,908,240.
- [Tmall](https://tianchi.aliyun.com/dataset/dataDetail?dataId=53):
This dataset is provided by Ant Financial Services, using in the IJCAI16 contest.
- [Tmall2014](https://tianchi.aliyun.com/dataset/140281):
This is a large-scale e-commerce dataset from Tmall.com containing user behavior logs from 2013. The dataset includes multiple types of user-item interactions: clicks, add-to-cart, favorites (collect), and purchases (alipay).
- [DIGINETICA](https://competitions.codalab.org/competitions/11161):
The dataset includes user sessions extracted from an e-commerce search engine logs, with anonymized user ids,
hashed queries, hashed query terms, hashed product descriptions and meta-data, log-scaled prices, clicks, and purchases.
Expand Down Expand Up @@ -204,23 +206,24 @@ These datasets contain measurements of clothing fit from [RentTheRunway](https:/
| 17 | [Ta Feng](https://github.com/RUCAIBox/RecommenderSystems-Datasets/tree/master/dataset_info/TaFeng) | 32,266 | 23,812 | 817,741 | 99\.89% | Click | √ | √ | √ | √ |
| 18 | [Foursquare](https://github.com/RUCAIBox/RecommenderSystems-Datasets/tree/master/dataset_info/Foursquare) | \- | \- | \- | \- | Check-in | √ | | √ | |
| 19 | [Tmall](https://github.com/RUCAIBox/RecommenderSystems-Datasets/tree/master/dataset_info/Tmall) | 963,923 | 2,353,207 | 44,528,127 | 99.99% | Click/Buy | √ | | | √ |
| 20 | [YOOCHOOSE](https://github.com/RUCAIBox/RecommenderSystems-Datasets/tree/master/dataset_info/YOOCHOOSE) | 9,249,729 | 52,739 | 34,154,697 | 99.99% | Click/Buy | √ | | | √ |
| 21 | [Retailrocket](https://github.com/RUCAIBox/RecommenderSystems-Datasets/tree/master/dataset_info/Retailrocket) | 1,407,580 | 247,085 | 2,756,101 | 99.99% | View/Addtocart/Transaction | √ | | | |
| 22 | [LFM-1b](https://github.com/RUCAIBox/RecommenderSystems-Datasets/tree/master/dataset_info/LFM-1b) | 120,322 | 3,123,496 | 1,088,161,692 | 99\.71% | Click | √ | √ | √ | √ |
| 23 | [MIND](https://github.com/RUCAIBox/RecommenderSystems-Datasets/tree/master/dataset_info/MIND) | - | - | - | - | Click | √ | | | |
| 24 | BeerAdvocate | 33,388 | 66,055 | 1,586,614 | 99\.9281% | Rating<br/> \[0,5\] | √ | | √ | |
| 25 | Behance | 63,497 | 178,788 | 1,000,000 | 99\.9912% | Likes | √ | | √ | |
| 26 | DianPing | 542,706 | 243,247 | 4,422,473 | 99\.9967% | Rating<br/> \[0,5\] | √ | | √ | √ |
| 27 | EndoMondo | 1,104 | 253,020 | 253,020 | 99\.9094% | Workout Logs | √ | √ | | √ |
| 28 | Food | 226,570 | 231,637 | 1,132,367 | 99\.9978% | Rating<br/> \[0,5\] | √ | | √ | |
| 29 | GoodReads | 876,145 | 2,360,650 | 228,648,342 | 99\.9889% | Rating<br/> \[0,5\] | √ | | √ | |
| 30 | [KGRec](https://github.com/RUCAIBox/RecommenderSystems-Datasets/tree/master/dataset_info/KGRec) | - | - | - | - | Click | | | √ | |
| 31 | ModCloth | 47,958 | 1,378 | 82,790 | 99\.8747% | Rating<br/> \[0,5\] | | √ | √ | √ |
| 32 | RateBeer | 29,265 | 110,369 | 2,924,163 | 99\.9095% | Overall Rating<br/> \[0,20\] | √ | | √ | √ |
| 33 | RentTheRunway | 105,571 | 5,850 | 192,544 | 99\.9688% | Rating<br/> \[0,10\] | √ | √ | √ | √ |
| 34 | [Twitch](https://github.com/RUCAIBox/RecommenderSystems-Datasets/tree/master/dataset_info/Twitch) | 15,524,309 | 6,161,666 | 474,676,929 | 99\.9995% | Click | | | | √ |
| 35 | Amazon_M2 | 3,606,349 | 1,410,675 | 15,306,183 | \- | Click | | | √ | √ |
| 36 | Music4All-Onion | 119,140 | 109,269 | 252,984,396 | \- | Click | √ | | √ | √ |
| 20 | [Tmall2014](dataset_info/Tmall2014) | ~1,500,000 | ~8,000,000 | ~22,400,000 (click) | 99.99% | Click/Cart/Collect/Alipay | √ | | | |
| 21 | [YOOCHOOSE](https://github.com/RUCAIBox/RecommenderSystems-Datasets/tree/master/dataset_info/YOOCHOOSE) | 9,249,729 | 52,739 | 34,154,697 | 99.99% | Click/Buy | √ | | | √ |
| 22 | [Retailrocket](https://github.com/RUCAIBox/RecommenderSystems-Datasets/tree/master/dataset_info/Retailrocket) | 1,407,580 | 247,085 | 2,756,101 | 99.99% | View/Addtocart/Transaction | √ | | | |
| 23 | [LFM-1b](https://github.com/RUCAIBox/RecommenderSystems-Datasets/tree/master/dataset_info/LFM-1b) | 120,322 | 3,123,496 | 1,088,161,692 | 99\.71% | Click | √ | √ | √ | √ |
| 24 | [MIND](https://github.com/RUCAIBox/RecommenderSystems-Datasets/tree/master/dataset_info/MIND) | - | - | - | - | Click | √ | | | |
| 25 | BeerAdvocate | 33,388 | 66,055 | 1,586,614 | 99\.9281% | Rating<br/> \[0,5\] | √ | | √ | |
| 26 | Behance | 63,497 | 178,788 | 1,000,000 | 99\.9912% | Likes | √ | | √ | |
| 27 | DianPing | 542,706 | 243,247 | 4,422,473 | 99\.9967% | Rating<br/> \[0,5\] | √ | | √ | √ |
| 28 | EndoMondo | 1,104 | 253,020 | 253,020 | 99\.9094% | Workout Logs | √ | √ | | √ |
| 29 | Food | 226,570 | 231,637 | 1,132,367 | 99\.9978% | Rating<br/> \[0,5\] | √ | | √ | |
| 30 | GoodReads | 876,145 | 2,360,650 | 228,648,342 | 99\.9889% | Rating<br/> \[0,5\] | √ | | √ | |
| 31 | [KGRec](https://github.com/RUCAIBox/RecommenderSystems-Datasets/tree/master/dataset_info/KGRec) | - | - | - | - | Click | | | √ | |
| 32 | ModCloth | 47,958 | 1,378 | 82,790 | 99\.8747% | Rating<br/> \[0,5\] | | √ | √ | √ |
| 33 | RateBeer | 29,265 | 110,369 | 2,924,163 | 99\.9095% | Overall Rating<br/> \[0,20\] | √ | | √ | √ |
| 34 | RentTheRunway | 105,571 | 5,850 | 192,544 | 99\.9688% | Rating<br/> \[0,10\] | √ | √ | √ | √ |
| 35 | [Twitch](https://github.com/RUCAIBox/RecommenderSystems-Datasets/tree/master/dataset_info/Twitch) | 15,524,309 | 6,161,666 | 474,676,929 | 99\.9995% | Click | | | | √ |
| 36 | Amazon_M2 | 3,606,349 | 1,410,675 | 15,306,183 | \- | Click | | | √ | √ |
| 37 | Music4All-Onion | 119,140 | 109,269 | 252,984,396 | \- | Click | √ | | √ | √ |

### CTR Datasets

Expand Down
11 changes: 6 additions & 5 deletions conversion_tools/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,11 +22,12 @@
| 17 | Ta Feng |[Link](https://github.com/RUCAIBox/RecDatasets/blob/master/conversion_tools/usage/TaFeng.md)|
| 18 | Foursquare |[Link](https://github.com/RUCAIBox/RecDatasets/blob/master/conversion_tools/usage/Foursquare.md)|
| 19 | Tmall |[Link](https://github.com/RUCAIBox/RecDatasets/blob/master/conversion_tools/usage/Tmall.md)|
| 20 | YOOCHOOSE |[Link](https://github.com/RUCAIBox/RecDatasets/blob/master/conversion_tools/usage/YOOCHOOSE.md)|
| 21 | Retailrocket |[Link](https://github.com/RUCAIBox/RecDatasets/blob/master/conversion_tools/usage/Retailrocket.md)|
| 22 | LFM\-1b |[Link](https://github.com/RUCAIBox/RecDatasets/blob/master/conversion_tools/usage/LFM-1b.md)|
| 23 | MIND |[Link](https://github.com/RUCAIBox/RecDatasets/blob/master/conversion_tools/usage/MIND.md)|
| 24 | Music4All_Onion |[Link](https://github.com/RUCAIBox/RecSysDatasets/blob/master/conversion_tools/usage/Onion.md)|
| 20 | Tmall2014 |[Link](usage/Tmall2014.md)|
| 21 | YOOCHOOSE |[Link](https://github.com/RUCAIBox/RecDatasets/blob/master/conversion_tools/usage/YOOCHOOSE.md)|
| 22 | Retailrocket |[Link](https://github.com/RUCAIBox/RecDatasets/blob/master/conversion_tools/usage/Retailrocket.md)|
| 23 | LFM\-1b |[Link](https://github.com/RUCAIBox/RecDatasets/blob/master/conversion_tools/usage/LFM-1b.md)|
| 24 | MIND |[Link](https://github.com/RUCAIBox/RecDatasets/blob/master/conversion_tools/usage/MIND.md)|
| 25 | Music4All_Onion |[Link](https://github.com/RUCAIBox/RecSysDatasets/blob/master/conversion_tools/usage/Onion.md)|


### CTR Datasets
Expand Down
73 changes: 72 additions & 1 deletion conversion_tools/run.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,11 @@

import argparse
import importlib
import time
from datetime import datetime

from src.utils import dataset2class, click_dataset, multiple_dataset, multiple_item_features
from src.logger import logger, format_to_str_box


if __name__ == '__main__':
Expand All @@ -29,21 +32,89 @@
assert args.input_path is not None, 'input_path can not be None, please specify the input_path'
assert args.output_path is not None, 'output_path can not be None, please specify the output_path'

# 构建配置信息
config_info = {
"数据集类型": args.dataset,
"输入路径": args.input_path,
"输出路径": args.output_path,
}

if args.interaction_type:
config_info["交互类型"] = args.interaction_type

config_info["去重模式"] = "已启用" if args.duplicate_removal else "未启用"
config_info["开始时间"] = datetime.now().strftime('%Y-%m-%d %H:%M:%S')

# 使用 logger 输出配置信息
logger.info("=" * 80)
logger.info("📊 数据集转换工具启动")
logger.info(format_to_str_box(config_info))
logger.info("=" * 80)

start_time = time.time()

input_args = [args.input_path, args.output_path]
dataset_class_name = dataset2class[args.dataset.lower()]
dataset_class = getattr(importlib.import_module('src.extended_dataset'), dataset_class_name)
if dataset_class_name in multiple_dataset:
input_args.append(args.interaction_type)
# 只有当interaction_type不为None时才添加,否则传入'all'表示处理所有行为类型
if args.interaction_type is not None:
input_args.append(args.interaction_type)
else:
input_args.append('all')
if dataset_class_name in click_dataset:
input_args.append(args.duplicate_removal)
if dataset_class_name in multiple_item_features:
input_args.append(args.item_feature_name)

logger.info(f"🔧 初始化数据集类: {dataset_class_name}")
datasets = dataset_class(*input_args)
logger.info("✅ 数据集类初始化完成")

if args.convert_inter:
logger.info("")
logger.info("=" * 80)
logger.info("🚀 开始转换交互数据 (Inter Data)")
logger.info("=" * 80)
datasets.convert_inter()
logger.info("=" * 80)
logger.info("✅ 交互数据转换完成")
logger.info("=" * 80)

if args.convert_item:
logger.info("")
logger.info("=" * 80)
logger.info("🚀 开始转换物品特征 (Item Features)")
logger.info("=" * 80)
datasets.convert_item()
logger.info("=" * 80)
logger.info("✅ 物品特征转换完成")
logger.info("=" * 80)

if args.convert_user:
logger.info("")
logger.info("=" * 80)
logger.info("🚀 开始转换用户特征 (User Features)")
logger.info("=" * 80)
datasets.convert_user()
logger.info("=" * 80)
logger.info("✅ 用户特征转换完成")
logger.info("=" * 80)

# 计算总耗时
end_time = time.time()
elapsed_time = end_time - start_time

# 构建完成信息
completion_info = {
"状态": "所有任务完成",
"总耗时": f"{elapsed_time:.2f} 秒 ({elapsed_time/60:.2f} 分钟)",
"结束时间": datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
"输出目录": args.output_path
}

logger.info("")
logger.info("=" * 80)
logger.info("🎉 转换任务执行完毕")
logger.info(format_to_str_box(completion_info))
logger.info("=" * 80)
Loading