diff --git a/.gitignore b/.gitignore
new file mode 100644
index 0000000..5c0f323
--- /dev/null
+++ b/.gitignore
@@ -0,0 +1,5 @@
+__pycache__/
+*/__pycache__/
+**/__pycache__/
+*.pyc
+conversion_tools/logs/
diff --git a/README.md b/README.md
index ad819d0..88c71d1 100644
--- a/README.md
+++ b/README.md
@@ -48,6 +48,8 @@ This dataset is a collection of anonymized customer sessions containing products
- Yelp-full: This is a combination dataset including four versions of yelp datasets mentioned above, where the duplicates are dropped and the number of total reviews is 28,908,240.
- [Tmall](https://tianchi.aliyun.com/dataset/dataDetail?dataId=53):
This dataset is provided by Ant Financial Services, using in the IJCAI16 contest.
+- [Tmall2014](https://tianchi.aliyun.com/dataset/140281):
+ This is a large-scale e-commerce dataset from Tmall.com containing user behavior logs from 2013. The dataset includes multiple types of user-item interactions: clicks, add-to-cart, favorites (collect), and purchases (alipay).
- [DIGINETICA](https://competitions.codalab.org/competitions/11161):
The dataset includes user sessions extracted from an e-commerce search engine logs, with anonymized user ids,
hashed queries, hashed query terms, hashed product descriptions and meta-data, log-scaled prices, clicks, and purchases.
@@ -204,23 +206,24 @@ These datasets contain measurements of clothing fit from [RentTheRunway](https:/
| 17 | [Ta Feng](https://github.com/RUCAIBox/RecommenderSystems-Datasets/tree/master/dataset_info/TaFeng) | 32,266 | 23,812 | 817,741 | 99\.89% | Click | √ | √ | √ | √ |
| 18 | [Foursquare](https://github.com/RUCAIBox/RecommenderSystems-Datasets/tree/master/dataset_info/Foursquare) | \- | \- | \- | \- | Check-in | √ | | √ | |
| 19 | [Tmall](https://github.com/RUCAIBox/RecommenderSystems-Datasets/tree/master/dataset_info/Tmall) | 963,923 | 2,353,207 | 44,528,127 | 99.99% | Click/Buy | √ | | | √ |
-| 20 | [YOOCHOOSE](https://github.com/RUCAIBox/RecommenderSystems-Datasets/tree/master/dataset_info/YOOCHOOSE) | 9,249,729 | 52,739 | 34,154,697 | 99.99% | Click/Buy | √ | | | √ |
-| 21 | [Retailrocket](https://github.com/RUCAIBox/RecommenderSystems-Datasets/tree/master/dataset_info/Retailrocket) | 1,407,580 | 247,085 | 2,756,101 | 99.99% | View/Addtocart/Transaction | √ | | | |
-| 22 | [LFM-1b](https://github.com/RUCAIBox/RecommenderSystems-Datasets/tree/master/dataset_info/LFM-1b) | 120,322 | 3,123,496 | 1,088,161,692 | 99\.71% | Click | √ | √ | √ | √ |
-| 23 | [MIND](https://github.com/RUCAIBox/RecommenderSystems-Datasets/tree/master/dataset_info/MIND) | - | - | - | - | Click | √ | | | |
-| 24 | BeerAdvocate | 33,388 | 66,055 | 1,586,614 | 99\.9281% | Rating
\[0,5\] | √ | | √ | |
-| 25 | Behance | 63,497 | 178,788 | 1,000,000 | 99\.9912% | Likes | √ | | √ | |
-| 26 | DianPing | 542,706 | 243,247 | 4,422,473 | 99\.9967% | Rating
\[0,5\] | √ | | √ | √ |
-| 27 | EndoMondo | 1,104 | 253,020 | 253,020 | 99\.9094% | Workout Logs | √ | √ | | √ |
-| 28 | Food | 226,570 | 231,637 | 1,132,367 | 99\.9978% | Rating
\[0,5\] | √ | | √ | |
-| 29 | GoodReads | 876,145 | 2,360,650 | 228,648,342 | 99\.9889% | Rating
\[0,5\] | √ | | √ | |
-| 30 | [KGRec](https://github.com/RUCAIBox/RecommenderSystems-Datasets/tree/master/dataset_info/KGRec) | - | - | - | - | Click | | | √ | |
-| 31 | ModCloth | 47,958 | 1,378 | 82,790 | 99\.8747% | Rating
\[0,5\] | | √ | √ | √ |
-| 32 | RateBeer | 29,265 | 110,369 | 2,924,163 | 99\.9095% | Overall Rating
\[0,20\] | √ | | √ | √ |
-| 33 | RentTheRunway | 105,571 | 5,850 | 192,544 | 99\.9688% | Rating
\[0,10\] | √ | √ | √ | √ |
-| 34 | [Twitch](https://github.com/RUCAIBox/RecommenderSystems-Datasets/tree/master/dataset_info/Twitch) | 15,524,309 | 6,161,666 | 474,676,929 | 99\.9995% | Click | | | | √ |
-| 35 | Amazon_M2 | 3,606,349 | 1,410,675 | 15,306,183 | \- | Click | | | √ | √ |
-| 36 | Music4All-Onion | 119,140 | 109,269 | 252,984,396 | \- | Click | √ | | √ | √ |
+| 20 | [Tmall2014](dataset_info/Tmall2014) | ~1,500,000 | ~8,000,000 | ~22,400,000 (click) | 99.99% | Click/Cart/Collect/Alipay | √ | | | |
+| 21 | [YOOCHOOSE](https://github.com/RUCAIBox/RecommenderSystems-Datasets/tree/master/dataset_info/YOOCHOOSE) | 9,249,729 | 52,739 | 34,154,697 | 99.99% | Click/Buy | √ | | | √ |
+| 22 | [Retailrocket](https://github.com/RUCAIBox/RecommenderSystems-Datasets/tree/master/dataset_info/Retailrocket) | 1,407,580 | 247,085 | 2,756,101 | 99.99% | View/Addtocart/Transaction | √ | | | |
+| 23 | [LFM-1b](https://github.com/RUCAIBox/RecommenderSystems-Datasets/tree/master/dataset_info/LFM-1b) | 120,322 | 3,123,496 | 1,088,161,692 | 99\.71% | Click | √ | √ | √ | √ |
+| 24 | [MIND](https://github.com/RUCAIBox/RecommenderSystems-Datasets/tree/master/dataset_info/MIND) | - | - | - | - | Click | √ | | | |
+| 25 | BeerAdvocate | 33,388 | 66,055 | 1,586,614 | 99\.9281% | Rating
\[0,5\] | √ | | √ | |
+| 26 | Behance | 63,497 | 178,788 | 1,000,000 | 99\.9912% | Likes | √ | | √ | |
+| 27 | DianPing | 542,706 | 243,247 | 4,422,473 | 99\.9967% | Rating
\[0,5\] | √ | | √ | √ |
+| 28 | EndoMondo | 1,104 | 253,020 | 253,020 | 99\.9094% | Workout Logs | √ | √ | | √ |
+| 29 | Food | 226,570 | 231,637 | 1,132,367 | 99\.9978% | Rating
\[0,5\] | √ | | √ | |
+| 30 | GoodReads | 876,145 | 2,360,650 | 228,648,342 | 99\.9889% | Rating
\[0,5\] | √ | | √ | |
+| 31 | [KGRec](https://github.com/RUCAIBox/RecommenderSystems-Datasets/tree/master/dataset_info/KGRec) | - | - | - | - | Click | | | √ | |
+| 32 | ModCloth | 47,958 | 1,378 | 82,790 | 99\.8747% | Rating
\[0,5\] | | √ | √ | √ |
+| 33 | RateBeer | 29,265 | 110,369 | 2,924,163 | 99\.9095% | Overall Rating
\[0,20\] | √ | | √ | √ |
+| 34 | RentTheRunway | 105,571 | 5,850 | 192,544 | 99\.9688% | Rating
\[0,10\] | √ | √ | √ | √ |
+| 35 | [Twitch](https://github.com/RUCAIBox/RecommenderSystems-Datasets/tree/master/dataset_info/Twitch) | 15,524,309 | 6,161,666 | 474,676,929 | 99\.9995% | Click | | | | √ |
+| 36 | Amazon_M2 | 3,606,349 | 1,410,675 | 15,306,183 | \- | Click | | | √ | √ |
+| 37 | Music4All-Onion | 119,140 | 109,269 | 252,984,396 | \- | Click | √ | | √ | √ |
### CTR Datasets
diff --git a/conversion_tools/README.md b/conversion_tools/README.md
index fa9de21..d2d9ef9 100644
--- a/conversion_tools/README.md
+++ b/conversion_tools/README.md
@@ -22,11 +22,12 @@
| 17 | Ta Feng |[Link](https://github.com/RUCAIBox/RecDatasets/blob/master/conversion_tools/usage/TaFeng.md)|
| 18 | Foursquare |[Link](https://github.com/RUCAIBox/RecDatasets/blob/master/conversion_tools/usage/Foursquare.md)|
| 19 | Tmall |[Link](https://github.com/RUCAIBox/RecDatasets/blob/master/conversion_tools/usage/Tmall.md)|
-| 20 | YOOCHOOSE |[Link](https://github.com/RUCAIBox/RecDatasets/blob/master/conversion_tools/usage/YOOCHOOSE.md)|
-| 21 | Retailrocket |[Link](https://github.com/RUCAIBox/RecDatasets/blob/master/conversion_tools/usage/Retailrocket.md)|
-| 22 | LFM\-1b |[Link](https://github.com/RUCAIBox/RecDatasets/blob/master/conversion_tools/usage/LFM-1b.md)|
-| 23 | MIND |[Link](https://github.com/RUCAIBox/RecDatasets/blob/master/conversion_tools/usage/MIND.md)|
-| 24 | Music4All_Onion |[Link](https://github.com/RUCAIBox/RecSysDatasets/blob/master/conversion_tools/usage/Onion.md)|
+| 20 | Tmall2014 |[Link](usage/Tmall2014.md)|
+| 21 | YOOCHOOSE |[Link](https://github.com/RUCAIBox/RecDatasets/blob/master/conversion_tools/usage/YOOCHOOSE.md)|
+| 22 | Retailrocket |[Link](https://github.com/RUCAIBox/RecDatasets/blob/master/conversion_tools/usage/Retailrocket.md)|
+| 23 | LFM\-1b |[Link](https://github.com/RUCAIBox/RecDatasets/blob/master/conversion_tools/usage/LFM-1b.md)|
+| 24 | MIND |[Link](https://github.com/RUCAIBox/RecDatasets/blob/master/conversion_tools/usage/MIND.md)|
+| 25 | Music4All_Onion |[Link](https://github.com/RUCAIBox/RecSysDatasets/blob/master/conversion_tools/usage/Onion.md)|
### CTR Datasets
diff --git a/conversion_tools/run.py b/conversion_tools/run.py
index 3d0c40b..cc36750 100644
--- a/conversion_tools/run.py
+++ b/conversion_tools/run.py
@@ -5,8 +5,11 @@
import argparse
import importlib
+import time
+from datetime import datetime
from src.utils import dataset2class, click_dataset, multiple_dataset, multiple_item_features
+from src.logger import logger, format_to_str_box
if __name__ == '__main__':
@@ -29,21 +32,89 @@
assert args.input_path is not None, 'input_path can not be None, please specify the input_path'
assert args.output_path is not None, 'output_path can not be None, please specify the output_path'
+ # 构建配置信息
+ config_info = {
+ "数据集类型": args.dataset,
+ "输入路径": args.input_path,
+ "输出路径": args.output_path,
+ }
+
+ if args.interaction_type:
+ config_info["交互类型"] = args.interaction_type
+
+ config_info["去重模式"] = "已启用" if args.duplicate_removal else "未启用"
+ config_info["开始时间"] = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
+
+ # 使用 logger 输出配置信息
+ logger.info("=" * 80)
+ logger.info("📊 数据集转换工具启动")
+ logger.info(format_to_str_box(config_info))
+ logger.info("=" * 80)
+
+ start_time = time.time()
+
input_args = [args.input_path, args.output_path]
dataset_class_name = dataset2class[args.dataset.lower()]
dataset_class = getattr(importlib.import_module('src.extended_dataset'), dataset_class_name)
if dataset_class_name in multiple_dataset:
- input_args.append(args.interaction_type)
+ # 只有当interaction_type不为None时才添加,否则传入'all'表示处理所有行为类型
+ if args.interaction_type is not None:
+ input_args.append(args.interaction_type)
+ else:
+ input_args.append('all')
if dataset_class_name in click_dataset:
input_args.append(args.duplicate_removal)
if dataset_class_name in multiple_item_features:
input_args.append(args.item_feature_name)
+
+ logger.info(f"🔧 初始化数据集类: {dataset_class_name}")
datasets = dataset_class(*input_args)
+ logger.info("✅ 数据集类初始化完成")
if args.convert_inter:
+ logger.info("")
+ logger.info("=" * 80)
+ logger.info("🚀 开始转换交互数据 (Inter Data)")
+ logger.info("=" * 80)
datasets.convert_inter()
+ logger.info("=" * 80)
+ logger.info("✅ 交互数据转换完成")
+ logger.info("=" * 80)
+
if args.convert_item:
+ logger.info("")
+ logger.info("=" * 80)
+ logger.info("🚀 开始转换物品特征 (Item Features)")
+ logger.info("=" * 80)
datasets.convert_item()
+ logger.info("=" * 80)
+ logger.info("✅ 物品特征转换完成")
+ logger.info("=" * 80)
if args.convert_user:
+ logger.info("")
+ logger.info("=" * 80)
+ logger.info("🚀 开始转换用户特征 (User Features)")
+ logger.info("=" * 80)
datasets.convert_user()
+ logger.info("=" * 80)
+ logger.info("✅ 用户特征转换完成")
+ logger.info("=" * 80)
+
+ # 计算总耗时
+ end_time = time.time()
+ elapsed_time = end_time - start_time
+
+ # 构建完成信息
+ completion_info = {
+ "状态": "所有任务完成",
+ "总耗时": f"{elapsed_time:.2f} 秒 ({elapsed_time/60:.2f} 分钟)",
+ "结束时间": datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
+ "输出目录": args.output_path
+ }
+
+ logger.info("")
+ logger.info("=" * 80)
+ logger.info("🎉 转换任务执行完毕")
+ logger.info(format_to_str_box(completion_info))
+ logger.info("=" * 80)
diff --git a/conversion_tools/src/extended_dataset.py b/conversion_tools/src/extended_dataset.py
index 2fdabd5..4965fac 100644
--- a/conversion_tools/src/extended_dataset.py
+++ b/conversion_tools/src/extended_dataset.py
@@ -526,6 +526,176 @@ def merge_duplicate(self, inter_table):
return inter_dict
+class TMALL2014Dataset(BaseDataset):
+ def __init__(self, input_path, output_path, interaction_type, duplicate_removal):
+ super(TMALL2014Dataset, self).__init__(input_path, output_path)
+ self.dataset_name = 'tmall2014'
+ self.interaction_type = interaction_type
+ self.duplicate_removal = duplicate_removal
+
+ # output file path (align with TMALLDataset style)
+ if self.interaction_type == 'all':
+ # 合并所有行为类型的情况
+ self.dataset_name = self.dataset_name + '-merged'
+ else:
+ # 单个行为类型的情况
+ self.dataset_name = self.dataset_name + '-' + self.interaction_type
+
+ self.output_path = os.path.join(self.output_path, self.dataset_name)
+ self.check_output_path()
+ self.output_inter_file = os.path.join(self.output_path, self.dataset_name + '.inter')
+
+ # input file
+ # 直接使用传入的文件路径(可为绝对或相对路径)
+ self.inter_file = self.input_path
+
+ self.sep = ','
+
+ # selected feature fields - 根据是否合并所有行为类型来定义字段
+ if self.interaction_type == 'all':
+ # 合并模式:包含行为类型字段
+ if self.duplicate_removal:
+ self.inter_fields = {
+ 0: 'user_id:token',
+ 1: 'item_id:token',
+ 2: 'timestamp:float',
+ 3: 'action_type:token',
+ 4: 'interactions:float'
+ }
+ else:
+ self.inter_fields = {
+ 0: 'user_id:token',
+ 1: 'item_id:token',
+ 2: 'timestamp:float',
+ 3: 'action_type:token'
+ }
+ else:
+ # 单个行为类型模式:不包含行为类型字段
+ if self.duplicate_removal:
+ self.inter_fields = {
+ 0: 'user_id:token',
+ 1: 'item_id:token',
+ 2: 'timestamp:float',
+ 3: 'interactions:float'
+ }
+ else:
+ self.inter_fields = {
+ 0: 'user_id:token',
+ 1: 'item_id:token',
+ 2: 'timestamp:float'
+ }
+
+ def load_inter_data_streaming(self):
+ """流式读取数据,边读边yield,不占用大量内存
+
+ 原始格式(\x01分隔):
+ item_id\x01user_id\x01action\x01timestamp
+ 示例: 3903192\x01u6276408\x01click\x012013-08-26 10:41:11
+ """
+ import os
+ from datetime import datetime
+
+ with open(self.inter_file, 'r') as fin:
+ file_size = os.path.getsize(self.inter_file)
+
+ # 使用更快的进度条更新(行数而非字节)
+ processed_bytes = 0
+ update_interval = 10000 # 每10000行更新一次进度
+ line_count = 0
+
+ with tqdm(total=file_size, unit='B', unit_scale=True) as pbar:
+ for line in fin:
+ line_count += 1
+ line_bytes = len(line)
+ processed_bytes += line_bytes
+
+ # 减少进度条更新频率
+ if line_count % update_interval == 0:
+ pbar.update(processed_bytes)
+ processed_bytes = 0
+
+ line = line.strip()
+ if not line:
+ continue
+
+ try:
+ # 使用 \x01 作为分隔符
+ fields = line.split('\x01')
+ if len(fields) != 4:
+ continue
+
+ item_id, user_id, action, vtime = fields
+
+ # 根据模式过滤交互类型
+ if self.interaction_type == 'all':
+ # 合并模式:包含所有4种行为类型
+ if action in ['click', 'cart', 'collect', 'alipay']:
+ dt = datetime.strptime(vtime, '%Y-%m-%d %H:%M:%S')
+ ts = int(dt.timestamp())
+ yield [user_id, item_id, str(ts), action]
+ else:
+ # 单个行为类型模式:只包含指定类型
+ if action == self.interaction_type:
+ dt = datetime.strptime(vtime, '%Y-%m-%d %H:%M:%S')
+ ts = int(dt.timestamp())
+ yield [user_id, item_id, str(ts)]
+ except Exception:
+ continue
+
+ # 更新剩余进度
+ if processed_bytes > 0:
+ pbar.update(processed_bytes)
+
+ def convert_inter(self):
+ try:
+ with open(self.output_inter_file, 'w', buffering=1024*1024) as fp: # 1MB 缓冲
+ fp.write('\t'.join([self.inter_fields[i] for i in range(len(self.inter_fields))]) + '\n')
+
+ if self.duplicate_removal:
+ inter_dict = {}
+ for line in self.load_inter_data_streaming():
+ key = tuple(line[:-1])
+ t = line[-1]
+ if key in inter_dict:
+ inter_dict[key][0] = t
+ inter_dict[key][1] += 1
+ else:
+ inter_dict[key] = [t, 1]
+
+ for k, v in tqdm(inter_dict.items()):
+ fp.write('\t'.join([str(item) for item in list(k) + v]) + '\n')
+ else:
+ # 批量写入优化
+ buffer = []
+ buffer_size = 10000
+
+ for line in self.load_inter_data_streaming():
+ buffer.append('\t'.join(line))
+ if len(buffer) >= buffer_size:
+ fp.write('\n'.join(buffer) + '\n')
+ buffer.clear()
+
+ # 写入剩余数据
+ if buffer:
+ fp.write('\n'.join(buffer) + '\n')
+
+ except NotImplementedError:
+ print('This dataset can\'t be converted to inter file\n')
+ except Exception as e:
+ print(f'TMALL2014Dataset convert_inter error: {e}')
+
+ def merge_duplicate(self, inter_table):
+ inter_dict = {}
+ for line in inter_table:
+ key = tuple(line[:-1])
+ t = line[-1]
+ if key in inter_dict:
+ inter_dict[key][0] = t
+ inter_dict[key][1] += 1
+ else:
+ inter_dict[key] = [t, 1]
+ return inter_dict
+
class NETFLIXDataset(BaseDataset):
def __init__(self, input_path, output_path):
super(NETFLIXDataset, self).__init__(input_path, output_path)
@@ -5314,3 +5484,162 @@ def convert_inter(self):
fout.write('\t'.join([current_list[0], item, rating, timestamp]) + '\n')
fin.close()
fout.close()
+
+
+class TaobaoDataset(BaseDataset):
+ def __init__(self, input_path, output_path, interaction_type, duplicate_removal):
+ super(TaobaoDataset, self).__init__(input_path, output_path)
+ self.dataset_name = 'taobao'
+ self.interaction_type = interaction_type
+ self.duplicate_removal = duplicate_removal
+
+ # 验证交互类型
+ valid_types = ['pv', 'cart', 'fav', 'buy', 'all']
+ assert self.interaction_type in valid_types, f'interaction_type must be in {valid_types}'
+
+ # 输出文件路径设置 - 与Rec_Tmall保持一致的结构
+ if self.interaction_type == 'all':
+ # 合并所有行为类型的情况
+ self.dataset_name = self.dataset_name + '-merged'
+ else:
+ # 单个行为类型的情况
+ self.dataset_name = self.dataset_name + '-' + self.interaction_type
+
+ # 创建Rec_Taobao/processed/taobao-{type}/结构
+ self.output_path = os.path.join(self.output_path, 'Rec_Taobao', 'processed', self.dataset_name)
+ self.check_output_path()
+ self.output_inter_file = os.path.join(self.output_path, self.dataset_name + '.inter')
+
+ # 输入文件
+ self.inter_file = self.input_path
+ self.sep = ','
+
+ # 根据是否合并所有行为类型来定义字段
+ if self.interaction_type == 'all':
+ # 合并模式:包含行为类型字段
+ if self.duplicate_removal:
+ self.inter_fields = {
+ 0: 'user_id:token',
+ 1: 'item_id:token',
+ 2: 'timestamp:float',
+ 3: 'action_type:token',
+ 4: 'interactions:float'
+ }
+ else:
+ self.inter_fields = {
+ 0: 'user_id:token',
+ 1: 'item_id:token',
+ 2: 'timestamp:float',
+ 3: 'action_type:token'
+ }
+ else:
+ # 单个行为类型模式:不包含行为类型字段
+ if self.duplicate_removal:
+ self.inter_fields = {
+ 0: 'user_id:token',
+ 1: 'item_id:token',
+ 2: 'timestamp:float',
+ 3: 'interactions:float'
+ }
+ else:
+ self.inter_fields = {
+ 0: 'user_id:token',
+ 1: 'item_id:token',
+ 2: 'timestamp:float'
+ }
+
+ def load_inter_data_streaming(self):
+ """流式读取数据,边读边yield,不占用大量内存
+
+ 原始格式(CSV,逗号分隔):
+ user_id,item_id,category_id,behavior_type,timestamp
+ 示例: 1,2268318,2520377,pv,1511544070
+ """
+ import os
+ from datetime import datetime
+
+ with open(self.inter_file, 'r') as fin:
+ file_size = os.path.getsize(self.inter_file)
+
+ # 跳过标题行
+ next(fin)
+
+ # 使用更快的进度条更新(行数而非字节)
+ processed_bytes = 0
+ update_interval = 10000 # 每10000行更新一次进度
+ line_count = 0
+
+ with tqdm(total=file_size, unit='B', unit_scale=True) as pbar:
+ for line in fin:
+ line_count += 1
+ line_bytes = len(line)
+ processed_bytes += line_bytes
+
+ # 减少进度条更新频率
+ if line_count % update_interval == 0:
+ pbar.update(processed_bytes)
+ processed_bytes = 0
+
+ line = line.strip()
+ if not line:
+ continue
+
+ try:
+ # 使用逗号作为分隔符
+ fields = line.split(',')
+ if len(fields) != 5:
+ continue
+
+ user_id, item_id, category_id, behavior_type, timestamp = fields
+
+ # 根据模式过滤交互类型
+ if self.interaction_type == 'all':
+ # 合并模式:包含所有4种行为类型
+ if behavior_type in ['pv', 'cart', 'fav', 'buy']:
+ yield [user_id, item_id, timestamp, behavior_type]
+ else:
+ # 单个行为类型模式:只包含指定类型
+ if behavior_type == self.interaction_type:
+ yield [user_id, item_id, timestamp]
+ except Exception:
+ continue
+
+ # 更新剩余进度
+ if processed_bytes > 0:
+ pbar.update(processed_bytes)
+
+ def convert_inter(self):
+ try:
+ with open(self.output_inter_file, 'w', buffering=1024*1024) as fp: # 1MB 缓冲
+ fp.write('\t'.join([self.inter_fields[i] for i in range(len(self.inter_fields))]) + '\n')
+
+ if self.duplicate_removal:
+ inter_dict = {}
+ for line in self.load_inter_data_streaming():
+ key = tuple(line[:-1])
+ t = line[-1]
+ if key in inter_dict:
+ inter_dict[key][0] = t
+ inter_dict[key][1] += 1
+ else:
+ inter_dict[key] = [t, 1]
+
+ for k, v in tqdm(inter_dict.items()):
+ fp.write('\t'.join([str(item) for item in list(k) + v]) + '\n')
+ else:
+ # 批量写入优化
+ buffer = []
+ buffer_size = 10000
+
+ for line in self.load_inter_data_streaming():
+ buffer.append('\t'.join(line))
+ if len(buffer) >= buffer_size:
+ fp.write('\n'.join(buffer) + '\n')
+ buffer.clear()
+
+ # 写入剩余数据
+ if buffer:
+ fp.write('\n'.join(buffer) + '\n')
+
+ except NotImplementedError:
+ print('This dataset can\'t be converted to inter file\n')
diff --git a/conversion_tools/src/logger.py b/conversion_tools/src/logger.py
new file mode 100644
index 0000000..9c487be
--- /dev/null
+++ b/conversion_tools/src/logger.py
@@ -0,0 +1,273 @@
+"""
+一个全面的日志模块,提供彩色控制台输出和轮转文件日志功能.
+
+该模块实现了单例模式的日志记录器,支持控制台和文件日志记录,
+当日志文件达到10MB大小时会自动进行轮转.
+"""
+
+import logging
+import sys
+from logging.handlers import RotatingFileHandler
+from pathlib import Path
+from typing import Optional, Union
+
+from colorama import Fore, Style, init # type: ignore
+
+# 创建日志格式
+LOG_FORMAT = "%(asctime)s [%(levelname)s] [%(module)s.%(funcName)s] - %(message)s"
+# 包含 完整文件路径 和 行号
+# LOG_FORMAT = "%(asctime)s [%(levelname)s] [%(pathname)s:%(lineno)d] - %(message)s"
+# 包含 文件名 和 行号
+# LOG_FORMAT = "%(asctime)s [%(levelname)s] [%(filename)s:%(lineno)d] - %(message)s"
+
+BERTOPIC_LOG_FORMAT = "%(asctime)s [%(levelname)s] [BERTopic] - %(message)s"
+DATE_FORMAT = "%Y-%m-%d %H:%M:%S"
+
+# 初始化 colorama
+init(autoreset=True)
+
+# 定义日志级别对应的颜色
+LOG_COLORS = {
+ "DEBUG": Fore.CYAN,
+ "INFO": Fore.GREEN,
+ "WARNING": Fore.YELLOW,
+ "ERROR": Fore.RED,
+ "CRITICAL": Fore.RED + Style.BRIGHT,
+}
+
+# 日志文件配置
+MAX_LOG_SIZE = 10 * 1024 * 1024 # 10MB
+BACKUP_COUNT = 5 # 保留5个备份文件
+
+
+class ColoredFormatter(logging.Formatter):
+ """自定义格式化器,为日志级别添加颜色."""
+
+ def format(self, record):
+ """格式化日志记录,为日志级别添加颜色."""
+ # 获取原始日志消息
+ message = super().format(record)
+
+ # 为日志级别添加颜色
+ level_color = LOG_COLORS.get(record.levelname, "")
+ if level_color:
+ # 只对日志级别关键字进行着色
+ level_name = record.levelname
+ colored_level = f"{level_color}{level_name}{Style.RESET_ALL}"
+ message = message.replace(level_name, colored_level)
+
+ return message
+
+
+def setup_bertopic_logger(log_dir: Path):
+ """
+ 配置 BERTopic 的日志记录器.
+
+ Args:
+ log_dir (Path): 日志目录路径
+ bertopic_formatter (ColoredFormatter): BERTopic 专用的格式化器.
+ """
+ # 创建 BERTopic 专用格式化器
+ bertopic_formatter = ColoredFormatter(BERTOPIC_LOG_FORMAT, datefmt=DATE_FORMAT)
+
+ # 配置 BERTopic 的日志记录器
+ bertopic_logger = logging.getLogger("BERTopic")
+ bertopic_logger.setLevel(logging.INFO)
+ bertopic_logger.propagate = False # 禁止传播到根记录器
+
+ # 移除现有的处理器(如果有的话)
+ for handler in bertopic_logger.handlers[:]:
+ bertopic_logger.removeHandler(handler)
+
+ # 为 BERTopic 创建专用的处理器
+ bertopic_console_handler = logging.StreamHandler(sys.stdout)
+ bertopic_console_handler.setFormatter(bertopic_formatter)
+ bertopic_logger.addHandler(bertopic_console_handler)
+
+ # 使用轮转文件处理器
+ bertopic_file_handler = RotatingFileHandler(
+ log_dir / "pipeline.log",
+ maxBytes=MAX_LOG_SIZE,
+ backupCount=BACKUP_COUNT,
+ encoding="utf-8",
+ )
+ bertopic_file_handler.setFormatter(bertopic_formatter)
+ bertopic_logger.addHandler(bertopic_file_handler)
+
+
+class Logger:
+ """单例模式实现日志记录器."""
+
+ _instance: Optional["Logger"] = None
+ _initialized: bool = False
+
+ def __new__(cls):
+ """创建单例实例."""
+ if cls._instance is None:
+ cls._instance = super().__new__(cls)
+ return cls._instance
+
+ def __init__(self):
+ """初始化日志记录器."""
+ if self._initialized:
+ return
+
+ self._initialized = True
+
+ # 配置根日志记录器
+ root_logger = logging.getLogger()
+ root_logger.setLevel(logging.INFO)
+
+ # 移除所有现有的处理器
+ for handler in root_logger.handlers[:]:
+ root_logger.removeHandler(handler)
+
+ # 创建日志目录
+ log_dir = Path("logs")
+ log_dir.mkdir(exist_ok=True)
+
+ # 创建格式化器
+ colored_formatter = ColoredFormatter(LOG_FORMAT, datefmt=DATE_FORMAT)
+ plain_formatter = logging.Formatter(LOG_FORMAT, datefmt=DATE_FORMAT)
+
+ # 控制台处理器(带颜色)
+ console_handler = logging.StreamHandler(sys.stdout)
+ console_handler.setFormatter(colored_formatter)
+ root_logger.addHandler(console_handler)
+
+ # 文件处理器(不带颜色,使用轮转)
+ file_handler = RotatingFileHandler(
+ log_dir / "pipeline.log",
+ maxBytes=MAX_LOG_SIZE,
+ backupCount=BACKUP_COUNT,
+ encoding="utf-8",
+ )
+ file_handler.setFormatter(plain_formatter)
+ root_logger.addHandler(file_handler)
+
+ # 配置 BERTopic 的日志记录器
+ setup_bertopic_logger(log_dir)
+
+ # 创建项目特定的logger
+ self.logger = logging.getLogger("TextMiningPipeline")
+ self.logger.setLevel(logging.INFO)
+
+ self.logger.propagate = True
+
+ def get_logger(self) -> logging.Logger:
+ """获取日志记录器实例."""
+ return self.logger
+
+ def set_level(self, level: int):
+ """设置日志级别."""
+ self.logger.setLevel(level)
+ logging.getLogger().setLevel(level)
+
+
+# Global logger instance
+logger = Logger().get_logger()
+
+
+def format_to_str_box(data: Union[dict[str, str], str], max_width: int = 80) -> str:
+ """
+ 将字典或字符串格式化为指定格式的盒子字符串,自动处理长行.
+
+ 一个中文字符的宽度等于两个英文字符的宽度.
+
+ 参数:
+ data: 可以是字典或字符串
+ - 如果是字典:按key: value格式逐行显示
+ - 如果是字符串:将字符串按行分割并显示在盒子中
+ max_width: 每行最大显示宽度(不包括边框),默认80字符
+
+ 返回:
+ 格式化后的盒子字符串
+ """
+ # 计算字符的显示宽度
+ def get_display_width(text):
+ return sum(2 if "\u4e00" <= char <= "\u9fff" else 1 for char in text)
+
+ def wrap_text(text: str, available_width: int) -> list[str]:
+ """将长文本按指定宽度换行."""
+ if get_display_width(text) <= available_width:
+ return [text]
+
+ words = text.split()
+ lines = []
+ current_line: list[str] = []
+ current_width = 0
+
+ for word in words:
+ word_width = get_display_width(word)
+ if (
+ current_width + word_width + (1 if current_line else 0)
+ <= available_width
+ ):
+ if current_line:
+ current_width += 1 # 空格的宽度
+ current_line.append(word)
+ current_width += word_width
+ else:
+ if current_line:
+ lines.append(" ".join(current_line))
+ current_line = [word]
+ current_width = word_width
+
+ if current_line:
+ lines.append(" ".join(current_line))
+ return lines
+
+ result = ""
+ border_length = max_width + 4 # 添加左右边距
+
+ if isinstance(data, str):
+ lines = []
+ for line in data.split("\n"):
+ lines.extend(wrap_text(line, max_width))
+
+ # 创建顶部边框
+ result = "+" + "-" * (border_length - 2) + "+\n"
+
+ # 添加每一行内容
+ for line in lines:
+ display_width = get_display_width(line)
+ padding = border_length - 4 - display_width # -4 是因为"| "和" |"
+ result += f"| {line}" + " " * padding + " |\n"
+
+ # 添加底部边框
+ result += "+" + "-" * (border_length - 2) + "+"
+
+ elif isinstance(data, dict):
+ # 创建顶部边框
+ result = "+" + "-" * (border_length - 2) + "+\n"
+
+ # 添加每一行内容
+ for key, value in data.items():
+ prefix = f"{key}: "
+ prefix_width = get_display_width(prefix)
+ available_width = max_width - prefix_width
+
+ # 处理值的换行
+ value_lines = wrap_text(str(value), available_width)
+
+ # 添加第一行(带键名)
+ first_line = prefix + value_lines[0]
+ display_width = get_display_width(first_line)
+ padding = border_length - 4 - display_width
+ result += f"| {first_line}" + " " * padding + " |\n"
+
+ # 添加后续行(如果有的话)
+ for line in value_lines[1:]:
+ display_width = get_display_width(line)
+ # 对齐前一行的值
+ indent = " " * prefix_width
+ padding = border_length - 4 - display_width - prefix_width
+ result += f"| {indent}{line}" + " " * padding + " |\n"
+
+ # 添加底部边框
+ result += "+" + "-" * (border_length - 2) + "+"
+
+ else:
+ raise TypeError("输入必须是字符串或字典")
+
+ return "\n" + result
diff --git a/conversion_tools/src/utils.py b/conversion_tools/src/utils.py
index ea4882c..de46008 100644
--- a/conversion_tools/src/utils.py
+++ b/conversion_tools/src/utils.py
@@ -12,6 +12,7 @@
'avazu': 'AVAZUDataset',
'adult': 'ADULTDataset',
'tmall': 'TMALLDataset',
+ 'tmall_2014': 'TMALL2014Dataset',
'netflix': 'NETFLIXDataset',
'criteo': 'CRITEODataset',
'foursquare': 'FOURSQUAREDataset',
@@ -63,20 +64,23 @@
'mind_large_dev': 'MINDLargeDevDataset',
'mind_small_train': 'MINDSmallTrainDataset',
'mind_small_dev': 'MINDSmallDevDataset',
- 'cosmetics': 'CosmeticsDataset'
+ 'cosmetics': 'CosmeticsDataset',
+ 'taobao': 'TaobaoDataset'
}
click_dataset = {
'YOOCHOOSEDataset',
'RETAILROCKETDataset',
'TMALLDataset',
+ 'TMALL2014Dataset',
'IPINYOUDataset',
'TAFENGDataset',
'LFM1bDataset',
'GOWALLADataset',
'DIGINETICADataset',
'FOURSQUAREDataset',
- 'STEAMDataset'
+ 'STEAMDataset',
+ 'TaobaoDataset'
}
multiple_dataset = {
@@ -85,8 +89,10 @@
'RETAILROCKETDataset',
'TAFENGDataset',
'TMALLDataset',
+ 'TMALL2014Dataset',
'IPINYOUDataset',
- 'LFM1bDataset'
+ 'LFM1bDataset',
+ 'TaobaoDataset'
}
multiple_item_features = {
diff --git a/conversion_tools/usage/Taobao.md b/conversion_tools/usage/Taobao.md
new file mode 100644
index 0000000..f8649c5
--- /dev/null
+++ b/conversion_tools/usage/Taobao.md
@@ -0,0 +1,117 @@
+# Taobao Dataset
+
+## Dataset Information
+
+**For detailed dataset information, please visit:** [Taobao User Behavior Dataset](https://tianchi.aliyun.com/dataset/dataDetail?dataId=649)
+
+## Prerequisites
+
+```bash
+git clone https://github.com/RUCAIBox/RecDatasets
+cd RecDatasets/conversion_tools
+pip install -r requirements.txt
+```
+
+## Data Conversion
+
+### Basic Usage
+
+```bash
+python run.py --dataset taobao \
+ --input_path /path/to/Taobao.csv \
+ --output_path output_data/taobao \
+ --interaction_type pv \
+ --convert_inter
+```
+
+### Parameters
+
+- `--dataset`: `taobao` (required)
+- `--input_path`: Path to the input data file (required)
+- `--output_path`: Directory to store converted files (required)
+- `--interaction_type`: `pv`, `cart`, `fav`, `buy`, or omit to merge all types (optional)
+- `--convert_inter`: Enable conversion (required)
+- `--duplicate_removal`: Enable deduplication (optional)
+
+**Note**: When `--interaction_type` is omitted, all four interaction types (pv, cart, fav, buy) will be merged into a single file with an additional `action_type` column.
+
+### Convert All Interaction Types
+
+#### Method 1: Convert Separately
+```bash
+for type in pv cart fav buy; do
+ python run.py --dataset taobao \
+ --input_path /path/to/Taobao.csv \
+ --output_path output_data/taobao \
+ --interaction_type $type \
+ --convert_inter
+done
+```
+
+#### Method 2: Convert All Types in One File (Recommended)
+```bash
+python run.py --dataset taobao \
+ --input_path /path/to/Taobao.csv \
+ --output_path output_data/taobao \
+ --convert_inter
+```
+
+## Output Format
+
+### Single Interaction Type
+Output file: `output_data/taobao/taobao-{interaction_type}/taobao-{interaction_type}.inter`
+
+```
+user_id:token item_id:token timestamp:float
+1 2268318 1511544070
+```
+
+### All Interaction Types (Merged)
+Output file: `output_data/taobao/taobao-merged/taobao-merged.inter`
+
+```
+user_id:token item_id:token timestamp:float action_type:token
+1 2268318 1511544070 pv
+1 2268318 1511544071 cart
+1 2268318 1511544072 fav
+1 2268318 1511544073 buy
+```
+
+### With `--duplicate_removal`
+
+#### Single Type:
+```
+user_id:token item_id:token timestamp:float interactions:float
+1 2268318 1511544070 3
+```
+
+#### Merged Types:
+```
+user_id:token item_id:token timestamp:float action_type:token interactions:float
+1 2268318 1511544070 pv 1
+1 2268318 1511544071 cart 2
+```
+
+## Dataset Statistics
+
+- **Total interactions**: ~100 million
+- **Behavior types**: pv (page view), cart (add to cart), fav (favorite), buy (purchase)
+- **Time period**: 2017-11-25 to 2017-12-03
+- **Users**: ~1 million
+- **Items**: ~4 million
+
+## Input Format
+
+The input CSV file should have the following format:
+```
+user_id,item_id,category_id,behavior_type,timestamp
+1,2268318,2520377,pv,1511544070
+1,2333346,2520771,pv,1511561733
+```
+
+Where:
+- `user_id`: User identifier
+- `item_id`: Item identifier
+- `category_id`: Category identifier
+- `behavior_type`: One of `pv`, `cart`, `fav`, `buy`
+- `timestamp`: Unix timestamp
diff --git a/conversion_tools/usage/Tmall2014.md b/conversion_tools/usage/Tmall2014.md
new file mode 100644
index 0000000..ee6aae7
--- /dev/null
+++ b/conversion_tools/usage/Tmall2014.md
@@ -0,0 +1,94 @@
+# Tmall2014
+
+## Dataset Information
+
+**For detailed dataset information, please visit:** [Tianchi Tmall Recommendation Dataset](https://tianchi.aliyun.com/dataset/140281)
+
+## Prerequisites
+
+```bash
+git clone https://github.com/RUCAIBox/RecDatasets
+cd RecDatasets/conversion_tools
+pip install -r requirements.txt
+```
+
+## Data Conversion
+
+### Basic Usage
+
+```bash
+python run.py --dataset tmall_2014 \
+ --input_path /path/to/tianchi_2014002_rec_tmall_log_partc.txt \
+ --output_path output_data/tmall2014 \
+ --interaction_type click \
+ --convert_inter
+```
+
+### Parameters
+
+- `--dataset`: `tmall_2014` (required)
+- `--input_path`: Path to the input data file (required)
+- `--output_path`: Directory to store converted files (required)
+- `--interaction_type`: `click`, `cart`, `collect`, or `alipay` (optional, omit to merge all types)
+- `--convert_inter`: Enable conversion (required)
+- `--duplicate_removal`: Enable deduplication (optional)
+
+**Note**: When `--interaction_type` is omitted, all four interaction types (click, cart, collect, alipay) will be merged into a single file with an additional `action_type` column.
+
+### Convert All Interaction Types
+
+#### Method 1: Convert Separately
+```bash
+for type in click cart collect alipay; do
+ python run.py --dataset tmall_2014 \
+ --input_path /path/to/data.txt \
+ --output_path output_data/tmall2014 \
+ --interaction_type $type \
+ --convert_inter
+done
+```
+
+#### Method 2: Convert All Types in One File (Recommended)
+```bash
+python run.py --dataset tmall_2014 \
+ --input_path /path/to/data.txt \
+ --output_path output_data/tmall2014 \
+ --convert_inter
+```
+
+## Output Format
+
+### Single Interaction Type
+Output file: `output_data/tmall2014/tmall2014-{interaction_type}/tmall2014-{interaction_type}.inter`
+
+```
+user_id:token item_id:token timestamp:float
+u6276408 3903192 1377496871
+```
+
+### All Interaction Types (Merged)
+Output file: `output_data/tmall2014/tmall2014-merged/tmall2014-merged.inter`
+
+```
+user_id:token item_id:token timestamp:float action_type:token
+u6276408 3903192 1377496871 click
+u6276408 3903192 1377496872 cart
+u6276408 3903192 1377496873 collect
+u6276408 3903192 1377496874 alipay
+```
+
+### With `--duplicate_removal`
+
+#### Single Type:
+```
+user_id:token item_id:token timestamp:float interactions:float
+u6276408 3903192 1377496871 3
+```
+
+#### Merged Types:
+```
+user_id:token item_id:token timestamp:float action_type:token interactions:float
+u6276408 3903192 1377496871 click 1
+u6276408 3903192 1377496872 cart 2
+```
+
diff --git a/dataset_info/Tmall2014/README.md b/dataset_info/Tmall2014/README.md
new file mode 100644
index 0000000..7f327ea
--- /dev/null
+++ b/dataset_info/Tmall2014/README.md
@@ -0,0 +1,20 @@
+# Tmall2014
+
+## Dataset Overview
+
+Tmall2014 is a large-scale e-commerce dataset collected from Tmall.com (formerly Taobao Mall), containing user behavior logs from 2013. The dataset includes multiple types of user-item interactions: clicks, add-to-cart, favorites (collect), and purchases (alipay).
+
+**For detailed dataset information, please visit:** [Tianchi Tmall Recommendation Dataset](https://tianchi.aliyun.com/dataset/140281)
+
+## Data Format
+
+The original data file uses `\x01` (ASCII control character) as field separator. After conversion, the data is in RecBole atomic file format (tab-separated):
+
+```
+user_id:token item_id:token timestamp:float
+u6276408 3903192 1377496871
+```
+
+## Usage
+
+Please refer to the [conversion tool documentation](../../conversion_tools/usage/Tmall2014.md) for instructions on how to convert this dataset to RecBole format.