diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..5c0f323 --- /dev/null +++ b/.gitignore @@ -0,0 +1,5 @@ +__pycache__/ +*/__pycache__/ +**/__pycache__/ +*.pyc +conversion_tools/logs/ diff --git a/README.md b/README.md index ad819d0..88c71d1 100644 --- a/README.md +++ b/README.md @@ -48,6 +48,8 @@ This dataset is a collection of anonymized customer sessions containing products - Yelp-full: This is a combination dataset including four versions of yelp datasets mentioned above, where the duplicates are dropped and the number of total reviews is 28,908,240. - [Tmall](https://tianchi.aliyun.com/dataset/dataDetail?dataId=53): This dataset is provided by Ant Financial Services, using in the IJCAI16 contest. +- [Tmall2014](https://tianchi.aliyun.com/dataset/140281): + This is a large-scale e-commerce dataset from Tmall.com containing user behavior logs from 2013. The dataset includes multiple types of user-item interactions: clicks, add-to-cart, favorites (collect), and purchases (alipay). - [DIGINETICA](https://competitions.codalab.org/competitions/11161): The dataset includes user sessions extracted from an e-commerce search engine logs, with anonymized user ids, hashed queries, hashed query terms, hashed product descriptions and meta-data, log-scaled prices, clicks, and purchases. @@ -204,23 +206,24 @@ These datasets contain measurements of clothing fit from [RentTheRunway](https:/ | 17 | [Ta Feng](https://github.com/RUCAIBox/RecommenderSystems-Datasets/tree/master/dataset_info/TaFeng) | 32,266 | 23,812 | 817,741 | 99\.89% | Click | √ | √ | √ | √ | | 18 | [Foursquare](https://github.com/RUCAIBox/RecommenderSystems-Datasets/tree/master/dataset_info/Foursquare) | \- | \- | \- | \- | Check-in | √ | | √ | | | 19 | [Tmall](https://github.com/RUCAIBox/RecommenderSystems-Datasets/tree/master/dataset_info/Tmall) | 963,923 | 2,353,207 | 44,528,127 | 99.99% | Click/Buy | √ | | | √ | -| 20 | [YOOCHOOSE](https://github.com/RUCAIBox/RecommenderSystems-Datasets/tree/master/dataset_info/YOOCHOOSE) | 9,249,729 | 52,739 | 34,154,697 | 99.99% | Click/Buy | √ | | | √ | -| 21 | [Retailrocket](https://github.com/RUCAIBox/RecommenderSystems-Datasets/tree/master/dataset_info/Retailrocket) | 1,407,580 | 247,085 | 2,756,101 | 99.99% | View/Addtocart/Transaction | √ | | | | -| 22 | [LFM-1b](https://github.com/RUCAIBox/RecommenderSystems-Datasets/tree/master/dataset_info/LFM-1b) | 120,322 | 3,123,496 | 1,088,161,692 | 99\.71% | Click | √ | √ | √ | √ | -| 23 | [MIND](https://github.com/RUCAIBox/RecommenderSystems-Datasets/tree/master/dataset_info/MIND) | - | - | - | - | Click | √ | | | | -| 24 | BeerAdvocate | 33,388 | 66,055 | 1,586,614 | 99\.9281% | Rating
\[0,5\] | √ | | √ | | -| 25 | Behance | 63,497 | 178,788 | 1,000,000 | 99\.9912% | Likes | √ | | √ | | -| 26 | DianPing | 542,706 | 243,247 | 4,422,473 | 99\.9967% | Rating
\[0,5\] | √ | | √ | √ | -| 27 | EndoMondo | 1,104 | 253,020 | 253,020 | 99\.9094% | Workout Logs | √ | √ | | √ | -| 28 | Food | 226,570 | 231,637 | 1,132,367 | 99\.9978% | Rating
\[0,5\] | √ | | √ | | -| 29 | GoodReads | 876,145 | 2,360,650 | 228,648,342 | 99\.9889% | Rating
\[0,5\] | √ | | √ | | -| 30 | [KGRec](https://github.com/RUCAIBox/RecommenderSystems-Datasets/tree/master/dataset_info/KGRec) | - | - | - | - | Click | | | √ | | -| 31 | ModCloth | 47,958 | 1,378 | 82,790 | 99\.8747% | Rating
\[0,5\] | | √ | √ | √ | -| 32 | RateBeer | 29,265 | 110,369 | 2,924,163 | 99\.9095% | Overall Rating
\[0,20\] | √ | | √ | √ | -| 33 | RentTheRunway | 105,571 | 5,850 | 192,544 | 99\.9688% | Rating
\[0,10\] | √ | √ | √ | √ | -| 34 | [Twitch](https://github.com/RUCAIBox/RecommenderSystems-Datasets/tree/master/dataset_info/Twitch) | 15,524,309 | 6,161,666 | 474,676,929 | 99\.9995% | Click | | | | √ | -| 35 | Amazon_M2 | 3,606,349 | 1,410,675 | 15,306,183 | \- | Click | | | √ | √ | -| 36 | Music4All-Onion | 119,140 | 109,269 | 252,984,396 | \- | Click | √ | | √ | √ | +| 20 | [Tmall2014](dataset_info/Tmall2014) | ~1,500,000 | ~8,000,000 | ~22,400,000 (click) | 99.99% | Click/Cart/Collect/Alipay | √ | | | | +| 21 | [YOOCHOOSE](https://github.com/RUCAIBox/RecommenderSystems-Datasets/tree/master/dataset_info/YOOCHOOSE) | 9,249,729 | 52,739 | 34,154,697 | 99.99% | Click/Buy | √ | | | √ | +| 22 | [Retailrocket](https://github.com/RUCAIBox/RecommenderSystems-Datasets/tree/master/dataset_info/Retailrocket) | 1,407,580 | 247,085 | 2,756,101 | 99.99% | View/Addtocart/Transaction | √ | | | | +| 23 | [LFM-1b](https://github.com/RUCAIBox/RecommenderSystems-Datasets/tree/master/dataset_info/LFM-1b) | 120,322 | 3,123,496 | 1,088,161,692 | 99\.71% | Click | √ | √ | √ | √ | +| 24 | [MIND](https://github.com/RUCAIBox/RecommenderSystems-Datasets/tree/master/dataset_info/MIND) | - | - | - | - | Click | √ | | | | +| 25 | BeerAdvocate | 33,388 | 66,055 | 1,586,614 | 99\.9281% | Rating
\[0,5\] | √ | | √ | | +| 26 | Behance | 63,497 | 178,788 | 1,000,000 | 99\.9912% | Likes | √ | | √ | | +| 27 | DianPing | 542,706 | 243,247 | 4,422,473 | 99\.9967% | Rating
\[0,5\] | √ | | √ | √ | +| 28 | EndoMondo | 1,104 | 253,020 | 253,020 | 99\.9094% | Workout Logs | √ | √ | | √ | +| 29 | Food | 226,570 | 231,637 | 1,132,367 | 99\.9978% | Rating
\[0,5\] | √ | | √ | | +| 30 | GoodReads | 876,145 | 2,360,650 | 228,648,342 | 99\.9889% | Rating
\[0,5\] | √ | | √ | | +| 31 | [KGRec](https://github.com/RUCAIBox/RecommenderSystems-Datasets/tree/master/dataset_info/KGRec) | - | - | - | - | Click | | | √ | | +| 32 | ModCloth | 47,958 | 1,378 | 82,790 | 99\.8747% | Rating
\[0,5\] | | √ | √ | √ | +| 33 | RateBeer | 29,265 | 110,369 | 2,924,163 | 99\.9095% | Overall Rating
\[0,20\] | √ | | √ | √ | +| 34 | RentTheRunway | 105,571 | 5,850 | 192,544 | 99\.9688% | Rating
\[0,10\] | √ | √ | √ | √ | +| 35 | [Twitch](https://github.com/RUCAIBox/RecommenderSystems-Datasets/tree/master/dataset_info/Twitch) | 15,524,309 | 6,161,666 | 474,676,929 | 99\.9995% | Click | | | | √ | +| 36 | Amazon_M2 | 3,606,349 | 1,410,675 | 15,306,183 | \- | Click | | | √ | √ | +| 37 | Music4All-Onion | 119,140 | 109,269 | 252,984,396 | \- | Click | √ | | √ | √ | ### CTR Datasets diff --git a/conversion_tools/README.md b/conversion_tools/README.md index fa9de21..d2d9ef9 100644 --- a/conversion_tools/README.md +++ b/conversion_tools/README.md @@ -22,11 +22,12 @@ | 17 | Ta Feng |[Link](https://github.com/RUCAIBox/RecDatasets/blob/master/conversion_tools/usage/TaFeng.md)| | 18 | Foursquare |[Link](https://github.com/RUCAIBox/RecDatasets/blob/master/conversion_tools/usage/Foursquare.md)| | 19 | Tmall |[Link](https://github.com/RUCAIBox/RecDatasets/blob/master/conversion_tools/usage/Tmall.md)| -| 20 | YOOCHOOSE |[Link](https://github.com/RUCAIBox/RecDatasets/blob/master/conversion_tools/usage/YOOCHOOSE.md)| -| 21 | Retailrocket |[Link](https://github.com/RUCAIBox/RecDatasets/blob/master/conversion_tools/usage/Retailrocket.md)| -| 22 | LFM\-1b |[Link](https://github.com/RUCAIBox/RecDatasets/blob/master/conversion_tools/usage/LFM-1b.md)| -| 23 | MIND |[Link](https://github.com/RUCAIBox/RecDatasets/blob/master/conversion_tools/usage/MIND.md)| -| 24 | Music4All_Onion |[Link](https://github.com/RUCAIBox/RecSysDatasets/blob/master/conversion_tools/usage/Onion.md)| +| 20 | Tmall2014 |[Link](usage/Tmall2014.md)| +| 21 | YOOCHOOSE |[Link](https://github.com/RUCAIBox/RecDatasets/blob/master/conversion_tools/usage/YOOCHOOSE.md)| +| 22 | Retailrocket |[Link](https://github.com/RUCAIBox/RecDatasets/blob/master/conversion_tools/usage/Retailrocket.md)| +| 23 | LFM\-1b |[Link](https://github.com/RUCAIBox/RecDatasets/blob/master/conversion_tools/usage/LFM-1b.md)| +| 24 | MIND |[Link](https://github.com/RUCAIBox/RecDatasets/blob/master/conversion_tools/usage/MIND.md)| +| 25 | Music4All_Onion |[Link](https://github.com/RUCAIBox/RecSysDatasets/blob/master/conversion_tools/usage/Onion.md)| ### CTR Datasets diff --git a/conversion_tools/run.py b/conversion_tools/run.py index 3d0c40b..cc36750 100644 --- a/conversion_tools/run.py +++ b/conversion_tools/run.py @@ -5,8 +5,11 @@ import argparse import importlib +import time +from datetime import datetime from src.utils import dataset2class, click_dataset, multiple_dataset, multiple_item_features +from src.logger import logger, format_to_str_box if __name__ == '__main__': @@ -29,21 +32,89 @@ assert args.input_path is not None, 'input_path can not be None, please specify the input_path' assert args.output_path is not None, 'output_path can not be None, please specify the output_path' + # 构建配置信息 + config_info = { + "数据集类型": args.dataset, + "输入路径": args.input_path, + "输出路径": args.output_path, + } + + if args.interaction_type: + config_info["交互类型"] = args.interaction_type + + config_info["去重模式"] = "已启用" if args.duplicate_removal else "未启用" + config_info["开始时间"] = datetime.now().strftime('%Y-%m-%d %H:%M:%S') + + # 使用 logger 输出配置信息 + logger.info("=" * 80) + logger.info("📊 数据集转换工具启动") + logger.info(format_to_str_box(config_info)) + logger.info("=" * 80) + + start_time = time.time() + input_args = [args.input_path, args.output_path] dataset_class_name = dataset2class[args.dataset.lower()] dataset_class = getattr(importlib.import_module('src.extended_dataset'), dataset_class_name) if dataset_class_name in multiple_dataset: - input_args.append(args.interaction_type) + # 只有当interaction_type不为None时才添加,否则传入'all'表示处理所有行为类型 + if args.interaction_type is not None: + input_args.append(args.interaction_type) + else: + input_args.append('all') if dataset_class_name in click_dataset: input_args.append(args.duplicate_removal) if dataset_class_name in multiple_item_features: input_args.append(args.item_feature_name) + + logger.info(f"🔧 初始化数据集类: {dataset_class_name}") datasets = dataset_class(*input_args) + logger.info("✅ 数据集类初始化完成") if args.convert_inter: + logger.info("") + logger.info("=" * 80) + logger.info("🚀 开始转换交互数据 (Inter Data)") + logger.info("=" * 80) datasets.convert_inter() + logger.info("=" * 80) + logger.info("✅ 交互数据转换完成") + logger.info("=" * 80) + if args.convert_item: + logger.info("") + logger.info("=" * 80) + logger.info("🚀 开始转换物品特征 (Item Features)") + logger.info("=" * 80) datasets.convert_item() + logger.info("=" * 80) + logger.info("✅ 物品特征转换完成") + logger.info("=" * 80) if args.convert_user: + logger.info("") + logger.info("=" * 80) + logger.info("🚀 开始转换用户特征 (User Features)") + logger.info("=" * 80) datasets.convert_user() + logger.info("=" * 80) + logger.info("✅ 用户特征转换完成") + logger.info("=" * 80) + + # 计算总耗时 + end_time = time.time() + elapsed_time = end_time - start_time + + # 构建完成信息 + completion_info = { + "状态": "所有任务完成", + "总耗时": f"{elapsed_time:.2f} 秒 ({elapsed_time/60:.2f} 分钟)", + "结束时间": datetime.now().strftime('%Y-%m-%d %H:%M:%S'), + "输出目录": args.output_path + } + + logger.info("") + logger.info("=" * 80) + logger.info("🎉 转换任务执行完毕") + logger.info(format_to_str_box(completion_info)) + logger.info("=" * 80) diff --git a/conversion_tools/src/extended_dataset.py b/conversion_tools/src/extended_dataset.py index 2fdabd5..4965fac 100644 --- a/conversion_tools/src/extended_dataset.py +++ b/conversion_tools/src/extended_dataset.py @@ -526,6 +526,176 @@ def merge_duplicate(self, inter_table): return inter_dict +class TMALL2014Dataset(BaseDataset): + def __init__(self, input_path, output_path, interaction_type, duplicate_removal): + super(TMALL2014Dataset, self).__init__(input_path, output_path) + self.dataset_name = 'tmall2014' + self.interaction_type = interaction_type + self.duplicate_removal = duplicate_removal + + # output file path (align with TMALLDataset style) + if self.interaction_type == 'all': + # 合并所有行为类型的情况 + self.dataset_name = self.dataset_name + '-merged' + else: + # 单个行为类型的情况 + self.dataset_name = self.dataset_name + '-' + self.interaction_type + + self.output_path = os.path.join(self.output_path, self.dataset_name) + self.check_output_path() + self.output_inter_file = os.path.join(self.output_path, self.dataset_name + '.inter') + + # input file + # 直接使用传入的文件路径(可为绝对或相对路径) + self.inter_file = self.input_path + + self.sep = ',' + + # selected feature fields - 根据是否合并所有行为类型来定义字段 + if self.interaction_type == 'all': + # 合并模式:包含行为类型字段 + if self.duplicate_removal: + self.inter_fields = { + 0: 'user_id:token', + 1: 'item_id:token', + 2: 'timestamp:float', + 3: 'action_type:token', + 4: 'interactions:float' + } + else: + self.inter_fields = { + 0: 'user_id:token', + 1: 'item_id:token', + 2: 'timestamp:float', + 3: 'action_type:token' + } + else: + # 单个行为类型模式:不包含行为类型字段 + if self.duplicate_removal: + self.inter_fields = { + 0: 'user_id:token', + 1: 'item_id:token', + 2: 'timestamp:float', + 3: 'interactions:float' + } + else: + self.inter_fields = { + 0: 'user_id:token', + 1: 'item_id:token', + 2: 'timestamp:float' + } + + def load_inter_data_streaming(self): + """流式读取数据,边读边yield,不占用大量内存 + + 原始格式(\x01分隔): + item_id\x01user_id\x01action\x01timestamp + 示例: 3903192\x01u6276408\x01click\x012013-08-26 10:41:11 + """ + import os + from datetime import datetime + + with open(self.inter_file, 'r') as fin: + file_size = os.path.getsize(self.inter_file) + + # 使用更快的进度条更新(行数而非字节) + processed_bytes = 0 + update_interval = 10000 # 每10000行更新一次进度 + line_count = 0 + + with tqdm(total=file_size, unit='B', unit_scale=True) as pbar: + for line in fin: + line_count += 1 + line_bytes = len(line) + processed_bytes += line_bytes + + # 减少进度条更新频率 + if line_count % update_interval == 0: + pbar.update(processed_bytes) + processed_bytes = 0 + + line = line.strip() + if not line: + continue + + try: + # 使用 \x01 作为分隔符 + fields = line.split('\x01') + if len(fields) != 4: + continue + + item_id, user_id, action, vtime = fields + + # 根据模式过滤交互类型 + if self.interaction_type == 'all': + # 合并模式:包含所有4种行为类型 + if action in ['click', 'cart', 'collect', 'alipay']: + dt = datetime.strptime(vtime, '%Y-%m-%d %H:%M:%S') + ts = int(dt.timestamp()) + yield [user_id, item_id, str(ts), action] + else: + # 单个行为类型模式:只包含指定类型 + if action == self.interaction_type: + dt = datetime.strptime(vtime, '%Y-%m-%d %H:%M:%S') + ts = int(dt.timestamp()) + yield [user_id, item_id, str(ts)] + except Exception: + continue + + # 更新剩余进度 + if processed_bytes > 0: + pbar.update(processed_bytes) + + def convert_inter(self): + try: + with open(self.output_inter_file, 'w', buffering=1024*1024) as fp: # 1MB 缓冲 + fp.write('\t'.join([self.inter_fields[i] for i in range(len(self.inter_fields))]) + '\n') + + if self.duplicate_removal: + inter_dict = {} + for line in self.load_inter_data_streaming(): + key = tuple(line[:-1]) + t = line[-1] + if key in inter_dict: + inter_dict[key][0] = t + inter_dict[key][1] += 1 + else: + inter_dict[key] = [t, 1] + + for k, v in tqdm(inter_dict.items()): + fp.write('\t'.join([str(item) for item in list(k) + v]) + '\n') + else: + # 批量写入优化 + buffer = [] + buffer_size = 10000 + + for line in self.load_inter_data_streaming(): + buffer.append('\t'.join(line)) + if len(buffer) >= buffer_size: + fp.write('\n'.join(buffer) + '\n') + buffer.clear() + + # 写入剩余数据 + if buffer: + fp.write('\n'.join(buffer) + '\n') + + except NotImplementedError: + print('This dataset can\'t be converted to inter file\n') + except Exception as e: + print(f'TMALL2014Dataset convert_inter error: {e}') + + def merge_duplicate(self, inter_table): + inter_dict = {} + for line in inter_table: + key = tuple(line[:-1]) + t = line[-1] + if key in inter_dict: + inter_dict[key][0] = t + inter_dict[key][1] += 1 + else: + inter_dict[key] = [t, 1] + return inter_dict + class NETFLIXDataset(BaseDataset): def __init__(self, input_path, output_path): super(NETFLIXDataset, self).__init__(input_path, output_path) @@ -5314,3 +5484,162 @@ def convert_inter(self): fout.write('\t'.join([current_list[0], item, rating, timestamp]) + '\n') fin.close() fout.close() + + +class TaobaoDataset(BaseDataset): + def __init__(self, input_path, output_path, interaction_type, duplicate_removal): + super(TaobaoDataset, self).__init__(input_path, output_path) + self.dataset_name = 'taobao' + self.interaction_type = interaction_type + self.duplicate_removal = duplicate_removal + + # 验证交互类型 + valid_types = ['pv', 'cart', 'fav', 'buy', 'all'] + assert self.interaction_type in valid_types, f'interaction_type must be in {valid_types}' + + # 输出文件路径设置 - 与Rec_Tmall保持一致的结构 + if self.interaction_type == 'all': + # 合并所有行为类型的情况 + self.dataset_name = self.dataset_name + '-merged' + else: + # 单个行为类型的情况 + self.dataset_name = self.dataset_name + '-' + self.interaction_type + + # 创建Rec_Taobao/processed/taobao-{type}/结构 + self.output_path = os.path.join(self.output_path, 'Rec_Taobao', 'processed', self.dataset_name) + self.check_output_path() + self.output_inter_file = os.path.join(self.output_path, self.dataset_name + '.inter') + + # 输入文件 + self.inter_file = self.input_path + self.sep = ',' + + # 根据是否合并所有行为类型来定义字段 + if self.interaction_type == 'all': + # 合并模式:包含行为类型字段 + if self.duplicate_removal: + self.inter_fields = { + 0: 'user_id:token', + 1: 'item_id:token', + 2: 'timestamp:float', + 3: 'action_type:token', + 4: 'interactions:float' + } + else: + self.inter_fields = { + 0: 'user_id:token', + 1: 'item_id:token', + 2: 'timestamp:float', + 3: 'action_type:token' + } + else: + # 单个行为类型模式:不包含行为类型字段 + if self.duplicate_removal: + self.inter_fields = { + 0: 'user_id:token', + 1: 'item_id:token', + 2: 'timestamp:float', + 3: 'interactions:float' + } + else: + self.inter_fields = { + 0: 'user_id:token', + 1: 'item_id:token', + 2: 'timestamp:float' + } + + def load_inter_data_streaming(self): + """流式读取数据,边读边yield,不占用大量内存 + + 原始格式(CSV,逗号分隔): + user_id,item_id,category_id,behavior_type,timestamp + 示例: 1,2268318,2520377,pv,1511544070 + """ + import os + from datetime import datetime + + with open(self.inter_file, 'r') as fin: + file_size = os.path.getsize(self.inter_file) + + # 跳过标题行 + next(fin) + + # 使用更快的进度条更新(行数而非字节) + processed_bytes = 0 + update_interval = 10000 # 每10000行更新一次进度 + line_count = 0 + + with tqdm(total=file_size, unit='B', unit_scale=True) as pbar: + for line in fin: + line_count += 1 + line_bytes = len(line) + processed_bytes += line_bytes + + # 减少进度条更新频率 + if line_count % update_interval == 0: + pbar.update(processed_bytes) + processed_bytes = 0 + + line = line.strip() + if not line: + continue + + try: + # 使用逗号作为分隔符 + fields = line.split(',') + if len(fields) != 5: + continue + + user_id, item_id, category_id, behavior_type, timestamp = fields + + # 根据模式过滤交互类型 + if self.interaction_type == 'all': + # 合并模式:包含所有4种行为类型 + if behavior_type in ['pv', 'cart', 'fav', 'buy']: + yield [user_id, item_id, timestamp, behavior_type] + else: + # 单个行为类型模式:只包含指定类型 + if behavior_type == self.interaction_type: + yield [user_id, item_id, timestamp] + except Exception: + continue + + # 更新剩余进度 + if processed_bytes > 0: + pbar.update(processed_bytes) + + def convert_inter(self): + try: + with open(self.output_inter_file, 'w', buffering=1024*1024) as fp: # 1MB 缓冲 + fp.write('\t'.join([self.inter_fields[i] for i in range(len(self.inter_fields))]) + '\n') + + if self.duplicate_removal: + inter_dict = {} + for line in self.load_inter_data_streaming(): + key = tuple(line[:-1]) + t = line[-1] + if key in inter_dict: + inter_dict[key][0] = t + inter_dict[key][1] += 1 + else: + inter_dict[key] = [t, 1] + + for k, v in tqdm(inter_dict.items()): + fp.write('\t'.join([str(item) for item in list(k) + v]) + '\n') + else: + # 批量写入优化 + buffer = [] + buffer_size = 10000 + + for line in self.load_inter_data_streaming(): + buffer.append('\t'.join(line)) + if len(buffer) >= buffer_size: + fp.write('\n'.join(buffer) + '\n') + buffer.clear() + + # 写入剩余数据 + if buffer: + fp.write('\n'.join(buffer) + '\n') + + except NotImplementedError: + print('This dataset can\'t be converted to inter file\n') diff --git a/conversion_tools/src/logger.py b/conversion_tools/src/logger.py new file mode 100644 index 0000000..9c487be --- /dev/null +++ b/conversion_tools/src/logger.py @@ -0,0 +1,273 @@ +""" +一个全面的日志模块,提供彩色控制台输出和轮转文件日志功能. + +该模块实现了单例模式的日志记录器,支持控制台和文件日志记录, +当日志文件达到10MB大小时会自动进行轮转. +""" + +import logging +import sys +from logging.handlers import RotatingFileHandler +from pathlib import Path +from typing import Optional, Union + +from colorama import Fore, Style, init # type: ignore + +# 创建日志格式 +LOG_FORMAT = "%(asctime)s [%(levelname)s] [%(module)s.%(funcName)s] - %(message)s" +# 包含 完整文件路径 和 行号 +# LOG_FORMAT = "%(asctime)s [%(levelname)s] [%(pathname)s:%(lineno)d] - %(message)s" +# 包含 文件名 和 行号 +# LOG_FORMAT = "%(asctime)s [%(levelname)s] [%(filename)s:%(lineno)d] - %(message)s" + +BERTOPIC_LOG_FORMAT = "%(asctime)s [%(levelname)s] [BERTopic] - %(message)s" +DATE_FORMAT = "%Y-%m-%d %H:%M:%S" + +# 初始化 colorama +init(autoreset=True) + +# 定义日志级别对应的颜色 +LOG_COLORS = { + "DEBUG": Fore.CYAN, + "INFO": Fore.GREEN, + "WARNING": Fore.YELLOW, + "ERROR": Fore.RED, + "CRITICAL": Fore.RED + Style.BRIGHT, +} + +# 日志文件配置 +MAX_LOG_SIZE = 10 * 1024 * 1024 # 10MB +BACKUP_COUNT = 5 # 保留5个备份文件 + + +class ColoredFormatter(logging.Formatter): + """自定义格式化器,为日志级别添加颜色.""" + + def format(self, record): + """格式化日志记录,为日志级别添加颜色.""" + # 获取原始日志消息 + message = super().format(record) + + # 为日志级别添加颜色 + level_color = LOG_COLORS.get(record.levelname, "") + if level_color: + # 只对日志级别关键字进行着色 + level_name = record.levelname + colored_level = f"{level_color}{level_name}{Style.RESET_ALL}" + message = message.replace(level_name, colored_level) + + return message + + +def setup_bertopic_logger(log_dir: Path): + """ + 配置 BERTopic 的日志记录器. + + Args: + log_dir (Path): 日志目录路径 + bertopic_formatter (ColoredFormatter): BERTopic 专用的格式化器. + """ + # 创建 BERTopic 专用格式化器 + bertopic_formatter = ColoredFormatter(BERTOPIC_LOG_FORMAT, datefmt=DATE_FORMAT) + + # 配置 BERTopic 的日志记录器 + bertopic_logger = logging.getLogger("BERTopic") + bertopic_logger.setLevel(logging.INFO) + bertopic_logger.propagate = False # 禁止传播到根记录器 + + # 移除现有的处理器(如果有的话) + for handler in bertopic_logger.handlers[:]: + bertopic_logger.removeHandler(handler) + + # 为 BERTopic 创建专用的处理器 + bertopic_console_handler = logging.StreamHandler(sys.stdout) + bertopic_console_handler.setFormatter(bertopic_formatter) + bertopic_logger.addHandler(bertopic_console_handler) + + # 使用轮转文件处理器 + bertopic_file_handler = RotatingFileHandler( + log_dir / "pipeline.log", + maxBytes=MAX_LOG_SIZE, + backupCount=BACKUP_COUNT, + encoding="utf-8", + ) + bertopic_file_handler.setFormatter(bertopic_formatter) + bertopic_logger.addHandler(bertopic_file_handler) + + +class Logger: + """单例模式实现日志记录器.""" + + _instance: Optional["Logger"] = None + _initialized: bool = False + + def __new__(cls): + """创建单例实例.""" + if cls._instance is None: + cls._instance = super().__new__(cls) + return cls._instance + + def __init__(self): + """初始化日志记录器.""" + if self._initialized: + return + + self._initialized = True + + # 配置根日志记录器 + root_logger = logging.getLogger() + root_logger.setLevel(logging.INFO) + + # 移除所有现有的处理器 + for handler in root_logger.handlers[:]: + root_logger.removeHandler(handler) + + # 创建日志目录 + log_dir = Path("logs") + log_dir.mkdir(exist_ok=True) + + # 创建格式化器 + colored_formatter = ColoredFormatter(LOG_FORMAT, datefmt=DATE_FORMAT) + plain_formatter = logging.Formatter(LOG_FORMAT, datefmt=DATE_FORMAT) + + # 控制台处理器(带颜色) + console_handler = logging.StreamHandler(sys.stdout) + console_handler.setFormatter(colored_formatter) + root_logger.addHandler(console_handler) + + # 文件处理器(不带颜色,使用轮转) + file_handler = RotatingFileHandler( + log_dir / "pipeline.log", + maxBytes=MAX_LOG_SIZE, + backupCount=BACKUP_COUNT, + encoding="utf-8", + ) + file_handler.setFormatter(plain_formatter) + root_logger.addHandler(file_handler) + + # 配置 BERTopic 的日志记录器 + setup_bertopic_logger(log_dir) + + # 创建项目特定的logger + self.logger = logging.getLogger("TextMiningPipeline") + self.logger.setLevel(logging.INFO) + + self.logger.propagate = True + + def get_logger(self) -> logging.Logger: + """获取日志记录器实例.""" + return self.logger + + def set_level(self, level: int): + """设置日志级别.""" + self.logger.setLevel(level) + logging.getLogger().setLevel(level) + + +# Global logger instance +logger = Logger().get_logger() + + +def format_to_str_box(data: Union[dict[str, str], str], max_width: int = 80) -> str: + """ + 将字典或字符串格式化为指定格式的盒子字符串,自动处理长行. + + 一个中文字符的宽度等于两个英文字符的宽度. + + 参数: + data: 可以是字典或字符串 + - 如果是字典:按key: value格式逐行显示 + - 如果是字符串:将字符串按行分割并显示在盒子中 + max_width: 每行最大显示宽度(不包括边框),默认80字符 + + 返回: + 格式化后的盒子字符串 + """ + # 计算字符的显示宽度 + def get_display_width(text): + return sum(2 if "\u4e00" <= char <= "\u9fff" else 1 for char in text) + + def wrap_text(text: str, available_width: int) -> list[str]: + """将长文本按指定宽度换行.""" + if get_display_width(text) <= available_width: + return [text] + + words = text.split() + lines = [] + current_line: list[str] = [] + current_width = 0 + + for word in words: + word_width = get_display_width(word) + if ( + current_width + word_width + (1 if current_line else 0) + <= available_width + ): + if current_line: + current_width += 1 # 空格的宽度 + current_line.append(word) + current_width += word_width + else: + if current_line: + lines.append(" ".join(current_line)) + current_line = [word] + current_width = word_width + + if current_line: + lines.append(" ".join(current_line)) + return lines + + result = "" + border_length = max_width + 4 # 添加左右边距 + + if isinstance(data, str): + lines = [] + for line in data.split("\n"): + lines.extend(wrap_text(line, max_width)) + + # 创建顶部边框 + result = "+" + "-" * (border_length - 2) + "+\n" + + # 添加每一行内容 + for line in lines: + display_width = get_display_width(line) + padding = border_length - 4 - display_width # -4 是因为"| "和" |" + result += f"| {line}" + " " * padding + " |\n" + + # 添加底部边框 + result += "+" + "-" * (border_length - 2) + "+" + + elif isinstance(data, dict): + # 创建顶部边框 + result = "+" + "-" * (border_length - 2) + "+\n" + + # 添加每一行内容 + for key, value in data.items(): + prefix = f"{key}: " + prefix_width = get_display_width(prefix) + available_width = max_width - prefix_width + + # 处理值的换行 + value_lines = wrap_text(str(value), available_width) + + # 添加第一行(带键名) + first_line = prefix + value_lines[0] + display_width = get_display_width(first_line) + padding = border_length - 4 - display_width + result += f"| {first_line}" + " " * padding + " |\n" + + # 添加后续行(如果有的话) + for line in value_lines[1:]: + display_width = get_display_width(line) + # 对齐前一行的值 + indent = " " * prefix_width + padding = border_length - 4 - display_width - prefix_width + result += f"| {indent}{line}" + " " * padding + " |\n" + + # 添加底部边框 + result += "+" + "-" * (border_length - 2) + "+" + + else: + raise TypeError("输入必须是字符串或字典") + + return "\n" + result diff --git a/conversion_tools/src/utils.py b/conversion_tools/src/utils.py index ea4882c..de46008 100644 --- a/conversion_tools/src/utils.py +++ b/conversion_tools/src/utils.py @@ -12,6 +12,7 @@ 'avazu': 'AVAZUDataset', 'adult': 'ADULTDataset', 'tmall': 'TMALLDataset', + 'tmall_2014': 'TMALL2014Dataset', 'netflix': 'NETFLIXDataset', 'criteo': 'CRITEODataset', 'foursquare': 'FOURSQUAREDataset', @@ -63,20 +64,23 @@ 'mind_large_dev': 'MINDLargeDevDataset', 'mind_small_train': 'MINDSmallTrainDataset', 'mind_small_dev': 'MINDSmallDevDataset', - 'cosmetics': 'CosmeticsDataset' + 'cosmetics': 'CosmeticsDataset', + 'taobao': 'TaobaoDataset' } click_dataset = { 'YOOCHOOSEDataset', 'RETAILROCKETDataset', 'TMALLDataset', + 'TMALL2014Dataset', 'IPINYOUDataset', 'TAFENGDataset', 'LFM1bDataset', 'GOWALLADataset', 'DIGINETICADataset', 'FOURSQUAREDataset', - 'STEAMDataset' + 'STEAMDataset', + 'TaobaoDataset' } multiple_dataset = { @@ -85,8 +89,10 @@ 'RETAILROCKETDataset', 'TAFENGDataset', 'TMALLDataset', + 'TMALL2014Dataset', 'IPINYOUDataset', - 'LFM1bDataset' + 'LFM1bDataset', + 'TaobaoDataset' } multiple_item_features = { diff --git a/conversion_tools/usage/Taobao.md b/conversion_tools/usage/Taobao.md new file mode 100644 index 0000000..f8649c5 --- /dev/null +++ b/conversion_tools/usage/Taobao.md @@ -0,0 +1,117 @@ +# Taobao Dataset + +## Dataset Information + +**For detailed dataset information, please visit:** [Taobao User Behavior Dataset](https://tianchi.aliyun.com/dataset/dataDetail?dataId=649) + +## Prerequisites + +```bash +git clone https://github.com/RUCAIBox/RecDatasets +cd RecDatasets/conversion_tools +pip install -r requirements.txt +``` + +## Data Conversion + +### Basic Usage + +```bash +python run.py --dataset taobao \ + --input_path /path/to/Taobao.csv \ + --output_path output_data/taobao \ + --interaction_type pv \ + --convert_inter +``` + +### Parameters + +- `--dataset`: `taobao` (required) +- `--input_path`: Path to the input data file (required) +- `--output_path`: Directory to store converted files (required) +- `--interaction_type`: `pv`, `cart`, `fav`, `buy`, or omit to merge all types (optional) +- `--convert_inter`: Enable conversion (required) +- `--duplicate_removal`: Enable deduplication (optional) + +**Note**: When `--interaction_type` is omitted, all four interaction types (pv, cart, fav, buy) will be merged into a single file with an additional `action_type` column. + +### Convert All Interaction Types + +#### Method 1: Convert Separately +```bash +for type in pv cart fav buy; do + python run.py --dataset taobao \ + --input_path /path/to/Taobao.csv \ + --output_path output_data/taobao \ + --interaction_type $type \ + --convert_inter +done +``` + +#### Method 2: Convert All Types in One File (Recommended) +```bash +python run.py --dataset taobao \ + --input_path /path/to/Taobao.csv \ + --output_path output_data/taobao \ + --convert_inter +``` + +## Output Format + +### Single Interaction Type +Output file: `output_data/taobao/taobao-{interaction_type}/taobao-{interaction_type}.inter` + +``` +user_id:token item_id:token timestamp:float +1 2268318 1511544070 +``` + +### All Interaction Types (Merged) +Output file: `output_data/taobao/taobao-merged/taobao-merged.inter` + +``` +user_id:token item_id:token timestamp:float action_type:token +1 2268318 1511544070 pv +1 2268318 1511544071 cart +1 2268318 1511544072 fav +1 2268318 1511544073 buy +``` + +### With `--duplicate_removal` + +#### Single Type: +``` +user_id:token item_id:token timestamp:float interactions:float +1 2268318 1511544070 3 +``` + +#### Merged Types: +``` +user_id:token item_id:token timestamp:float action_type:token interactions:float +1 2268318 1511544070 pv 1 +1 2268318 1511544071 cart 2 +``` + +## Dataset Statistics + +- **Total interactions**: ~100 million +- **Behavior types**: pv (page view), cart (add to cart), fav (favorite), buy (purchase) +- **Time period**: 2017-11-25 to 2017-12-03 +- **Users**: ~1 million +- **Items**: ~4 million + +## Input Format + +The input CSV file should have the following format: +``` +user_id,item_id,category_id,behavior_type,timestamp +1,2268318,2520377,pv,1511544070 +1,2333346,2520771,pv,1511561733 +``` + +Where: +- `user_id`: User identifier +- `item_id`: Item identifier +- `category_id`: Category identifier +- `behavior_type`: One of `pv`, `cart`, `fav`, `buy` +- `timestamp`: Unix timestamp diff --git a/conversion_tools/usage/Tmall2014.md b/conversion_tools/usage/Tmall2014.md new file mode 100644 index 0000000..ee6aae7 --- /dev/null +++ b/conversion_tools/usage/Tmall2014.md @@ -0,0 +1,94 @@ +# Tmall2014 + +## Dataset Information + +**For detailed dataset information, please visit:** [Tianchi Tmall Recommendation Dataset](https://tianchi.aliyun.com/dataset/140281) + +## Prerequisites + +```bash +git clone https://github.com/RUCAIBox/RecDatasets +cd RecDatasets/conversion_tools +pip install -r requirements.txt +``` + +## Data Conversion + +### Basic Usage + +```bash +python run.py --dataset tmall_2014 \ + --input_path /path/to/tianchi_2014002_rec_tmall_log_partc.txt \ + --output_path output_data/tmall2014 \ + --interaction_type click \ + --convert_inter +``` + +### Parameters + +- `--dataset`: `tmall_2014` (required) +- `--input_path`: Path to the input data file (required) +- `--output_path`: Directory to store converted files (required) +- `--interaction_type`: `click`, `cart`, `collect`, or `alipay` (optional, omit to merge all types) +- `--convert_inter`: Enable conversion (required) +- `--duplicate_removal`: Enable deduplication (optional) + +**Note**: When `--interaction_type` is omitted, all four interaction types (click, cart, collect, alipay) will be merged into a single file with an additional `action_type` column. + +### Convert All Interaction Types + +#### Method 1: Convert Separately +```bash +for type in click cart collect alipay; do + python run.py --dataset tmall_2014 \ + --input_path /path/to/data.txt \ + --output_path output_data/tmall2014 \ + --interaction_type $type \ + --convert_inter +done +``` + +#### Method 2: Convert All Types in One File (Recommended) +```bash +python run.py --dataset tmall_2014 \ + --input_path /path/to/data.txt \ + --output_path output_data/tmall2014 \ + --convert_inter +``` + +## Output Format + +### Single Interaction Type +Output file: `output_data/tmall2014/tmall2014-{interaction_type}/tmall2014-{interaction_type}.inter` + +``` +user_id:token item_id:token timestamp:float +u6276408 3903192 1377496871 +``` + +### All Interaction Types (Merged) +Output file: `output_data/tmall2014/tmall2014-merged/tmall2014-merged.inter` + +``` +user_id:token item_id:token timestamp:float action_type:token +u6276408 3903192 1377496871 click +u6276408 3903192 1377496872 cart +u6276408 3903192 1377496873 collect +u6276408 3903192 1377496874 alipay +``` + +### With `--duplicate_removal` + +#### Single Type: +``` +user_id:token item_id:token timestamp:float interactions:float +u6276408 3903192 1377496871 3 +``` + +#### Merged Types: +``` +user_id:token item_id:token timestamp:float action_type:token interactions:float +u6276408 3903192 1377496871 click 1 +u6276408 3903192 1377496872 cart 2 +``` + diff --git a/dataset_info/Tmall2014/README.md b/dataset_info/Tmall2014/README.md new file mode 100644 index 0000000..7f327ea --- /dev/null +++ b/dataset_info/Tmall2014/README.md @@ -0,0 +1,20 @@ +# Tmall2014 + +## Dataset Overview + +Tmall2014 is a large-scale e-commerce dataset collected from Tmall.com (formerly Taobao Mall), containing user behavior logs from 2013. The dataset includes multiple types of user-item interactions: clicks, add-to-cart, favorites (collect), and purchases (alipay). + +**For detailed dataset information, please visit:** [Tianchi Tmall Recommendation Dataset](https://tianchi.aliyun.com/dataset/140281) + +## Data Format + +The original data file uses `\x01` (ASCII control character) as field separator. After conversion, the data is in RecBole atomic file format (tab-separated): + +``` +user_id:token item_id:token timestamp:float +u6276408 3903192 1377496871 +``` + +## Usage + +Please refer to the [conversion tool documentation](../../conversion_tools/usage/Tmall2014.md) for instructions on how to convert this dataset to RecBole format.