From 3782c99f54ff261b0bce6269610bdac165816de7 Mon Sep 17 00:00:00 2001 From: "Shuhao Zhang (Tony)" Date: Wed, 14 Jun 2023 20:41:14 +0800 Subject: [PATCH 1/2] Update README.md --- README.md | 50 -------------------------------------------------- 1 file changed, 50 deletions(-) diff --git a/README.md b/README.md index 9f15756..8b13789 100644 --- a/README.md +++ b/README.md @@ -1,51 +1 @@ -# PLStream: A Framework for Fast Polarity Labelling of Massive Data Streams -## Motivation -* When dataset freshness is critical, the annotating of high speed unlabelled data streams becomes critical but remains an open problem. -* We propose PLStream, a novel Apache Flink-based framework for fast polarity labelling of massive data streams, like Twitter tweets or online product reviews. - -## Environment Requirements -relative python packages are summerized in `requirements.txt` -1. Python 3.7 -2. Java 8 -3. Redis server v7.0.0 - -## DataSource -* Dataset quick access in https://course.fast.ai/datasets#nlp -### Tweets -* 1.6 million labeled Tweets: -* Source:[Sentiment140](http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip) -### Yelp Reviews -* 280,000 training and 19,000 test samples in each polarity -* Source:[Yelp Review Polarity](https://s3.amazonaws.com/fast-ai-nlp/yelp_review_polarity_csv.tgz) -### Amazon Reviews -* 1,800,000 training and 200,000 testing samples in each polarity -* Source:[Amazon product review polarity](https://s3.amazonaws.com/fast-ai-nlp/amazon_review_polarity_csv.tgz) - -## Quick Start -quick try PLStream on yelp review dataset -### Data Prepare -``` -cd PLStream -wget https://s3.amazonaws.com/fast-ai-nlp/yelp_review_polarity_csv.tgz -tar zxvf yelp_review_polarity_csv.tgz -mv yelp_review_polarity_csv/train.csv train.csv -``` -### 1. Install required environment of PLStream -``` -pip install -r requirements.txt -``` -### 2. Start Redis-Server in a terminal -``` -redis-server -``` -### 3. Run PLStream -``` -python PLStream.py -``` -* The outputs' form is "original text" + "label" + "@@@@": -* With help of a split("@@@@") function we can further reorganize the labelled dataset. - -### Optional -to see the labelling accuracy, simply run: -`python PLStream_acc.py` From 82a809b0b41a1fb0ce9aa7b48a02f93ad1ddf6b7 Mon Sep 17 00:00:00 2001 From: "Shuhao Zhang (Tony)" Date: Wed, 14 Jun 2023 20:44:38 +0800 Subject: [PATCH 2/2] Update README.md --- README.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/README.md b/README.md index 8b13789..87b42f4 100644 --- a/README.md +++ b/README.md @@ -1 +1,3 @@ +We are a young and dynamic research group based at the Singapore University of Technology and Design in the Information Systems Technology and Design Pillar. From a high-level point of view, our research goal is to optimize and utilize distributed and parallel stream processing technology to better support existing and emerging big data applications. This is important to improve performance and reduce resource consumption, especially for the network connected world by 5G, IoT, etc. Our work spanning around database management, networking systems, machine learning, and data mining. +SentiStream is one of the four project teams of IntelliStream.