Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 2 additions & 50 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,51 +1,3 @@
# PLStream: A Framework for Fast Polarity Labelling of Massive Data Streams
We are a young and dynamic research group based at the Singapore University of Technology and Design in the Information Systems Technology and Design Pillar. From a high-level point of view, our research goal is to optimize and utilize distributed and parallel stream processing technology to better support existing and emerging big data applications. This is important to improve performance and reduce resource consumption, especially for the network connected world by 5G, IoT, etc. Our work spanning around database management, networking systems, machine learning, and data mining.

## Motivation
* When dataset freshness is critical, the annotating of high speed unlabelled data streams becomes critical but remains an open problem.
* We propose PLStream, a novel Apache Flink-based framework for fast polarity labelling of massive data streams, like Twitter tweets or online product reviews.

## Environment Requirements
relative python packages are summerized in `requirements.txt`
1. Python 3.7
2. Java 8
3. Redis server v7.0.0

## DataSource
* Dataset quick access in https://course.fast.ai/datasets#nlp
### Tweets
* 1.6 million labeled Tweets:
* Source:[Sentiment140](http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip)
### Yelp Reviews
* 280,000 training and 19,000 test samples in each polarity
* Source:[Yelp Review Polarity](https://s3.amazonaws.com/fast-ai-nlp/yelp_review_polarity_csv.tgz)
### Amazon Reviews
* 1,800,000 training and 200,000 testing samples in each polarity
* Source:[Amazon product review polarity](https://s3.amazonaws.com/fast-ai-nlp/amazon_review_polarity_csv.tgz)

## Quick Start
quick try PLStream on yelp review dataset
### Data Prepare
```
cd PLStream
wget https://s3.amazonaws.com/fast-ai-nlp/yelp_review_polarity_csv.tgz
tar zxvf yelp_review_polarity_csv.tgz
mv yelp_review_polarity_csv/train.csv train.csv
```
### 1. Install required environment of PLStream
```
pip install -r requirements.txt
```
### 2. Start Redis-Server in a terminal
```
redis-server
```
### 3. Run PLStream
```
python PLStream.py
```
* The outputs' form is "original text" + "label" + "@@@@":
* With help of a split("@@@@") function we can further reorganize the labelled dataset.

### Optional
to see the labelling accuracy, simply run:
`python PLStream_acc.py`
SentiStream is one of the four project teams of IntelliStream.