Skip to content
Alexander O. Smith edited this page Oct 26, 2018 · 16 revisions

Installing STACK

This documentation assumes the following:

  • You know how to use ssh.
  • Your server has MongoDB already installed.
  • You understand how to edit files using vim (“vi”).
  • You have rights and know how to install Python libraries.

In addition, this doc is geared towards working on a Linux system (for testing we use Ubuntu). We've tried to link to external documentation where installation diverges if you are using other systems.

Step 1) Download STACK

First, clone this repo to your local machine:

sudo git clone https://github.com/bitslabsyr/stack.git

Next, make sure to install the required Python libraries outlined in the requirements.txt file. Check to see if pip is installed:

sudo pip --version

If you don't get a version back, install pip:

sudo apt-get update
sudo apt-get install python-pip

We use pip to install and manage dependencies:

sudo pip install -r requirements.txt

Note - We use Python 2.7.6 for STACK.

Step 2) Configuration & Setup

STACK is built to work with MongoDB. The app stores configuration information in Mongo. Before getting started with STACK, you'll need to do the following:

  • Modify the main config file
  • Setup a project account
  • Create & start a collector

These steps are detailed below.

Modify the main config file

STACK is closely integrated with Mongo. The first step is to make sure that STACK can interact with Mongo, having read and write permissions. If you don't have user authentication enabled for Mongo, you should strongly consider enabling that security measure. Note that the version of pymongo used by STACK requires that Mongo uses MONGODB-CR as the authentication schema. As of Mongo 3.0, the default authentication schema is SCRAM, which is incompatible with STACK. If you use authentication on Mongo, you must ensure that the authentication schema is correct:

  1. Comment out the the security and authorization lines in the config file, /etc/mongod.conf
  2. Restart Mongo with sudo service mongod restart
  3. Enter the Mongo shell using "mongo"
  4. Use the admin db with "use admin" command
  5. Enter the following command: db.system.version.save({ "_id" : "authSchema", "currentVersion" : 3 })
  6. Create a Mongo user with appropriate permissions, "db.createUser({user:"YOUR_USERNAME",pwd:"YOUR_PASSWORD",roles: [{role:"root", db: "admin"}]})"
  7. If user exists drop the user and try again with "use admin", "db.dropUser('YOUR_USERNAME')", and repeat step 6.
  8. Re-enable authentication (do the opposite of step 1) .
  9. restart Mongo

After cloning the STACK repo to your local machine, move into the main directory:

cd stack

Then, open the config file using vim (or another command line text editor):

vi config.py

The first section of this file contains information that STACK uses to connect to Mongo. If you have user authentication enabled, "AUTH" should equal "True" (this is the default setting). Replace the fields for "USERNAME" and "PASSWORD" with the actual username and corresponding password used to authenticate in Mongo. If you do not have user authentication enabled and you decide not to enable it, change "AUTH = True" to "AUTH = False".

Project Account Setup

STACK uses "project accounts" to maintain ownership over collection processes. A project account can own multiple collection processes that run concurrently.

To create a project, activate the setup.py script:

sudo python __main__.py db create_project

The setup script initializes the Mongo database with important configuration information, as well as creates your user account. The script will prompt you for the following information:

  • Project Name: A unique account name for your project. (We recommend NOT using underscores in your project name as it could cause trouble with STACKzip.) STACK calls all login accounts "projects" and allows for multiple projects at once.
  • Password: A password for your project account.
  • Email: An email for used for project notifications.
  • Description: A short description for your project account.

If the script returns a successful execution notice, you will be able to start creating and running collection processes for that account. You can rerun the setup.py script to create new accounts.

Creating a Collector

Each project account can instantiate multiple collectors that will scrape data. A collector is defined as a singular instance that collects data for a specific set of user-provided terms. A project can have multiple collectors running for a given network.

To create a collector, first run the following command from the main STACK diretcory:

sudo python __main__.py db set_collector_detail

You will then be prompted to provide the following configuration information for the collector:

  • Project Account Name (required): The name of your project account.
  • Collector Name (required): Non-unique name to identify your collector instance.
  • Network (required): The network (i.e., Facebook or Twitter) which the collector will collect from. The network name must be in lower case.
  • Language(s) (required): A list of BCP-47 language codes. If this used, the collector will only grab tweets in this language. Learn more here about Twitter language parameters. If no language parameter is needed, enter "none".
  • Location(s) (required): A list of location coordinates. If used, we will collect all geocoded tweets within the location bounding box. Bounding boxes must consist of four lat/long pairs. Learn more here about location formatting for the Twitter API. If no location parameter is needed, enter "none".
  • Terms (optional): A line item list of terms for the collector to stream. If using more than one term, they can be inserted in a single line, separated by commas, but do not place spaces between the comma and the term (i.e., "list,of,terms" rather than "list, of, terms").
  • API (required): Three options: track, follow, or none. The option must be entered lower case. Each collector can stream from one part of Twitter's Streaming API:
    • Track: Collects tweets that contain any of a given list of terms in the tweet text (including hashtags, mentions, and URLs).
    • Follow: Collects all tweets, retweets, and replies for a given use handle. Each term must be a valid Twitter screen name.
    • None: Only choose this option if you have not inputted a terms list and are collecting for a given set of language(s) and/or location(s). If you do not track a terms list, make sure you are tracking at least one language or location.
  • OAuth Information: Four keys used to authenticate with the Twitter API. To get consumer & access tokens, first register your app on https://dev.twitter.com/apps/new. Navigate to Keys and Access Tokens and click "Create my access token." NOTE - Each collector needs to have a unique set of access keys, or else the Streaming API will limit your connection. The four keys include:
    • Consumer Key
    • Consumer Secret
    • Access Token
    • Access Token Secret

A note on location tracking: Location tracking with Twitter is an OR filter. We will collect all tweets that match other filters (such as a terms list or a language identifier) OR tweets in the given location. Please plan accordingly.

Step 3) Starting STACK

There a three processes to start to have STACK running in full: collector, processor, and inserter. As noted above, multiple instances of each process can run at the same time. In turn, an instance of each process need not run for STACK to operate.

  • Collectors: A specific collector used to scrape data for a given set of filters. Multiple can be created/run for each project account.
  • Processors: This processes raw tweet files written by a collector. Only one processor can be run for a given project account.
  • Inserters: A process that takes processed tweets and inserts them into MongoDB. Only one inserter can be run for a given project account.

Starting a Collector

To start a collector, you'll need to pass both a project_id and collector_id to STACK via the console. First, get your project accounts ID:

$ sudo python __main__.py db auth [project_name] [password]
{"status": 1, "message": "Success", "project_id": "your_id_value"}

Then, using the project_id returned above, find a list of your collectors and their ID values:

$ sudo python __main__.py db get_collector_ids [project_id]
{"status": 1, "collectors": [{"collector_name": [your_collector_name], "collector_id": [your_collector_id]}]}

Finally, using the project_id and collector_id values returned above, start the given collector for the project account of your choice:

sudo python __main__.py controller collect start [project_id] [collector_id]

Your collector is now running!

Starting a Processor

To start a processor, the syntax is very similar to the collector start command above. Here though, you only need to pass a project account ID:

sudo python __main__.py controller process start [project_id] twitter

Your processor is now running!

Starting an Inserter

To start an inserter, follow the syntax for starting a processor, but instead calling the "insert" command instead:

sudo python __main__.py controller insert start [project_id] twitter

Your inserter is now running!

Clone this wiki locally