Skip to content

Scraping and analysis of Arxiv.org database, comparing performance of PostgreSQL, Hive and Spark

Notifications You must be signed in to change notification settings

jaimeps/distributed-computing-arxiv

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Analysis of ArXiv.org database of scientific papers

MSAN 694 - Distributed Computing
Team: D. Wen, A. Romriell, J. Pastor, J. Pollard

Data Description:

Source: ArXiv Electronic Archive of Scientific Papers
We analyzed the entire database of arXiv.org (1.6GB):

  • 1.26 million papers
  • 600,000 authors
  • 86,262,827 words

Goal:

Exploratory Data Analysis and Community Detection, implemented with three different technologies (postgreSQL, Hive and Spark) for performance comparison.

Experimental environment:

Local:

MacBook Pro 2.7 GHz Intel Core i5 16 GB 1867 MHz DDR3

Distributed:

4-node cluster of r3.xlarge (160GB) emr-4.6.0
Hadoop distribution: Amazon 2.7.2
Applications: Hive 1.0.0, Pig 0.14.0, Spark 1.6.1

Summary results:

The following table summarizes the running time (in seconds) of the tasks in each of the different platforms (postgreSQL, Hive and SparkSQL): As the queries grew in complexity (4 and 5), Hive and SparkSQL perform drastically better than PostgreSQL. In particular, we observed a reduction in running times between 80% and 97% using SparkSQL.

We concluded that - in the context of this problem - SparkSQL was the optimal tool given its speed, ease of use and flexibility.

About

Scraping and analysis of Arxiv.org database, comparing performance of PostgreSQL, Hive and Spark

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published