GitHub - jaimeps/distributed-computing-arxiv: Scraping and analysis of Arxiv.org database, comparing performance of PostgreSQL, Hive and Spark

Analysis of ArXiv.org database of scientific papers

MSAN 694 - Distributed Computing
Team: D. Wen, A. Romriell, J. Pastor, J. Pollard

Data Description:

Source: ArXiv Electronic Archive of Scientific Papers
We analyzed the entire database of arXiv.org (1.6GB):

1.26 million papers
600,000 authors
86,262,827 words

Goal:

Exploratory Data Analysis and Community Detection, implemented with three different technologies (postgreSQL, Hive and Spark) for performance comparison.

Experimental environment:

Local:

MacBook Pro 2.7 GHz Intel Core i5 16 GB 1867 MHz DDR3

Distributed:

4-node cluster of r3.xlarge (160GB) emr-4.6.0
Hadoop distribution: Amazon 2.7.2
Applications: Hive 1.0.0, Pig 0.14.0, Spark 1.6.1

Summary results:

The following table summarizes the running time (in seconds) of the tasks in each of the different platforms (postgreSQL, Hive and SparkSQL): As the queries grew in complexity (4 and 5), Hive and SparkSQL perform drastically better than PostgreSQL. In particular, we observed a reduction in running times between 80% and 97% using SparkSQL.

We concluded that - in the context of this problem - SparkSQL was the optimal tool given its speed, ease of use and flexibility.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
0_scraping		0_scraping
1_parsing		1_parsing
2_postgresql		2_postgresql
3_hive		3_hive
4_sparksql		4_sparksql
5_network_analysis		5_network_analysis
images		images
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Analysis of ArXiv.org database of scientific papers

Data Description:

Goal:

Experimental environment:

Summary results:

About

Uh oh!

Releases

Packages

Languages

jaimeps/distributed-computing-arxiv

Folders and files

Latest commit

History

Repository files navigation

Analysis of ArXiv.org database of scientific papers

Data Description:

Goal:

Experimental environment:

Summary results:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages