This repository contains the solution for plotting exercise 1 in "Exploratory Data Analysis" course. Each plot has an individual R-script named from plot1.R till plot4.R and the diagrams generated by the scripts are the PNG files store in the my_plots subfolder.
-
Dataset: Electric power consumption [20Mb]
-
Description: Measurements of electric power consumption in one household with a one-minute sampling rate over a period of almost 4 years. Different electrical quantities and some sub-metering values are available.
The following descriptions of the 9 variables in the dataset are taken from the UCI web site:
- Date: Date in format dd/mm/yyyy
- Time: time in format hh:mm:ss
- Global_active_power: household global minute-averaged active power (in kilowatt)
- Global_reactive_power: household global minute-averaged reactive power (in kilowatt)
- Voltage: minute-averaged voltage (in volt)
- Global_intensity: household global minute-averaged current intensity (in ampere)
- Sub_metering_1: energy sub-metering No. 1 (in watt-hour of active energy). It corresponds to the kitchen, containing mainly a dishwasher, an oven and a microwave (hot plates are not electric but gas powered).
- Sub_metering_2: energy sub-metering No. 2 (in watt-hour of active energy). It corresponds to the laundry room, containing a washing-machine, a tumble-drier, a refrigerator and a light.
- Sub_metering_3: energy sub-metering No. 3 (in watt-hour of active energy). It corresponds to an electric water-heater and an air-conditioner.
Dataset has 2,075,259 rows and 9 columns such that the first two columns are strings of 10 characters and 8 characters; while the remaining 7 columns are just floating point numbers. Being read into R, this dataset would consume approximately the following amount of RAM: 2,075,259 * (10 bytes/string + 8 bytes/string + 7 * 8 bytes/numeric) = 153,569,166 bytes = ~146.5 MB Note that this estimate is quite close to the size of the unpacked dataset TXT gfile (129 MB). Even though it would be nice not to read all this data into the memory, this size of a table should be fine for every modern computer.
Since the exercise requires using only the data from the dates 2007-02-01 and 2007-02-02, the data rows corresponding to other dates are not taken into fileteredData table.
While reading the dataset into R, all numeric columns were converted into "numeric" type of data, but the first two columns for date and time remain intact as "character" strings. In addition to that, all empty rows are ignored and all the missing values originally denoted by question marks are converted into R's own "NA".
To make plotting a bit easier, an additional column "Date_Time" is added to to the filteredData table. That column contains the merged values of date and time converted into R internal format. NB: The original "Date" and "Time" columns remain as character strings and cannot be used anywhere untill they are converted.
Each R script verifies first if it has the source dataset. If not, it will download it locally and unpack for further use. After that, the dataset is read, filtered and converted to become suitable for the plots.
Generated plots are saved it to a PNG files with a dimantions of 480 pixels by 480 pixels and transparent background. The PNG files could be found in the my_plots subfolder from the one where the R script was ran.