Visualization using R Programming

Sharath Kumar Jagannathan

3 Visualization using R Programming

R programming

Exercise with Solution

#1.Look at Orange using either head or as.tibble() (you’ll have to run library(tidyverse) for that second option). What type of data are each of the columns?

Solution:

dataSet=Orange
head(dataSet)

## Tree age circumference
## 1 1 118 30
## 2 1 484 58
## 3 1 664 87
## 4 1 1004 115
## 5 1 1231 120
## 6 1 1372 142

#Dataset orange has 3 columns Tree,age,circumference with all integer values

#2.Find the mean, standard deviation, and standard error of tree circumference

Solution:

mean(dataSet$circumference)

## [1] 115.8571

sd(dataSet$circumference)

## [1] 57.48818

sd(dataSet$circumference)/sqrt(length(dataSet$circumference))

## [1] 9.717276

#3.Make a linear model which describes circumference (the response) as a function of age (the predictor). Save it as an object with <-, then print the object out by typing its name. What do those coefficients mean?

Solution:

linearM=lm(dataSet$circumference ~ dataSet$age)
linearM

##
## Call:
## lm(formula = dataSet$circumference ~ dataSet$age)
##
## Coefficients:
## (Intercept) dataSet$age
## 17.3997 0.1068

#Intecept variable tells us that where the linear models cut at y and the other coefficent is the slope

#4.Make another linear model describing age as a function of circumference. Save this as a different object.

Solution:

linearAge=lm(dataSet$age ~ dataSet$circumference)
linearAge

##
## Call:
## lm(formula = dataSet$age ~ dataSet$circumference)
##
## Coefficients:
## (Intercept) dataSet$circumference
## 16.604 7.816

#5.Call summary() on both of your model objects. What do you notice?

summary(linearM)

##
## Call:
## lm(formula = dataSet$circumference ~ dataSet$age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -46.310 -14.946 -0.076 19.697 45.111
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.399650 8.622660 2.018 0.0518 .
## dataSet$age 0.106770 0.008277 12.900 1.93e-14 ***
## —
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
##
## Residual standard error: 23.74 on 33 degrees of freedom
## Multiple R-squared: 0.8345, Adjusted R-squared: 0.8295
## F-statistic: 166.4 on 1 and 33 DF, p-value: 1.931e-14

summary(linearAge)

##
## Call:
## lm(formula = dataSet$age ~ dataSet$circumference)
##
## Residuals:
## Min 1Q Median 3Q Max
## -317.88 -140.90 -17.20 96.54 471.16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.6036 78.1406 0.212 0.833
## dataSet$circumference 7.8160 0.6059 12.900 1.93e-14 ***
## —
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
##
## Residual standard error: 203.1 on 33 degrees of freedom
## Multiple R-squared: 0.8345, Adjusted R-squared: 0.8295
## F-statistic: 166.4 on 1 and 33 DF, p-value: 1.931e-14

#6.Does this mean that trees growing makes them get older? Does a tree getting older make it grow larger? Or are these just correlations?

Solution:

plot(dataSet$circumference ~ dataSet$age)
abline(linearM)

#its a correlation that older trees are generally bigger in size

#7.Does the significant p value prove that trees growing makes them get older? Why not?

Solution:

cor.test(dataSet$age, dataSet$circumference, method = “pearson”)

##
## Pearson’s product-moment correlation
##
## data: dataSet$age and dataSet$circumference
## t = 12.9, df = 33, p-value = 1.931e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8342364 0.9557955
## sample estimates:
## cor
## 0.9135189

#From the above, we get the conclusion that if the P-Value for our values is Significant then the values to #be randomised is significant. Hence, significant p value does not prove that trees growing makes them get #older.

Practice Problems

Forecasting

Problem 1:

Auto sales at Carmen’s Chevrolet are shown below. Develop a 3-week moving average.

Week	Auto Sales
1	8
2	10
3	9
4	11
5	10
6	13
7	–

Problem 2:

Carmen’s decides to forecast auto sales by weighting the three weeks as follows:

Weights Applied	Period
3	Last week
2	Two weeks ago
1	Three weeks ago
6	Total

Problem 3:

A firm uses simple exponential smoothing with to forecast demand. The forecast for the week of January 1 was 500 units whereas the actual demand turned out to be 450 units. Calculate the demand forecast for the week of January 8.

Problem 4:

Exponential smoothing is used to forecast automobile battery sales. Two values of are examined, and Evaluate the accuracy of each smoothing constant. Which is preferable? (Assume the forecast for January was 22 batteries.) Actual sales are given below:

Month	Actual Battery Sales	Forecast
January	20	22
February	21
March	15
April	14
May	13
June	16

Problem 5:

Use the sales data given below to determine: (a) the least squares trend line, and (b) the predicted value for 2013 sales.

Year	Sales (Units)
2006	100
2007	110
2008	122
2009	130
2010	139
2011	152
2012	164

To minimize computations, transform the value of x (time) to simpler numbers. In this case, designate year 2006 as year 1, 2007 as year 2, etc.

Problem 6:

Given the forecast demand and actual demand for 10-foot fishing boats, compute the tracking signal and MAD.

Year	Forecast Demand	Actual Demand
1	78	71
2	75	80
3	83	101
4	84	84
5	88	60
6	85	73

Problem: 7

Over the past year Meredith and Smunt Manufacturing had annual sales of 10,000 portable water pumps. The average quarterly sales for the past 5 years have averaged: spring 4,000, summer 3,000, fall 2,000, and winter 1,000. Compute the quarterly index.

Problem: 8

Using the data in Problem 7, Meredith and Smunt Manufacturing expects sales of pumps to grow by 10% next year. Compute next year’s sales and the sales for each quarter.

Solutions

Problem 1:

Week	Auto Sales	Three-Week Moving Average
1	8
2	10
3	9
4	11	(8 + 9 + 10) / 3 = 9
5	10	(10 + 9 + 11) / 3 = 10
6	13	(9 + 11 + 10) / 3 = 10
7	–	(11 + 10 + 13) / 3 = 11 1/3

Problem 2:

Week	Auto Sales	Three-Week Moving Average
1	8
2	10
3	9
4	11	[(39) + (210) + (1*8)] / 6 = 9 1/6
5	10	[(311) + (29) + (1*10)] / 6 = 10 1/6
6	13	[(310) + (211) + (1*9)] / 6 = 10 1/6
7	–	[(313) + (210) + (1*11)] / 6 = 11 2/3

Problem 3:

Problem 4:

Month	Actual Battery Sales	Rounded Forecast with a =0.8	Absolute Deviation with a =0.8	Rounded Forecast with a =0.5	Absolute Deviation with a =0.5
January	20	22	2	22	2
February	21	20	1	21	0
March	15	21	6	21	6
April	14	16	2	18	4
May	13	14	1	16	3
June	16	13	3	14.5	1.5

			∑ = 15		∑ = 16.5
			2.5		2.75

Based on this analysis, a smoothing constant of a = 0.8 is preferred to that of a = 0.5 because it has a smaller MAD.

Problem 5:

Year	Time Period (X)	Sales (Units) (Y)	X²	XY
2006	1	100	1	100
2007	2	110	4	220
2008	3	122	9	366
2009	4	130	16	520
2010	5	139	25	695
2011	6	152	36	912
2012	7	164	49	1148

Therefore, the least squares trend equation is:

To project demand in 2013, we denote the year 2013 as and:

Sales in

Problem 6:

Year	Forecast Demand	Actual Demand	Error	RSFE
1	78	71	-7	-7
2	75	80	5	-2
3	83	101	18	16
4	84	84	0	16
5	88	60	-28	-12
6	85	73	-12	-24

Year	Forecast Demand	Actual Demand	\|Forecast Error\|	Cumulative Error	MAD	Tracking Signal
1	78	71	7	7	7.0	-1.0
2	75	80	5	12	6.0	-0.3
3	83	101	18	30	10.0	+1.6
4	84	84	0	30	7.5	+2.1
5	88	60	28	58	11.6	-1.0
6	85	73	12	70	11.7	-2.1

Problem 7:

Sales of 10,000 units annually divided equally over the 4 seasons is and the seasonal index for each quarter is: spring summer fall winter

Problem 8:

Next years sales should be 11,000 pumps Sales for each quarter should be 1/4 of the annual sales the quarterly index.

Exercise 1. DRINKING WATER MONITORING AND FORECASTING USING R

Aim:

To perform analysis on drinking water dataset and use R to do time series forecasting on the data by analyzing, monitoring and plotting the obtained forecast

Problem Statement:

Getting enough water every day is important for one’s health. Drinking water can prevent dehydration, a condition that can cause unclear thinking, result in mood change, cause your body to overheat, and lead to constipation and kidney stones. It is critical to examine the amount of water consumed on a regular basis in order to determine how much water has been consumed and to enhance water consumption if it is too low or vice versa.

Dataset:

The dataset https://raw.githubusercontent.com/jbrownlee/Datasets/master/ yearly-water-usage.csv which consists of annual water consumption in Baltimore from 1885 to 1963 (unit used is liters per capita per day), is used to analyze and monitor drinking water. Time series forecasting using SMA, Holt-Winter filtering, MannKendall and data visualization are performed using the same.

Procedure:

Install necessary libraries like Kendall, wql, etc.
Import the dataset downloaded from https://raw.githubusercontent.com/ jbrownlee/Datasets/master/yearly-water-usage.csv

Plot data as time series

Plot logarithmic time series

Plot SMA(Simple Moving Average) and view the time series output

Use Holt – Winters filtering and view the time series output

Forecast based on Holt – Winters

Calculate Mann-Kendall test of trend on time series and visualize the output

Perform decomposition of Additive time series

Plot decomposition of Additive time series

Convert time series to dataframe using ts2df.

CODE:

#R version 4.1.2 (2021-11-01)

#RStudio version 1.2.1335

#Program Execution

#1. Importing dataset and plotting values as a timeseries

df1 <- read.csv(“C:\\Users\\Lenovo\\waterdata.csv”)

time_series <- ts(df1$Water,frequency=1, start=c(1885))

time_series

plot.ts(time_series)

#2 Plotting Logarithmic timeseries

log_series <- log(time_series)

log_series

plot.ts(log_series)

#3 Simple Moving Average(SMA)

library(“TTR”)

SMA_series <- SMA(time_series,n=3)

plot.ts(SMA_series)

#4 Holt-Winters filtering

time_series_forecasts <- HoltWinters(time_series, beta=FALSE, gamma=FALSE)

time_series_forecasts

time_series_forecasts$fitted

plot(time_series_forecasts)

#5 Forecasting

time_series_forecasts$SSE

HoltWinters(time_series, beta=FALSE, gamma=FALSE, l.start=23.56)

#6 MannKendall

library(Kendall)

MannKendall(time_series)

plot(time_series)

lines(lowess(time(time_series),time_series), col=’blue’)

#7 Decomposition of Additive timeseries

time_series <- ts(df1$Water,frequency=12, start=c(1885)) time_series_components <- decompose(time_series) time_series_components$seasonal

plot(time_series_components)

OUTPUT:

Exercise 2: IoT BASED HEALTH MONITORING USING R

Aim:

To formulate an IOT based healthcare application – prediction of possibility of heart attack using Generalized Linear model, Random forest and Decision trees in R

Problem Statement:

Heart disease has received a lot of attention in medical research as one of the many life-threatening diseases. The diagnosis of heart disease is a difficult task which when automated can offer better predictions about the patient’s heart condition so that further treatment can be made effective. The signs, symptoms, and physical examination of the patient are usually used to make a diagnosis of heart disease. Resting blood pressure, cholesterol, age, sex, type of chest pain, fasting blood sugar, ST depression, and exercise-induced angina can all help to predict the likelihood of having a heart attack. Using models like Decision trees, Random forest and GLM to train on the given dataset and view the predicted class – 0 = less chance of heart attack, 1 = more chance of heart attack.

Procedure:

Import the packages Rplot, RColorBrewer, Rattle and randomForest

Download and read the dataset from Kaggle : https://www.kaggle.com/ datasets/nareshbhat/health-care-data-set-on-heart-attack-possibility

View the statistics of the variables in the dataset using function “summary”.

Analyse the data, specific to resting blood pressure

Using the “cor” function, find the correlation between resting blood pressure and age

Construct a Logistic regression model using GLM and view the output plots

Encode the target values into categorical values

Split the dataset into training and testing data in the ratio 70 : 30

Construct a decision tree model 10.Target variable is categorised based on resting blood pressure, serum cholesterol and maximum heart rate achieved

Plot the decision tree and view the output

Devise a Random forest model based on the relationship between resting blood pressure, old peak and chest pain type

View the confusion matrix and importance of each predictor

CODE:

#Ex1- IOT based healthcare application using Generalized Linear model, Random forest and Decision trees in R

#R version 3.3.2 (2016-10-31)

#RStudio version 1.2.1335

# Loading all the necessary Libraries

library(rpart) #used for building classification and regression trees.

library(rpart.plot)

library(RColorBrewer) # help you choose sensible colour schemes for figures library(rattle) # provides a collection of utilities functions for a data scientist. library(randomForest) #Used to create and analyse random forests.

# Loading the dataset

data = read.csv(“heart_health.csv”)

Analyzing the data in the dataset print(“Minimum resting blood pressure”) min(data$trestbps) print(“Maximum resting blood pressure”) max(data$trestbps)

print(“Summary of Dataset”) summary(data)

print(“Range of resting blood pressure”) max(data$trestbps) – min(data$trestbps) quantile(data$trestbps, c(0.25, 0.5, 0.75)) print(“Column name of the Data”) names(data)

print(“Attributes of the Data”) str(data)

print(“Number of Rows and Columns:”) dim(data)

# Analyze the Correlation between resting BP and age

print(“Correlation between the resting blood pressure and the age”)

cor(data$trestbps, data$age, method = “pearson”)

cor.test(data$trestbps, data$age, method = “pearson”)

# Constructing the GLM

print(“Constructing the Logistic regression Model”)

glm(target~ trestbps+ restecg + fbs, data = data, family=binomial())

model <- glm(target~trestbps+ chol + thalach, data = data, family=binomial())

plot(model)

Make dependent variable as a factor (categorical) data$target = as.factor(data$target)

Splitting the dataset into test and train print(“Train Test Split”) # 70/30 Split

dt = sort(sample(nrow(data), nrow(data)*.7)) train<-data[dt,]

val<-data[-dt,] nrow(train) nrow(val)

Constructing the Decision Tree Model print(“Construction of the Decision Tree Model”) mtree <- rpart(target ~ trestbps + chol + thalach, data = train, method=”class”,

control = rpart.control(minsplit = 20, minbucket = 7, maxdepth = 10, usesurrogate = 2,

xval =10))

mtree

Plotting the Decision Tree for the dataset print(“Plotting the Decision Tree”) plot(mtree)

text(mtree)

par(xpd = NA, mar = rep(0.7, 4)) plot(mtree, compress = TRUE)

text(mtree, cex = 0.7, use.n = TRUE, fancy = FALSE, all = TRUE) prp(mtree, faclen = 0,box.palette = “Reds”, cex = 0.8, extra = 1)

Constructing the Random Forest model

rf <- randomForest(target ~ trestbps + oldpeak + cp, data = data)

View the forest results print(“Random Forest Results:”)

print(rf)

Importance of each predictor print(“Importance of each predictor:”) print(importance(rf,type = 2))

Plot the Random Forest

plot(rf)

#Conclusion

#Models like Decision trees, Random forest and GLM were trained on the given dataset and the predictions were visualised successfully

OUTPUT:

Exercise 3: TRAFFIC PATTERN RECOGNITION USING R

Aim:

To formulate an IOT based Traffic pattern recognition using Decision tree, correlation study, Naïve Bayes classification and Time series forecasting in R

Problem Statement:

The term “traffic patterns recognition” refers to the process of recognising a user’s current traffic pattern, which can be applicable to transportation planning, location-based services, social networks, and a range of other applications.

Dataset:

Using the dataset from Kaggle https://www.kaggle.com/datasets/utathya/smart-city-traffic-patterns called “Smart City traffic patterns”, perform pattern recognition and prediction using Decision Tree classifiers and Naïve Bayes classification. Moreover, time series analysis and simple moving average, Arima and exponential smoothing are performed.

Procedure:

Import required packages after installing

Load and read the data set

Pre-process the data appropriately

Use summary method to see the characteristics of the data set

Use the Simple Moving Average forecasting model and visualize the output

Use the Exponential smoothing forecasting model and see the output

Use the Arima forecasting model and view the output

Get the correlation between the columns

Split the data set into training and testing in the ratio of 70:30

Perform Decision tree classification and view the results in tree format

Perform Naïve Bayes and view the results in confusion matrix

CODE:

#R version 3.6.1

#RStudio version 1.2.1335

#Import required packages after installing

library(“e1071”)

library(“caTools”)

library(“caret”)

library(“party”)

library(“dplyr”)

library(“magrittr”)

library(“TTR”)

library(“data.table”)

#Load the data set

data <- read.csv(“traffic.csv”)

data

#Pre-process the data appropriately

data$DateTime = strtrim(data$DateTime,15)

data

#Use summary method to see the characteristics of the data set

print(“Summary of Dataset”)

summary(data)

# correlation study

print(“Correlation between Traffic and Junction”)

cor(data$Vehicles,data$Junction,method = “pearson”)

cor.test(data$Vehicles,data$Junction,method = “pearson”)

#Use the Simple Moving Average forecasting model and visualize the output t_col1 <- fread(“traffic.csv”,select = c(“Vehicles”))

t_col1series <- ts(t_col1,frequency=12, start=c(2015,1))

t_col1series[is.na(t_col1series)]<-mean(t_col1series,na.rm=TRUE) #Replace NA with mean

t_col1seriesSMA3 <- SMA(t_col1series,n=12)

plot.ts(t_col1seriesSMA3)

#Use the Exponential smoothing forecasting model and visualize the output t_col1 <- fread(“traffic.csv”,select = c(“Junction”))

t_col1series <- ts(t_col1,frequency=12, start=c(2015,1))

t_col1series[is.na(t_col1series)]<-mean(t_col1series,na.rm=TRUE) #Replace NA with mean

t_col1seriesforecasts <- HoltWinters(t_col1series, beta=FALSE, gamma=FALSE)

t_col1seriesforecasts

t_col1seriesforecasts$SSE

HoltWinters(t_col1series, beta=FALSE, gamma=FALSE, l.start=23.56)

#Use the Arima forecasting model and view the output library(“TTR”)

v1 <- data[[4]]

datats <- ts(v1)

partition into train and test train_series=datats[1:40] test_series=datats[41:50]

make arima models

arimaModel_1=arima(train_series, order=c(0,1,2))

arimaModel_2=arima(train_series, order=c(1,1,0))

arimaModel_3=arima(train_series, order=c(1,1,2))

## look at the parameters

print(arimaModel_1);print(arimaModel_2);print(arimaModel_3)

#Split the data set into training and testing in the ratio of 70:30

split <- sample.split(data, SplitRatio = 0.7)

train_cl <- subset(data, split == “TRUE”)

test_cl <- subset(data, split == “FALSE”)

#Perform Decision tree classification and view the results in tree format model<- ctree(Vehicles ~ Junction, train_cl) plot(model)

#Perform Naïve Bayes and view the results in confusion matrix set.seed(120) # Setting Seed

classifier_cl <- naiveBayes(Junction ~ ., data = train_cl)

# Predicting on test data

y_pred <- predict(classifier_cl, newdata = test_cl)

# Confusion Matrix

cm <- table(test_cl$Junction, y_pred)

cm

confusionMatrix(cm)

plot(cm)

OUTPUT:

#Conclusion

Traffic pattern recognition with Decision trees, correlation study, Naïve Bayes classification and Time series forecasting was successfully implemented and visualised using R

Exercise 4: POWER DATAANALYSIS AND VISUALISATION FOR PROJECT IN HOME POWER IN RASPBERRY PI USING R

AIM:

To perform analysis on power data collected using Raspberry Pi and use R to do time series forecasting on the data by analyzing it and plotting the obtained forecast

PROBLEM STATEMENT:

The data on household power consumption can not only show the current state of household power consumption, but it can also bring awareness to the power sector, assisting in the understanding of power supply. With the proliferation of smart electricity metres and the widespread deployment of electricity producing technology such as solar panels, there is a lot of data about electricity usage available.

The Kaggle Dataset named “Household Electric Power Consumption” present in the link “https://www.kaggle.com/datasets/uciml/electric-power-consumption-data-set“, is a multivariate time series of power-related variables that can be used to model and even forecast future electricity consumption. Time series forecasting, exploratory data analytics and data visualization is performed using the same.

PROCEDURE:

Import the table, dplyr, lubridate, plotly and forecast packages

Import the data and print a short summary of it using the head(), glimpse() and summary() functions.

Check for the missing values in the dataset and remove if there are any.

Convert the date and time to a standard format.

Extract the Year, Week and Day from the dataset.

Visualize the granularity of the submetering using the plot() function

Filter the data for any particular year and visualize the submetering for that particular year.

Use the plotly package to plot the submetering across a day of usage.

Reduce the number of observations for that day and again plot the graph using plotly.

For the time series analysis extract the weekly time series data for all the submeters and plot them.

Fit the time series data into a time series linear regression model.

View the summary of the model.

Plot the forecast for all 3 submeters.

CODE:

#R version 3.6.1

#RStudio version 1.2.1335

Import Packages library(data.table)

library(dplyr)

library(lubridate)

library(plotly)

library(forecast)

# Import Data

data <- fread(“household_power_consumption.txt”)

head(data)

glimpse(data)

summary(data)

# Data Preprocessing

data <- data[complete.cases(data)]

sum(is.na(data))

data$datetime <- paste(data$Date,data$Time)

data$datetime <- as.POSIXct(data$datetime, format=”%d/%m/%Y %H:%M:%S”) attr(data$datetime, “tzone”) <- “Europe/Paris” str(data)

data$year <- year(data$datetime)

data$week <- week(data$datetime)

data$day <- day(data$datetime)

data$month <- month(data$datetime)

data$minute <- minute(data$datetime)

Data Visualization plot(data$Sub_metering_1) ann <- filter(data, year == 2006) plot(ann$Sub_metering_1) plot(ann$Sub_metering_2) plot(ann$Sub_metering_3)

houseDay <- filter(data, year == 2008 & day == 10 & month==1)

plot_ly(houseDay, x = ~houseDay$datetime, y = ~houseDay$Sub_metering_1, type = ‘scatter’, mode = ‘lines’)

dtDay <- filter(data, year == 2009 & day == 2 & month==2)

plot_ly(dtDay, x = ~dtDay$datetime, y = ~dtDay$Sub_metering_1, name = ‘Kitchen’, type = ‘scatter’, mode = ‘lines’) %>%

add_trace(y = ~dtDay$Sub_metering_2, name = ‘Laundry Room’, mode = ‘lines’) %>%

add_trace(y = ~dtDay$Sub_metering_3, name = ‘Water Heater & AC’, mode = ‘lines’) %>%

layout(title = “Power Consumption Feb 2th, 2009”,

xaxis = list(title = “Time”),

yaxis = list (title = “Power (watt-hours)”))

plot_ly(houseDay10, x = ~houseDay10$datetime, y = ~houseDay10$Sub_metering_1, name = ‘Kitchen’, type = ‘scatter’, mode = ‘lines’) %>%

add_trace(y = ~houseDay10$Sub_metering_2, name = ‘Laundry Room’, mode = ‘lines’) %>%

add_trace(y = ~houseDay10$Sub_metering_3, name = ‘Water Heater & AC’, mode = ‘lines’) %>%

layout(title = “Power Consumption Feb 2th, 2009”, xaxis = list(title = “Time”),

yaxis = list (title = “Power (watt-hours)”))

data$minute <- minute(data$datetime)

houseDay10 <- filter(data, year == 2008 & month == 5 & day == 10 & (minute == 0 |

minute == 10 | minute == 20 | minute == 30 | minute == 40 | minute == 50))

plot_ly(houseDay10, x = ~houseDay10$datetime, y = ~houseDay10$Sub_metering_1,

name = ‘Kitchen’, type = ‘scatter’, mode = ‘lines’) %>%

add_trace(y = ~houseDay10$Sub_metering_2, name = ‘Laundry Room’, mode = ‘lines’) %>%

add_trace(y = ~houseDay10$Sub_metering_3, name = ‘Water Heater & AC’, mode = ‘lines’) %>%

layout(title = “Power Consumption May 10th, 2008”, xaxis = list(title = “Time”),

yaxis = list (title = “Power (watt-hours)”))

# Time Series Analysis

data$hour <- hour(data$datetime)

houseweekly <- filter(data, week == 2 & hour == 20 & minute == 1)

tsSM3_weekly <- ts(houseweekly$Sub_metering_3, frequency=52, start=c(2007,1))

plot(tsSM3_weekly, xlab = “Time”, ylab = “Watt Hours”, main = “Sub-meter 3”)

tsSM3_weekly <- ts(houseweekly$Sub_metering_1, frequency=52, start=c(2007,1))

plot(tsSM3_weekly, xlab = “Time”, ylab = “Watt Hours”, main = “Sub-meter 1”)

tsSM3_weekly <- ts(houseweekly$Sub_metering_2, frequency=52, start=c(2007,1))

plot(tsSM3_weekly, xlab = “Time”, ylab = “Watt Hours”, main = “Sub-meter 2”)

house070809weekly <- filter(data, year==2008, hour == 20 & minute == 1)

tsSM3_070809weekly <- ts(house070809weekly$Sub_metering_3, frequency=52,

start=c(2008,3))

tsSM2_070809weekly <- ts(house070809weekly$Sub_metering_2, frequency=52, start=c(2008,3))

tsSM1_070809weekly <- ts(house070809weekly$Sub_metering_1, frequency=52, start=c(2008,3))

fit3 <- tslm(tsSM3_070809weekly ~ trend + season)

fit2 <- tslm(tsSM2_070809weekly ~ trend + season)

fit1 <- tslm(tsSM1_070809weekly ~ trend + season)

summary(fit3)

forecastfitSM3c <- forecast(fit3, h=20, level=c(80,90))

forecastfitSM2c <- forecast(fit2, h=20, level=c(80,90))

forecastfitSM1c <- forecast(fit1, h=20, level=c(80,90))

plot(forecastfitSM3c, ylim = c(0, 20), ylab= “Watt-Hours”, xlab=”Time”)

plot(forecastfitSM2c, ylim = c(0, 20), ylab= “Watt-Hours”, xlab=”Time”)

plot(forecastfitSM1c, ylim = c(0, 20), ylab= “Watt-Hours”, xlab=“Time”)

#Conclusion

Time series forecasting, exploratory data analytics and data visualization using the Household Electric Power Consumption dataset was successfully implemented and visualised using R

OUTPUT:

Case Studies 1: Use ggplot2 in R to visualize the distribution of petal length with respect to Species in default iris dataset. And apply correlation plot (Marginal Histogram / Boxplot, Correlogram, Diverging bars, Diverging Lollipop Chart, Diverging Dot Plot and Area Chart) in R to show the correlation among the all petal width and sepal width attributes in Iris dataset.

Create Sheets with one Dashboard in tableau to explore the above dataset. Apply R and Tableau integration to group the species type based on the attributes and show the results with visual impact.