This indicates that as the portland cement (“C”) content increases, fly ash (“F”) content increases, water (“W”) content decreases, or aggregate (“CA” and “FA”) content decreases, the concrete compressive strength increases, accounting for the effects of the other variables in the model. This model accounts for 88% of the variability in concrete compressive strength and is more appropriate than using the mean compressive strength. Furthermore upon validation, the model accounts 90% of the response variability when predicting new concrete strengths from the model and has a low error rate of 8.1%, indicating the model is appropriate when predicting the concrete compressive strength from the amounts of concrete components. Using this linear model in practice will allow for the concrete components to optimized for increased concrete compressive strength, resulting in less waste of materials and concrete with desirable properties.
The data set was retrieved from UCI Machine Learning Repository on November 12, 2019. Data Reference: Yeh, I-Cheng, “Modeling slump flow of concrete using second-order regressions and artificial neural networks,” Cement and Concrete Composites, Vol.29, No. 6, 474-480, 2007.
In this day and age, we are surrounded by a concrete jungle –in our buildings, our roads, our pipelines– are all made possible thanks to this wonderful material!
Concrete is a composite material composed of aggregates and cement or simply put “rocks glued together”. The beauty of composites is they have unique properties that individual components do not possess on their own. The aggregates reinforce the surrounding cement creating a strong material. But what makes concrete so strong? Is there are an optinum mixture of components?
The Concrete Jungle
Image Source: Edward Burtynsky, twittersifter.com twistedsifter.com/2012/03/picture-of-the-day-the-concrete-jungle/
The following components have an effect on the compressive strength (“strength”), flow (“Fl”), and slump (“Sl”) of the concrete material:
The data represents the amount of each component (kilograms) in a cubic meter of concrete.
Image Source: Paulo Montiero, UC Berkely
Three concrete properties were investigated given various amounts of concrete components:
Concrete compressive strength (“Strength”)
Slump (“Sl”)
Flow (“Fl”)
Concrete Slump Test
Image Source: theconstructor.org/concrete/concrete-slump-test/1558/
Compressive Strength has a normal distribution, and will be investigated using linear regression.
Slump has a left-skewed distribution, and will not be investigated using linear regression.
Flow has a bimodal distribution, and will not be investigated using linear regression.
The concrete data was seperated into two data sets– 80% of the data was used to train the model (black) and 20% of the data was used to test the data (red).
A linear model relating portland cement (“C”), blast furnace slag (“S”), fly ash (“F”), water (“W”), superplasticizer (“SP”), coarse aggregates (“CA”), fine aggregates (“FA”) to concrete compressive strength (“Strength”) was created.
Call:
lm(formula = Strength ~ C + S + F + W + SP + CA + FA, data = train_data)
Residuals:
Min 1Q Median 3Q Max
-5.5113 -1.4818 -0.1948 1.2946 7.5805
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 123.25103 79.20910 1.556 0.12397
C 0.06870 0.02504 2.743 0.00763 **
S -0.01870 0.03529 -0.530 0.59787
F 0.05611 0.02566 2.187 0.03190 *
W -0.22442 0.08030 -2.795 0.00661 **
SP 0.10348 0.15451 0.670 0.50512
CA -0.05005 0.03045 -1.644 0.10450
FA -0.03017 0.03224 -0.936 0.35240
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.59 on 74 degrees of freedom
Multiple R-squared: 0.8901, Adjusted R-squared: 0.8797
F-statistic: 85.62 on 7 and 74 DF, p-value: < 2.2e-16
Testing for variable contribution to the model, using results from the t-tests:
Using the t-test, the p-value for blast furnace slag (“S”) was greater than \(\alpha=0.05\). Therefore, there is not sufficient evidence to reject the null hyposthesis (\(H_0\)) and conclude that blast furnace slag (“S”) does not significantly contribute to the model. For the following iteration, blast furnace slag (“S”) was removed from the model.
Call:
lm(formula = Strength ~ C + F + W + SP + CA + FA, data = train_data)
Residuals:
Min 1Q Median 3Q Max
-5.6174 -1.3097 -0.2674 1.3338 7.7172
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 81.819591 12.487931 6.552 6.34e-09 ***
C 0.081721 0.004776 17.112 < 2e-16 ***
F 0.069480 0.004592 15.129 < 2e-16 ***
W -0.183353 0.020813 -8.809 3.43e-13 ***
SP 0.157830 0.114974 1.373 0.1739
CA -0.034176 0.005407 -6.321 1.69e-08 ***
FA -0.013401 0.006079 -2.205 0.0306 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.578 on 75 degrees of freedom
Multiple R-squared: 0.8897, Adjusted R-squared: 0.8809
F-statistic: 100.8 on 6 and 75 DF, p-value: < 2.2e-16
This process was continued and the variables were evaluated using t-tests. In this iteration, superplasticizer (“SP”) was removed next due to the p-value being greater than \(\alpha=0.05\). This resulted in the final model depicting compressive strength with five (5) independent variables that all have significant p-values at a significance level of \(\alpha=0.05\). These five (5) variables were confirmed using the variable selection from the “leaps” package.
Call:
lm(formula = Strength ~ C + F + W + CA + FA, data = train_data)
Residuals:
Min 1Q Median 3Q Max
-5.8679 -1.5570 -0.2748 1.0369 7.2812
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 89.530519 11.218012 7.981 1.20e-11 ***
C 0.080206 0.004673 17.163 < 2e-16 ***
F 0.067584 0.004405 15.342 < 2e-16 ***
W -0.194009 0.019424 -9.988 1.75e-15 ***
CA -0.036741 0.005103 -7.200 3.68e-10 ***
FA -0.015249 0.005962 -2.558 0.0125 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.593 on 76 degrees of freedom
Multiple R-squared: 0.8869, Adjusted R-squared: 0.8795
F-statistic: 119.2 on 5 and 76 DF, p-value: < 2.2e-16
Analysis of Variance Table
Response: Strength
Df Sum Sq Mean Sq F value Pr(>F)
C 1 909.91 909.91 135.362 < 2.2e-16 ***
F 1 2405.94 2405.94 357.917 < 2.2e-16 ***
W 1 316.65 316.65 47.107 1.586e-09 ***
CA 1 330.07 330.07 49.103 8.506e-10 ***
FA 1 43.97 43.97 6.541 0.01253 *
Residuals 76 510.88 6.72
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Subset selection object
Call: regsubsets.formula(Strength ~ C + S + F + W + SP + CA + FA, data = train_data,
nvmax = 7)
7 Variables (and intercept)
Forced in Forced out
C FALSE FALSE
S FALSE FALSE
F FALSE FALSE
W FALSE FALSE
SP FALSE FALSE
CA FALSE FALSE
FA FALSE FALSE
1 subsets of each size up to 7
Selection Algorithm: exhaustive
C S F W SP CA FA
1 ( 1 ) "*" " " " " " " " " " " " "
2 ( 1 ) "*" " " "*" " " " " " " " "
3 ( 1 ) "*" " " "*" "*" " " " " " "
4 ( 1 ) "*" " " "*" "*" " " "*" " "
5 ( 1 ) "*" " " "*" "*" " " "*" "*"
6 ( 1 ) "*" " " "*" "*" "*" "*" "*"
7 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
The final model was determined by removing the variable with the highest p-value from the t-test until all variables were significant at a confindence level of \(\alpha=0.05\). The following variables were determined to have a significant effect on the compressive strength (“Strength”) of concrete (from the t-tests):
The resulting linear model is:
Concrete strength increases as:
Using the training data, the adjusted \(R^2\) value is 0.8795. This indicates that 87.95% of the variability in strength is accounted for using the model.
Using the test data to validate the model,
For the regression model to be appropriate, the following assumptions or conditions must be valid:
1. Linear
Using the Residuals vs. Fitted Plot, there are no distinct patterns, which is an indication of a linear relationship between the concrete compressive strength (\(strength\)) and the regressors, portland cement (\(C\)), fly ash (\(F\)) and water (\(W\)).
2. Zero Mean
In Ordinary Least Squares (OLS) Regression, the linear regression is created with the error term having a mean of zero and thus, this condition is always satisfied.
3. Equal Variance
Assuming the data is indexed in the order it was collected, the Indexed Residuals Plot indicates the residuals have no specific patterns, suggesting the residuals are independent. Without knowing the order in which the data was collected, the independent condition cannot be confirmed.
4. Independent
Using the Scale-Location Plot, the points appear to be random with no indication of a pattern, suggesting the constant variance condition is satisfied.
5. Normality
Using the Q-Q Plot, the standarized residuals lie along the 45 degree line, indicating the normality condition is satisfied.
Additionally, influential points and collinearity should be assessed.
Influential Points
Using the Leverage vs. Residuals, there are no points that have leverage values greater than 1, suggesting they are not any suspected influential points. Influential points can be an issue because they affect the model estimates. It is important to determine if these data points are “bad” values from mistakes in data collection or if they represent actual values. If these outliers are determined to be from mistakes, they can be removed from the sample data but should be mentioned in the analysis.
Collinearity
From the Variance Inflation Factors (VIF) for the five regressors, none of the square root of VIF values are greater than 2, suggesting collinearity is not prescence.
Conclusion
From reviewing the conditions, the linear model is appropriate as the conditions are valid and there are no influential points or collinearity between variables.
Using the Residuals vs. Fitted Plot, there are no distinct patterns, which is an indication of a linear relationship between the concrete compressive strength (\(strength\)) and the regressors, portland cement (\(C\)), fly ash (\(F\)) and water (\(W\)).
In Ordinary Least Squares (OLS) Regression, the linear regression is created with the error term having a mean of zero and thus, this condition is always satisfied.
Assuming the data is indexed in the order it was collected, the Indexed Residuals Plot indicates the residuals have no specific patterns, suggesting the residuals are independent. Without knowing the order in which the data was collected, the independent condition cannot be confirmed.
Using the Scale-Location Plot, the points appear to be random with no indication of a pattern, suggesting the constant variance condition is satisfied.
Using the Q-Q Plot, the standarized residuals lie along the 45 degree line, indicating the normality condition is satisfied.
Using the Leverage vs. Residuals, there are no points that have leverage values greater than 1, suggesting they are not any suspected influential points. Influential points can be an issue because they affect the model estimates. It is important to determine if these data points are “bad” values from mistakes in data collection or if they represent actual values. If these outliers are determined to be from mistakes, they can be removed from the sample data but should be mentioned in the analysis.
In addition to linearity between regressors and the response, it is also important to assess the near linear dependence amoung regressors.
C F W CA FA
1.287230 1.280103 1.375916 1.611531 1.277268
From the Variance Inflation Factors (VIF) for the five regressors, none of the square root of VIF values are greater than 2, suggesting collinearity is not prescence.
Discussion
The significant variables for the concrete compressive strength model are also confirmed in literature:
increasing cement content, increases compressive strength due to the chemical reaction between cement and water. The result is an extremely strong rock-like material (Ref[2])
fly ash content and aggregate also have a significant effect on the compressive strength of concrete (Ref [3])
excess water content, decreases the compressive strength due to the formation of voids upon evaporation (Ref[2])
Conclusions
Which concrete components have a significant effect on the concrete compressive strength?
The concrete components blast furnace slag (“S”) and superplasticizer (“SP”) were determined to not have a significant effect on concrete compressive strength and were not included in the model describing strength.
What is the linear model describing the concrete compressive strength with significant concrete compressive strength?
\(\hat{strength}=\) 89.5305187 + 0.0802058\(C\) + 0.0675839\(F\) -0.1940091\(W\) -0.0367414\(CA\) -0.0152489\(FA\).
The model was developed using the training data set (80% of the orginial data) and 87.95% of the varibility in strength is accounted for in the model. Additionally, the model was validated using the test data set (20% of the original data) and had an error rate of 7.9589768 %, which is relatively low. Thus, the model is decent at predicting the compressive strength given the amount of each significant concrete components.
Why is this model important? This linear model describes the effect of concrete components on the compressive strength which allows for the concrete materials to optimized. This optimization results in less waste of materials and concrete with the desired properties.
Future Work
References
[1] Yeh, I-Cheng, “Modeling slump flow of concrete using second-order regressions and artificial neural networks,” Cement and Concrete Composites, Vol.29, No. 6, 474-480, 2007.
[2] Sami W. Tabsh, Akmal S. Abdelfatah, Influence of recycled concrete aggregates on strength properties of concrete, Construction and Building Materials, 2009, ISSN 0950-0618, https://doi.org/10.1016/j.conbuildmat.2008.06.007.
[3] J. Wongpa, K. Kiattikomol, C. Jaturapitakkul, P.Chindaprasirt, Compressive strength, modulus of elasticity, and water permeability of inorganic polymer concrete,Materials & Design, 2010, ISSN 0261-3069, https://doi.org/10.1016/j.matdes.2010.05.012. Retrieved from: (http://www.sciencedirect.com/science/article/pii/S0261306910002931)
---
title: "MTH 543 Prj Concrete"
author: "K.M. Burzynski"
output:
flexdashboard::flex_dashboard:
theme: cosmo
orientation: columns
social: ["facebook", "twitter", "linkedin"]
source_code: embed
---
```{r setup, include=FALSE}
# load necessary packages
library(caret)
library(car)
library(ggplot2)
library(plotly)
library(plyr)
library(flexdashboard) ## you need this package to create dashboard
# read the data set here, I use data: mtcars as an example
concretedata <- read.csv("/Users/katherineburzynski/Documents/MTH 543 - Linear Regression/Project/Concrete_test.csv")
```
Introduction
=======================================================================
Column {data-width=600}
-----------------------------------------------------------------------
### Abstract
#### **Determing the Effect of Concrete Components on Concrete Properties**
Concrete is the most used building material throughout the world today. Concrete is commonly composed of portland cement (“C”), blast furnace slag (“S”), fly ash (“F”), water (“W”), superplasticizer (“SP”), coarse aggregates (“CA”), fine aggregates (“FA”). The purpose of this research is to determine which concrete components have a significant effect on the compressive strength. Using the Ordinary Least Squares (OLS) Method, the regression model was determined to be:
$\hat{strength}$ = 89.5305187 + 0.0802058 $C$ + 0.0675839 $F$ -0.1940091 $W$ -0.0367414 $CA$ -0.0152489 $FA$
This indicates that as the portland cement (“C”) content increases, fly ash (“F”) content increases, water (“W”) content decreases, or aggregate ("CA" and "FA") content decreases, the concrete compressive strength increases, accounting for the effects of the other variables in the model. This model accounts for 88% of the variability in concrete compressive strength and is more appropriate than using the mean compressive strength. Furthermore upon validation, the model accounts 90% of the response variability when predicting new concrete strengths from the model and has a low error rate of 8.1%, indicating the model is appropriate when predicting the concrete compressive strength from the amounts of concrete components. Using this linear model in practice will allow for the concrete components to optimized for increased concrete compressive strength, resulting in less waste of materials and concrete with desirable properties.
The data set was retrieved from *UCI Machine Learning Repository* on November 12, 2019.
**Data Reference:**
Yeh, I-Cheng, "Modeling slump flow of concrete using second-order regressions and artificial neural networks," Cement and Concrete Composites, Vol.29, No. 6, 474-480, 2007.
### What does the data look like?
```{r}
pairs(concretedata)
#histstrength <- plot_ly(concretedata, x=~Strength)%>%
#layout(title = "Concrete Compressive Strength",
#xaxis = list(title="Compressive Strength",
#font=list(size=14)),
#yaxis = list(title="Frequency", font=list(size=14)))
#ggplotly(histstrength)
```
Column {.tabset data-width=400}
-----------------------------------------------------------------------
### What makes concrete strong?
In this day and age, we are surrounded by a concrete jungle --in our buildings, our roads, our pipelines-- are all made possible thanks to this wonderful material!
Concrete is a composite material composed of aggregates and cement or simply put *"rocks glued together"*. The beauty of composites is they have unique properties that individual components do not possess on their own. The aggregates reinforce the surrounding cement creating a strong material. But what makes concrete so strong? Is there are an optinum mixture of components?

*Image Source: Edward Burtynsky, twittersifter.com*
twistedsifter.com/2012/03/picture-of-the-day-the-concrete-jungle/
### Concrete Components
The following components have an effect on the compressive strength ("strength"), flow ("Fl"), and slump ("Sl") of the concrete material:
* Portland Cement (“C”)
* Blast furnace slag (“S”)
* Fly ash (“F”)
* Water (“W”)
* Superplasticizer (“SP”)
* Coarse aggregates (“CA”)
* Fine aggregates (“FA”)
The data represents the amount of each component (kilograms) in a cubic meter of concrete.

*Image Source: Paulo Montiero, UC Berkely*
### Concrete Properties
Three concrete properties were investigated given various amounts of concrete components:
**Concrete compressive strength ("Strength")**
* concrete samples were tested in compression until they failed
* determines the strength of the cured composite
* reported in megapascals (MPa)
**Slump ("Sl")**
* slump is how much drop there is in the wet concrete during the "Slump-Cone Test"
* helps understand how easy the wet concrete is to work with
* measured in meters (m)
**Flow ("Fl")**
* flow is the diameter of the wet concrete cone during the "Slump-Cone Test"
* helps understand how easy the wet concrete is to work with
* measured in meters (m)

*Image Source: theconstructor.org/concrete/concrete-slump-test/1558/*
Response Variable Exploration
=======================================================================
Column {.tabset data-width=400}
-----------------------------------------------------------------------
### Compressive Strength of Concrete ("Strength")
* determines the strength of the cured composite
* reported in megapascals (MPa)
``` {r}
boxstrength <- plot_ly(concretedata, y=~Strength, type="box")%>%
layout(title = "Concrete Compressive Strength (MPa)",
xaxis = list(title=" ", font=list(size=14)),
yaxis = list(title=" ", font=list(size=14)))
ggplotly(boxstrength)
```
### Strength Data
Compressive Strength has a normal distribution, and will be investigated using linear regression.
```{r}
histstrength <- plot_ly(concretedata, x=~Strength)%>%
layout(xaxis = list(title="Concrete Compressive Strength (MPa)", font=list(size=14)),
yaxis = list(title="Frequency", font=list(size=14)))
ggplotly(histstrength)
```
Column {.tabset data-width=400}
-----------------------------------------------------------------------
### Slump ("Sl")
* slump is how much drop there is in the cone
* measured in meters (m)
```{r}
boxslump <- plot_ly(concretedata, y=~Sl, type="box")%>%
layout(title = "Concrete Slump (m)",
xaxis = list(title=" ", font=list(size=14)),
yaxis = list(title=" ", font=list(size=14)))
ggplotly(boxslump)
```
### Slump Data
Slump has a left-skewed distribution, and will not be investigated using linear regression.
```{r}
histslump <- plot_ly(concretedata, x=~Sl)%>%
layout(xaxis = list(title="Concrete Slump (m)", font=list(size=14)),
yaxis = list(title="Frequency", font=list(size=14)))
ggplotly(histslump)
```
Column {.tabset data-width=400}
-----------------------------------------------------------------------
### Flow ("Fl")
* flow is the diameter of the cone
* measured in meters (m)
```{r}
boxflow <- plot_ly(concretedata, y=~Fl, type="box")%>%
layout(title = "Concrete Flow (m)",
xaxis = list(title=" ", font=list(size=14)),
yaxis = list(title=" ", font=list(size=14)))
ggplotly(boxflow)
```
### Flow Data
Flow has a bimodal distribution, and will not be investigated using linear regression.
```{r}
histflow <- plot_ly(concretedata, x=~Fl)%>%
layout(xaxis = list(title="Concrete Flow (m)", font=list(size=14)),
yaxis = list(title="Frequency", font=list(size=14)))
ggplotly(histflow)
```
Linear Model of Strength
=======================================================================
Column {.tabset data-width=400}
-----------------------------------------------------------------------
### Strength Data Sets
```{r}
set.seed(2019)
train_index=sample(1:103,82)
train_data=concretedata[train_index,]
test_data=concretedata[-train_index,]
```
```{r}
boxplot(train_data$Strength,test_data$Strength,main="Concrete Strength for Validation",names = c("train","test"), horizontal = TRUE, xlab="Concrete Strength (MPa)", col=c("black","red")
)
```
The concrete data was seperated into two data sets-- 80% of the data was used to train the model (black) and 20% of the data was used to test the data (red).
### Significant Variables Selection
A linear model relating portland cement (“C”), blast furnace slag (“S”), fly ash (“F”), water (“W”), superplasticizer (“SP”), coarse aggregates (“CA”), fine aggregates (“FA”) to concrete compressive strength ("Strength") was created.
```{r}
cstrength=lm(Strength~C+S+F+W+SP+CA+FA,train_data)
summary(cstrength)
```
Testing for variable contribution to the model, using results from the t-tests:
Null hyposthesis ($H_0$): $\beta_i$ = 0, i = 1,2,3,4,5,6,7. versus Alternative hypothesis ($H_A$): $\beta_i$ does not equal 0
Using the t-test, the p-value for blast furnace slag (“S”) was greater than $\alpha=0.05$. Therefore, there is not sufficient evidence to reject the null hyposthesis ($H_0$) and conclude that blast furnace slag (“S”) does not significantly contribute to the model. For the following iteration, blast furnace slag (“S”) was removed from the model.
```{r}
cstrength=lm(Strength~C+F+W+SP+CA+FA,train_data)
summary(cstrength)
```
This process was continued and the variables were evaluated using t-tests. In this iteration, superplasticizer (“SP”) was removed next due to the p-value being greater than $\alpha=0.05$. This resulted in the final model depicting compressive strength with five (5) independent variables that all have significant p-values at a significance level of $\alpha=0.05$. These five (5) variables were confirmed using the variable selection from the "leaps" package.
```{r}
library(MASS)
library(leaps)
cstrength=lm(Strength~C+F+W+CA+FA,train_data)
summary(cstrength)
anova(cstrength)
fit.subset=regsubsets(Strength~C+S+F+W+SP+CA+FA,data=train_data,nvmax=7)
summary(fit.subset)
```
Column
-----------------------------------------------------------------------
### Linear Model of Concrete Compressive Strength
The final model was determined by removing the variable with the highest p-value from the t-test until all variables were significant at a confindence level of $\alpha=0.05$. The following variables were determined to have a significant effect on the compressive strength ("Strength") of concrete (from the t-tests):
* portland cement (“C”)
* fly ash (“F”)
* water (“W”)
* coarse aggregates (“CA”)
* fine aggregates (“FA”)
```{r}
css=lm(Strength~C+F+W+CA+FA,train_data)
```
**The resulting linear model is:**
#### $\hat{strength}=$ `r css$coefficients[1]` + `r css$coefficients[2]`$C$ + `r css$coefficients[3]`$F$ `r css$coefficients[4]`$W$ `r css$coefficients[5]`$CA$ `r css$coefficients[6]`$FA$.
**Concrete strength increases as:**
* portland cement (“C”) content increases
* fly ash (“F”) content increases
* water (“W”) content decreases
* coarse aggregates (“CA”) decreases
* fine aggregates (“FA”) decreases
#### **How good is this linear model at predicting the test data?**
Using the training data, the adjusted $R^2$ value is 0.8795. This indicates that 87.95% of the variability in strength is accounted for using the model.
```{r}
prd=predict(css,test_data)
Rsq=R2(prd,test_data$Strength)
Rootmean=RMSE(prd,test_data$Strength)/mean(test_data$Strength)
```
Using the test data to validate the model,
* The $R^2$ values is `r Rsq`. This indicates that `r Rsq*100` % of the varibility in strength is accounted for in the model.
* There is `r Rootmean*100` % error rate, when predicting the test data using the model created from the training data.
Model Conditions
=======================================================================
Column {.tabset data-width=400}
-----------------------------------------------------------------------
### Summary
For the regression model to be appropriate, the following assumptions or conditions must be valid:
***1. Linear***
Using the ***Residuals vs. Fitted Plot***, there are no distinct patterns, which is an indication of a linear relationship between the concrete compressive strength ($strength$) and the regressors, portland cement ($C$), fly ash ($F$) and water ($W$).
***2. Zero Mean***
In *Ordinary Least Squares* (OLS) Regression, the linear regression is created with the error term having a mean of zero and thus, this condition is always satisfied.
***3. Equal Variance***
Assuming the data is indexed in the order it was collected, the ***Indexed Residuals Plot*** indicates the residuals have no specific patterns, suggesting the residuals are independent. Without knowing the order in which the data was collected, the independent condition cannot be confirmed.
***4. Independent***
Using the ***Scale-Location Plot***, the points appear to be random with no indication of a pattern, suggesting the constant variance condition is satisfied.
***5. Normality***
Using the ***Q-Q Plot***, the standarized residuals lie along the 45 degree line, indicating the normality condition is satisfied.
*Additionally, influential points and collinearity should be assessed.*
***Influential Points***
Using the *Leverage vs. Residuals*, there are no points that have leverage values greater than 1, suggesting they are not any suspected influential points. Influential points can be an issue because they affect the model estimates. It is important to determine if these data points are "bad" values from mistakes in data collection or if they represent actual values. If these outliers are determined to be from mistakes, they can be removed from the sample data but should be mentioned in the analysis.
***Collinearity***
From the *Variance Inflation Factors* (VIF) for the five regressors, none of the square root of VIF values are greater than 2, suggesting collinearity is not prescence.
***Conclusion***
From reviewing the conditions, the linear model is appropriate as the conditions are valid and there are no influential points or collinearity between variables.
### 1. Linearality Condition
```{r}
css=lm(Strength~C+F+W+CA+FA,train_data)
Fitted.Values <- css$fitted.values
Residuals <- css$residuals
Standardized.Residuals <- scale(css$residuals)
Theoretical.Quantiles <- qqnorm(Residuals, plot.it = F)$x
Root.Residuals <- sqrt(abs(Standardized.Residuals))
Leverage <- lm.influence(css)$hat
diagnostics <- data.frame(Fitted.Values, Residuals, Standardized.Residuals, Theoretical.Quantiles, Root.Residuals, Leverage)
m <- list(l = 100, r = 100, b = 100, t = 100, pad = 4)
p1 <- plot_ly(diagnostics, x = Fitted.Values, y = Residuals,
type = "scatter", mode = "markers",
hoverinfo = "x+y", name = "Data",
marker = list(size = 10, opacity = 0.5))%>%
layout(title = "Residuals v.s. Fitted Values",
xaxis = list(title="Fitted Values", font=list(size=14)),
yaxis = list(title="Residuals",
font=list(size=14),margin=m))
ggplotly(p1)
plot(css,1)
```
Using the ***Residuals vs. Fitted Plot***, there are no distinct patterns, which is an indication of a linear relationship between the concrete compressive strength ($strength$) and the regressors, portland cement ($C$), fly ash ($F$) and water ($W$).
### 2. Zero Mean Condition
In *Ordinary Least Squares* (OLS) Regression, the linear regression is created with the error term having a mean of zero and thus, this condition is always satisfied.
### 3. Independent Condition
```{r}
plot(css$residuals, main="Indexed Residuals",ylab="residual")
```
Assuming the data is indexed in the order it was collected, the ***Indexed Residuals Plot*** indicates the residuals have no specific patterns, suggesting the residuals are independent. Without knowing the order in which the data was collected, the independent condition cannot be confirmed.
### 4. Equal Variance (aka Constant Variance) Condition
```{r}
p3 <- plot_ly(diagnostics, x = Fitted.Values, y = Root.Residuals, type = "scatter", mode = "markers", hoverinfo = "x+y", name = "Data", marker = list(size = 10, opacity = 0.5), showlegend = F)%>%
layout(title = "Scale-Location", xaxis = list(title="Fitted Values", font=list(size=14)), yaxis = list(title=expression(sqrt("|Standardized Residuals|")), font=list(size=14)), font=list(size=14), margin=m)
ggplotly(p3)
plot(css,3)
```
Using the ***Scale-Location Plot***, the points appear to be random with no indication of a pattern, suggesting the constant variance condition is satisfied.
### 5. Normality Condition
```{r}
p2 <- plot_ly(diagnostics, x = Theoretical.Quantiles, y = Standardized.Residuals, type = "scatter", mode = "markers", hoverinfo = "x+y", name = "Data", marker = list(size = 10, opacity = 0.5), showlegend = F)%>%
add_trace(x = Theoretical.Quantiles, y = Theoretical.Quantiles, type = "scatter", mode = "line", name = "", line = list(width = 2))%>%
layout(title = "Q-Q Plot",xaxis = list(title="Theoretical Quantiles", font=list(size=14)), yaxis = list(title="Standardized Residuals", font=list(size=14)),font=list(size=14), margin=m)
ggplotly(p2)
plot(css,2)
```
Using the ***Q-Q Plot***, the standarized residuals lie along the 45 degree line, indicating the normality condition is satisfied.
### Influential Points
```{r}
s <- loess.smooth(Leverage, Residuals)
p4 <- plot_ly(diagnostics, x = Leverage, y = Residuals, type = "scatter", mode = "markers", hoverinfo = "x+y", name = "Data", marker = list(size = 10, opacity = 0.5), showlegend = F) %>%
add_trace(x = s$x, y = s$y, type = "scatter", mode = "line", name = "Smooth", line = list(width = 2)) %>%
layout(title = "Leverage vs Residuals", xaxis = list(title="Leverage", font=list(size=14)), yaxis = list(title="Residuals", font=list(size=14)), font=list(size=14), margin=m)
ggplotly(p4)
plot(css,4)
```
Using the *Leverage vs. Residuals*, there are no points that have leverage values greater than 1, suggesting they are not any suspected influential points. Influential points can be an issue because they affect the model estimates. It is important to determine if these data points are "bad" values from mistakes in data collection or if they represent actual values. If these outliers are determined to be from mistakes, they can be removed from the sample data but should be mentioned in the analysis.
### Collinearity
In addition to linearity between regressors and the response, it is also important to assess the near linear dependence amoung regressors.
```{r}
sqrt(vif(css))
```
From the *Variance Inflation Factors* (VIF) for the five regressors, none of the square root of VIF values are greater than 2, suggesting collinearity is not prescence.
Result Discussion
=======================================================================
Column {.tabset data-width=400}
-----------------------------------------------------------------------
### Discussion
***Discussion***
The significant variables for the concrete compressive strength model are also confirmed in literature:
* increasing cement content, increases compressive strength due to the chemical reaction between cement and water. The result is an extremely strong rock-like material (Ref[2])
* fly ash content and aggregate also have a significant effect on the compressive strength of concrete (Ref [3])
* excess water content, decreases the compressive strength due to the formation of voids upon evaporation (Ref[2])
### Conclusions
***Conclusions***
*Which concrete components have a significant effect on the concrete compressive strength?*
* portland cement (“C”)
* fly ash (“F”)
* water (“W”)
* coarse aggregates (“CA”)
* fine aggregates (“FA”)
The concrete components blast furnace slag (“S”) and superplasticizer (“SP”) were determined to not have a significant effect on concrete compressive strength and were not included in the model describing strength.
*What is the linear model describing the concrete compressive strength with significant concrete compressive strength?*
$\hat{strength}=$ `r css$coefficients[1]` + `r css$coefficients[2]`$C$ + `r css$coefficients[3]`$F$ `r css$coefficients[4]`$W$ `r css$coefficients[5]`$CA$ `r css$coefficients[6]`$FA$.
The model was developed using the training data set (80% of the orginial data) and 87.95% of the varibility in strength is accounted for in the model. Additionally, the model was validated using the test data set (20% of the original data) and had an error rate of `r Rootmean*100` %, which is relatively low. Thus, the model is decent at predicting the compressive strength given the amount of each significant concrete components.
*Why is this model important?*
This linear model describes the effect of concrete components on the compressive strength which allows for the concrete materials to optimized. This optimization results in less waste of materials and concrete with the desired properties.
### Future Work
***Future Work***
* the data set is limited to 103 observations, more data may provide a clearer understanding of model the factors that affect concrete compressive strength
* other factors in addition to compressive strength are important to understanding how concrete behaves during use and what factors are important, specifically for recyclability purposes
* investigating transformations of concrete slump and flow for the cone test
* looking into other models that may be more appropriate to model the concrete compression data
### References
***References***
[1] Yeh, I-Cheng, "Modeling slump flow of concrete using second-order regressions and artificial neural networks," Cement and Concrete Composites, Vol.29, No. 6, 474-480, 2007.
[2] Sami W. Tabsh, Akmal S. Abdelfatah,
Influence of recycled concrete aggregates on strength properties of concrete, Construction and Building Materials, 2009, ISSN 0950-0618, https://doi.org/10.1016/j.conbuildmat.2008.06.007.
[3] J. Wongpa, K. Kiattikomol, C. Jaturapitakkul, P.Chindaprasirt,
Compressive strength, modulus of elasticity, and water permeability of inorganic polymer concrete,Materials & Design,
2010, ISSN 0261-3069, https://doi.org/10.1016/j.matdes.2010.05.012. Retrieved from: (http://www.sciencedirect.com/science/article/pii/S0261306910002931)