Introduction

Column

Abstract

Determing the Effect of Concrete Components on Concrete Properties

Concrete is the most used building material throughout the world today. Concrete is commonly composed of portland cement (“C”), blast furnace slag (“S”), fly ash (“F”), water (“W”), superplasticizer (“SP”), coarse aggregates (“CA”), fine aggregates (“FA”). The purpose of this research is to determine which concrete components have a significant effect on the compressive strength. Using the Ordinary Least Squares (OLS) Method, the regression model was determined to be: \(\hat{strength}\) = 89.5305187 + 0.0802058 \(C\) + 0.0675839 \(F\) -0.1940091 \(W\) -0.0367414 \(CA\) -0.0152489 \(FA\)

This indicates that as the portland cement (“C”) content increases, fly ash (“F”) content increases, water (“W”) content decreases, or aggregate (“CA” and “FA”) content decreases, the concrete compressive strength increases, accounting for the effects of the other variables in the model. This model accounts for 88% of the variability in concrete compressive strength and is more appropriate than using the mean compressive strength. Furthermore upon validation, the model accounts 90% of the response variability when predicting new concrete strengths from the model and has a low error rate of 8.1%, indicating the model is appropriate when predicting the concrete compressive strength from the amounts of concrete components. Using this linear model in practice will allow for the concrete components to optimized for increased concrete compressive strength, resulting in less waste of materials and concrete with desirable properties.

The data set was retrieved from UCI Machine Learning Repository on November 12, 2019. Data Reference: Yeh, I-Cheng, “Modeling slump flow of concrete using second-order regressions and artificial neural networks,” Cement and Concrete Composites, Vol.29, No. 6, 474-480, 2007.

What does the data look like?

Column

What makes concrete strong?

In this day and age, we are surrounded by a concrete jungle –in our buildings, our roads, our pipelines– are all made possible thanks to this wonderful material!

Concrete is a composite material composed of aggregates and cement or simply put “rocks glued together”. The beauty of composites is they have unique properties that individual components do not possess on their own. The aggregates reinforce the surrounding cement creating a strong material. But what makes concrete so strong? Is there are an optinum mixture of components?

The Concrete Jungle

Image Source: Edward Burtynsky, twittersifter.com twistedsifter.com/2012/03/picture-of-the-day-the-concrete-jungle/

Concrete Components

The following components have an effect on the compressive strength (“strength”), flow (“Fl”), and slump (“Sl”) of the concrete material:

Portland Cement (“C”)
Blast furnace slag (“S”)
Fly ash (“F”)
Water (“W”)
Superplasticizer (“SP”)
Coarse aggregates (“CA”)
Fine aggregates (“FA”)

The data represents the amount of each component (kilograms) in a cubic meter of concrete.

Image Source: Paulo Montiero, UC Berkely

Concrete Properties

Three concrete properties were investigated given various amounts of concrete components:

Concrete compressive strength (“Strength”)

concrete samples were tested in compression until they failed
determines the strength of the cured composite
reported in megapascals (MPa)

Slump (“Sl”)

slump is how much drop there is in the wet concrete during the “Slump-Cone Test”
helps understand how easy the wet concrete is to work with
measured in meters (m)

Flow (“Fl”)

flow is the diameter of the wet concrete cone during the “Slump-Cone Test”
helps understand how easy the wet concrete is to work with
measured in meters (m)

Concrete Slump Test

Image Source: theconstructor.org/concrete/concrete-slump-test/1558/

Response Variable Exploration

Column

Compressive Strength of Concrete (“Strength”)

determines the strength of the cured composite
reported in megapascals (MPa)

Strength Data

Compressive Strength has a normal distribution, and will be investigated using linear regression.

Column

Slump (“Sl”)

slump is how much drop there is in the cone
measured in meters (m)

Slump Data

Slump has a left-skewed distribution, and will not be investigated using linear regression.

Column

Flow (“Fl”)

flow is the diameter of the cone
measured in meters (m)

Flow Data

Flow has a bimodal distribution, and will not be investigated using linear regression.

Linear Model of Strength

Column

Strength Data Sets

The concrete data was seperated into two data sets– 80% of the data was used to train the model (black) and 20% of the data was used to test the data (red).

Significant Variables Selection

A linear model relating portland cement (“C”), blast furnace slag (“S”), fly ash (“F”), water (“W”), superplasticizer (“SP”), coarse aggregates (“CA”), fine aggregates (“FA”) to concrete compressive strength (“Strength”) was created.


Call:
lm(formula = Strength ~ C + S + F + W + SP + CA + FA, data = train_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.5113 -1.4818 -0.1948  1.2946  7.5805 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)   
(Intercept) 123.25103   79.20910   1.556  0.12397   
C             0.06870    0.02504   2.743  0.00763 **
S            -0.01870    0.03529  -0.530  0.59787   
F             0.05611    0.02566   2.187  0.03190 * 
W            -0.22442    0.08030  -2.795  0.00661 **
SP            0.10348    0.15451   0.670  0.50512   
CA           -0.05005    0.03045  -1.644  0.10450   
FA           -0.03017    0.03224  -0.936  0.35240   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.59 on 74 degrees of freedom
Multiple R-squared:  0.8901,    Adjusted R-squared:  0.8797 
F-statistic: 85.62 on 7 and 74 DF,  p-value: < 2.2e-16

Testing for variable contribution to the model, using results from the t-tests:

Null hyposthesis (\(H_0\)): \(\beta_i\) = 0, i = 1,2,3,4,5,6,7. versus Alternative hypothesis (\(H_A\)): \(\beta_i\) does not equal 0

Using the t-test, the p-value for blast furnace slag (“S”) was greater than \(\alpha=0.05\). Therefore, there is not sufficient evidence to reject the null hyposthesis (\(H_0\)) and conclude that blast furnace slag (“S”) does not significantly contribute to the model. For the following iteration, blast furnace slag (“S”) was removed from the model.


Call:
lm(formula = Strength ~ C + F + W + SP + CA + FA, data = train_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.6174 -1.3097 -0.2674  1.3338  7.7172 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 81.819591  12.487931   6.552 6.34e-09 ***
C            0.081721   0.004776  17.112  < 2e-16 ***
F            0.069480   0.004592  15.129  < 2e-16 ***
W           -0.183353   0.020813  -8.809 3.43e-13 ***
SP           0.157830   0.114974   1.373   0.1739    
CA          -0.034176   0.005407  -6.321 1.69e-08 ***
FA          -0.013401   0.006079  -2.205   0.0306 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.578 on 75 degrees of freedom
Multiple R-squared:  0.8897,    Adjusted R-squared:  0.8809 
F-statistic: 100.8 on 6 and 75 DF,  p-value: < 2.2e-16

This process was continued and the variables were evaluated using t-tests. In this iteration, superplasticizer (“SP”) was removed next due to the p-value being greater than \(\alpha=0.05\). This resulted in the final model depicting compressive strength with five (5) independent variables that all have significant p-values at a significance level of \(\alpha=0.05\). These five (5) variables were confirmed using the variable selection from the “leaps” package.


Call:
lm(formula = Strength ~ C + F + W + CA + FA, data = train_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.8679 -1.5570 -0.2748  1.0369  7.2812 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 89.530519  11.218012   7.981 1.20e-11 ***
C            0.080206   0.004673  17.163  < 2e-16 ***
F            0.067584   0.004405  15.342  < 2e-16 ***
W           -0.194009   0.019424  -9.988 1.75e-15 ***
CA          -0.036741   0.005103  -7.200 3.68e-10 ***
FA          -0.015249   0.005962  -2.558   0.0125 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.593 on 76 degrees of freedom
Multiple R-squared:  0.8869,    Adjusted R-squared:  0.8795 
F-statistic: 119.2 on 5 and 76 DF,  p-value: < 2.2e-16

Analysis of Variance Table

Response: Strength
          Df  Sum Sq Mean Sq F value    Pr(>F)    
C          1  909.91  909.91 135.362 < 2.2e-16 ***
F          1 2405.94 2405.94 357.917 < 2.2e-16 ***
W          1  316.65  316.65  47.107 1.586e-09 ***
CA         1  330.07  330.07  49.103 8.506e-10 ***
FA         1   43.97   43.97   6.541   0.01253 *  
Residuals 76  510.88    6.72                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Subset selection object
Call: regsubsets.formula(Strength ~ C + S + F + W + SP + CA + FA, data = train_data, 
    nvmax = 7)
7 Variables  (and intercept)
   Forced in Forced out
C      FALSE      FALSE
S      FALSE      FALSE
F      FALSE      FALSE
W      FALSE      FALSE
SP     FALSE      FALSE
CA     FALSE      FALSE
FA     FALSE      FALSE
1 subsets of each size up to 7
Selection Algorithm: exhaustive
         C   S   F   W   SP  CA  FA 
1  ( 1 ) "*" " " " " " " " " " " " "
2  ( 1 ) "*" " " "*" " " " " " " " "
3  ( 1 ) "*" " " "*" "*" " " " " " "
4  ( 1 ) "*" " " "*" "*" " " "*" " "
5  ( 1 ) "*" " " "*" "*" " " "*" "*"
6  ( 1 ) "*" " " "*" "*" "*" "*" "*"
7  ( 1 ) "*" "*" "*" "*" "*" "*" "*"

Column

Linear Model of Concrete Compressive Strength

The final model was determined by removing the variable with the highest p-value from the t-test until all variables were significant at a confindence level of \(\alpha=0.05\). The following variables were determined to have a significant effect on the compressive strength (“Strength”) of concrete (from the t-tests):

portland cement (“C”)
fly ash (“F”)
water (“W”)
coarse aggregates (“CA”)
fine aggregates (“FA”)

The resulting linear model is:

\(\hat{strength}=\) 89.5305187 + 0.0802058\(C\) + 0.0675839\(F\) -0.1940091\(W\) -0.0367414\(CA\) -0.0152489\(FA\).

Concrete strength increases as:

portland cement (“C”) content increases
fly ash (“F”) content increases
water (“W”) content decreases
coarse aggregates (“CA”) decreases
fine aggregates (“FA”) decreases

How good is this linear model at predicting the test data?

Using the training data, the adjusted \(R^2\) value is 0.8795. This indicates that 87.95% of the variability in strength is accounted for using the model.

Using the test data to validate the model,

The \(R^2\) values is 0.9036583. This indicates that 90.3658291 % of the varibility in strength is accounted for in the model.
There is 7.9589768 % error rate, when predicting the test data using the model created from the training data.

Model Conditions

Column

Summary

For the regression model to be appropriate, the following assumptions or conditions must be valid:

1. Linear

Using the Residuals vs. Fitted Plot, there are no distinct patterns, which is an indication of a linear relationship between the concrete compressive strength (\(strength\)) and the regressors, portland cement (\(C\)), fly ash (\(F\)) and water (\(W\)).

2. Zero Mean

In Ordinary Least Squares (OLS) Regression, the linear regression is created with the error term having a mean of zero and thus, this condition is always satisfied.

3. Equal Variance

Assuming the data is indexed in the order it was collected, the Indexed Residuals Plot indicates the residuals have no specific patterns, suggesting the residuals are independent. Without knowing the order in which the data was collected, the independent condition cannot be confirmed.

4. Independent

Using the Scale-Location Plot, the points appear to be random with no indication of a pattern, suggesting the constant variance condition is satisfied.

5. Normality

Using the Q-Q Plot, the standarized residuals lie along the 45 degree line, indicating the normality condition is satisfied.

Additionally, influential points and collinearity should be assessed.

Influential Points

Using the Leverage vs. Residuals, there are no points that have leverage values greater than 1, suggesting they are not any suspected influential points. Influential points can be an issue because they affect the model estimates. It is important to determine if these data points are “bad” values from mistakes in data collection or if they represent actual values. If these outliers are determined to be from mistakes, they can be removed from the sample data but should be mentioned in the analysis.

Collinearity

From the Variance Inflation Factors (VIF) for the five regressors, none of the square root of VIF values are greater than 2, suggesting collinearity is not prescence.

Conclusion

From reviewing the conditions, the linear model is appropriate as the conditions are valid and there are no influential points or collinearity between variables.

1. Linearality Condition

2. Zero Mean Condition

In Ordinary Least Squares (OLS) Regression, the linear regression is created with the error term having a mean of zero and thus, this condition is always satisfied.

3. Independent Condition

4. Equal Variance (aka Constant Variance) Condition

Using the Scale-Location Plot, the points appear to be random with no indication of a pattern, suggesting the constant variance condition is satisfied.

5. Normality Condition

Using the Q-Q Plot, the standarized residuals lie along the 45 degree line, indicating the normality condition is satisfied.

Influential Points

Collinearity

In addition to linearity between regressors and the response, it is also important to assess the near linear dependence amoung regressors.

       C        F        W       CA       FA 
1.287230 1.280103 1.375916 1.611531 1.277268

From the Variance Inflation Factors (VIF) for the five regressors, none of the square root of VIF values are greater than 2, suggesting collinearity is not prescence.

Result Discussion

Column

Discussion

Discussion

The significant variables for the concrete compressive strength model are also confirmed in literature:

increasing cement content, increases compressive strength due to the chemical reaction between cement and water. The result is an extremely strong rock-like material (Ref[2])
fly ash content and aggregate also have a significant effect on the compressive strength of concrete (Ref [3])
excess water content, decreases the compressive strength due to the formation of voids upon evaporation (Ref[2])

Conclusions

Conclusions

Which concrete components have a significant effect on the concrete compressive strength?

portland cement (“C”)
fly ash (“F”)
water (“W”)
coarse aggregates (“CA”)
fine aggregates (“FA”)

The concrete components blast furnace slag (“S”) and superplasticizer (“SP”) were determined to not have a significant effect on concrete compressive strength and were not included in the model describing strength.

What is the linear model describing the concrete compressive strength with significant concrete compressive strength?

\(\hat{strength}=\) 89.5305187 + 0.0802058\(C\) + 0.0675839\(F\) -0.1940091\(W\) -0.0367414\(CA\) -0.0152489\(FA\).

The model was developed using the training data set (80% of the orginial data) and 87.95% of the varibility in strength is accounted for in the model. Additionally, the model was validated using the test data set (20% of the original data) and had an error rate of 7.9589768 %, which is relatively low. Thus, the model is decent at predicting the compressive strength given the amount of each significant concrete components.

Why is this model important? This linear model describes the effect of concrete components on the compressive strength which allows for the concrete materials to optimized. This optimization results in less waste of materials and concrete with the desired properties.

Future Work

Future Work

the data set is limited to 103 observations, more data may provide a clearer understanding of model the factors that affect concrete compressive strength
other factors in addition to compressive strength are important to understanding how concrete behaves during use and what factors are important, specifically for recyclability purposes
investigating transformations of concrete slump and flow for the cone test
looking into other models that may be more appropriate to model the concrete compression data

References

References

[1] Yeh, I-Cheng, “Modeling slump flow of concrete using second-order regressions and artificial neural networks,” Cement and Concrete Composites, Vol.29, No. 6, 474-480, 2007.

[2] Sami W. Tabsh, Akmal S. Abdelfatah, Influence of recycled concrete aggregates on strength properties of concrete, Construction and Building Materials, 2009, ISSN 0950-0618, https://doi.org/10.1016/j.conbuildmat.2008.06.007.

[3] J. Wongpa, K. Kiattikomol, C. Jaturapitakkul, P.Chindaprasirt, Compressive strength, modulus of elasticity, and water permeability of inorganic polymer concrete,Materials & Design, 2010, ISSN 0261-3069, https://doi.org/10.1016/j.matdes.2010.05.012. Retrieved from: (http://www.sciencedirect.com/science/article/pii/S0261306910002931)

---
title: "MTH 543 Prj Concrete"
author: "K.M. Burzynski"
output: 
  flexdashboard::flex_dashboard:
    theme: cosmo
    orientation: columns
    social: ["facebook", "twitter", "linkedin"]
    source_code: embed
---

```{r setup, include=FALSE}
# load necessary packages
library(caret)
library(car)
library(ggplot2)
library(plotly)
library(plyr)
library(flexdashboard)  ## you need this package to create dashboard

# read the data set here, I use data: mtcars as an example
concretedata <- read.csv("/Users/katherineburzynski/Documents/MTH 543 - Linear Regression/Project/Concrete_test.csv")

```

Introduction
=======================================================================

Column {data-width=600}
-----------------------------------------------------------------------

### Abstract

#### **Determing the Effect of Concrete Components on Concrete Properties**

Concrete is  the most used building material throughout the world today. Concrete is commonly composed of portland cement (“C”), blast furnace slag (“S”), fly ash (“F”), water (“W”), superplasticizer (“SP”), coarse aggregates (“CA”), fine aggregates (“FA”). The purpose of this research is to determine which concrete components have a significant effect on the compressive strength. Using the Ordinary Least Squares (OLS) Method, the regression model was determined to be: 

$\hat{strength}$ = 89.5305187 + 0.0802058 $C$ + 0.0675839 $F$ -0.1940091 $W$ -0.0367414 $CA$ -0.0152489 $FA$ 

This indicates that as the portland cement (“C”) content increases, fly ash (“F”) content increases, water (“W”) content decreases, or aggregate ("CA" and "FA") content decreases, the concrete compressive strength increases, accounting for the effects of the other variables in the model. This model accounts for 88% of the variability in concrete compressive strength and is more appropriate than using the mean compressive strength. Furthermore upon validation, the model accounts 90% of the response variability when predicting new concrete strengths from the model and has a low error rate of 8.1%, indicating the model is appropriate when predicting the concrete compressive strength from the amounts of concrete components. Using this linear model in practice will allow for the concrete components to optimized for increased concrete compressive strength, resulting in less waste of materials and concrete with desirable properties. 

The data set was retrieved from *UCI Machine Learning Repository* on November 12, 2019.
**Data Reference:**
Yeh, I-Cheng, "Modeling slump flow of concrete using second-order regressions and artificial neural networks," Cement and Concrete Composites, Vol.29, No. 6, 474-480, 2007.


### What does the data look like?

```{r}
pairs(concretedata)
#histstrength <- plot_ly(concretedata, x=~Strength)%>% 
  #layout(title = "Concrete Compressive Strength", 
         #xaxis = list(title="Compressive Strength",
                      #font=list(size=14)),
         #yaxis = list(title="Frequency", font=list(size=14))) 
#ggplotly(histstrength)
```

Column {.tabset data-width=400} 
-----------------------------------------------------------------------

### What makes concrete strong?
In this day and age, we are surrounded by a concrete jungle --in our buildings, our roads, our pipelines-- are all made possible thanks to this wonderful material!

Concrete is a composite material composed of aggregates and cement or simply put *"rocks glued together"*. The beauty of composites is they have unique properties that individual components do not possess on their own. The aggregates reinforce the surrounding cement creating a strong material. But what makes concrete so strong? Is there are an optinum mixture of components?


![The Concrete Jungle](/Users/katherineburzynski/Documents/MTH 543 - Linear Regression/Project/judge-harry-pregerson-inerchange.jpg)


*Image Source: Edward Burtynsky, twittersifter.com*
twistedsifter.com/2012/03/picture-of-the-day-the-concrete-jungle/


### Concrete Components
The following components have an effect on the compressive strength ("strength"), flow ("Fl"), and slump ("Sl") of the concrete material:

* Portland Cement (“C”)
* Blast furnace slag (“S”)
* Fly ash (“F”)
* Water (“W”)
* Superplasticizer (“SP”)
* Coarse aggregates (“CA”)
* Fine aggregates (“FA”)

The data represents the amount of each component (kilograms) in a cubic meter of concrete.



![Concrete Components](/Users/katherineburzynski/Documents/MTH 543 - Linear Regression/Project/concrete_microstructure.jpg)


*Image Source: Paulo Montiero, UC Berkely*

### Concrete Properties

Three concrete properties were investigated given various amounts of concrete components:

**Concrete compressive strength ("Strength")**

* concrete samples were tested in compression until they failed
* determines the strength of the cured composite 
* reported in megapascals (MPa)

**Slump ("Sl")**

* slump is how much drop there is in the wet concrete during the "Slump-Cone Test"
* helps understand how easy the wet concrete is to work with
* measured in meters (m)

**Flow ("Fl")**

* flow is the diameter of the wet concrete cone during the "Slump-Cone Test"
* helps understand how easy the wet concrete is to work with
* measured in meters (m)



![Concrete Slump Test](/Users/katherineburzynski/Documents/MTH 543 - Linear Regression/Project/concrete-slum-test-procedure-results.jpg)

*Image Source: theconstructor.org/concrete/concrete-slump-test/1558/*



Response Variable Exploration
=======================================================================
Column {.tabset data-width=400} 
-----------------------------------------------------------------------

### Compressive Strength of Concrete ("Strength")

* determines the strength of the cured composite 
* reported in megapascals (MPa)

``` {r}
boxstrength <- plot_ly(concretedata, y=~Strength, type="box")%>%
  layout(title = "Concrete Compressive Strength (MPa)", 
         xaxis = list(title=" ", font=list(size=14)),
         yaxis = list(title=" ", font=list(size=14)))
ggplotly(boxstrength)
```


### Strength Data

Compressive Strength has a normal distribution, and will be investigated using linear regression.

```{r}
histstrength <- plot_ly(concretedata, x=~Strength)%>%
  layout(xaxis = list(title="Concrete Compressive Strength (MPa)", font=list(size=14)),
         yaxis = list(title="Frequency", font=list(size=14)))
ggplotly(histstrength)
```

Column {.tabset data-width=400} 
-----------------------------------------------------------------------
### Slump ("Sl")

* slump is how much drop there is in the cone
* measured in meters (m)

```{r}
boxslump <- plot_ly(concretedata, y=~Sl, type="box")%>%
  layout(title = "Concrete Slump (m)", 
         xaxis = list(title=" ", font=list(size=14)),
         yaxis = list(title=" ", font=list(size=14)))
ggplotly(boxslump)
```

### Slump Data

Slump has a left-skewed distribution, and will not be investigated using linear regression.

```{r}
histslump <- plot_ly(concretedata, x=~Sl)%>%
  layout(xaxis = list(title="Concrete Slump (m)", font=list(size=14)),
         yaxis = list(title="Frequency", font=list(size=14)))
ggplotly(histslump)
```

Column {.tabset data-width=400} 
-----------------------------------------------------------------------
### Flow ("Fl")

* flow is the diameter of the cone
* measured in meters (m)

```{r}
boxflow <- plot_ly(concretedata, y=~Fl, type="box")%>%
  layout(title = "Concrete Flow (m)", 
         xaxis = list(title=" ", font=list(size=14)),
         yaxis = list(title=" ", font=list(size=14)))
ggplotly(boxflow)
```

### Flow Data

Flow has a bimodal distribution, and will not be investigated using linear regression.

```{r}
histflow <- plot_ly(concretedata, x=~Fl)%>%
  layout(xaxis = list(title="Concrete Flow (m)", font=list(size=14)),
         yaxis = list(title="Frequency", font=list(size=14)))
ggplotly(histflow)
```

Linear Model of Strength
=======================================================================

Column {.tabset data-width=400} 
-----------------------------------------------------------------------

### Strength Data Sets

```{r}
set.seed(2019)
train_index=sample(1:103,82)
train_data=concretedata[train_index,]
test_data=concretedata[-train_index,]
```

```{r}
boxplot(train_data$Strength,test_data$Strength,main="Concrete Strength for Validation",names = c("train","test"), horizontal = TRUE, xlab="Concrete Strength (MPa)", col=c("black","red")
        )
```


The concrete data was seperated into two data sets-- 80% of the data was used to train the model (black) and 20% of the data was used to test the data (red).


### Significant Variables Selection

A linear model relating portland cement (“C”), blast furnace slag (“S”), fly ash (“F”), water (“W”), superplasticizer (“SP”), coarse aggregates (“CA”), fine aggregates (“FA”) to concrete compressive strength ("Strength") was created. 

```{r}
cstrength=lm(Strength~C+S+F+W+SP+CA+FA,train_data)
summary(cstrength)
```

Testing for variable contribution to the model, using results from the t-tests:

 Null hyposthesis ($H_0$): $\beta_i$ = 0, i = 1,2,3,4,5,6,7.   versus   Alternative hypothesis ($H_A$): $\beta_i$ does not equal 0   


Using the t-test, the p-value for blast furnace slag (“S”) was  greater than $\alpha=0.05$. Therefore, there is not sufficient evidence to reject the null hyposthesis ($H_0$) and conclude that blast furnace slag (“S”) does not significantly contribute to the model. For the following iteration, blast furnace slag (“S”) was removed from the model.

```{r}
cstrength=lm(Strength~C+F+W+SP+CA+FA,train_data)
summary(cstrength)
```

This process was continued and the variables were evaluated using t-tests. In this iteration, superplasticizer (“SP”) was removed next due to the p-value being greater than $\alpha=0.05$. This resulted in the final model depicting compressive strength with five (5) independent variables that all have significant p-values at a significance level of $\alpha=0.05$. These five (5) variables were confirmed using the variable selection from the "leaps" package.

```{r}
library(MASS)
library(leaps)
cstrength=lm(Strength~C+F+W+CA+FA,train_data)
summary(cstrength)
anova(cstrength)
fit.subset=regsubsets(Strength~C+S+F+W+SP+CA+FA,data=train_data,nvmax=7)
summary(fit.subset)
```



Column
-----------------------------------------------------------------------

### Linear Model of Concrete Compressive Strength

The final model was determined by removing the variable with the highest p-value from the t-test until all variables were significant at a confindence level of $\alpha=0.05$. The following variables were determined to have a significant effect on the compressive strength ("Strength") of concrete (from the t-tests):

* portland cement (“C”)
* fly ash (“F”)
* water (“W”)
* coarse aggregates (“CA”)
* fine aggregates (“FA”)


```{r}
css=lm(Strength~C+F+W+CA+FA,train_data)
```

**The resulting linear model is:**

#### $\hat{strength}=$ `r css$coefficients[1]` + `r css$coefficients[2]`$C$ + `r css$coefficients[3]`$F$ `r css$coefficients[4]`$W$ `r css$coefficients[5]`$CA$ `r css$coefficients[6]`$FA$.


**Concrete strength increases as:**

* portland cement (“C”) content increases
* fly ash (“F”) content increases
* water (“W”) content decreases
* coarse aggregates (“CA”) decreases
* fine aggregates (“FA”) decreases

#### **How good is this linear model at predicting the test data?**

Using the training data, the adjusted $R^2$ value is 0.8795. This indicates that 87.95% of the variability in strength is accounted for using the model.


```{r}
prd=predict(css,test_data)
Rsq=R2(prd,test_data$Strength)
Rootmean=RMSE(prd,test_data$Strength)/mean(test_data$Strength)
```

Using the test data to validate the model,

* The $R^2$ values is `r Rsq`. This indicates that `r Rsq*100` % of the varibility in strength is accounted for in the model.
* There is `r Rootmean*100` % error rate, when predicting the test data using the model created from the training data. 

Model Conditions
=======================================================================

Column {.tabset data-width=400} 
-----------------------------------------------------------------------
### Summary

For the regression model to be appropriate, the following assumptions or conditions must be valid: 

***1. Linear***

Using the ***Residuals vs. Fitted Plot***, there are no distinct patterns, which is an indication of a linear relationship between the concrete compressive strength ($strength$) and the regressors, portland cement ($C$), fly ash ($F$) and water ($W$).

***2. Zero Mean***

In *Ordinary Least Squares* (OLS) Regression, the linear regression is created with the error term having a mean of zero and thus, this condition is always satisfied.

***3. Equal Variance***

Assuming the data is indexed in the order it was collected, the ***Indexed Residuals Plot*** indicates the residuals have no specific patterns, suggesting the residuals are independent. Without knowing the order in which the data was collected, the independent condition cannot be confirmed.

***4. Independent***

Using the ***Scale-Location Plot***, the points appear to be random with no indication of a pattern, suggesting the constant variance condition is satisfied.

***5. Normality***

Using the ***Q-Q Plot***, the standarized residuals lie along the 45 degree line, indicating the normality condition is satisfied.

*Additionally, influential points and collinearity should be assessed.*

***Influential Points***

Using the *Leverage vs. Residuals*, there are no points that have leverage values greater than 1, suggesting they are not any suspected influential points. Influential points can be an issue because they affect the model estimates. It is important to determine if these data points are "bad" values from mistakes in data collection or if they represent actual values. If these outliers are determined to be from mistakes, they can be removed from the sample data but should be mentioned in the analysis.

***Collinearity***

From the *Variance Inflation Factors* (VIF) for the five regressors, none of the square root of VIF values are greater than 2, suggesting collinearity is not prescence.


***Conclusion***

From reviewing the conditions, the linear model is appropriate as the conditions are valid and there are no influential points or collinearity between variables.

### 1. Linearality Condition

```{r}
css=lm(Strength~C+F+W+CA+FA,train_data)
Fitted.Values <- css$fitted.values
Residuals <- css$residuals
Standardized.Residuals <- scale(css$residuals)
Theoretical.Quantiles <- qqnorm(Residuals, plot.it = F)$x
Root.Residuals <- sqrt(abs(Standardized.Residuals))
Leverage <- lm.influence(css)$hat
diagnostics <- data.frame(Fitted.Values, Residuals, Standardized.Residuals, Theoretical.Quantiles, Root.Residuals, Leverage)
m <- list(l = 100, r = 100, b = 100, t = 100, pad = 4)
p1 <- plot_ly(diagnostics, x = Fitted.Values, y = Residuals, 
              type = "scatter", mode = "markers", 
              hoverinfo = "x+y", name = "Data", 
              marker = list(size = 10, opacity = 0.5))%>%
  layout(title = "Residuals v.s. Fitted Values", 
       xaxis = list(title="Fitted Values", font=list(size=14)), 
       yaxis = list(title="Residuals",
                    font=list(size=14),margin=m))
ggplotly(p1)
plot(css,1)
```


Using the ***Residuals vs. Fitted Plot***, there are no distinct patterns, which is an indication of a linear relationship between the concrete compressive strength ($strength$) and the regressors, portland cement ($C$), fly ash ($F$) and water ($W$).


### 2. Zero Mean Condition

In *Ordinary Least Squares* (OLS) Regression, the linear regression is created with the error term having a mean of zero and thus, this condition is always satisfied.

### 3. Independent Condition

```{r}
plot(css$residuals, main="Indexed Residuals",ylab="residual")
```

Assuming the data is indexed in the order it was collected, the ***Indexed Residuals Plot*** indicates the residuals have no specific patterns, suggesting the residuals are independent. Without knowing the order in which the data was collected, the independent condition cannot be confirmed.

### 4. Equal Variance (aka Constant Variance) Condition

```{r}
p3 <- plot_ly(diagnostics, x = Fitted.Values, y = Root.Residuals, type = "scatter", mode = "markers", hoverinfo = "x+y", name = "Data", marker = list(size = 10, opacity = 0.5), showlegend = F)%>%
  layout(title = "Scale-Location", xaxis = list(title="Fitted Values", font=list(size=14)), yaxis = list(title=expression(sqrt("|Standardized Residuals|")), font=list(size=14)), font=list(size=14), margin=m)

ggplotly(p3)
plot(css,3)
```

Using the ***Scale-Location Plot***, the points appear to be random with no indication of a pattern, suggesting the constant variance condition is satisfied.

### 5. Normality Condition

```{r}
p2 <- plot_ly(diagnostics, x = Theoretical.Quantiles, y = Standardized.Residuals, type = "scatter", mode = "markers", hoverinfo = "x+y", name = "Data", marker = list(size = 10, opacity = 0.5), showlegend = F)%>%
  add_trace(x = Theoretical.Quantiles, y = Theoretical.Quantiles, type = "scatter", mode = "line", name = "", line = list(width = 2))%>%
  layout(title = "Q-Q Plot",xaxis = list(title="Theoretical Quantiles", font=list(size=14)), yaxis = list(title="Standardized Residuals", font=list(size=14)),font=list(size=14), margin=m)

ggplotly(p2)
plot(css,2)
```

Using the ***Q-Q Plot***, the standarized residuals lie along the 45 degree line, indicating the normality condition is satisfied.


### Influential Points

```{r}

s <- loess.smooth(Leverage, Residuals)
p4 <- plot_ly(diagnostics, x = Leverage, y = Residuals, type = "scatter", mode = "markers", hoverinfo = "x+y", name = "Data", marker = list(size = 10, opacity = 0.5), showlegend = F) %>% 
  add_trace(x = s$x, y = s$y, type = "scatter", mode = "line", name = "Smooth", line = list(width = 2)) %>% 
  layout(title = "Leverage vs Residuals", xaxis = list(title="Leverage", font=list(size=14)), yaxis = list(title="Residuals", font=list(size=14)), font=list(size=14), margin=m)
ggplotly(p4) 
plot(css,4)
```

Using the *Leverage vs. Residuals*, there are no points that have leverage values greater than 1, suggesting they are not any suspected influential points. Influential points can be an issue because they affect the model estimates. It is important to determine if these data points are "bad" values from mistakes in data collection or if they represent actual values. If these outliers are determined to be from mistakes, they can be removed from the sample data but should be mentioned in the analysis.

### Collinearity
In addition to linearity between regressors and the response, it is also important to assess the near linear dependence amoung regressors. 

```{r}
sqrt(vif(css))
```
From the *Variance Inflation Factors* (VIF) for the five regressors, none of the square root of VIF values are greater than 2, suggesting collinearity is not prescence.

Result Discussion
=======================================================================
Column {.tabset data-width=400} 
-----------------------------------------------------------------------
### Discussion

***Discussion***

The significant variables for the concrete compressive strength model are also confirmed in literature:

* increasing cement content, increases compressive strength due to the chemical reaction between cement and water. The result is an extremely strong rock-like material (Ref[2])

* fly ash content and aggregate also have a significant effect on the compressive strength of concrete (Ref [3])

* excess water content, decreases the compressive strength due to the formation of voids upon evaporation (Ref[2])

### Conclusions

***Conclusions***

*Which concrete components have a significant effect on the concrete compressive strength?*

* portland cement (“C”)
* fly ash (“F”)
* water (“W”)
* coarse aggregates (“CA”)
* fine aggregates (“FA”)

The concrete components blast furnace slag (“S”) and superplasticizer (“SP”) were determined to not have a significant effect on concrete compressive strength and were not included in the model describing strength.


*What is the linear model describing the concrete compressive strength with significant concrete compressive strength?*

$\hat{strength}=$ `r css$coefficients[1]` + `r css$coefficients[2]`$C$ + `r css$coefficients[3]`$F$ `r css$coefficients[4]`$W$ `r css$coefficients[5]`$CA$ `r css$coefficients[6]`$FA$.

The model was developed using the training data set (80% of the orginial data) and 87.95% of the varibility in strength is accounted for in the model. Additionally, the model was validated using the test data set (20% of the original data) and had an error rate of `r Rootmean*100` %, which is relatively low. Thus, the model is decent at predicting the compressive strength given the amount of each significant concrete components.

*Why is this model important?*
This linear model describes the effect of concrete components on the compressive strength which allows for the concrete materials to optimized. This optimization results in less waste of materials and concrete with the desired properties.


### Future Work

***Future Work***

* the data set is limited to 103 observations, more data may provide a clearer understanding of model the factors that affect concrete compressive strength
* other factors in addition to compressive strength are important to understanding how concrete behaves during use and what factors are important, specifically for recyclability purposes
* investigating transformations of concrete slump and flow for the cone test
* looking into other models that may be more appropriate to model the concrete compression data


### References

***References***

[1] Yeh, I-Cheng, "Modeling slump flow of concrete using second-order regressions and artificial neural networks," Cement and Concrete Composites, Vol.29, No. 6, 474-480, 2007.

[2] Sami W. Tabsh, Akmal S. Abdelfatah,
Influence of recycled concrete aggregates on strength properties of concrete, Construction and Building Materials, 2009, ISSN 0950-0618, https://doi.org/10.1016/j.conbuildmat.2008.06.007.

[3] J. Wongpa, K. Kiattikomol, C. Jaturapitakkul, P.Chindaprasirt,
Compressive strength, modulus of elasticity, and water permeability of inorganic polymer concrete,Materials & Design,
2010, ISSN 0261-3069, https://doi.org/10.1016/j.matdes.2010.05.012. Retrieved from: (http://www.sciencedirect.com/science/article/pii/S0261306910002931)