Stats 195 Final Project

Names: Brian Wang, Greses Perez, Jugal Gala

Load data into R

getwd()
## [1] "/Users/brianwang/Desktop/Education Data Project"
setwd("/Users/brianwang/Desktop/Education Data Project")
district.means <- read.csv("district means grade-subject std (cs) by year grade subject (long file)_v1_1.csv")
district.covariates <- read.csv("district by year by grade covariates from acs and ccd master_v1_1.csv")

I want to merge the two datasets into one based on the LEA ID (Local Education Agency ID), year, and grade.

district.combined <- merge(district.means, district.covariates, by = c("leaid", "year", "grade"))

This dataset crashed on me a few times when I was plotting it, so I’m going to just work with California data.

district.ca.subset <- subset(district.combined, district.combined$stateabb.x == "CA")

To view how the socioeconomic status of a district is correlated with the performance of that district, we can plot the district English language arts means vs. the SES composite variable provided by the data set. We can also do the same thing for the district math means.

Load ggplot2 and plotly libraries

library(ggplot2); library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout

Plot ELA means vs. SES composite variable

ggplot(district.ca.subset, aes(x = sesall, y = cstdmn_ela)) + geom_point(alpha = 0.6) + labs(title = "CA District ELA Means vs. SES Composite Variable", x = "Socioeconomic Status", y = "California District ELA Means")
## Warning: Removed 1411 rows containing missing values (geom_point).

Plot math means vs. SES composite variable

ggplot(district.ca.subset, aes(x = sesall, y = cstdmn_math)) + geom_point(alpha = 0.6) + labs(title = "CA District Math Means vs. SES Composite Variable", x = "Socioeconomic Status", y = "California District Math Means")
## Warning: Removed 7863 rows containing missing values (geom_point).

Next, the NYT article provides some visualization based on demographic / racial characteristics of districts. I will try to do something similar by coloring the previous graphs on a scale based on the ‘perwht’ variable, which gives the percentage of white students per grade, to see if there’s any correlation between the percentage of white students and the performance of a district. Racial percentages are not used to calculate the SES composite variable according to the dataset so this should not be an issue.

Plot ELA means vs. SES composite variable, coloring based on % White variable

ggplot(district.ca.subset, aes(x = sesall, y = cstdmn_ela)) + geom_point(alpha = 0.6, aes(color = perwht)) + labs(title = "CA District ELA Means vs. SES Composite Variable", x = "Socioeconomic Status", y = "California District ELA Means") + scale_colour_continuous(name = "% White Students In Grade") + theme(legend.position = "top")
## Warning: Removed 1411 rows containing missing values (geom_point).

Plot math means vs. SES composite variable, coloring based on % White variable

ggplot(district.ca.subset, aes(x = sesall, y = cstdmn_math)) + geom_point(alpha = 0.6, aes(color = perwht)) + labs(title = "CA District Math Means vs. SES Composite Variable", x = "Socioeconomic Status", y = "California District Math Means") + scale_colour_continuous(name = "% White Students In Grade") + theme(legend.position = "top")
## Warning: Removed 7863 rows containing missing values (geom_point).

From the data, it seems like there is a high correlation between mostly white areas and higher socioeconomic status. There also seems to be some correlation between predominantly white areas and the strongest mean district performance in ELA and math, though there are definitely many not predominantly white district that perform above the average.

Next, I’m interested in learning if there are any differences in ELA and Mathematics means regionally. So, I will add a Region column to the original combined dataset to assign regions to different states.

westRegion <- c("CA", "OR", "WA", "NV", "ID", "UT", "WY", "MT", "CO", "HI", "AK")
southwestRegion <- c("AZ", "NM", "TX", "OK")
midwestRegion <- c("ND", "SD", "NE", "KS", "MO", "IA", "IL", "MN", "WI", "IN", "MI", "OH")
southeastRegion <- c("LA", "AR", "MS", "AL", "GA", "TN", "KY", "WV", "VA", "NC", "SC", "FL")
northeastRegion <- c("PA", "MD", "DE", "NJ", "NY","CT", "RI", "MA", "NH", "ME", "VT")
district.combined$Region[district.combined$stateabb.x %in% westRegion] <- "West"
district.combined$Region[district.combined$stateabb.x %in% southwestRegion] <- "Southwest"
district.combined$Region[district.combined$stateabb.x %in% midwestRegion] <- "Midwest"
district.combined$Region[district.combined$stateabb.x %in% southeastRegion] <- "Southeast"
district.combined$Region[district.combined$stateabb.x %in% northeastRegion] <- "Northeast"

Because there are an extremely large number of points, I want to subset the data by region to accurately and best visualize the data clearly.

Subset ‘means’ dataset into five subsets based on Region

district.combined.west <- subset(district.combined, district.combined$Region == "West")
district.combined.sw <- subset(district.combined, district.combined$Region == "Southwest")
district.combined.mw <- subset(district.combined, district.combined$Region == "Midwest")
district.combined.se <- subset(district.combined, district.combined$Region == "Southeast")
district.combined.ne <- subset(district.combined, district.combined$Region == "Northeast")

Plot ELA and math means for West region, coloring by state

ggplot(district.combined.west, aes(cstdmn_ela, cstdmn_math)) + geom_point(aes(color = stateabb.x), alpha = 0.5) + labs(title = "Scatterplot of District ELA/Math Means for States in West Region", x = "District ELA Means", y = "District Math Means")
## Warning: Removed 7107 rows containing missing values (geom_point).

This doesn’t look that nice. There are still too many datapoints to visualize neatly on one graph, so I will try plotting boxplots of district means for each state to analyze state performance within the region a different way.

(p1 <- ggplot(district.combined.west, aes(x = stateabb.x, y = cstdmn_ela)) + geom_boxplot() + labs(title = "Boxplots of District ELA Means for States in West Region", x = "State", y = "District ELA Means"))
## Warning: Removed 110 rows containing non-finite values (stat_boxplot).

This is more helpful for looking at the data at the state level. From the data, we can see that Colorado districts are the strongest performing, on average, in the West region. We can now plot the same thing for math district means for each state in the West region.

(p2 <- ggplot(district.combined.west, aes(x = stateabb.x, y = cstdmn_math)) + geom_boxplot() + labs(title = "Boxplots of District Math Means for States in West Region", x = "State", y = "District Math Means"))
## Warning: Removed 6997 rows containing non-finite values (stat_boxplot).

I think now I want to try building a predictive model based off of the combined dataset. I will try predicting the math performance of a district based off of the following variables:

baplus_all - % of adults with ba+ (all) poverty517_all - % of hh with 5-17 yr olds in poverty (all) singmom_all - % hh with children, female head (all) snap_all - % of hh receiving snap benefits (all) samehouse_all - % living in same house as last year (all) unemp_all - unemployed (all) teenbirth_all - percent of 15-19 year olds giving birth

Based off of general stereotypes of what characteristics a “better” school district possesses, I would expect ba_plus_all to have a positive effect on the math performance of a district. I’m not sure what effect same_house_all (living in the same house) will have, but I will guess positive as it seems to indicate household stability. For the rest of the variables I would guess a negative relation.

Fit a linear model to the data

fit = lm(cstdmn_math ~ baplus_all + poverty517_all + singmom_all + snap_all + samehouse_all + unemp_all + teenbirth_all, data = district.combined.west)

Check on the details of the fit

summary(fit, correlation = TRUE)
## 
## Call:
## lm(formula = cstdmn_math ~ baplus_all + poverty517_all + singmom_all + 
##     snap_all + samehouse_all + unemp_all + teenbirth_all, data = district.combined.west)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0225 -0.1950  0.0045  0.1981  1.9338 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     0.12831    0.02222   5.775 7.76e-09 ***
## baplus_all      1.10362    0.01366  80.766  < 2e-16 ***
## poverty517_all -0.56226    0.02584 -21.761  < 2e-16 ***
## singmom_all    -0.38877    0.02356 -16.500  < 2e-16 ***
## snap_all        0.10105    0.02466   4.097 4.19e-05 ***
## samehouse_all  -0.13449    0.02380  -5.650 1.62e-08 ***
## unemp_all      -4.02143    0.07465 -53.873  < 2e-16 ***
## teenbirth_all  -0.17443    0.03932  -4.436 9.17e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3131 on 35892 degrees of freedom
##   (8916 observations deleted due to missingness)
## Multiple R-squared:  0.4122, Adjusted R-squared:  0.4121 
## F-statistic:  3596 on 7 and 35892 DF,  p-value: < 2.2e-16
## 
## Correlation of Coefficients:
##                (Intercept) baplus_all poverty517_all singmom_all snap_all
## baplus_all     -0.19                                                     
## poverty517_all -0.03        0.21                                         
## singmom_all    -0.28       -0.11      -0.19                              
## snap_all       -0.06        0.23      -0.49          -0.24               
## samehouse_all  -0.95       -0.03      -0.05           0.18        0.05   
## unemp_all      -0.12        0.11      -0.08          -0.10       -0.22   
## teenbirth_all  -0.04        0.05      -0.08           0.02       -0.13   
##                samehouse_all unemp_all
## baplus_all                            
## poverty517_all                        
## singmom_all                           
## snap_all                              
## samehouse_all                         
## unemp_all       0.01                  
## teenbirth_all   0.02          0.03

From the model, the strongest predictor of stronger district math performance is the % of adults with bachelor’s degrees, as predicted. The strongest predictors of worse district math performance are the unemployment rate (by far), the poverty variable, and the single mom variable. It’s interesting that the other variables don’t have much of an effect on the district performance, at least in this model. Also, clearly there is a lot of variability in the data that isn’t explained by this model, as the R-squared value is pretty low.

Fit the same model to California data only

fit2 = lm(cstdmn_math ~ baplus_all + poverty517_all + singmom_all + snap_all + samehouse_all + unemp_all + teenbirth_all, data = district.ca.subset)
summary(fit2)
## 
## Call:
## lm(formula = cstdmn_math ~ baplus_all + poverty517_all + singmom_all + 
##     snap_all + samehouse_all + unemp_all + teenbirth_all, data = district.ca.subset)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.78445 -0.17473  0.00961  0.17708  1.91569 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -0.46769    0.03788 -12.346   <2e-16 ***
## baplus_all      1.37710    0.01907  72.225   <2e-16 ***
## poverty517_all -0.38984    0.04416  -8.827   <2e-16 ***
## singmom_all    -0.57983    0.03609 -16.067   <2e-16 ***
## snap_all        0.02567    0.04651   0.552   0.5810    
## samehouse_all   0.09141    0.03969   2.303   0.0213 *  
## unemp_all      -0.34318    0.13434  -2.555   0.0106 *  
## teenbirth_all  -0.17913    0.07249  -2.471   0.0135 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2939 on 13255 degrees of freedom
##   (7863 observations deleted due to missingness)
## Multiple R-squared:  0.5246, Adjusted R-squared:  0.5243 
## F-statistic:  2089 on 7 and 13255 DF,  p-value: < 2.2e-16

It’s interesting to see that when we fit the same model to California data only, not all the predictors are significant. I think it’s especially interesting that unemployment rate is not as strongly related to the math performance of a district for California.

Plot histograms of demographic composition for school districts

district.covariates.long <- reshape(district.covariates,
  varying = c("perind", "perasn", "perhsp", "perblk", "perwht"),
  v.names = "percent",
  timevar = "per.race",
  times = c("% Native American", "% Asian", "% Hispanic", "% Black", "% White"),
  direction = "long")
district.covariates.long.per.race <- district.covariates.long[,c("leaid", "year", "grade", "per.race", "percent")]
ggplot(district.covariates.long, aes(percent,  fill = per.race)) + geom_histogram(alpha = 0.7) + xlim(0, 0.25)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 458561 rows containing non-finite values (stat_bin).

For this graph, we wanted to show percentage of racial demographics breakdown for grade 3 for three districts, 100005, 100006, 100007. To do this, first we subset the data into a smaller dataset with the appropriate datapoints. Then we plotted the graph.

newdata <- subset(district.covariates.long.per.race, leaid <= 100007 & year == 2009 & grade <= 5)
plot.data <- newdata[,c(1,4,5)]
library(latticeExtra)
## Loading required package: lattice
## Loading required package: RColorBrewer
## 
## Attaching package: 'latticeExtra'
## The following object is masked from 'package:ggplot2':
## 
##     layer
plot.data$leaid <- as.factor(plot.data$leaid)
plot.data$per.race <- as.factor(plot.data$per.race)
cloud(percent ~ leaid + per.race, plot.data, panel.3d.cloud=panel.3dbars, 
      xbase = 0.4, ybase = 0.4, scales = list(arrows=FALSE, col=1, cex=0.5), 
      par.settings = list(axis.line = list(col = "transparent")), xlab = NULL, ylab = NULL, zlab = NULL)