Is there a correlation between population density (people per square mile) and depression per capita and square mile in the State of California between the years 2000-2020?
Depression is a serious and widespread mental health issue that affects people all over the world. Currently, depression rates worldwide are at about 4.4%; in the U.S., it's at about 5.9%1, and the rates seem to be increasing—especially for younger populations2.
There have been many scientific inquiries into the causes of depression and the environmental risk factors that may play a role in its development. Of particular interest to us is how population density may be associated with depression. As college students, we have all recently had to move and adapt to a very specific and highly populated environment that, in many ways, differs substantially from what we are used to.
As such, we are interested in investigating how population density, which may affect factors like social interactions, loneliness, emotional health, and more, might influence mental health. There are a few studies that have already looked at the related topic of urbanization, though many of them have disagreeing perspectives on how it affects depression.
For example, one such study collected medical data from patients in Spain and conducted a linear regression data analysis which found support for urbanization as one environmental factor that may contribute to greater prevalence of depression (3Llorente et al., 2018). A longitudinal study of depression among older people in China (4He et al., 2020) and a study of the Swedish population (5Sundquist et al., 2018) concluded similar results; namely, that urbanization and high population density are associated with higher rates of depression and other mental health disorders.
Conversely, a UCLA study used data from U.S. cities to create a multidimensional statistical model (much like the Spanish study mentioned above) to predict depression rates, and found evidence that larger cities were actually associated with lower depression rates (6Stier et al., 2021). Overall, a literature review of studies on cities and mental health concluded that while cities tend to see higher rates of depression than rural areas, urban cities are also associated with other factors like socioeconomic status and better healthcare access that may confound the results (7Gruebner et al., 2017).
For the purposes of our study, we hypothesized that depression rates (i.e., proportion of a population affected by depression) would increase as population density increases.
As anecdotal evidence, we have noticed that socialization often takes a less personal and less fulfilling form when interacting with a larger group of people. Thus, living in a higher population density might worsen feelings of isolation or dissatisfaction in life, which could contribute to higher rates of depression (and other mental health issues).
From our background research, we know that there is conflicting evidence on this topic. Different researchers have associated urbanization with both higher and lower rates of depression. However, on the whole, we still predicted that an environment with high population density (as opposed to a city with high total population or development) negatively impacts individuals' mental health on a collective scale.
To further examine our hypothesis and to explore some of the meaningful relationships between depression and population density, we decided to include a temporal dimension to our data. Examining how depression and population change over time can provide more insight on changing environments and may help to predict how these factors may affect each other in the future.
References:
To help answer our question, we needed datasets that had a lot of structure and consistency over a large area and long period of time. This led us to search specifically for government census data and regional medical data.
The primary variables we needed for our analysis were time, region, population density (or population size + area of region), and rates or incidences of depression. Datasets from SAMHSA, the census, and LA County medical data supplied us with this information. Because of the limitations on the availability of medical data nationwide, we limited our analysis to look at LA County and its health districts over the past couple decades.
To begin, let's import libraries to help us with webscraping, data wrangling, visualizations, and statistical analyses.
# importing libraries
# data handling
import pandas as pd
from pandas.core.arrays import string_
import numpy as np
import io
import csv
from tabulate import tabulate
# plotting
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.ticker import FuncFormatter
# stats
from scipy import stats
# webscraping
import requests
from bs4 import BeautifulSoup
from pdfminer.high_level import extract_text
from functools import reduce
import os
import warnings
In this section, each dataset is standardized by making values and headers lowercase and stripping them of whitespace and punctuation. Empty values are dropped where necessary and justified.
We begin by defining functions to remove whitespace and punctuation:
# Standardization functions
def remove_comma(output):
if "," in output:
output = output.replace(',', '')
return output
def remove_period(str_in):
output = str_in
try:
output = str_in.lower()
output = output.replace('.', '')
except:
output = str_in
print("error")
return output
def remove_dash(output):
if "-" in output:
output = output.replace('-','')
return output
def remove_asterik(output):
if "*" in output:
output = output.replace('*', '')
return output
def remove_date(astr):
return int(astr.replace("-01-01 00:00:00",""))
Let's start working with the first dataset.
# DATASET 1: reading Historical Population Density Data (1910-2020) data from github
population_url = 'https://raw.githubusercontent.com/jennynunez/032-DATA/main/apportionment.csv'
df1 = pd.read_csv(population_url) # Original shape of this data, starting from 1910, is (684,10)
# Drop all years except for 2000-2020
pop = df1.loc[(df1['Year'] >= 2000)]
#pop.shape # After filtering it to only contain the 2000s the shape is (171,10)
# Focus on states and drop observations with Regions & Nations
population = pop[pop['Geography Type'].str.contains("State")]
# Create a dataframe that focuses on the states & rename 'Geography Type' column to 'State'
population = population.rename(columns={'Name': 'name', 'Geography Type':'state', 'Year': 'year', 'Resident Population': 'res_pop_count', 'Percent Change in Resident Population': 'perc_change_res_pop', 'Resident Population Density': 'res_pop_density', 'Resident Population Density Rank': 'res_pop_density_rank'})
population = population.drop(population.columns[7:10], axis=1)
population = population.replace(',', '', regex=True)
population = population.astype({"res_pop_count": int, "perc_change_res_pop": float, "res_pop_density": float, "res_pop_density_rank": float})
# Filtered all states except California
population = population[population.name == 'California']
population = population.reset_index(drop=True)
population['state'] = population['state'].str.lower()
population['name'] = population['name'].str.lower()
population.head()
Now to the second dataset.
# DATASET 2: California counties population totals from 2010-2019 from github
counties_url = 'https://raw.githubusercontent.com/jennynunez/032-DATA/main/CA_counties_pop.csv'
df3 = pd.read_csv(counties_url)
# rename columns
df3 = df3.rename(columns={'table with row headers in column A and column headers in rows 3 through 4 (leading dots indicate sub-parts)': 'county',
'Unnamed: 1':'census', 'Unnamed: 2': 'estimates',
'Unnamed: 3': '2010',
'Unnamed: 4': '2011',
'Unnamed: 5': '2012',
'Unnamed: 6': '2013', 'Unnamed: 7': '2014',
'Unnamed: 8': '2015', 'Unnamed: 9': '2016',
'Unnamed: 10': '2017', 'Unnamed: 11': '2018',
'Unnamed: 12': '2019'})
# drop unneeded rows
df3.drop(
labels = [0,1,2,62,63,64,65,66],
axis = 0,
inplace=True
)
df3 = df3.reset_index(drop=True)
# standardize county names to lowercase and remove unecessary punctuations
df3['county'] = df3['county'].apply(remove_period)
#Remove state
df3['county'] = df3['county'].apply(lambda str_in: str_in.replace(', california', ''))
#Remove commas
df3 = df3.applymap(lambda str_in: str_in.replace(',', ''))
df3.drop(labels = 0, axis=0, inplace = True)
#iterate remove_comma function to columns
for i in range(0,10):
temp = 2010 + i
temp = str(temp)
df3[temp] = df3[temp].apply(remove_comma)
df3['census'] = df3['census'].apply(remove_comma)
df3['estimates'] = df3['estimates'].apply(remove_comma)
# store la county information in its own variable
la_pop = df3.loc[df3['county'] == 'los angeles county']
la_pop
For the third dataset:
# DATASET 3: California individuals counted by type of mental health diagnosis through the years 2014-2019
# converting pdf to csv
box = [1,2,7,9] # Table measurements from PDF document [top, left, bottom, width]
cm_to_pdf = 28.28 # Converts our margin measurements from cm to pdf points
for i in range(0,len(box)):
box[i] *= cm_to_pdf
mh_url = 'https://www.samhsa.gov/data/sites/default/files/reports/rpt35253/MHCLD-2019-R-FINAL.pdf'
# mh_disorders = tabula.read_pdf(mh_url, pages='438'). # Using tabulate to read pdf
# *Convert this tabulated data to CSV (This portion will be commented out for faster shell processing)
# tabula.convert_into(mh_url, "mh.csv", pages="438", output_format="csv", stream=True)
# !cat mh.csv
mh_url2 = 'https://raw.githubusercontent.com/jennynunez/032-DATA/main/mh.csv' #to dataframe
mh_disorders = pd.read_csv(mh_url2)
# rename columns
mh_disorders = mh_disorders.rename(columns={'Selected characteristics': 'disorder_type',
'Unnamed: 1':'totals', '2014': '2014',
'2015': '2015', '2016': '2016',
'2017': '2017', '2018': '2018',
'2019': '2019'})
mh_disorders = mh_disorders.iloc[0:13, :]
# Make gender into its own df *may delete
mh_genders = mh_disorders.iloc[13:17, :]
# Make age its own df *may delete
mh_ages = mh_disorders.iloc[18:33, :]
# Clean NaN values for disorder types df
mh_disorders = mh_disorders.drop('totals', axis=1)
mh_disorders = mh_disorders.dropna(axis=0)
mh_disorders = mh_disorders.reset_index(drop=True)
#iterate remove_comma function to columns
for i in range(0,6):
temp = 2014 + i
temp = str(temp)
mh_disorders[temp] = mh_disorders[temp].apply(remove_comma)
def remove_dash(output):
if "-" in output:
output = output.replace('-','')
return output
mh_disorders['disorder_type'] = mh_disorders['disorder_type'].str.lower()
mh_disorders['disorder_type'] = mh_disorders['disorder_type'].apply(remove_dash)
mh_disorders
The fourth dataset:
# DATASET 4: Los Angeles mental health data of people diagnosed with depression (ever)
la_url = 'https://raw.githubusercontent.com/jennynunez/032-DATA/main/la_demos_cleaned.csv'
df4 = pd.read_csv(la_url, index_col=0)
# We want to drop the index name because it is not useful
df4.index.name = None
# We will now reset the index column and removing unhelpful values
df4 = df4.T.reset_index().reset_index(drop=True)
# Renaming columns that have incorrect labels
df4 = df4.rename(columns={'index': 'year'})
# Removing percentage symbol throughout the DataFrame
df4 = df4.replace('%', '', regex=True)
# Convering DataFrame to numeric values
df4 = df4.apply(pd.to_numeric)
# Properly formatting years in datetime format so Seaborn doesn't convert the years to floats
df4['year'] = pd.to_datetime(df4.year.astype(str), format="%Y")
# Renaming 'LA Overall' column to be more descriptive about what the data is telling us
overall = df4.rename(columns={'LA Overall': 'population_depression_diagnosis_percent'})
# The overall LA county data is in the first few rows so we are using a subset of the data
overall = overall.iloc[:, 0:2]
df4
Our fifth dataset required a bit more work:
# DATASET 5: Districts in LA county populations
# First use BeautifulSoup to scrape the table data and convert to pandas dataframe
la_pop_url = 'http://www.laalmanac.com/population/po03.php'
r = requests.get(la_pop_url)
soup = BeautifulSoup(r.content, 'lxml')
table = soup.find_all('table')[0]
# remove multi-headers and set relevant column names
pop_est = pd.read_html(str(table), header=None)[0]
pop_est.columns = ['city', 'pop_2020', 'pop_2010', 'pop_2000', 'pop_1990']
# For newer cities that didn't exist in 1990, we'll remove values that appear as "---(2)"
pop_est = pop_est[(pop_est['pop_2000'] != "---(2)")]
pop_est = pop_est[(pop_est['pop_1990'] != "---(3)")] # reduced size to 129 cities
pop_est = pop_est.iloc[1: , :]
pop_est = pop_est.reset_index(drop=True)
# Standardize to remove " * or - " and make cities lowercase
def remove_asterik(output):
if "*" in output:
output = output.replace('*', '')
return output
pop_est['city'] = pop_est['city'].apply(remove_asterik)
pop_est['city'] = pop_est['city'].str.lower()
pop_est['city'] = pop_est['city'].apply(remove_dash)
# Make all estimates numeric
pop_est[['pop_2020', 'pop_2010', 'pop_2000', 'pop_1990']] = pop_est[['pop_2020', 'pop_2010', 'pop_2000', 'pop_1990']].apply(pd.to_numeric)
pop_est
After standardizing dataset 5, we wanted to combine it with dataset 4 (depression diagnosis rates in LA). We merged the dataframes together and deal with data from missing cities individually.
# Now we want to filter only the cities that are also recorded from our mental health dataset
# First flip df4 using transpose so cities are row indexes
flipped = df4.transpose()
flipped = flipped.reset_index()
flipped = flipped.iloc[1: , :]
flipped.columns = ['city','mh_2018', 'mh_2015', 'mh_2011', 'mh_2007', 'mh_2005', 'mh_2002', 'mh_1999']
flipped['city'] = flipped['city'].str.lower()
flipped[['mh_2018', 'mh_2015', 'mh_2011', 'mh_2007', 'mh_2005', 'mh_2002', 'mh_1999']] = flipped[['mh_2018', 'mh_2015', 'mh_2011', 'mh_2007', 'mh_2005', 'mh_2002', 'mh_1999']].apply(pd.to_numeric)
# Merge once all values have been made numeric, and city names have been standardized
# Trying merge them together with reduce method
# df6 = reduce(lambda l, r: pd.merge(flipped, df5, on='city', how='inner'), df5)
# This method brings 11 out of 26 cities we're looking for
# Some missing cities are included but as a different label in df5. Once we identify the missing cities, we'll manually extract, and append to:
missing = pd.DataFrame()
# Antelope Valley mental health cases comprise of 5 cities. We will make a new df with Antelope Valley cities, then sum and reassign these values to this city.
# Antelope Valley (includes 'acton', 'palmdale', 'lancaster', 'lake los angeles', 'quartz hill')
ant_v = pd.DataFrame(columns = ['city', 'pop_2020', 'pop_2010', 'pop_2000', 'pop_1990'])
ant_v = ant_v.append({'city':'acton', 'pop_2020':7431,'pop_2010':7596, 'pop_2000':2390, 'pop_1990':1471}, ignore_index=True)
ant_v = ant_v.append({'city':'palmdale', 'pop_2020':169450,'pop_2010':152750, 'pop_2000':116670, 'pop_1990':68842}, ignore_index=True)
ant_v = ant_v.append({'city':'lancaster', 'pop_2020':173516,'pop_2010':156633, 'pop_2000':118718, 'pop_1990':97291}, ignore_index=True)
ant_v = ant_v.append({'city':'lancaster', 'pop_2020':13187,'pop_2010':12328, 'pop_2000':11523, 'pop_1990':7977}, ignore_index=True)
ant_v = ant_v.append({'city':'quartz hill', 'pop_2020':11447,'pop_2010':10912, 'pop_2000':9890, 'pop_1990':9626}, ignore_index=True)
ant_v = ant_v.set_index('city')
# Now that Antelope Valley's cities are in dataframe, we sum value of columns, and assign values to the missing dataframe
t1 = ant_v.sum()
t1.name = 'antelope valley'
ant_v = ant_v.append(t1) # assigning totals to a row
missing = missing.append({'city':'antelope valley', 'pop_2020':375031,'pop_2010':340219, 'pop_2000':259191, 'pop_1990':185207}, ignore_index=True)
# Now we will do the same for ALL missing cities (Central, East LA, East Valley, Foothill, Harbor, Hollywood/Wilshire, Northeast, San Antonio, South, Southeast, Southwest, west, west valley)
# Central LA (includes: 'los angeles')
cla = pd.DataFrame(columns=['city', 'pop_2020', 'pop_2010', 'pop_2000', 'pop_1990'])
cla = cla.append({'city':'central', 'pop_2020':3898747,'pop_2010':3792621, 'pop_2000':3694820, 'pop_1990':3485398}, ignore_index=True)
missing = missing.append({'city':'central', 'pop_2020':3898747,'pop_2010':3792621, 'pop_2000':3694820, 'pop_1990':3485398}, ignore_index=True)
# East LA (includes: 'east los angeles')
ela = pd.DataFrame(columns = ['city', 'pop_2020', 'pop_2010', 'pop_2000', 'pop_1990'])
ela = ela.append({'city':'east los angeles', 'pop_2020':118786,'pop_2010':126496, 'pop_2000':124283, 'pop_1990':126379}, ignore_index=True)
missing = missing.append({'city':'east la', 'pop_2020':118786,'pop_2010':126496, 'pop_2000':124283, 'pop_1990':126379}, ignore_index=True)
# East Valley (includes: 'san fernando','burbank', 'hidden hills')
eva = pd.DataFrame(columns = ['city', 'pop_2020', 'pop_2010', 'pop_2000', 'pop_1990'])
eva = eva.append({'city':'san fernando', 'pop_2020':23946,'pop_2010':23645, 'pop_2000':23564, 'pop_1990':22580}, ignore_index=True)
eva = eva.append({'city':'san fernando', 'pop_2020':107337,'pop_2010':103340, 'pop_2000':100316, 'pop_1990':93643}, ignore_index=True)
eva = eva.append({'city':'hidden hills', 'pop_2020':1725,'pop_2010':1856, 'pop_2000':1875, 'pop_1990':1729}, ignore_index=True)
eva = eva.set_index('city')
t2 = eva.sum()
t2.name = 'east valley'
eva = eva.append(t2)
missing = missing.append({'city':'east valley', 'pop_2020':133008,'pop_2010':128841, 'pop_2000':125755, 'pop_1990':117952},ignore_index=True)
# Foothills (includes: 'san gabirel', 'south san gabriel','east san gabriel', 'south pasadena', 'san marino','east pasadena', 'sierra madre', 'glendora',
# 'san dimas', 'covina','west covina','charter oak', 'walnut', 'la puente', 'valinda', 'vincent', 'monterey park','rosemead','temple city', 'hacienda heights'
# 'rowland heights', 'citrus', 'south el monte', 'avocado heights', 'baldwin park','south san jose hills')
f = pd.DataFrame(columns = ['city', 'pop_2020', 'pop_2010', 'pop_2000', 'pop_1990'])
f = f.append({'city':'san gabriel', 'pop_2020':39568,'pop_2010':39718, 'pop_2000':39804, 'pop_1990':37120}, ignore_index=True)
f = f.append({'city':'south san gabriel', 'pop_2020':7920,'pop_2010':8070, 'pop_2000':7595, 'pop_1990':7700}, ignore_index=True)
f = f.append({'city':'east san gabriel', 'pop_2020':22769,'pop_2010':14874, 'pop_2000':14512, 'pop_1990':12736}, ignore_index=True)
f = f.append({'city':'south pasadena', 'pop_2020':26943,'pop_2010':25619, 'pop_2000':24292, 'pop_1990':23936}, ignore_index=True)
f = f.append({'city':'san marino', 'pop_2020':12513,'pop_2010':13147, 'pop_2000':12945, 'pop_1990':12959}, ignore_index=True)
f = f.append({'city':'east pasadena', 'pop_2020':6021,'pop_2010':6144, 'pop_2000':6045, 'pop_1990':5910}, ignore_index=True)
f = f.append({'city':'sierra madre', 'pop_2020':11268,'pop_2010':10917, 'pop_2000':10578, 'pop_1990':10762}, ignore_index=True)
f = f.append({'city':'glendora', 'pop_2020':52558,'pop_2010':50073, 'pop_2000':49415, 'pop_1990':47828}, ignore_index=True)
f = f.append({'city':'san dimas', 'pop_2020':34924,'pop_2010':33371, 'pop_2000':34980, 'pop_1990':32397}, ignore_index=True)
f = f.append({'city':'covina', 'pop_2020':51268,'pop_2010':47796, 'pop_2000':46837, 'pop_1990':43207}, ignore_index=True)
f = f.append({'city':'west covina', 'pop_2020':109501,'pop_2010':106098, 'pop_2000':105080, 'pop_1990':96086}, ignore_index=True)
f = f.append({'city':'charter oak', 'pop_2020':9739,'pop_2010':9310, 'pop_2000':9027, 'pop_1990':8858}, ignore_index=True)
f = f.append({'city':'walnut', 'pop_2020':28430,'pop_2010':29172, 'pop_2000':30004, 'pop_1990':29105}, ignore_index=True)
f = f.append({'city':'la puente', 'pop_2020':38062,'pop_2010':39816, 'pop_2000':41063, 'pop_1990':36955}, ignore_index=True)
f = f.append({'city':'valinda', 'pop_2020':22437,'pop_2010':22822, 'pop_2000':46837, 'pop_1990':18735}, ignore_index=True)
f = f.append({'city':'vincent', 'pop_2020':15714,'pop_2010':15922, 'pop_2000':15097, 'pop_1990':13713}, ignore_index=True)
f = f.append({'city':'monterey park', 'pop_2020':61096,'pop_2010':60269, 'pop_2000':60051, 'pop_1990':60738}, ignore_index=True)
f = f.append({'city':'rosemead', 'pop_2020':51185,'pop_2010':53764, 'pop_2000':53505, 'pop_1990':51638}, ignore_index=True)
f = f.append({'city':'temple city', 'pop_2020':36494,'pop_2010':35558, 'pop_2000':33377, 'pop_1990':31100}, ignore_index=True)
f = f.append({'city':'hacienda heights', 'pop_2020':54191,'pop_2010':54038, 'pop_2000':53122, 'pop_1990':52354}, ignore_index=True)
f = f.append({'city':'rowland heights', 'pop_2020':54191,'pop_2010':54038, 'pop_2000':53122, 'pop_1990':52354}, ignore_index=True)
f = f.append({'city':'citrus', 'pop_2020':10243,'pop_2010':10866, 'pop_2000':10581, 'pop_1990':9481}, ignore_index=True)
f = f.append({'city':'south el monte', 'pop_2020':19567,'pop_2010':20116, 'pop_2000':21144, 'pop_1990':20850}, ignore_index=True)
f = f.append({'city':'avocado heights', 'pop_2020':13317,'pop_2010':15411, 'pop_2000':15148, 'pop_1990':14232}, ignore_index=True)
f = f.append({'city':'baldwin park', 'pop_2020':72176,'pop_2010':75390, 'pop_2000':75837, 'pop_1990':69330}, ignore_index=True)
f = f.append({'city':'south san jose hills', 'pop_2020':19855,'pop_2010':20551, 'pop_2000':20218, 'pop_1990':17814}, ignore_index=True)
f = f.set_index('city')
t3 = f.sum()
t3.name = 'foothill'
f = f.append(t3)
missing = missing.append({'city':'foothill', 'pop_2020':881950,'pop_2010':872870, 'pop_2000':890216, 'pop_1990':817898},ignore_index=True)
# Harbor (includes: 'carson','west carson', 'long beach', 'lakewood','hawaiian gardens', 'signal hill')
h = pd.DataFrame(columns = ['city', 'pop_2020', 'pop_2010', 'pop_2000', 'pop_1990'])
h = h.append({'city':'carson', 'pop_2020':95558,'pop_2010':91714, 'pop_2000':89730, 'pop_1990':83995}, ignore_index=True)
h = h.append({'city':'west carson', 'pop_2020':22870,'pop_2010':21699, 'pop_2000':21138, 'pop_1990':20143}, ignore_index=True)
h = h.append({'city':'long beach', 'pop_2020':466742,'pop_2010':462257, 'pop_2000':461522, 'pop_1990':429433}, ignore_index=True)
h = h.append({'city':'lakewood', 'pop_2020':82496,'pop_2010':80048, 'pop_2000':79345, 'pop_1990':73557}, ignore_index=True)
h = h.append({'city':'hawaiian gardens', 'pop_2020':14149,'pop_2010':14254, 'pop_2000':14779, 'pop_1990':13639}, ignore_index=True)
h = h.append({'city':'signal hill', 'pop_2020':11848,'pop_2010':11016, 'pop_2000':9333, 'pop_1990':8371}, ignore_index=True)
h = h.set_index('city')
t4 = h.sum()
t4.name = 'harbor'
h = h.append(t4)
missing = missing.append({'city':'harbor', 'pop_2020':693663,'pop_2010':680988, 'pop_2000':675847, 'pop_1990':629138},ignore_index=True)
# West (include: 'west hollywood', 'beverly hills', 'santa monica', 'ladera heights', 'marina del rey','culver city',)
w = pd.DataFrame(columns = ['city', 'pop_2020', 'pop_2010', 'pop_2000', 'pop_1990'])
w = w.append({'city':'west hollywood', 'pop_2020':35757,'pop_2010':34399, 'pop_2000':35716, 'pop_1990':36118}, ignore_index=True)
w = w.append({'city':'beverly hills', 'pop_2020':32701,'pop_2010':34109, 'pop_2000':33784, 'pop_1990':31971}, ignore_index=True)
w = w.append({'city':'santa monica', 'pop_2020':93076,'pop_2010':89736, 'pop_2000':84084, 'pop_1990':86905}, ignore_index=True)
w = w.append({'city':'ladera heights', 'pop_2020':6654,'pop_2010':6498, 'pop_2000':6568, 'pop_1990':6316}, ignore_index=True)
w = w.append({'city':'marina del rey', 'pop_2020':11373,'pop_2010':8866, 'pop_2000':8176, 'pop_1990':7431}, ignore_index=True)
w = w.append({'city':'culver city', 'pop_2020':40779,'pop_2010':38883, 'pop_2000':38816, 'pop_1990':38793}, ignore_index=True)
w = w.set_index('city')
t5 = w.sum()
t5.name = 'west'
w = w.append(t5)
missing = missing.append({'city':'west', 'pop_2020':220340,'pop_2010':212491, 'pop_2000':207144, 'pop_1990':207534},ignore_index=True)
# South (includes: 'view park windsor hills', 'westmont','west athens', 'willowbrook','florencegraham')
s = pd.DataFrame(columns = ['city', 'pop_2020', 'pop_2010', 'pop_2000', 'pop_1990'])
s = s.append({'city':'view park windsor hills', 'pop_2020':11419,'pop_2010':11075, 'pop_2000':10958, 'pop_1990':11769}, ignore_index=True)
s = s.append({'city':'westmont', 'pop_2020':33913,'pop_2010':31853, 'pop_2000':31623, 'pop_1990':31044}, ignore_index=True)
s = s.append({'city':'west athens', 'pop_2020':9393,'pop_2010':8729, 'pop_2000':9101, 'pop_1990':8859}, ignore_index=True)
s = s.append({'city':'willowbrook', 'pop_2020':24295,'pop_2010':35983, 'pop_2000':34138, 'pop_1990':32772}, ignore_index=True)
s = s.append({'city':'florence graham', 'pop_2020':61983,'pop_2010':63387, 'pop_2000':60197, 'pop_1990':57147}, ignore_index=True)
s = s.set_index('city')
t6 = s.sum()
t6.name = 'south'
s = s.append(t6)
missing = missing.append({'city':'south', 'pop_2020':141003,'pop_2010':151027, 'pop_2000':146017, 'pop_1990':141591},ignore_index=True)
# Southeast (include:'artesia','vernon', 'bellflower','downey', 'south gate', 'bell','cudahy','bell gardens', 'huntington park',
# 'la mirada', 'santa fe springs','maywood','norwalk','paramount','south whittier', 'pico rivera','montebello','commerce','cerritos',
# 'east whitter (formerly east la mirada)'
se = pd.DataFrame(columns = ['city', 'pop_2020', 'pop_2010', 'pop_2000', 'pop_1990'])
se = se.append({'city':'artesia', 'pop_2020':16395,'pop_2010':16522, 'pop_2000':16380, 'pop_1990':15464}, ignore_index=True)
se = se.append({'city':'vernon', 'pop_2020':222,'pop_2010':112, 'pop_2000':91, 'pop_1990':152}, ignore_index=True)
se = se.append({'city':'bellflower', 'pop_2020':79190,'pop_2010':76616, 'pop_2000':72878, 'pop_1990':61815}, ignore_index=True)
se = se.append({'city':'downey', 'pop_2020':114355,'pop_2010':111772, 'pop_2000':107323, 'pop_1990':91444}, ignore_index=True)
se = se.append({'city':'south gate', 'pop_2020':92726,'pop_2010':94396, 'pop_2000':96375, 'pop_1990':86284}, ignore_index=True)
se = se.append({'city':'bell', 'pop_2020':33559,'pop_2010':35477, 'pop_2000':36664, 'pop_1990':34365}, ignore_index=True)
se = se.append({'city':'cudahy', 'pop_2020':22811,'pop_2010':23805, 'pop_2000':24208, 'pop_1990':22817}, ignore_index=True)
se = se.append({'city':'bell gardens', 'pop_2020':39501,'pop_2010':42072, 'pop_2000':44054, 'pop_1990':42355}, ignore_index=True)
se = se.append({'city':'huntington park', 'pop_2020':54883,'pop_2010':58114, 'pop_2000':61348, 'pop_1990':56065}, ignore_index=True)
se = se.append({'city':'la mirada', 'pop_2020':48008,'pop_2010':48527, 'pop_2000':46783, 'pop_1990':40452}, ignore_index=True)
se = se.append({'city':'santa fe springs', 'pop_2020':19219,'pop_2010':16223, 'pop_2000':17438, 'pop_1990':15520}, ignore_index=True)
se = se.append({'city':'maywood', 'pop_2020':25138,'pop_2010':27395, 'pop_2000':28083, 'pop_1990':27850 }, ignore_index=True)
se = se.append({'city':'norwalk', 'pop_2020':102773,'pop_2010':105549, 'pop_2000':103298, 'pop_1990':94279 }, ignore_index=True)
se = se.append({'city':'paramount', 'pop_2020':53733,'pop_2010':54098, 'pop_2000':55266, 'pop_1990':47669 }, ignore_index=True)
se = se.append({'city':'south whittier', 'pop_2020':56415,'pop_2010':57156, 'pop_2000':55193, 'pop_1990':49514 }, ignore_index=True)
se = se.append({'city':'pico rivera', 'pop_2020':62088,'pop_2010':62942, 'pop_2000':63428, 'pop_1990':59177 }, ignore_index=True)
se = se.append({'city':'montebello', 'pop_2020':62640,'pop_2010':62500, 'pop_2000':62150, 'pop_1990':59564 }, ignore_index=True)
se = se.append({'city':'commerce', 'pop_2020':12378,'pop_2010':12823, 'pop_2000':12568, 'pop_1990':12135 }, ignore_index=True)
se = se.append({'city':'cerritos', 'pop_2020':49578,'pop_2010':49041, 'pop_2000':51488, 'pop_1990':53240 }, ignore_index=True)
se = se.append({'city':'east whitter (formerly east la mirada)', 'pop_2020':10394,'pop_2010':9757, 'pop_2000':9538, 'pop_1990':9367 }, ignore_index=True)
se = se.set_index('city')
t7 = se.sum()
t7.name = 'southeast'
se = se.append(t7)
missing = missing.append({'city':'southeast', 'pop_2020':956006,'pop_2010':964897, 'pop_2000':964554, 'pop_1990':879528},ignore_index=True)
# Southwest (include:'manhattan beach','torrance','inglewood','gardena','lennox','el segundo','hawthorne','redondo beach','lawndale'
# 'hermosa beach','rolling hills estates','palo verde estates','rancho palos verdes','lomita','alondra park')
sw = pd.DataFrame(columns = ['city', 'pop_2020', 'pop_2010', 'pop_2000', 'pop_1990'])
sw = sw.append({'city':'manhattan beach', 'pop_2020':35506,'pop_2010':35135, 'pop_2000':33852, 'pop_1990':32063 }, ignore_index=True)
sw = sw.append({'city':'torrance', 'pop_2020':147067,'pop_2010':145438, 'pop_2000':137946, 'pop_1990':133107}, ignore_index=True)
sw = sw.append({'city':'inglewood', 'pop_2020':107762,'pop_2010':109673, 'pop_2000':112580, 'pop_1990':109602 }, ignore_index=True)
sw = sw.append({'city':'gardena', 'pop_2020':61027,'pop_2010':58829, 'pop_2000':57746, 'pop_1990':49847 }, ignore_index=True)
sw = sw.append({'city':'lennox', 'pop_2020':20323,'pop_2010':22753, 'pop_2000':22950, 'pop_1990':22757 }, ignore_index=True)
sw = sw.append({'city':'el segundo', 'pop_2020':17272,'pop_2010':16654, 'pop_2000':16033, 'pop_1990':15223 }, ignore_index=True)
sw = sw.append({'city':'hawthorne', 'pop_2020':88083,'pop_2010':84293, 'pop_2000':84112 , 'pop_1990':71349 }, ignore_index=True)
sw = sw.append({'city':'redondo beach', 'pop_2020':71576,'pop_2010':66748, 'pop_2000':63261, 'pop_1990':60167 }, ignore_index=True)
sw = sw.append({'city':'lawndale', 'pop_2020':31807,'pop_2010':32769, 'pop_2000':31711, 'pop_1990':27331 }, ignore_index=True)
sw = sw.append({'city':'hermosa beach', 'pop_2020':19728,'pop_2010':19506, 'pop_2000':18566, 'pop_1990':18219 }, ignore_index=True)
sw = sw.append({'city':'rolling hills estates', 'pop_2020':8280,'pop_2010':8067, 'pop_2000':7676, 'pop_1990':7789 }, ignore_index=True)
sw = sw.append({'city':'palos verdes estates', 'pop_2020':13347,'pop_2010':13438, 'pop_2000':13340, 'pop_1990':13512 }, ignore_index=True)
sw = sw.append({'city':'rancho palos verdes', 'pop_2020':42287,'pop_2010':41643, 'pop_2000':41145, 'pop_1990':41659 }, ignore_index=True)
sw = sw.append({'city':'lomita', 'pop_2020':20921,'pop_2010':20256, 'pop_2000':20046, 'pop_1990':19382 }, ignore_index=True)
sw = sw.append({'city':'alondra park', 'pop_2020':8569,'pop_2010':8592, 'pop_2000':8622, 'pop_1990':12215 }, ignore_index=True)
sw = sw.set_index('city')
t8 = sw.sum()
t8.name = 'southwest'
sw = sw.append(t8)
missing = missing.append({'city':'southwest', 'pop_2020':693555,'pop_2010':683794, 'pop_2000':669586, 'pop_1990':634222},ignore_index=True)
# Make all estimates numeric
missing[['pop_2020', 'pop_2010', 'pop_2000', 'pop_1990']] = missing[['pop_2020', 'pop_2010', 'pop_2000', 'pop_1990']].apply(pd.to_numeric)
# DROPPING: HOLLYWOOD/WILSHIRE, CENTRAL, WEST VALLEY, SAN ANTONIO, NORTHEAST (alternative is to find data from census through zipcodes)
warnings.filterwarnings('ignore') # Many warning errors arose later on from using append method since pandas is deprecating it and moving to concat
missing
Now that the data from the missing cities has been found and placed in a dataframe, we concatenate it to the dataframe with all of the cities population estimates (pop_est
; dataset 5). This updated list is then merged with the mental health cases per region (flipped
; dataset 4).
This combined dataset will be used to check if a relationship exists between mental health disorders and population, and how that may have changed over the years. We also renamed city label to district, since the mental health data is broken up into health districts.
Note that some regions were dropped during the merge because of insufficient information. These regions were: Hollywood/Wilshire, West Valley, San Antonio, and Northeast. It is unfortunate that our datasets are limited in this way, but this drop is justified because it is only a small subsection of the health districts and should not affect our analyses.
# Concatenate missing data to combined dataset
pop_est = pd.concat([missing,pop_est])
pop_est = pop_est.reset_index(drop=True)
# Merge at the end
districts_health = reduce(lambda l, r: pd.merge(flipped, pop_est, on='city', how='left'), pop_est)
districts_health = districts_health.rename(columns={'city':'district'})
districts_health = districts_health.dropna()
districts_health = districts_health.reset_index(drop=True)
districts_health
Our sixth and final dataset provides information about the areas of LA's health districts. We will use this later to determine the population density in each district.
# DATASET 6: A Closer Look at the Health Districts
#Scrape health districts webpage
webpage = requests.get('https://www.lacounty.hiv/health-districts/').text
# # Parse webpage HTML
html = BeautifulSoup(webpage, 'html.parser')
# # Select .icon-dl HTML class
district_links = html.select('.icon-dl')
district_names = str(html.select('[data-district]')[0].select('.district-name')[0]).split('</span>')[1]
#Open counteez.csv file
with open('counteez.csv', 'r') as districts_file:
# Read lines from csv file
districts = districts_file.readlines()
# Extract district names
districts = list(map(lambda district: district.split(',')[1].strip(), districts))
districts_data = list()
#Iterate over districts
for district in districts:
# Extract text from district PDF files
text = extract_text('district_pdfs/document_' + district.capitalize() + '.pdf')
text = text.split('\n')
area_square_miles = ''
# population = ''
# Find area for district from extracted text
for index, line in enumerate(text):
if 'square miles' in line:
area_square_miles = line
# population = text[index + 1]
break
area_square_miles = area_square_miles.split('s')[0].strip()
area_square_miles = area_square_miles.replace(',', '')
area_square_miles = int(area_square_miles)
# population = population.split('(')[0]
# population = population.replace(',', '')
# population = int(population)
districts_data.append((district, area_square_miles))
districts_data = np.array(districts_data)
The main focuses of our analysis were to examine the trend of depression rates over time, and to look at the association between depression rates and population density. We start here with the former.
The first part of our analysis looks at how depression rates have changed over time. As mentioned before, data availability constrains us to look specifically at the LA region over the past couple decades.
# Graphing Percentage of Population with Depression in Los Angeles, CA using Seaborn lineplot and renaming axes to be more descriptive
sns.lineplot(data=overall, x='year', y='population_depression_diagnosis_percent').set(
title="Percentage of Population with Depression in Los Angeles, CA", xlabel="Year", ylabel="Percent of LA Population with Depression")
sns.set(style='white', font_scale=2)
plt.rcParams['figure.figsize']=(8,5)
plt.show()
The figure above is a line chart displaying the percent of the population in Los Angeles, CA that had a diagnosis of depression in a particular year. Although there was a brief decline in the percent of the LA population diagnosed with depression between 2007 and 2011, the overall trend has been a significant overall increase from about 9% of the population in 2000 to over 16% of the population in 2018.
Below we have graphed a more thorough breakdown of depression rates by district in LA over time.
# Graphing Depression Rates in LA Health Districts using Seaborn lineplot and renaming axes to be more descriptive
# Setting years as indices to make graphing wideform data easier
df4_graph = df4.iloc[:, 2:]
df4_graph.index = df4.year
# Creating figure
sns.set(style='white', font_scale=1)
plt.rcParams['figure.figsize']=(8,5)
sns.lineplot(data=df4_graph, legend=False).set(
title="Percentage of Population with Depression", xlabel="Year", ylabel="Percent with Depression")
plt.show()
The legend for this plot is not provided because it would be cumbersome with the number of districts in the data; the main point we see from this graph is that depression rates have generally been increasing over time for nearly all of LA's healthcare districts. The dip between 2007 and 2011 seen in the first plot is also visible in the plot above, albeit subtly.
Of course, there is a decent amount of variability between health regions, and depression rates over the plotted time period increased very little in some health regions, while increasing dramatically in others.
To get a better idea of how depression rates have changed over time in individual health districts, we plotted a regression line for each health district.
def remove_date(astr):
return int(astr.replace("-01-01 00:00:00",""))
df4_graph = df4.drop(columns = ['LA Overall'])
#Regression line for each city by time.
df4_graph['year'] = df4_graph['year'].astype(object).astype(str).apply(remove_date)
sns.set(rc={'figure.figsize':(40,30)})
curr = 1;
plt.figure(0)
temp = df4_graph.iloc[:, 2:]
for i in temp:
ax = plt.subplot(5, 5, curr)
ax = sns.regplot(x="year", y= i , data = df4_graph);
ax.xaxis.set_major_formatter(FuncFormatter(lambda x, _: int(x)))
ax.set_ylabel("Depression in " + i)
curr = curr + 1
plt.show()
From the above figures, we can confirm our earlier observation that there is a positive relationship in all of LA cities depression cases—meaning that depression rates are generally increasing everywhere. However, there is a lot of variability between health districts, with some depression-time relations being steeper or stronger than others.
As preparation for examining how depression and population density are related, we did some preliminary analyses on population data. First, we looked at how population was distributed over LA's health districts in 2000 and 2020 by creating histograms:
# Graph style and sizing
sns.set(style='white', font_scale=2)
plt.rcParams['figure.figsize']=(17,7)
sns.despine()
# Plotting 2020 and 2000 graphs side by side
plt.subplot(1,2,1)
sns.histplot(districts_health['pop_2020'],color='teal',bins=10,kde=True).set(
title='LA District Population Counts(2020)', xlabel='District populations (millions)', ylabel='Frequency')
plt.subplot(1,2,2)
sns.histplot(districts_health['pop_2000'],color='blue',bins=10,kde=True).set(
title='LA District Population Counts(2000)', xlabel='District populations (millions)', ylabel='Frequency')
plt.show()
Our histograms show that LA city populations have a non-normal distribution skewed to the right, with most cities having less than 1 million people. Comparing the two histograms shows that the population has remained similar in most districts, with the exception of a few cities that either grew significantly or decreased in population.
The outlier is a district with over 3 million people, making it the district that likely has the highest population density. To find our outlier, we'll take a look at the district counts with a barplot.
sns.set(rc={'figure.figsize':(20,12)}, font_scale = 1.5)
city_cts = sns.barplot(x=districts_health['pop_2020'], y='district', data=districts_health, order=districts_health.sort_values('pop_2020',ascending=False).district)
Based on this barchart from above, we can see that Central district has the highest population count followed by Southeast, Foothill then Harbor. The least populous district is San Fernando.
Now that we know what the most populous districts are, we'll take a look at the districts with the highest depression cases. It is important to note that the population data is from 2020, while our mental health data goes up to 2018.
sns.set(rc={'figure.figsize':(20,12)}, font_scale = 1.5)
city_cts = sns.barplot(x=districts_health['mh_2018'], y='district', data=districts_health, order=districts_health.sort_values('mh_2018',ascending=False).district)
Based on the bar chart above, we can see that Harbor, West, and Torrance have the highest depression rates. Central, our most populous (and likely most dense) city, is actually on the lower end of depression rates.
One important limitation of our population data is that it comes from the government census, which only happens once per decade. In particular, we have data about LA city populations in 2020, 2010, 2000, and 1990.
In order to have more accurate data for analysis, we decided to use linear regression to interpolate data points for 2018, 2015, 2011, 2007,2005, 2002, and 1999 in each city. (These are the years for which we have mental health data.)
We will save depression rates and population predictions of each city in the DataFrame pop_predictions
.
Slope, intercept, R, p and standard error of each regression line will be saved in the DataFrame RL_Cities
.
Although it is true that population tends to grow exponentially rather than linearly, the sequences of four data points that we have for each city can often be approximated quite closely with linear regression. Where linear regression is not ideal, the data do not show exponential relationships, meaning random variability is likely responsible for strange-looking sequences.
pop_predictions = pd.DataFrame(np.array([[2018],[2015],[2011],[2007],[2005],[2002],[1999]]),columns=['year'])
RL_Cities = pd.DataFrame(columns = ['district', 'slope', 'intercept', 'r', 'p', 'std_err'])
for index, row in districts_health.iterrows():
district = row['district']
mh_2018 = row['mh_2018']
mh_2015 = row['mh_2015']
mh_2011 = row['mh_2011']
mh_2007= row['mh_2007']
mh_2005 = row['mh_2005']
mh_2002 = row['mh_2002']
mh_1999 = row['mh_1999']
#Getting regression line with current datas so that I can predict population in each city in the certain year
y = [row['pop_2020'],row['pop_2010'],row['pop_2000'],row['pop_1990']]
x = [2020,2010,2000,1990]
slope, intercept, r, p, std_err = stats.linregress(x, y)
pop_2018= slope * 2018 + intercept
pop_2015= slope * 2015 + intercept
pop_2011= slope * 2011 + intercept
pop_2007= slope * 2007 + intercept
pop_2005 = slope * 2005 + intercept
pop_2002 = slope * 2002 + intercept
pop_1999 = slope * 1999 + intercept
RL_Cities = RL_Cities.append({'district':district, 'slope': slope, 'intercept':intercept,'r':r, 'std_err':std_err, 'p':p }, ignore_index=True)
pop_predictions[district + "_Mh"] = [mh_2018,mh_2015,mh_2011,mh_2007 ,mh_2005,mh_2002,mh_1999 ]
pop_predictions[district + "_Pop"] = [pop_2018,pop_2015,pop_2011,pop_2007 ,pop_2005,pop_2002,pop_1999 ]
pop_predictions
RL_Cities
#Population changes in each district in LA county over time
sns.set(rc={'figure.figsize':(40,30)})
curr = 1
index = 0
plt.figure(0)
x = [2020,2018,2015,2011,2010,2007,2005,2002,2000,1999,1990]
plt.figure(0)
#graphing each district individually
for index, row in districts_health.iterrows():
district = row['district']
sns.set(rc={'figure.figsize':(40,30)})
y = [row['pop_2020'],row['pop_2010'],row['pop_2000'],row['pop_1990']]
x = [2020,2010,2000,1990]
ax = plt.subplot(5, 5, curr)
ax = plt.scatter(x, y)
x = np.linspace(1990, 2020, 7)
slope = RL_Cities['slope'][index]
intercept = RL_Cities['intercept'][index]
index = index + 1
y = slope*x + intercept
plt.plot(x, y)
curr = curr + 1
plt.xlabel('Year')
plt.ylabel('Population')
plt.title("The population change in " + district)
plt.show()
The above figures show the regression lines for each of the LA cities with population. There appears to be a positive relationship for most of LA cities populations, meaning populations generally increase for these districts (excepting Alhambra, Inglewood and East LA).
With individual analyses of depression rates and population out of the way, we can now begin examining the relationship between depression and population density.
LA County's health page provides the area of each health district. We combine this information (dataset 6) with population data to obtain population density information for census years.
df_districts = pd.DataFrame(districts_data)
# Rename columns
df_districts = df_districts.rename(columns={0: 'district', 1: 'area_square_miles'})
# Convert area_square_miles column to numeric values
df_districts['area_square_miles'] = df_districts['area_square_miles'].apply(pd.to_numeric)
# # Sort DataFrame in alphabetical order
df_districts = df_districts.sort_values('district', ascending=True).reset_index(drop=True)
districts_health['area_square_miles'] = df_districts['area_square_miles']
for year in range(1990, 2030, 10):
districts_health['pop_density_{}'.format(year)] = districts_health['pop_{}'.format(year)] / districts_health['area_square_miles']
# Round decimals
districts_health = districts_health.round(decimals=1)
districts_health
We also need to calculate population density with the interpolated population data from earlier.
#Concatenate population that we get from regresssion line to the data variable districts_health
temp_2018 = []
temp_2015 = []
temp_2011 = []
temp_2007 = []
temp_2005 = []
temp_2002 = []
temp_1999 = []
temp_2018_density = []
temp_2015_density = []
temp_2011_density = []
temp_2007_density = []
temp_2005_density = []
temp_2002_density = []
temp_1999_density = []
area_list = districts_health['area_square_miles'].to_numpy()
index = 0;
#We save the population in the variable pop_predictions. So we iterate it to get correct population.
for i in pop_predictions:
if "_Pop" in i:
temp_2018.append(pop_predictions[i].to_numpy()[0])
temp_2015.append(pop_predictions[i].to_numpy()[1])
temp_2011.append(pop_predictions[i].to_numpy()[2])
temp_2007.append(pop_predictions[i].to_numpy()[3])
temp_2005.append(pop_predictions[i].to_numpy()[4])
temp_2002.append(pop_predictions[i].to_numpy()[5])
temp_1999.append(pop_predictions[i].to_numpy()[6])
temp_2018_density.append((pop_predictions[i].to_numpy()[0] / area_list[index]).round(decimals=1))
temp_2015_density.append((pop_predictions[i].to_numpy()[1] / area_list[index]).round(decimals=1))
temp_2011_density.append((pop_predictions[i].to_numpy()[2] / area_list[index]).round(decimals=1))
temp_2007_density.append((pop_predictions[i].to_numpy()[3] / area_list[index]).round(decimals=1))
temp_2005_density.append((pop_predictions[i].to_numpy()[4] / area_list[index]).round(decimals=1))
temp_2002_density.append((pop_predictions[i].to_numpy()[5] / area_list[index]).round(decimals=1))
temp_1999_density.append((pop_predictions[i].to_numpy()[6] / area_list[index]).round(decimals=1))
index = index + 1
districts_health["pop_2018"] = temp_2018
districts_health["pop_2015"] = temp_2015
districts_health["pop_2011"] = temp_2011
districts_health["pop_2007"] = temp_2007
districts_health["pop_2005"] = temp_2005
districts_health["pop_2002"] = temp_2002
districts_health["pop_1999"] = temp_1999
districts_health["pop_density_2018"] = temp_2018_density
districts_health["pop_density_2015"] = temp_2015_density
districts_health["pop_density_2011"] = temp_2011_density
districts_health["pop_density_2007"] = temp_2007_density
districts_health["pop_density_2005"] = temp_2005_density
districts_health["pop_density_2002"] = temp_2002_density
districts_health["pop_density_1999"] = temp_1999_density
POPULATION = districts_health[["district", "pop_2020", "pop_2018", "pop_2015", "pop_2011", "pop_2010", "pop_2007", "pop_2005", "pop_2002", "pop_2000", "pop_1999", "pop_1990"]]
POPULATION
DENSITY = districts_health[["district", "pop_density_2020", "pop_density_2018", "pop_density_2015", "pop_density_2011", "pop_density_2010", "pop_density_2007", "pop_density_2005", "pop_density_2002", "pop_density_2000", "pop_density_1999", "pop_density_1990"]]
DENSITY
The variable "DENSITY" is part of the variable "districts_health" data. It visually shows the population density nicely.
We now examine the relationship between population density and depression rates by performing linear regressions for each year in our data.
#2018
sns.lmplot(y = 'pop_density_2018',
x = 'mh_2018',
data = districts_health)
plt.xlabel('Mental Health in 2018')
plt.ylabel('Population density in 2018')
plt.show()
The above graph shows the correlation between population density and mental health in LA county in 2018, which is relatively low due to high outliers.
#2015
sns.lmplot(y = 'pop_density_2015',
x = 'mh_2015',
data = districts_health)
plt.xlabel('Mental Health in 2015')
plt.ylabel('Population density in 2015')
plt.show()
The above graph shows the association between population density and mental health in LA county in 2015. Here we see a stronger relationship, but the outliers are still prevalent.
sns.lmplot(y = 'pop_density_2011',
x = 'mh_2011',
data = districts_health)
plt.xlabel('Mental Health in 2011')
plt.ylabel('Population density in 2011')
plt.show()
The above graph shows the correlation between population density and mental health in LA county in 2011.
sns.lmplot(y = 'pop_density_2007',
x = 'mh_2007',
data = districts_health)
plt.xlabel('Mental Health in 2007')
plt.ylabel('Population density in 2007')
plt.show()
The above graph shows that the correlation between population density and mental health in LA county in 2007. It's quite positive due to the high right outlier.
sns.lmplot(y = 'pop_density_2005',
x = 'mh_2005',
data = districts_health)
plt.xlabel('Mental Health in 2005')
plt.ylabel('Population density in 2005')
plt.show()
The above graph shows the correlation between population density and mental health in LA county in 2005. This one is somewhat negative, which is partially due to the outliers.
sns.lmplot(y = 'pop_density_2002',
x = 'mh_2002',
data = districts_health)
plt.xlabel('Mental Health in 2002')
plt.ylabel('Population density in 2002')
plt.show()
The above graph shows that the correlation between population density and mental health in LA county in 2002. There is little association.
As a supplementary analysis, we create a heatmap here to look at associations with all variables considered.
overallheatmap = districts_health.copy()
def heatmap(x, y, **kwargs):
if 'color' in kwargs:
color = kwargs['color']
else:
color = [1]*len(x)
if 'palette' in kwargs:
palette = kwargs['palette']
n_colors = len(palette)
else:
n_colors = 256 # Use 256 colors for the diverging color palette
palette = sns.color_palette("Blues", n_colors)
if 'color_range' in kwargs:
color_min, color_max = kwargs['color_range']
else:
color_min, color_max = min(color), max(color) # Range of values that will be mapped to the palette
def value_to_color(val):
if color_min == color_max:
return palette[-1]
else:
val_position = float((val - color_min)) / (color_max - color_min) # position of value in the input range, relative to the length of the input range
val_position = min(max(val_position, 0), 1) # bound the position betwen 0 and 1
ind = int(val_position * (n_colors - 1)) # target index in the color palette
return palette[ind]
if 'size' in kwargs:
size = kwargs['size']
else:
size = [1]*len(x)
if 'size_range' in kwargs:
size_min, size_max = kwargs['size_range'][0], kwargs['size_range'][1]
else:
size_min, size_max = min(size), max(size)
size_scale = kwargs.get('size_scale', 500)
def value_to_size(val):
if size_min == size_max:
return 1 * size_scale
else:
val_position = (val - size_min) * 0.99 / (size_max - size_min) + 0.01 # position of value in the input range, relative to the length of the input range
val_position = min(max(val_position, 0), 1) # bound the position betwen 0 and 1
return val_position * size_scale
if 'x_order' in kwargs:
x_names = [t for t in kwargs['x_order']]
else:
x_names = [t for t in sorted(set([v for v in x]))]
x_to_num = {p[1]:p[0] for p in enumerate(x_names)}
if 'y_order' in kwargs:
y_names = [t for t in kwargs['y_order']]
else:
y_names = [t for t in sorted(set([v for v in y]))]
y_to_num = {p[1]:p[0] for p in enumerate(y_names)}
plot_grid = plt.GridSpec(1, 15, hspace=0.2, wspace=0.1) # Setup a 1x10 grid
ax = plt.subplot(plot_grid[:,:-1]) # Use the left 14/15ths of the grid for the main plot
marker = kwargs.get('marker', 's')
kwargs_pass_on = {k:v for k,v in kwargs.items() if k not in [
'color', 'palette', 'color_range', 'size', 'size_range', 'size_scale', 'marker', 'x_order', 'y_order'
]}
ax.scatter(
x=[x_to_num[v] for v in x],
y=[y_to_num[v] for v in y],
marker=marker,
s=[value_to_size(v) for v in size],
c=[value_to_color(v) for v in color],
**kwargs_pass_on
)
ax.set_xticks([v for k,v in x_to_num.items()])
ax.set_xticklabels([k for k in x_to_num], rotation=45, horizontalalignment='right')
ax.set_yticks([v for k,v in y_to_num.items()])
ax.set_yticklabels([k for k in y_to_num])
ax.grid(False, 'major')
ax.grid(True, 'minor')
ax.set_xticks([t + 0.5 for t in ax.get_xticks()], minor=True)
ax.set_yticks([t + 0.5 for t in ax.get_yticks()], minor=True)
ax.set_xlim([-0.5, max([v for v in x_to_num.values()]) + 0.5])
ax.set_ylim([-0.5, max([v for v in y_to_num.values()]) + 0.5])
ax.set_facecolor('#F1F1F1')
# Add color legend on the right side of the plot
if color_min < color_max:
ax = plt.subplot(plot_grid[:,-1]) # Use the rightmost column of the plot
col_x = [0]*len(palette) # Fixed x coordinate for the bars
bar_y=np.linspace(color_min, color_max, n_colors) # y coordinates for each of the n_colors bars
bar_height = bar_y[1] - bar_y[0]
ax.barh(
y=bar_y,
width=[5]*len(palette), # Make bars 5 units wide
left=col_x, # Make bars start at 0
height=bar_height,
color=palette,
linewidth=0
)
ax.set_xlim(1, 2) # Bars are going from 0 to 5, so lets crop the plot somewhere in the middle
ax.grid(False) # Hide grid
ax.set_facecolor('white') # Make background white
ax.set_xticks([]) # Remove horizontal ticks
ax.set_yticks(np.linspace(min(bar_y), max(bar_y), 3)) # Show vertical ticks for min, middle and max
ax.yaxis.tick_right() # Show vertical ticks on the right
def corrplot(data, size_scale=500, marker='s'):
corr = pd.melt(data.reset_index(), id_vars='index')
corr.columns = ['x', 'y', 'value']
heatmap(
corr['x'], corr['y'],
color=corr['value'], color_range=[-1, 1],
palette=sns.diverging_palette(20, 220, n=256),
size=corr['value'].abs(), size_range=[0,1],
marker=marker,
x_order=data.columns,
y_order=data.columns[::-1],
size_scale=size_scale
)
corr = overallheatmap.corr()
ax.set_xticklabels(
ax.get_xticklabels(),
rotation=45,
horizontalalignment='right'
)
plt.figure(figsize=(13, 13))
corrplot(corr)
The above correlation heatmap shows the correlations of all of our variables in comparison to eachother. We are able to visualize the relationships of these correlations with eachother with color and square size representations; blue representing a positive correlalation and red representing a negative. The stronger the colors are, the larger the correlation magnitudes are. The weakest relationships are almost invisible. This heatmap goes a step further from the usual heatmaps by adding a size parameter in addition to color, so that the size of the squares correspond to the magnitude of the correlation as well (size(c1,c2) ~ abs(corr(c1,c2))). This is for easier viewer interpretability. The darkest variables represent comparisons to eachother, and they are blocked out in large portions for population estimates and densities. We'll have to look at these more closely.
This heatmap shows us how population densities in 2018, 2007 and 2005 had negative correlations to depression cases during those years, while in 2015 and 2007 there is somewhat of a stronger positive relationship between these variables. It seems that in the years 2002 and 1999 there was hardly any relationship that existed between depression and densities.
It is ethical to collect data from government reports because the data does not contain personally identifiable records and it is in the public domain. Any data that we used from LA County is publicly accessible and is not associated with singular individuals.
Our research question does not cause harm to any person and was not an unethical question to ask. In particular, the data was not analyzed for the purpose of making a healthcare or policy decision that would affect anyone. The data analysis is only being conducted as an exploration of the research question.
The objective of our project was to see if there was a relationship between population density and depression. We limited our analysis to Los Angeles county because of data availability. Our original prediction was that there would be a significant relationship between population density and depression rates; however, the analyses we conducted show weak or insignificant correlations between these variables. As a result, we have no evidence to support our hypothesis (i.e., we fail to reject the null hypothesis) of a relationship between population density and depression.
One important trend we noticed is that depression rates seem to be increasing over time. This could be partially a reflection of the growing priority society has put on mental health in recent years (i.e., leading to more diagnoses), but it is likely also a sign of depression becoming more likely—particularly among younger populations.
There are several limitations with our data analysis: namely, in our final regressions, outliers (e.g. the Central district) greatly affected our results. Perhaps performing statistical analyses with these outliers removed would lead to more conclusive results.