COGS 108 - Final Project

Names

  • Naval Patel
  • Ray Zhou
  • Nathaniel Satriya
  • Yoomin Oh
  • Jennifer Nunez

Research Question

Is there a correlation between population density (people per square mile) and depression per capita and square mile in the State of California between the years 2000-2020?

Background & Prior Work

Depression is a serious and widespread mental health issue that affects people all over the world. Currently, depression rates worldwide are at about 4.4%; in the U.S., it's at about 5.9%1, and the rates seem to be increasing—especially for younger populations2.

There have been many scientific inquiries into the causes of depression and the environmental risk factors that may play a role in its development. Of particular interest to us is how population density may be associated with depression. As college students, we have all recently had to move and adapt to a very specific and highly populated environment that, in many ways, differs substantially from what we are used to.

As such, we are interested in investigating how population density, which may affect factors like social interactions, loneliness, emotional health, and more, might influence mental health. There are a few studies that have already looked at the related topic of urbanization, though many of them have disagreeing perspectives on how it affects depression.

For example, one such study collected medical data from patients in Spain and conducted a linear regression data analysis which found support for urbanization as one environmental factor that may contribute to greater prevalence of depression (3Llorente et al., 2018). A longitudinal study of depression among older people in China (4He et al., 2020) and a study of the Swedish population (5Sundquist et al., 2018) concluded similar results; namely, that urbanization and high population density are associated with higher rates of depression and other mental health disorders.

Conversely, a UCLA study used data from U.S. cities to create a multidimensional statistical model (much like the Spanish study mentioned above) to predict depression rates, and found evidence that larger cities were actually associated with lower depression rates (6Stier et al., 2021). Overall, a literature review of studies on cities and mental health concluded that while cities tend to see higher rates of depression than rural areas, urban cities are also associated with other factors like socioeconomic status and better healthcare access that may confound the results (7Gruebner et al., 2017).

For the purposes of our study, we hypothesized that depression rates (i.e., proportion of a population affected by depression) would increase as population density increases.

As anecdotal evidence, we have noticed that socialization often takes a less personal and less fulfilling form when interacting with a larger group of people. Thus, living in a higher population density might worsen feelings of isolation or dissatisfaction in life, which could contribute to higher rates of depression (and other mental health issues).

From our background research, we know that there is conflicting evidence on this topic. Different researchers have associated urbanization with both higher and lower rates of depression. However, on the whole, we still predicted that an environment with high population density (as opposed to a city with high total population or development) negatively impacts individuals' mental health on a collective scale.

To further examine our hypothesis and to explore some of the meaningful relationships between depression and population density, we decided to include a temporal dimension to our data. Examining how depression and population change over time can provide more insight on changing environments and may help to predict how these factors may affect each other in the future.

References:

  1. https://worldpopulationreview.com/country-rankings/depression-rates-by-country
  2. https://www.publichealth.columbia.edu/public-health-now/news/depression-rise-us-especially-among-young-teens
  3. https://doi.org/10.3389/fpsyg.2018.02182
  4. https://doi.org/10.1177/0046958020965470
  5. https://doi.org/10.1192/bjp.184.4.293
  6. https://doi.org/10.1073/pnas.2022472118
  7. https://doi.org/10.3238/arztebl.2017.0121

Dataset(s)

To help answer our question, we needed datasets that had a lot of structure and consistency over a large area and long period of time. This led us to search specifically for government census data and regional medical data.

The primary variables we needed for our analysis were time, region, population density (or population size + area of region), and rates or incidences of depression. Datasets from SAMHSA, the census, and LA County medical data supplied us with this information. Because of the limitations on the availability of medical data nationwide, we limited our analysis to look at LA County and its health districts over the past couple decades.

  1. Dataset name: Historical Population Density Data (1910-2020)
  • Link: https://www.census.gov/data/tables/time-series/dec/density-data-text.html
  • Number of Observations: 684
  • This data provides the changing resident populations of each state in the US over 12 decades and ranks them based on their population density. Density is measured as the average population per square mile. According to the US Census Bureau, density levels have been higher across the Eastern sea and pacific coastline and lower in most of the West.

  1. Dataset name: County Population Totals: 2010-2019 (California)
  1. Dataset name: California Individuals by mental health diagnosis, gender, age, race, ethnicity and veteran status: Number, 2014-2019
  • Link: https://www.samhsa.gov/data/sites/default/files/reports/rpt35253/MHCLD-2019-R-FINAL.pdf
  • Number of Observations: 10
  • This table was pulled from the 2019 Mental Health Client-Level Data (MH-CLD) Annual Report by SAMHSA, the agency within the U.S Department of Health and Human Services. Specifically, it provides information reported from state administrative data systems on: mental health diagnoses, mental health treatment settings, and demographic and substance use characteristics of individuals in mental health treatment facillities. Since our state of interest is California we pulled a relevant table from this report.

  1. Dataset name: Diagnosed Depression (Ever) in LA county
  • Link: http://publichealth.lacounty.gov/ha/HA_DATA_TRENDS.htm
  • Number of Observations: 27
  • This dataset provides estimates of depression diagnoses in health districts part of Los Angeles county jurisdiction, from 1999 to 2018. It contains other information about race, ethnicity, age group, poverty level, gender and education. The county of LA notes: "We are happy to provide the following data from the 2018 Los Angeles County Health Survey. Please note that the disparities seen in the results for the different groups shown in the table are to a large degree driven by inequities in the social, economic, and environmental conditions experienced by these groups."

  1. Dataset name: General Population Cities & Unincorporated Communities Los Angeles County (1990 through 2020 Census)
  1. Dataset name: A Closer Look at the Health Districts

Setup

To begin, let's import libraries to help us with webscraping, data wrangling, visualizations, and statistical analyses.

In [1]:
# importing libraries

# data handling
import pandas as pd
from pandas.core.arrays import string_
import numpy as np
import io
import csv
from tabulate import tabulate

# plotting
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.ticker import FuncFormatter

# stats
from scipy import stats

# webscraping
import requests
from bs4 import BeautifulSoup
from pdfminer.high_level import extract_text
from functools import reduce
import os
import warnings

Data Cleaning

In this section, each dataset is standardized by making values and headers lowercase and stripping them of whitespace and punctuation. Empty values are dropped where necessary and justified.

We begin by defining functions to remove whitespace and punctuation:

In [2]:
# Standardization functions
def remove_comma(output):
  if "," in output:
    output = output.replace(',', '')   
  return output

def remove_period(str_in):
  output = str_in
  try:
    output = str_in.lower()
    output = output.replace('.', '')
  except:
    output = str_in
    print("error")
  return output

def remove_dash(output):
  if "-" in output:
    output = output.replace('-','')
  return output

def remove_asterik(output):
  if "*" in output:
    output = output.replace('*', '')  
  return output

def remove_date(astr):
  return int(astr.replace("-01-01 00:00:00",""))

Let's start working with the first dataset.

  • The census data was fairly tidy already, but it included a lot of unnecessary variables. To narrow our focus, we decided to extract only California population counts and estimates. We will use this dataset to compare California mental health case estimates over the years 2000 to 2020.
In [3]:
# DATASET 1: reading Historical Population Density Data (1910-2020) data from github 

population_url = 'https://raw.githubusercontent.com/jennynunez/032-DATA/main/apportionment.csv'
df1 = pd.read_csv(population_url)   # Original shape of this data, starting from 1910, is (684,10)                                   

# Drop all years except for 2000-2020
pop = df1.loc[(df1['Year'] >= 2000)]
#pop.shape                                        # After filtering it to only contain the 2000s the shape is (171,10)

# Focus on states and drop observations with Regions & Nations 
population = pop[pop['Geography Type'].str.contains("State")]

# Create a dataframe that focuses on the states & rename 'Geography Type' column to 'State'
population = population.rename(columns={'Name': 'name', 'Geography Type':'state', 'Year': 'year', 'Resident Population': 'res_pop_count', 'Percent Change in Resident Population': 'perc_change_res_pop', 'Resident Population Density': 'res_pop_density', 'Resident Population Density Rank': 'res_pop_density_rank'})
population = population.drop(population.columns[7:10], axis=1)
population = population.replace(',', '', regex=True)
population = population.astype({"res_pop_count": int, "perc_change_res_pop": float, "res_pop_density": float, "res_pop_density_rank": float})
# Filtered all states except California
population = population[population.name == 'California']
population = population.reset_index(drop=True)
population['state'] = population['state'].str.lower()
population['name'] = population['name'].str.lower()
population.head()
Out[3]:
name state year res_pop_count perc_change_res_pop res_pop_density res_pop_density_rank
0 california state 2000 33871648 13.8 217.4 14.0
1 california state 2010 37253956 10.0 239.1 13.0
2 california state 2020 39538223 6.1 253.7 13.0

Now to the second dataset.

  • The counties dataset needed all of its columns to be renamed to their respective variables.
  • Initially we aimed to work with all counties in California, but based on the available medical data, we narrowed our scope to Los Angeles County.
In [4]:
# DATASET 2: California counties population totals from 2010-2019 from github

counties_url = 'https://raw.githubusercontent.com/jennynunez/032-DATA/main/CA_counties_pop.csv'
df3 = pd.read_csv(counties_url)

# rename columns
df3 = df3.rename(columns={'table with row headers in column A and column headers in rows 3 through 4 (leading dots indicate sub-parts)': 'county',
                          'Unnamed: 1':'census', 'Unnamed: 2': 'estimates',
                          'Unnamed: 3': '2010',
                          'Unnamed: 4': '2011',
                          'Unnamed: 5': '2012',
                          'Unnamed: 6': '2013', 'Unnamed: 7': '2014',
                          'Unnamed: 8': '2015', 'Unnamed: 9': '2016',
                          'Unnamed: 10': '2017', 'Unnamed: 11': '2018',
                          'Unnamed: 12': '2019'})
# drop unneeded rows
df3.drop(
    labels = [0,1,2,62,63,64,65,66],
    axis = 0,
    inplace=True
)
df3 = df3.reset_index(drop=True)

# standardize county names to lowercase and remove unecessary punctuations
df3['county'] = df3['county'].apply(remove_period)
#Remove state
df3['county'] = df3['county'].apply(lambda str_in: str_in.replace(', california', ''))
#Remove commas
df3 = df3.applymap(lambda str_in: str_in.replace(',', ''))
df3.drop(labels = 0, axis=0, inplace = True)

#iterate remove_comma function to columns
for i in range(0,10):
  temp = 2010 + i
  temp = str(temp)
  df3[temp] = df3[temp].apply(remove_comma)
df3['census'] = df3['census'].apply(remove_comma)
df3['estimates'] = df3['estimates'].apply(remove_comma)

# store la county information in its own variable
la_pop = df3.loc[df3['county'] == 'los angeles county']
la_pop
Out[4]:
county census estimates 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
19 los angeles county 9818605 9819968 9823246 9876482 9935375 9992484 10040072 10085416 10105708 10103711 10073906 10039107

For the third dataset:

  • Mental health data for the state of California was extracted from the SAMHSA report and rewritten as a CSV file using the tabulate method.
  • Columns were renamed for better understandability and values were standardized.
In [5]:
# DATASET 3: California individuals counted by type of mental health diagnosis through the years 2014-2019

# converting pdf to csv
box = [1,2,7,9]                            # Table measurements from PDF document [top, left, bottom, width]
cm_to_pdf = 28.28                          # Converts our margin measurements from cm to pdf points
for i in range(0,len(box)):
  box[i] *= cm_to_pdf
mh_url = 'https://www.samhsa.gov/data/sites/default/files/reports/rpt35253/MHCLD-2019-R-FINAL.pdf'
# mh_disorders = tabula.read_pdf(mh_url, pages='438').      # Using tabulate to read pdf

# *Convert this tabulated data to CSV (This portion will be commented out for faster shell processing)
# tabula.convert_into(mh_url, "mh.csv", pages="438", output_format="csv", stream=True)
# !cat mh.csv 
mh_url2 = 'https://raw.githubusercontent.com/jennynunez/032-DATA/main/mh.csv' #to dataframe
mh_disorders = pd.read_csv(mh_url2)

# rename columns
mh_disorders = mh_disorders.rename(columns={'Selected characteristics': 'disorder_type',
                          'Unnamed: 1':'totals', '2014': '2014',
                          '2015': '2015', '2016': '2016',
                          '2017': '2017', '2018': '2018',
                          '2019': '2019'})
mh_disorders = mh_disorders.iloc[0:13, :]

# Make gender into its own df  *may delete
mh_genders = mh_disorders.iloc[13:17, :] 

# Make age its own df *may delete
mh_ages = mh_disorders.iloc[18:33, :]

# Clean NaN values for disorder types df
mh_disorders = mh_disorders.drop('totals', axis=1)
mh_disorders = mh_disorders.dropna(axis=0)
mh_disorders = mh_disorders.reset_index(drop=True)

#iterate remove_comma function to columns
for i in range(0,6):
  temp = 2014 + i
  temp = str(temp)
  mh_disorders[temp] = mh_disorders[temp].apply(remove_comma)

def remove_dash(output):
  if "-" in output:
    output = output.replace('-','')
  return output

mh_disorders['disorder_type'] = mh_disorders['disorder_type'].str.lower()
mh_disorders['disorder_type'] = mh_disorders['disorder_type'].apply(remove_dash)
mh_disorders
Out[5]:
disorder_type 2014 2015 2016 2017 2018 2019
0 total individuals 697180 695815 647328 638394 708521 742725
1 trauma and stressorrelated disorders 107725 111807 105263 95751 110463 119469
2 anxiety disorders 75243 82479 81052 89225 109679 114954
3 attention deficit/hyperactivity disorder 52189 51489 47165 46369 49613 48632
4 mood disorders 308348 309996 271061 196965 243123 261427
5 bipolar disorders 113612 112390 99012 51718 60772 62227
6 depressive disorders 196785 199698 173975 146363 183711 200791
7 personality disorders 24234 23816 21994 14570 12654 10520
8 psychotic disorders 126411 126516 116156 116861 137630 137726
9 alcohol or substancerelated disorders 113129 112757 92537 64358 70782 68319
10 other mental disorders2 538234 530421 476321 140099 148930 139990

The fourth dataset:

  • Irrelevant headers and information (e.g. confidence intervals) were removed, and columns were renamed for better understanding.
  • Data (originally in percentages) was standardized and converted to numerical values.
In [6]:
# DATASET 4: Los Angeles mental health data of people diagnosed with depression (ever)

la_url = 'https://raw.githubusercontent.com/jennynunez/032-DATA/main/la_demos_cleaned.csv'
df4 = pd.read_csv(la_url, index_col=0)

# We want to drop the index name because it is not useful
df4.index.name = None
# We will now reset the index column and removing unhelpful values
df4 = df4.T.reset_index().reset_index(drop=True)
# Renaming columns that have incorrect labels
df4 = df4.rename(columns={'index': 'year'})
# Removing percentage symbol throughout the DataFrame
df4 = df4.replace('%', '', regex=True)
# Convering DataFrame to numeric values
df4 = df4.apply(pd.to_numeric)
# Properly formatting years in datetime format so Seaborn doesn't convert the years to floats
df4['year'] = pd.to_datetime(df4.year.astype(str), format="%Y")
# Renaming 'LA Overall' column to be more descriptive about what the data is telling us
overall = df4.rename(columns={'LA Overall': 'population_depression_diagnosis_percent'})
# The overall LA county data is in the first few rows so we are using a subset of the data
overall = overall.iloc[:, 0:2]
df4
Out[6]:
year LA Overall Alhambra Antelope Valley Bellflower Central Compton East LA East Valley El Monte ... Pomona San Antonio San Fernando South Southeast Southwest Torrance West West Valley Whittier
0 2018-01-01 16.5 11.2 18.4 18.2 13.9 15.5 12.5 14.0 14.6 ... 10.9 21.5 15.7 13.1 14.5 19.7 19.9 23.0 15.8 13.2
1 2015-01-01 13.0 9.7 19.2 13.8 20.0 12.2 12.4 13.5 6.2 ... 10.2 11.6 14.6 14.8 12.0 10.3 11.6 15.2 10.8 12.7
2 2011-01-01 12.2 7.1 14.6 12.8 10.6 10.1 16.3 11.9 5.4 ... 14.9 8.6 13.7 16.3 9.5 9.4 9.6 13.4 12.8 11.4
3 2007-01-01 13.6 9.1 17.1 10.7 15.4 12.7 15.2 13.6 12.3 ... 14.2 14.2 11.2 12.8 18.7 12.4 13.7 13.2 12.4 14.5
4 2005-01-01 12.9 10.2 15.7 11.3 13.0 13.2 16.6 13.9 14.2 ... 9.7 11.4 10.7 16.8 6.9 11.7 10.5 16.6 14.0 15.3
5 2002-01-01 9.7 8.3 9.2 8.8 8.8 5.4 8.1 12.4 5.4 ... 8.6 9.5 8.9 6.7 10.1 7.1 10.4 11.7 11.2 12.0
6 1999-01-01 8.8 5.7 9.7 7.0 9.2 4.8 9.0 8.5 5.7 ... 6.7 9.2 5.7 8.9 6.2 7.9 10.0 8.8 13.0 9.7

7 rows × 28 columns

Our fifth dataset required a bit more work:

  • For LA cities populations (based on their zipcode), we scraped a web page that compiled a table sourced from the Census Bureau data. This required reading in the data with the BeautifulSoup and requests libraries in order to create a clean pandas dataframe.
  • We filtered the dataset to only include cities that existed from 1990 and onwards, since our mental health data consists of these cities in this timeframe.
  • Values were standardized using previously made functions and all observations were converted to numerical values.
In [7]:
# DATASET 5:  Districts in LA county populations

# First use BeautifulSoup to scrape the table data and convert to pandas dataframe
la_pop_url = 'http://www.laalmanac.com/population/po03.php'
r = requests.get(la_pop_url)
soup = BeautifulSoup(r.content, 'lxml')
table = soup.find_all('table')[0]
# remove multi-headers and set relevant column names
pop_est = pd.read_html(str(table), header=None)[0]
pop_est.columns = ['city', 'pop_2020', 'pop_2010', 'pop_2000', 'pop_1990']

# For newer cities that didn't exist in 1990, we'll remove values that appear as "---(2)"
pop_est = pop_est[(pop_est['pop_2000'] != "---(2)")]
pop_est = pop_est[(pop_est['pop_1990'] != "---(3)")] # reduced size to 129 cities
pop_est = pop_est.iloc[1: , :]
pop_est = pop_est.reset_index(drop=True)


# Standardize to remove " * or - " and make cities lowercase
def remove_asterik(output):
  if "*" in output:
    output = output.replace('*', '')  
  return output
pop_est['city'] = pop_est['city'].apply(remove_asterik)
pop_est['city'] = pop_est['city'].str.lower()
pop_est['city'] = pop_est['city'].apply(remove_dash)
# Make all estimates numeric
pop_est[['pop_2020', 'pop_2010', 'pop_2000', 'pop_1990']] = pop_est[['pop_2020', 'pop_2010', 'pop_2000', 'pop_1990']].apply(pd.to_numeric)

pop_est
Out[7]:
city pop_2020 pop_2010 pop_2000 pop_1990
0 acton 7431 7596 2390 1471
1 agoura hills 20299 20330 20537 20390
2 alhambra 82868 83089 85804 82106
3 alondra park 8569 8592 8622 12215
4 altadena 42846 42777 42610 42658
... ... ... ... ... ...
121 west whittierlos nietos 25325 25540 25129 24164
122 westlake village 8029 8270 8368 7455
123 westmont 33913 31853 31623 31044
124 whittier 87306 85331 83680 77671
125 willowbrook 24295 35983 34138 32772

126 rows × 5 columns

After standardizing dataset 5, we wanted to combine it with dataset 4 (depression diagnosis rates in LA). We merged the dataframes together and deal with data from missing cities individually.

In [8]:
# Now we want to filter only the cities that are also recorded from our mental health dataset

# First flip df4 using transpose so cities are row indexes
flipped = df4.transpose()
flipped = flipped.reset_index()
flipped = flipped.iloc[1: , :]
flipped.columns = ['city','mh_2018', 'mh_2015', 'mh_2011', 'mh_2007', 'mh_2005', 'mh_2002', 'mh_1999']
flipped['city'] = flipped['city'].str.lower()
flipped[['mh_2018', 'mh_2015', 'mh_2011', 'mh_2007', 'mh_2005', 'mh_2002', 'mh_1999']] = flipped[['mh_2018', 'mh_2015', 'mh_2011', 'mh_2007', 'mh_2005', 'mh_2002', 'mh_1999']].apply(pd.to_numeric)
# Merge once all values have been made numeric, and city names have been standardized
# Trying merge them together with reduce method 
# df6 = reduce(lambda l, r: pd.merge(flipped, df5, on='city', how='inner'), df5)
# This method brings 11 out of 26 cities we're looking for

# Some missing cities are included but as a different label in df5. Once we identify the missing cities, we'll manually extract, and append to:
missing = pd.DataFrame()

# Antelope Valley mental health cases comprise of 5 cities. We will make a new df with Antelope Valley cities, then sum and reassign these values to this city.
# Antelope Valley (includes 'acton', 'palmdale', 'lancaster', 'lake los angeles', 'quartz hill')
ant_v = pd.DataFrame(columns = ['city', 'pop_2020', 'pop_2010', 'pop_2000', 'pop_1990'])
ant_v = ant_v.append({'city':'acton', 'pop_2020':7431,'pop_2010':7596, 'pop_2000':2390, 'pop_1990':1471}, ignore_index=True)
ant_v = ant_v.append({'city':'palmdale', 'pop_2020':169450,'pop_2010':152750, 'pop_2000':116670, 'pop_1990':68842}, ignore_index=True)
ant_v = ant_v.append({'city':'lancaster', 'pop_2020':173516,'pop_2010':156633, 'pop_2000':118718, 'pop_1990':97291}, ignore_index=True)
ant_v = ant_v.append({'city':'lancaster', 'pop_2020':13187,'pop_2010':12328, 'pop_2000':11523, 'pop_1990':7977}, ignore_index=True)
ant_v = ant_v.append({'city':'quartz hill', 'pop_2020':11447,'pop_2010':10912, 'pop_2000':9890, 'pop_1990':9626}, ignore_index=True)
ant_v = ant_v.set_index('city')
# Now that Antelope Valley's cities are in dataframe, we sum value of columns, and assign values to the missing dataframe
t1 = ant_v.sum()
t1.name = 'antelope valley'
ant_v = ant_v.append(t1) # assigning totals to a row
missing = missing.append({'city':'antelope valley', 'pop_2020':375031,'pop_2010':340219, 'pop_2000':259191, 'pop_1990':185207}, ignore_index=True)
# Now we will do the same for ALL missing cities (Central, East LA, East Valley, Foothill, Harbor, Hollywood/Wilshire, Northeast, San Antonio, South, Southeast, Southwest, west, west valley)

# Central LA (includes: 'los angeles')
cla = pd.DataFrame(columns=['city', 'pop_2020', 'pop_2010', 'pop_2000', 'pop_1990'])
cla = cla.append({'city':'central', 'pop_2020':3898747,'pop_2010':3792621, 'pop_2000':3694820, 'pop_1990':3485398}, ignore_index=True)
missing = missing.append({'city':'central', 'pop_2020':3898747,'pop_2010':3792621, 'pop_2000':3694820, 'pop_1990':3485398}, ignore_index=True)

# East LA (includes: 'east los angeles')
ela = pd.DataFrame(columns = ['city', 'pop_2020', 'pop_2010', 'pop_2000', 'pop_1990'])
ela = ela.append({'city':'east los angeles', 'pop_2020':118786,'pop_2010':126496, 'pop_2000':124283, 'pop_1990':126379}, ignore_index=True)
missing = missing.append({'city':'east la', 'pop_2020':118786,'pop_2010':126496, 'pop_2000':124283, 'pop_1990':126379}, ignore_index=True)

# East Valley (includes: 'san fernando','burbank', 'hidden hills')
eva = pd.DataFrame(columns = ['city', 'pop_2020', 'pop_2010', 'pop_2000', 'pop_1990'])
eva = eva.append({'city':'san fernando', 'pop_2020':23946,'pop_2010':23645, 'pop_2000':23564, 'pop_1990':22580}, ignore_index=True)
eva = eva.append({'city':'san fernando', 'pop_2020':107337,'pop_2010':103340, 'pop_2000':100316, 'pop_1990':93643}, ignore_index=True)
eva = eva.append({'city':'hidden hills', 'pop_2020':1725,'pop_2010':1856, 'pop_2000':1875, 'pop_1990':1729}, ignore_index=True)
eva = eva.set_index('city')
t2 = eva.sum()
t2.name = 'east valley'
eva = eva.append(t2)
missing = missing.append({'city':'east valley', 'pop_2020':133008,'pop_2010':128841, 'pop_2000':125755, 'pop_1990':117952},ignore_index=True)

# Foothills (includes: 'san gabirel', 'south san gabriel','east san gabriel', 'south pasadena', 'san marino','east pasadena', 'sierra madre', 'glendora',
# 'san dimas', 'covina','west covina','charter oak', 'walnut', 'la puente', 'valinda', 'vincent', 'monterey park','rosemead','temple city', 'hacienda heights'
# 'rowland heights', 'citrus', 'south el monte', 'avocado heights', 'baldwin park','south san jose hills')
f = pd.DataFrame(columns = ['city', 'pop_2020', 'pop_2010', 'pop_2000', 'pop_1990'])
f = f.append({'city':'san gabriel', 'pop_2020':39568,'pop_2010':39718, 'pop_2000':39804, 'pop_1990':37120}, ignore_index=True)
f = f.append({'city':'south san gabriel', 'pop_2020':7920,'pop_2010':8070, 'pop_2000':7595, 'pop_1990':7700}, ignore_index=True)
f = f.append({'city':'east san gabriel', 'pop_2020':22769,'pop_2010':14874, 'pop_2000':14512, 'pop_1990':12736}, ignore_index=True)
f = f.append({'city':'south pasadena', 'pop_2020':26943,'pop_2010':25619, 'pop_2000':24292, 'pop_1990':23936}, ignore_index=True)
f = f.append({'city':'san marino', 'pop_2020':12513,'pop_2010':13147, 'pop_2000':12945, 'pop_1990':12959}, ignore_index=True)
f = f.append({'city':'east pasadena', 'pop_2020':6021,'pop_2010':6144, 'pop_2000':6045, 'pop_1990':5910}, ignore_index=True)
f = f.append({'city':'sierra madre', 'pop_2020':11268,'pop_2010':10917, 'pop_2000':10578, 'pop_1990':10762}, ignore_index=True)
f = f.append({'city':'glendora', 'pop_2020':52558,'pop_2010':50073, 'pop_2000':49415, 'pop_1990':47828}, ignore_index=True)
f = f.append({'city':'san dimas', 'pop_2020':34924,'pop_2010':33371, 'pop_2000':34980, 'pop_1990':32397}, ignore_index=True)
f = f.append({'city':'covina', 'pop_2020':51268,'pop_2010':47796, 'pop_2000':46837, 'pop_1990':43207}, ignore_index=True)
f = f.append({'city':'west covina', 'pop_2020':109501,'pop_2010':106098, 'pop_2000':105080, 'pop_1990':96086}, ignore_index=True)
f = f.append({'city':'charter oak', 'pop_2020':9739,'pop_2010':9310, 'pop_2000':9027, 'pop_1990':8858}, ignore_index=True)
f = f.append({'city':'walnut', 'pop_2020':28430,'pop_2010':29172, 'pop_2000':30004, 'pop_1990':29105}, ignore_index=True)
f = f.append({'city':'la puente', 'pop_2020':38062,'pop_2010':39816, 'pop_2000':41063, 'pop_1990':36955}, ignore_index=True)
f = f.append({'city':'valinda', 'pop_2020':22437,'pop_2010':22822, 'pop_2000':46837, 'pop_1990':18735}, ignore_index=True)
f = f.append({'city':'vincent', 'pop_2020':15714,'pop_2010':15922, 'pop_2000':15097, 'pop_1990':13713}, ignore_index=True)
f = f.append({'city':'monterey park', 'pop_2020':61096,'pop_2010':60269, 'pop_2000':60051, 'pop_1990':60738}, ignore_index=True)
f = f.append({'city':'rosemead', 'pop_2020':51185,'pop_2010':53764, 'pop_2000':53505, 'pop_1990':51638}, ignore_index=True)
f = f.append({'city':'temple city', 'pop_2020':36494,'pop_2010':35558, 'pop_2000':33377, 'pop_1990':31100}, ignore_index=True)
f = f.append({'city':'hacienda heights', 'pop_2020':54191,'pop_2010':54038, 'pop_2000':53122, 'pop_1990':52354}, ignore_index=True)
f = f.append({'city':'rowland heights', 'pop_2020':54191,'pop_2010':54038, 'pop_2000':53122, 'pop_1990':52354}, ignore_index=True)
f = f.append({'city':'citrus', 'pop_2020':10243,'pop_2010':10866, 'pop_2000':10581, 'pop_1990':9481}, ignore_index=True)
f = f.append({'city':'south el monte', 'pop_2020':19567,'pop_2010':20116, 'pop_2000':21144, 'pop_1990':20850}, ignore_index=True)
f = f.append({'city':'avocado heights', 'pop_2020':13317,'pop_2010':15411, 'pop_2000':15148, 'pop_1990':14232}, ignore_index=True)
f = f.append({'city':'baldwin park', 'pop_2020':72176,'pop_2010':75390, 'pop_2000':75837, 'pop_1990':69330}, ignore_index=True)
f = f.append({'city':'south san jose hills', 'pop_2020':19855,'pop_2010':20551, 'pop_2000':20218, 'pop_1990':17814}, ignore_index=True)
f = f.set_index('city')
t3 = f.sum()
t3.name = 'foothill'
f = f.append(t3)
missing = missing.append({'city':'foothill', 'pop_2020':881950,'pop_2010':872870, 'pop_2000':890216, 'pop_1990':817898},ignore_index=True)

# Harbor (includes: 'carson','west carson', 'long beach', 'lakewood','hawaiian gardens', 'signal hill')
h = pd.DataFrame(columns = ['city', 'pop_2020', 'pop_2010', 'pop_2000', 'pop_1990'])
h = h.append({'city':'carson', 'pop_2020':95558,'pop_2010':91714, 'pop_2000':89730, 'pop_1990':83995}, ignore_index=True)
h = h.append({'city':'west carson', 'pop_2020':22870,'pop_2010':21699, 'pop_2000':21138, 'pop_1990':20143}, ignore_index=True)
h = h.append({'city':'long beach', 'pop_2020':466742,'pop_2010':462257, 'pop_2000':461522, 'pop_1990':429433}, ignore_index=True)
h = h.append({'city':'lakewood', 'pop_2020':82496,'pop_2010':80048, 'pop_2000':79345, 'pop_1990':73557}, ignore_index=True)
h = h.append({'city':'hawaiian gardens', 'pop_2020':14149,'pop_2010':14254, 'pop_2000':14779, 'pop_1990':13639}, ignore_index=True)
h = h.append({'city':'signal hill', 'pop_2020':11848,'pop_2010':11016, 'pop_2000':9333, 'pop_1990':8371}, ignore_index=True)
h = h.set_index('city')
t4 = h.sum()
t4.name = 'harbor'
h = h.append(t4)
missing = missing.append({'city':'harbor', 'pop_2020':693663,'pop_2010':680988, 'pop_2000':675847, 'pop_1990':629138},ignore_index=True)

# West (include: 'west hollywood', 'beverly hills', 'santa monica', 'ladera heights', 'marina del rey','culver city',)
w = pd.DataFrame(columns = ['city', 'pop_2020', 'pop_2010', 'pop_2000', 'pop_1990'])
w = w.append({'city':'west hollywood', 'pop_2020':35757,'pop_2010':34399, 'pop_2000':35716, 'pop_1990':36118}, ignore_index=True)
w = w.append({'city':'beverly hills', 'pop_2020':32701,'pop_2010':34109, 'pop_2000':33784, 'pop_1990':31971}, ignore_index=True)
w = w.append({'city':'santa monica', 'pop_2020':93076,'pop_2010':89736, 'pop_2000':84084, 'pop_1990':86905}, ignore_index=True)
w = w.append({'city':'ladera heights', 'pop_2020':6654,'pop_2010':6498, 'pop_2000':6568, 'pop_1990':6316}, ignore_index=True)
w = w.append({'city':'marina del rey', 'pop_2020':11373,'pop_2010':8866, 'pop_2000':8176, 'pop_1990':7431}, ignore_index=True)
w = w.append({'city':'culver city', 'pop_2020':40779,'pop_2010':38883, 'pop_2000':38816, 'pop_1990':38793}, ignore_index=True)
w = w.set_index('city')
t5 = w.sum()
t5.name = 'west'
w = w.append(t5)
missing = missing.append({'city':'west', 'pop_2020':220340,'pop_2010':212491, 'pop_2000':207144, 'pop_1990':207534},ignore_index=True)

# South (includes: 'view park windsor hills', 'westmont','west athens', 'willowbrook','florencegraham')
s = pd.DataFrame(columns = ['city', 'pop_2020', 'pop_2010', 'pop_2000', 'pop_1990'])
s = s.append({'city':'view park windsor hills', 'pop_2020':11419,'pop_2010':11075, 'pop_2000':10958, 'pop_1990':11769}, ignore_index=True)
s = s.append({'city':'westmont', 'pop_2020':33913,'pop_2010':31853, 'pop_2000':31623, 'pop_1990':31044}, ignore_index=True)
s = s.append({'city':'west athens', 'pop_2020':9393,'pop_2010':8729, 'pop_2000':9101, 'pop_1990':8859}, ignore_index=True)
s = s.append({'city':'willowbrook', 'pop_2020':24295,'pop_2010':35983, 'pop_2000':34138, 'pop_1990':32772}, ignore_index=True)
s = s.append({'city':'florence graham', 'pop_2020':61983,'pop_2010':63387, 'pop_2000':60197, 'pop_1990':57147}, ignore_index=True)
s = s.set_index('city')
t6 = s.sum()
t6.name = 'south'
s = s.append(t6)
missing = missing.append({'city':'south', 'pop_2020':141003,'pop_2010':151027, 'pop_2000':146017, 'pop_1990':141591},ignore_index=True)

# Southeast (include:'artesia','vernon', 'bellflower','downey', 'south gate', 'bell','cudahy','bell gardens', 'huntington park',
# 'la mirada', 'santa fe springs','maywood','norwalk','paramount','south whittier', 'pico rivera','montebello','commerce','cerritos',
# 'east whitter (formerly east la mirada)'
se = pd.DataFrame(columns = ['city', 'pop_2020', 'pop_2010', 'pop_2000', 'pop_1990'])
se = se.append({'city':'artesia', 'pop_2020':16395,'pop_2010':16522, 'pop_2000':16380, 'pop_1990':15464}, ignore_index=True)
se = se.append({'city':'vernon', 'pop_2020':222,'pop_2010':112, 'pop_2000':91, 'pop_1990':152}, ignore_index=True)
se = se.append({'city':'bellflower', 'pop_2020':79190,'pop_2010':76616, 'pop_2000':72878, 'pop_1990':61815}, ignore_index=True)
se = se.append({'city':'downey', 'pop_2020':114355,'pop_2010':111772, 'pop_2000':107323, 'pop_1990':91444}, ignore_index=True)
se = se.append({'city':'south gate', 'pop_2020':92726,'pop_2010':94396, 'pop_2000':96375, 'pop_1990':86284}, ignore_index=True)
se = se.append({'city':'bell', 'pop_2020':33559,'pop_2010':35477, 'pop_2000':36664, 'pop_1990':34365}, ignore_index=True)
se = se.append({'city':'cudahy', 'pop_2020':22811,'pop_2010':23805, 'pop_2000':24208, 'pop_1990':22817}, ignore_index=True)
se = se.append({'city':'bell gardens', 'pop_2020':39501,'pop_2010':42072, 'pop_2000':44054, 'pop_1990':42355}, ignore_index=True)
se = se.append({'city':'huntington park', 'pop_2020':54883,'pop_2010':58114, 'pop_2000':61348, 'pop_1990':56065}, ignore_index=True)
se = se.append({'city':'la mirada', 'pop_2020':48008,'pop_2010':48527, 'pop_2000':46783, 'pop_1990':40452}, ignore_index=True)
se = se.append({'city':'santa fe springs', 'pop_2020':19219,'pop_2010':16223, 'pop_2000':17438, 'pop_1990':15520}, ignore_index=True)
se = se.append({'city':'maywood', 'pop_2020':25138,'pop_2010':27395, 'pop_2000':28083, 'pop_1990':27850	}, ignore_index=True)
se = se.append({'city':'norwalk', 'pop_2020':102773,'pop_2010':105549, 'pop_2000':103298, 'pop_1990':94279	}, ignore_index=True)
se = se.append({'city':'paramount', 'pop_2020':53733,'pop_2010':54098, 'pop_2000':55266, 'pop_1990':47669	}, ignore_index=True)
se = se.append({'city':'south whittier', 'pop_2020':56415,'pop_2010':57156, 'pop_2000':55193, 'pop_1990':49514	}, ignore_index=True)
se = se.append({'city':'pico rivera', 'pop_2020':62088,'pop_2010':62942, 'pop_2000':63428, 'pop_1990':59177	}, ignore_index=True)
se = se.append({'city':'montebello', 'pop_2020':62640,'pop_2010':62500, 'pop_2000':62150, 'pop_1990':59564	}, ignore_index=True)
se = se.append({'city':'commerce', 'pop_2020':12378,'pop_2010':12823, 'pop_2000':12568, 'pop_1990':12135	}, ignore_index=True)
se = se.append({'city':'cerritos', 'pop_2020':49578,'pop_2010':49041, 'pop_2000':51488, 'pop_1990':53240	}, ignore_index=True)
se = se.append({'city':'east whitter (formerly east la mirada)', 'pop_2020':10394,'pop_2010':9757, 'pop_2000':9538, 'pop_1990':9367	}, ignore_index=True)
se = se.set_index('city')
t7 = se.sum()
t7.name = 'southeast'
se = se.append(t7)
missing = missing.append({'city':'southeast', 'pop_2020':956006,'pop_2010':964897, 'pop_2000':964554, 'pop_1990':879528},ignore_index=True)

# Southwest (include:'manhattan beach','torrance','inglewood','gardena','lennox','el segundo','hawthorne','redondo beach','lawndale'
# 'hermosa beach','rolling hills estates','palo verde estates','rancho palos verdes','lomita','alondra park')
sw = pd.DataFrame(columns = ['city', 'pop_2020', 'pop_2010', 'pop_2000', 'pop_1990'])
sw = sw.append({'city':'manhattan beach', 'pop_2020':35506,'pop_2010':35135, 'pop_2000':33852, 'pop_1990':32063	}, ignore_index=True)
sw = sw.append({'city':'torrance', 'pop_2020':147067,'pop_2010':145438, 'pop_2000':137946, 'pop_1990':133107}, ignore_index=True)
sw = sw.append({'city':'inglewood', 'pop_2020':107762,'pop_2010':109673, 'pop_2000':112580, 'pop_1990':109602	}, ignore_index=True)
sw = sw.append({'city':'gardena', 'pop_2020':61027,'pop_2010':58829, 'pop_2000':57746, 'pop_1990':49847	}, ignore_index=True)
sw = sw.append({'city':'lennox', 'pop_2020':20323,'pop_2010':22753, 'pop_2000':22950, 'pop_1990':22757	}, ignore_index=True)
sw = sw.append({'city':'el segundo', 'pop_2020':17272,'pop_2010':16654, 'pop_2000':16033, 'pop_1990':15223	}, ignore_index=True)
sw = sw.append({'city':'hawthorne', 'pop_2020':88083,'pop_2010':84293, 'pop_2000':84112	, 'pop_1990':71349	}, ignore_index=True)
sw = sw.append({'city':'redondo beach', 'pop_2020':71576,'pop_2010':66748, 'pop_2000':63261, 'pop_1990':60167	}, ignore_index=True)
sw = sw.append({'city':'lawndale', 'pop_2020':31807,'pop_2010':32769, 'pop_2000':31711, 'pop_1990':27331	}, ignore_index=True)
sw = sw.append({'city':'hermosa beach', 'pop_2020':19728,'pop_2010':19506, 'pop_2000':18566, 'pop_1990':18219	}, ignore_index=True)
sw = sw.append({'city':'rolling hills estates', 'pop_2020':8280,'pop_2010':8067, 'pop_2000':7676, 'pop_1990':7789	}, ignore_index=True)
sw = sw.append({'city':'palos verdes estates', 'pop_2020':13347,'pop_2010':13438, 'pop_2000':13340, 'pop_1990':13512	}, ignore_index=True)
sw = sw.append({'city':'rancho palos verdes', 'pop_2020':42287,'pop_2010':41643, 'pop_2000':41145, 'pop_1990':41659	}, ignore_index=True)
sw = sw.append({'city':'lomita', 'pop_2020':20921,'pop_2010':20256, 'pop_2000':20046, 'pop_1990':19382	}, ignore_index=True)
sw = sw.append({'city':'alondra park', 'pop_2020':8569,'pop_2010':8592, 'pop_2000':8622, 'pop_1990':12215	}, ignore_index=True)
sw = sw.set_index('city')
t8 = sw.sum()
t8.name = 'southwest'
sw = sw.append(t8)
missing = missing.append({'city':'southwest', 'pop_2020':693555,'pop_2010':683794, 'pop_2000':669586, 'pop_1990':634222},ignore_index=True)

# Make all estimates numeric
missing[['pop_2020', 'pop_2010', 'pop_2000', 'pop_1990']] = missing[['pop_2020', 'pop_2010', 'pop_2000', 'pop_1990']].apply(pd.to_numeric)
# DROPPING: HOLLYWOOD/WILSHIRE, CENTRAL, WEST VALLEY, SAN ANTONIO, NORTHEAST (alternative is to find data from census through zipcodes)
warnings.filterwarnings('ignore') # Many warning errors arose later on from using append method since pandas is deprecating it and moving to concat 
missing
Out[8]:
city pop_2020 pop_2010 pop_2000 pop_1990
0 antelope valley 375031.0 340219.0 259191.0 185207.0
1 central 3898747.0 3792621.0 3694820.0 3485398.0
2 east la 118786.0 126496.0 124283.0 126379.0
3 east valley 133008.0 128841.0 125755.0 117952.0
4 foothill 881950.0 872870.0 890216.0 817898.0
5 harbor 693663.0 680988.0 675847.0 629138.0
6 west 220340.0 212491.0 207144.0 207534.0
7 south 141003.0 151027.0 146017.0 141591.0
8 southeast 956006.0 964897.0 964554.0 879528.0
9 southwest 693555.0 683794.0 669586.0 634222.0

Now that the data from the missing cities has been found and placed in a dataframe, we concatenate it to the dataframe with all of the cities population estimates (pop_est; dataset 5). This updated list is then merged with the mental health cases per region (flipped; dataset 4).

This combined dataset will be used to check if a relationship exists between mental health disorders and population, and how that may have changed over the years. We also renamed city label to district, since the mental health data is broken up into health districts.

Note that some regions were dropped during the merge because of insufficient information. These regions were: Hollywood/Wilshire, West Valley, San Antonio, and Northeast. It is unfortunate that our datasets are limited in this way, but this drop is justified because it is only a small subsection of the health districts and should not affect our analyses.

In [9]:
# Concatenate missing data to combined dataset
pop_est = pd.concat([missing,pop_est])
pop_est = pop_est.reset_index(drop=True)
# Merge at the end
districts_health = reduce(lambda l, r: pd.merge(flipped, pop_est, on='city', how='left'), pop_est)
districts_health = districts_health.rename(columns={'city':'district'})
districts_health = districts_health.dropna()
districts_health = districts_health.reset_index(drop=True)
districts_health
Out[9]:
district mh_2018 mh_2015 mh_2011 mh_2007 mh_2005 mh_2002 mh_1999 pop_2020 pop_2010 pop_2000 pop_1990
0 alhambra 11.2 9.7 7.1 9.1 10.2 8.3 5.7 82868.0 83089.0 85804.0 82106.0
1 antelope valley 18.4 19.2 14.6 17.1 15.7 9.2 9.7 375031.0 340219.0 259191.0 185207.0
2 bellflower 18.2 13.8 12.8 10.7 11.3 8.8 7.0 79190.0 76616.0 72878.0 61815.0
3 central 13.9 20.0 10.6 15.4 13.0 8.8 9.2 3898747.0 3792621.0 3694820.0 3485398.0
4 compton 15.5 12.2 10.1 12.7 13.2 5.4 4.8 95740.0 96455.0 93493.0 90454.0
5 east la 12.5 12.4 16.3 15.2 16.6 8.1 9.0 118786.0 126496.0 124283.0 126379.0
6 east valley 14.0 13.5 11.9 13.6 13.9 12.4 8.5 133008.0 128841.0 125755.0 117952.0
7 el monte 14.6 6.2 5.4 12.3 14.2 5.4 5.7 109450.0 113475.0 115965.0 106209.0
8 foothill 14.1 12.5 13.7 12.4 11.1 11.7 10.9 881950.0 872870.0 890216.0 817898.0
9 glendale 19.3 12.1 19.2 18.2 11.5 7.9 8.2 196543.0 191719.0 194973.0 180038.0
10 harbor 24.5 12.1 14.5 11.8 13.0 4.1 4.1 693663.0 680988.0 675847.0 629138.0
11 inglewood 13.4 12.1 8.3 10.9 14.1 9.4 10.1 107762.0 109673.0 112580.0 109602.0
12 long beach 14.5 16.0 12.1 17.0 17.0 10.6 11.9 466742.0 462257.0 461522.0 429433.0
13 pasadena 13.7 20.9 13.0 14.3 11.1 16.3 10.1 138699.0 137122.0 133936.0 131591.0
14 pomona 10.9 10.2 14.9 14.2 9.7 8.6 6.7 151713.0 149058.0 149473.0 131723.0
15 san fernando 15.7 14.6 13.7 11.2 10.7 8.9 5.7 23946.0 23645.0 23564.0 22580.0
16 south 13.1 14.8 16.3 12.8 16.8 6.7 8.9 141003.0 151027.0 146017.0 141591.0
17 southeast 14.5 12.0 9.5 18.7 6.9 10.1 6.2 956006.0 964897.0 964554.0 879528.0
18 southwest 19.7 10.3 9.4 12.4 11.7 7.1 7.9 693555.0 683794.0 669586.0 634222.0
19 torrance 19.9 11.6 9.6 13.7 10.5 10.4 10.0 147067.0 145438.0 137946.0 133107.0
20 west 23.0 15.2 13.4 13.2 16.6 11.7 8.8 220340.0 212491.0 207144.0 207534.0
21 whittier 13.2 12.7 11.4 14.5 15.3 12.0 9.7 87306.0 85331.0 83680.0 77671.0

Our sixth and final dataset provides information about the areas of LA's health districts. We will use this later to determine the population density in each district.

In [10]:
# DATASET 6: A Closer Look at the Health Districts
#Scrape health districts webpage
webpage = requests.get('https://www.lacounty.hiv/health-districts/').text

# # Parse webpage HTML
html = BeautifulSoup(webpage, 'html.parser')
# # Select .icon-dl HTML class
district_links = html.select('.icon-dl')
district_names = str(html.select('[data-district]')[0].select('.district-name')[0]).split('</span>')[1]

#Open counteez.csv file
with open('counteez.csv', 'r') as districts_file:
    # Read lines from csv file
    districts = districts_file.readlines()
# Extract district names
districts = list(map(lambda district: district.split(',')[1].strip(), districts))
districts_data = list()
#Iterate over districts
for district in districts:
    # Extract text from district PDF files
    text = extract_text('district_pdfs/document_' + district.capitalize() + '.pdf')
    text = text.split('\n')
    area_square_miles = ''
    # population = ''

    # Find area for district from extracted text
    for index, line in enumerate(text):
        if 'square miles' in line:
            area_square_miles = line
            # population = text[index + 1]
            break

    area_square_miles = area_square_miles.split('s')[0].strip()
    area_square_miles = area_square_miles.replace(',', '')
    area_square_miles = int(area_square_miles)
    # population = population.split('(')[0]
    # population = population.replace(',', '')
    # population = int(population)
    districts_data.append((district, area_square_miles))

districts_data = np.array(districts_data)

Data Analysis & Results

The main focuses of our analysis were to examine the trend of depression rates over time, and to look at the association between depression rates and population density. We start here with the former.

Depression Rates Over Time

The first part of our analysis looks at how depression rates have changed over time. As mentioned before, data availability constrains us to look specifically at the LA region over the past couple decades.

In [11]:
# Graphing Percentage of Population with Depression in Los Angeles, CA using Seaborn lineplot and renaming axes to be more descriptive
sns.lineplot(data=overall, x='year', y='population_depression_diagnosis_percent').set(
    title="Percentage of Population with Depression in Los Angeles, CA", xlabel="Year", ylabel="Percent of LA Population with Depression")
sns.set(style='white', font_scale=2)
plt.rcParams['figure.figsize']=(8,5)
plt.show()

The figure above is a line chart displaying the percent of the population in Los Angeles, CA that had a diagnosis of depression in a particular year. Although there was a brief decline in the percent of the LA population diagnosed with depression between 2007 and 2011, the overall trend has been a significant overall increase from about 9% of the population in 2000 to over 16% of the population in 2018.

Below we have graphed a more thorough breakdown of depression rates by district in LA over time.

In [12]:
# Graphing Depression Rates in LA Health Districts using Seaborn lineplot and renaming axes to be more descriptive
# Setting years as indices to make graphing wideform data easier
df4_graph = df4.iloc[:, 2:]
df4_graph.index = df4.year

# Creating figure
sns.set(style='white', font_scale=1)
plt.rcParams['figure.figsize']=(8,5)
sns.lineplot(data=df4_graph, legend=False).set(
    title="Percentage of Population with Depression", xlabel="Year", ylabel="Percent with Depression")
plt.show()

The legend for this plot is not provided because it would be cumbersome with the number of districts in the data; the main point we see from this graph is that depression rates have generally been increasing over time for nearly all of LA's healthcare districts. The dip between 2007 and 2011 seen in the first plot is also visible in the plot above, albeit subtly.

Of course, there is a decent amount of variability between health regions, and depression rates over the plotted time period increased very little in some health regions, while increasing dramatically in others.

To get a better idea of how depression rates have changed over time in individual health districts, we plotted a regression line for each health district.

In [13]:
def remove_date(astr):
  return int(astr.replace("-01-01 00:00:00",""))

df4_graph = df4.drop(columns = ['LA Overall'])

#Regression line for each city by time. 
df4_graph['year'] = df4_graph['year'].astype(object).astype(str).apply(remove_date)
sns.set(rc={'figure.figsize':(40,30)})
curr = 1;
plt.figure(0)
temp = df4_graph.iloc[:, 2:]
for i in temp:
  ax = plt.subplot(5, 5, curr)
  ax = sns.regplot(x="year", y= i , data = df4_graph);
  ax.xaxis.set_major_formatter(FuncFormatter(lambda x, _: int(x)))
  ax.set_ylabel("Depression in " + i)
  curr = curr + 1
plt.show()

From the above figures, we can confirm our earlier observation that there is a positive relationship in all of LA cities depression cases—meaning that depression rates are generally increasing everywhere. However, there is a lot of variability between health districts, with some depression-time relations being steeper or stronger than others.

  • Bellflower, San Fernando and Northeast appear to have the strongest linear relationships with lower standard error; meanwhile, Inglewood, El Monte, East LA and Southeast have weaker linear relationships and higher standard error.
  • Harbor and Northeast have steep increases in depression rates. Long Beach, West Valley, and Whittier appear to have very mild increases, though this is in part due to weak linear relationships.

Populations

As preparation for examining how depression and population density are related, we did some preliminary analyses on population data. First, we looked at how population was distributed over LA's health districts in 2000 and 2020 by creating histograms:

In [14]:
# Graph style and sizing
sns.set(style='white', font_scale=2)
plt.rcParams['figure.figsize']=(17,7)
sns.despine()

# Plotting 2020 and 2000 graphs side by side
plt.subplot(1,2,1)
sns.histplot(districts_health['pop_2020'],color='teal',bins=10,kde=True).set(
    title='LA District Population Counts(2020)', xlabel='District populations (millions)', ylabel='Frequency')
plt.subplot(1,2,2)
sns.histplot(districts_health['pop_2000'],color='blue',bins=10,kde=True).set(
    title='LA District Population Counts(2000)', xlabel='District populations (millions)', ylabel='Frequency')
plt.show()

Our histograms show that LA city populations have a non-normal distribution skewed to the right, with most cities having less than 1 million people. Comparing the two histograms shows that the population has remained similar in most districts, with the exception of a few cities that either grew significantly or decreased in population.

The outlier is a district with over 3 million people, making it the district that likely has the highest population density. To find our outlier, we'll take a look at the district counts with a barplot.

In [15]:
sns.set(rc={'figure.figsize':(20,12)}, font_scale = 1.5)
city_cts = sns.barplot(x=districts_health['pop_2020'], y='district', data=districts_health, order=districts_health.sort_values('pop_2020',ascending=False).district)

Based on this barchart from above, we can see that Central district has the highest population count followed by Southeast, Foothill then Harbor. The least populous district is San Fernando.

Now that we know what the most populous districts are, we'll take a look at the districts with the highest depression cases. It is important to note that the population data is from 2020, while our mental health data goes up to 2018.

In [16]:
sns.set(rc={'figure.figsize':(20,12)}, font_scale = 1.5)
city_cts = sns.barplot(x=districts_health['mh_2018'], y='district', data=districts_health, order=districts_health.sort_values('mh_2018',ascending=False).district)

Based on the bar chart above, we can see that Harbor, West, and Torrance have the highest depression rates. Central, our most populous (and likely most dense) city, is actually on the lower end of depression rates.

One important limitation of our population data is that it comes from the government census, which only happens once per decade. In particular, we have data about LA city populations in 2020, 2010, 2000, and 1990.

In order to have more accurate data for analysis, we decided to use linear regression to interpolate data points for 2018, 2015, 2011, 2007,2005, 2002, and 1999 in each city. (These are the years for which we have mental health data.)

  • We will save depression rates and population predictions of each city in the DataFrame pop_predictions.

  • Slope, intercept, R, p and standard error of each regression line will be saved in the DataFrame RL_Cities.

Although it is true that population tends to grow exponentially rather than linearly, the sequences of four data points that we have for each city can often be approximated quite closely with linear regression. Where linear regression is not ideal, the data do not show exponential relationships, meaning random variability is likely responsible for strange-looking sequences.

In [17]:
pop_predictions = pd.DataFrame(np.array([[2018],[2015],[2011],[2007],[2005],[2002],[1999]]),columns=['year'])
RL_Cities = pd.DataFrame(columns = ['district', 'slope', 'intercept', 'r', 'p', 'std_err'])
for index, row  in districts_health.iterrows():
  district = row['district']
  mh_2018 = row['mh_2018']
  mh_2015 = row['mh_2015']
  mh_2011	= row['mh_2011']
  mh_2007= 	row['mh_2007']
  mh_2005	= row['mh_2005']
  mh_2002	= row['mh_2002']
  mh_1999	= row['mh_1999']

  #Getting regression line with current datas so that I can predict population in each city in the certain year
  y = [row['pop_2020'],row['pop_2010'],row['pop_2000'],row['pop_1990']]
  x = [2020,2010,2000,1990]
  slope, intercept, r, p, std_err = stats.linregress(x, y)

  pop_2018=	slope * 2018 + intercept
  pop_2015= slope * 2015 + intercept
  pop_2011=	slope * 2011 + intercept
  pop_2007=	slope * 2007 + intercept
  pop_2005	= slope * 2005 + intercept
  pop_2002	= slope * 2002 + intercept
  pop_1999	= slope * 1999 + intercept
  RL_Cities = RL_Cities.append({'district':district, 'slope': slope, 'intercept':intercept,'r':r, 'std_err':std_err, 'p':p	}, ignore_index=True)
  pop_predictions[district + "_Mh"] = [mh_2018,mh_2015,mh_2011,mh_2007 ,mh_2005,mh_2002,mh_1999 ]
  pop_predictions[district + "_Pop"] = [pop_2018,pop_2015,pop_2011,pop_2007 ,pop_2005,pop_2002,pop_1999 ]
pop_predictions
Out[17]:
year alhambra_Mh alhambra_Pop antelope valley_Mh antelope valley_Pop bellflower_Mh bellflower_Pop central_Mh central_Pop compton_Mh ... southeast_Mh southeast_Pop southwest_Mh southwest_Pop torrance_Mh torrance_Pop west_Mh west_Pop whittier_Mh whittier_Pop
0 2018 11.2 83410.98 18.4 374477.0 18.2 79886.94 13.9 3891816.74 15.5 ... 14.5 971117.26 19.7 695276.16 19.9 147307.86 23.0 217566.70 13.2 87469.28
1 2015 9.7 83423.85 19.2 354962.0 13.8 78211.05 20.0 3851681.30 12.2 ... 12.0 964223.95 10.3 689509.95 11.6 145826.70 15.2 216253.75 12.7 86552.60
2 2011 7.1 83441.01 14.6 328942.0 12.8 75976.53 10.6 3798167.38 10.1 ... 9.5 955032.87 9.4 681821.67 9.6 143851.82 13.4 214503.15 11.4 85330.36
3 2007 9.1 83458.17 17.1 302922.0 10.7 73742.01 15.4 3744653.46 12.7 ... 18.7 945841.79 12.4 674133.39 13.7 141876.94 13.2 212752.55 14.5 84108.12
4 2005 10.2 83466.75 15.7 289912.0 11.3 72624.75 13.0 3717896.50 13.2 ... 6.9 941246.25 11.7 670289.25 10.5 140889.50 16.6 211877.25 15.3 83497.00
5 2002 8.3 83479.62 9.2 270397.0 8.8 70948.86 8.8 3677761.06 5.4 ... 10.1 934352.94 7.1 664523.04 10.4 139408.34 11.7 210564.30 12.0 82580.32
6 1999 5.7 83492.49 9.7 250882.0 7.0 69272.97 9.2 3637625.62 4.8 ... 6.2 927459.63 7.9 658756.83 10.0 137927.18 8.8 209251.35 9.7 81663.64

7 rows × 45 columns

In [33]:
RL_Cities
Out[33]:
district slope intercept r p std_err
0 alhambra -4.29 92068.2 -0.034313 0.965687 88.353680
1 antelope valley 6505.00 -12752613.0 0.987809 0.012191 724.874317
2 bellflower 558.63 -1047428.4 0.941711 0.058289 141.115826
3 central 13378.48 -23105955.9 0.981598 0.018402 1840.344546
4 compton 188.20 -283305.5 0.899650 0.100350 64.584278
5 east la -205.66 536334.3 -0.734924 0.265076 134.189196
6 east valley 482.54 -841103.7 0.979251 0.020749 70.611360
7 el monte 72.33 -33746.9 0.216454 0.783546 230.684343
8 foothill 1748.10 -2639207.0 0.690832 0.309168 1293.679082
9 glendale 462.61 -736714.8 0.800321 0.199679 245.062940
10 harbor 1987.16 -3314346.8 0.909948 0.090052 640.412843
11 inglewood -84.27 278865.6 -0.546386 0.453614 91.339960
12 long beach 1126.62 -1803884.6 0.845985 0.154015 502.105312
13 pasadena 245.10 -356088.5 0.992632 0.007368 21.156110
14 pomona 595.55 -1048586.0 0.830927 0.169073 281.976980
15 san fernando 41.79 -60355.2 0.910684 0.089316 13.404428
16 south 32.46 79827.2 0.090081 0.909919 253.765091
17 southeast 2297.77 -3665782.6 0.717381 0.282619 1577.887291
18 southwest 1922.07 -3183461.1 0.955081 0.044919 421.708355
19 torrance 493.72 -849019.1 0.975488 0.024512 78.753653
20 west 437.65 -665611.0 0.919537 0.080463 132.263152
21 whittier 305.56 -529150.8 0.948891 0.051109 71.863379
In [18]:
#Population changes in each district in LA county over time 
sns.set(rc={'figure.figsize':(40,30)})
curr = 1
index = 0
plt.figure(0)
x = [2020,2018,2015,2011,2010,2007,2005,2002,2000,1999,1990]
plt.figure(0)

#graphing each district individually
for index, row  in districts_health.iterrows():
    
    district = row['district']
    sns.set(rc={'figure.figsize':(40,30)})
    y = [row['pop_2020'],row['pop_2010'],row['pop_2000'],row['pop_1990']]
    x = [2020,2010,2000,1990]
    ax = plt.subplot(5, 5, curr)
    ax = plt.scatter(x, y)
    x = np.linspace(1990, 2020, 7)
    slope = RL_Cities['slope'][index]
    intercept = RL_Cities['intercept'][index]
    index = index + 1
    y = slope*x + intercept
    plt.plot(x, y)
    curr = curr + 1
    
    plt.xlabel('Year') 
    plt.ylabel('Population')
    plt.title("The population change in  " + district)
    
plt.show() 

The above figures show the regression lines for each of the LA cities with population. There appears to be a positive relationship for most of LA cities populations, meaning populations generally increase for these districts (excepting Alhambra, Inglewood and East LA).

Depression & Population Density

With individual analyses of depression rates and population out of the way, we can now begin examining the relationship between depression and population density.

LA County's health page provides the area of each health district. We combine this information (dataset 6) with population data to obtain population density information for census years.

In [19]:
df_districts = pd.DataFrame(districts_data)
# Rename columns
df_districts = df_districts.rename(columns={0: 'district', 1: 'area_square_miles'})
# Convert area_square_miles column to numeric values
df_districts['area_square_miles'] = df_districts['area_square_miles'].apply(pd.to_numeric)

# # Sort DataFrame in alphabetical order
df_districts = df_districts.sort_values('district', ascending=True).reset_index(drop=True)
districts_health['area_square_miles'] = df_districts['area_square_miles']
for year in range(1990, 2030, 10):
    districts_health['pop_density_{}'.format(year)] = districts_health['pop_{}'.format(year)] / districts_health['area_square_miles']
# Round decimals
districts_health = districts_health.round(decimals=1)
districts_health
Out[19]:
district mh_2018 mh_2015 mh_2011 mh_2007 mh_2005 mh_2002 mh_1999 pop_2020 pop_2010 pop_2000 pop_1990 area_square_miles pop_density_1990 pop_density_2000 pop_density_2010 pop_density_2020
0 alhambra 11.2 9.7 7.1 9.1 10.2 8.3 5.7 82868.0 83089.0 85804.0 82106.0 39 2105.3 2200.1 2130.5 2124.8
1 antelope valley 18.4 19.2 14.6 17.1 15.7 9.2 9.7 375031.0 340219.0 259191.0 185207.0 1384 133.8 187.3 245.8 271.0
2 bellflower 18.2 13.8 12.8 10.7 11.3 8.8 7.0 79190.0 76616.0 72878.0 61815.0 39 1585.0 1868.7 1964.5 2030.5
3 central 13.9 20.0 10.6 15.4 13.0 8.8 9.2 3898747.0 3792621.0 3694820.0 3485398.0 32 108918.7 115463.1 118519.4 121835.8
4 compton 15.5 12.2 10.1 12.7 13.2 5.4 4.8 95740.0 96455.0 93493.0 90454.0 26 3479.0 3595.9 3709.8 3682.3
5 east la 12.5 12.4 16.3 15.2 16.6 8.1 9.0 118786.0 126496.0 124283.0 126379.0 23 5494.7 5403.6 5499.8 5164.6
6 east valley 14.0 13.5 11.9 13.6 13.9 12.4 8.5 133008.0 128841.0 125755.0 117952.0 192 614.3 655.0 671.0 692.8
7 el monte 14.6 6.2 5.4 12.3 14.2 5.4 5.7 109450.0 113475.0 115965.0 106209.0 62 1713.0 1870.4 1830.2 1765.3
8 foothill 14.1 12.5 13.7 12.4 11.1 11.7 10.9 881950.0 872870.0 890216.0 817898.0 548 1492.5 1624.5 1592.8 1609.4
9 glendale 19.3 12.1 19.2 18.2 11.5 7.9 8.2 196543.0 191719.0 194973.0 180038.0 60 3000.6 3249.6 3195.3 3275.7
10 harbor 24.5 12.1 14.5 11.8 13.0 4.1 4.1 693663.0 680988.0 675847.0 629138.0 176 3574.6 3840.0 3869.2 3941.3
11 inglewood 13.4 12.1 8.3 10.9 14.1 9.4 10.1 107762.0 109673.0 112580.0 109602.0 38 2884.3 2962.6 2886.1 2835.8
12 long beach 14.5 16.0 12.1 17.0 17.0 10.6 11.9 466742.0 462257.0 461522.0 429433.0 52 8258.3 8875.4 8889.6 8975.8
13 pasadena 13.7 20.9 13.0 14.3 11.1 16.3 10.1 138699.0 137122.0 133936.0 131591.0 24 5483.0 5580.7 5713.4 5779.1
14 pomona 10.9 10.2 14.9 14.2 9.7 8.6 6.7 151713.0 149058.0 149473.0 131723.0 135 975.7 1107.2 1104.1 1123.8
15 san fernando 15.7 14.6 13.7 11.2 10.7 8.9 5.7 23946.0 23645.0 23564.0 22580.0 597 37.8 39.5 39.6 40.1
16 south 13.1 14.8 16.3 12.8 16.8 6.7 8.9 141003.0 151027.0 146017.0 141591.0 11 12871.9 13274.3 13729.7 12818.5
17 southeast 14.5 12.0 9.5 18.7 6.9 10.1 6.2 956006.0 964897.0 964554.0 879528.0 8 109941.0 120569.2 120612.1 119500.8
18 southwest 19.7 10.3 9.4 12.4 11.7 7.1 7.9 693555.0 683794.0 669586.0 634222.0 26 24393.2 25753.3 26299.8 26675.2
19 torrance 19.9 11.6 9.6 13.7 10.5 10.4 10.0 147067.0 145438.0 137946.0 133107.0 67 1986.7 2058.9 2170.7 2195.0
20 west 23.0 15.2 13.4 13.2 16.6 11.7 8.8 220340.0 212491.0 207144.0 207534.0 189 1098.1 1096.0 1124.3 1165.8
21 whittier 13.2 12.7 11.4 14.5 15.3 12.0 9.7 87306.0 85331.0 83680.0 77671.0 61 1273.3 1371.8 1398.9 1431.2

We also need to calculate population density with the interpolated population data from earlier.

In [20]:
#Concatenate population that we get from regresssion line to the data variable districts_health
temp_2018 = []
temp_2015 = []
temp_2011 = []
temp_2007 = []
temp_2005 = []
temp_2002 = []
temp_1999 = []

temp_2018_density = []
temp_2015_density = []
temp_2011_density = []
temp_2007_density = []
temp_2005_density = []
temp_2002_density = []
temp_1999_density = []

area_list = districts_health['area_square_miles'].to_numpy()

index = 0;

#We save the population in the variable pop_predictions. So we iterate it to get correct population. 
for i in pop_predictions:
    if "_Pop" in i:
        temp_2018.append(pop_predictions[i].to_numpy()[0])
        temp_2015.append(pop_predictions[i].to_numpy()[1])
        temp_2011.append(pop_predictions[i].to_numpy()[2])
        temp_2007.append(pop_predictions[i].to_numpy()[3])
        temp_2005.append(pop_predictions[i].to_numpy()[4])
        temp_2002.append(pop_predictions[i].to_numpy()[5])
        temp_1999.append(pop_predictions[i].to_numpy()[6])
        
        temp_2018_density.append((pop_predictions[i].to_numpy()[0] / area_list[index]).round(decimals=1))
        temp_2015_density.append((pop_predictions[i].to_numpy()[1] / area_list[index]).round(decimals=1))
        temp_2011_density.append((pop_predictions[i].to_numpy()[2] / area_list[index]).round(decimals=1))
        temp_2007_density.append((pop_predictions[i].to_numpy()[3] / area_list[index]).round(decimals=1))
        temp_2005_density.append((pop_predictions[i].to_numpy()[4] / area_list[index]).round(decimals=1))
        temp_2002_density.append((pop_predictions[i].to_numpy()[5] / area_list[index]).round(decimals=1))
        temp_1999_density.append((pop_predictions[i].to_numpy()[6] / area_list[index]).round(decimals=1))
        
        index = index + 1
        
districts_health["pop_2018"] = temp_2018
districts_health["pop_2015"] = temp_2015
districts_health["pop_2011"] = temp_2011
districts_health["pop_2007"] = temp_2007
districts_health["pop_2005"] = temp_2005
districts_health["pop_2002"] = temp_2002
districts_health["pop_1999"] = temp_1999

districts_health["pop_density_2018"] = temp_2018_density
districts_health["pop_density_2015"] = temp_2015_density
districts_health["pop_density_2011"] = temp_2011_density
districts_health["pop_density_2007"] = temp_2007_density
districts_health["pop_density_2005"] = temp_2005_density
districts_health["pop_density_2002"] = temp_2002_density
districts_health["pop_density_1999"] = temp_1999_density
        
In [21]:
POPULATION =  districts_health[["district", "pop_2020", "pop_2018", "pop_2015", "pop_2011", "pop_2010", "pop_2007", "pop_2005", "pop_2002", "pop_2000", "pop_1999", "pop_1990"]]
POPULATION
Out[21]:
district pop_2020 pop_2018 pop_2015 pop_2011 pop_2010 pop_2007 pop_2005 pop_2002 pop_2000 pop_1999 pop_1990
0 alhambra 82868.0 83410.98 83423.85 83441.01 83089.0 83458.17 83466.75 83479.62 85804.0 83492.49 82106.0
1 antelope valley 375031.0 374477.00 354962.00 328942.00 340219.0 302922.00 289912.00 270397.00 259191.0 250882.00 185207.0
2 bellflower 79190.0 79886.94 78211.05 75976.53 76616.0 73742.01 72624.75 70948.86 72878.0 69272.97 61815.0
3 central 3898747.0 3891816.74 3851681.30 3798167.38 3792621.0 3744653.46 3717896.50 3677761.06 3694820.0 3637625.62 3485398.0
4 compton 95740.0 96482.10 95917.50 95164.70 96455.0 94411.90 94035.50 93470.90 93493.0 92906.30 90454.0
5 east la 118786.0 121312.42 121929.40 122752.04 126496.0 123574.68 123986.00 124602.98 124283.0 125219.96 126379.0
6 east valley 133008.0 132662.02 131214.40 129284.24 128841.0 127354.08 126389.00 124941.38 125755.0 123493.76 117952.0
7 el monte 109450.0 112215.04 111998.05 111708.73 113475.0 111419.41 111274.75 111057.76 115965.0 110840.77 106209.0
8 foothill 881950.0 888458.80 883214.50 876222.10 872870.0 869229.70 865733.50 860489.20 890216.0 855244.90 817898.0
9 glendale 196543.0 196832.18 195444.35 193593.91 191719.0 191743.47 190818.25 189430.42 194973.0 188042.59 180038.0
10 harbor 693663.0 695742.08 689780.60 681831.96 680988.0 673883.32 669909.00 663947.52 675847.0 657986.04 629138.0
11 inglewood 107762.0 108808.74 109061.55 109398.63 109673.0 109735.71 109904.25 110157.06 112580.0 110409.87 109602.0
12 long beach 466742.0 469634.56 466254.70 461748.22 462257.0 457241.74 454988.50 451608.64 461522.0 448228.78 429433.0
13 pasadena 138699.0 138523.30 137788.00 136807.60 137122.0 135827.20 135337.00 134601.70 133936.0 133866.40 131591.0
14 pomona 151713.0 153233.90 151447.25 149065.05 149058.0 146682.85 145491.75 143705.10 149473.0 141918.45 131723.0
15 san fernando 23946.0 23977.02 23851.65 23684.49 23645.0 23517.33 23433.75 23308.38 23564.0 23183.01 22580.0
16 south 141003.0 145331.48 145234.10 145104.26 151027.0 144974.42 144909.50 144812.12 146017.0 144714.74 141591.0
17 southeast 956006.0 971117.26 964223.95 955032.87 964897.0 945841.79 941246.25 934352.94 964554.0 927459.63 879528.0
18 southwest 693555.0 695276.16 689509.95 681821.67 683794.0 674133.39 670289.25 664523.04 669586.0 658756.83 634222.0
19 torrance 147067.0 147307.86 145826.70 143851.82 145438.0 141876.94 140889.50 139408.34 137946.0 137927.18 133107.0
20 west 220340.0 217566.70 216253.75 214503.15 212491.0 212752.55 211877.25 210564.30 207144.0 209251.35 207534.0
21 whittier 87306.0 87469.28 86552.60 85330.36 85331.0 84108.12 83497.00 82580.32 83680.0 81663.64 77671.0
In [22]:
DENSITY =  districts_health[["district", "pop_density_2020", "pop_density_2018", "pop_density_2015", "pop_density_2011", "pop_density_2010", "pop_density_2007", "pop_density_2005", "pop_density_2002", "pop_density_2000", "pop_density_1999", "pop_density_1990"]]
DENSITY
Out[22]:
district pop_density_2020 pop_density_2018 pop_density_2015 pop_density_2011 pop_density_2010 pop_density_2007 pop_density_2005 pop_density_2002 pop_density_2000 pop_density_1999 pop_density_1990
0 alhambra 2124.8 2138.7 2139.1 2139.5 2130.5 2140.0 2140.2 2140.5 2200.1 2140.8 2105.3
1 antelope valley 271.0 270.6 256.5 237.7 245.8 218.9 209.5 195.4 187.3 181.3 133.8
2 bellflower 2030.5 2048.4 2005.4 1948.1 1964.5 1890.8 1862.2 1819.2 1868.7 1776.2 1585.0
3 central 121835.8 121619.3 120365.0 118692.7 118519.4 117020.4 116184.3 114930.0 115463.1 113675.8 108918.7
4 compton 3682.3 3710.8 3689.1 3660.2 3709.8 3631.2 3616.8 3595.0 3595.9 3573.3 3479.0
5 east la 5164.6 5274.5 5301.3 5337.0 5499.8 5372.8 5390.7 5417.5 5403.6 5444.3 5494.7
6 east valley 692.8 690.9 683.4 673.4 671.0 663.3 658.3 650.7 655.0 643.2 614.3
7 el monte 1765.3 1809.9 1806.4 1801.8 1830.2 1797.1 1794.8 1791.3 1870.4 1787.8 1713.0
8 foothill 1609.4 1621.3 1611.7 1598.9 1592.8 1586.2 1579.8 1570.2 1624.5 1560.7 1492.5
9 glendale 3275.7 3280.5 3257.4 3226.6 3195.3 3195.7 3180.3 3157.2 3249.6 3134.0 3000.6
10 harbor 3941.3 3953.1 3919.2 3874.0 3869.2 3828.9 3806.3 3772.4 3840.0 3738.6 3574.6
11 inglewood 2835.8 2863.4 2870.0 2878.9 2886.1 2887.8 2892.2 2898.9 2962.6 2905.5 2884.3
12 long beach 8975.8 9031.4 8966.4 8879.8 8889.6 8793.1 8749.8 8684.8 8875.4 8619.8 8258.3
13 pasadena 5779.1 5771.8 5741.2 5700.3 5713.4 5659.5 5639.0 5608.4 5580.7 5577.8 5483.0
14 pomona 1123.8 1135.1 1121.8 1104.2 1104.1 1086.5 1077.7 1064.5 1107.2 1051.2 975.7
15 san fernando 40.1 40.2 40.0 39.7 39.6 39.4 39.3 39.0 39.5 38.8 37.8
16 south 12818.5 13212.0 13203.1 13191.3 13729.7 13179.5 13173.6 13164.7 13274.3 13155.9 12871.9
17 southeast 119500.8 121389.7 120528.0 119379.1 120612.1 118230.2 117655.8 116794.1 120569.2 115932.5 109941.0
18 southwest 26675.2 26741.4 26519.6 26223.9 26299.8 25928.2 25780.4 25558.6 25753.3 25336.8 24393.2
19 torrance 2195.0 2198.6 2176.5 2147.0 2170.7 2117.6 2102.8 2080.7 2058.9 2058.6 1986.7
20 west 1165.8 1151.1 1144.2 1134.9 1124.3 1125.7 1121.0 1114.1 1096.0 1107.1 1098.1
21 whittier 1431.2 1433.9 1418.9 1398.9 1398.9 1378.8 1368.8 1353.8 1371.8 1338.7 1273.3

The variable "DENSITY" is part of the variable "districts_health" data. It visually shows the population density nicely.

We now examine the relationship between population density and depression rates by performing linear regressions for each year in our data.

In [23]:
#2018
sns.lmplot(y = 'pop_density_2018',
           x =  'mh_2018',
          data = districts_health)
plt.xlabel('Mental Health in 2018') 
plt.ylabel('Population density in 2018') 
plt.show()

The above graph shows the correlation between population density and mental health in LA county in 2018, which is relatively low due to high outliers.

In [24]:
#2015
sns.lmplot(y = 'pop_density_2015',
           x =  'mh_2015',
          data = districts_health)
plt.xlabel('Mental Health in 2015') 
plt.ylabel('Population density in 2015') 
plt.show()

The above graph shows the association between population density and mental health in LA county in 2015. Here we see a stronger relationship, but the outliers are still prevalent.

In [25]:
sns.lmplot(y = 'pop_density_2011',
           x =  'mh_2011',
          data = districts_health)
plt.xlabel('Mental Health in 2011') 
plt.ylabel('Population density in 2011')
plt.show() 

The above graph shows the correlation between population density and mental health in LA county in 2011.

In [26]:
sns.lmplot(y = 'pop_density_2007',
           x =  'mh_2007',
          data = districts_health)
plt.xlabel('Mental Health in 2007') 
plt.ylabel('Population density in 2007') 
plt.show()

The above graph shows that the correlation between population density and mental health in LA county in 2007. It's quite positive due to the high right outlier.

In [27]:
sns.lmplot(y = 'pop_density_2005',
           x =  'mh_2005',
          data = districts_health)
plt.xlabel('Mental Health in 2005') 
plt.ylabel('Population density in 2005') 
plt.show()

The above graph shows the correlation between population density and mental health in LA county in 2005. This one is somewhat negative, which is partially due to the outliers.

In [28]:
sns.lmplot(y = 'pop_density_2002',
           x =  'mh_2002',
          data = districts_health)
plt.xlabel('Mental Health in 2002') 
plt.ylabel('Population density in 2002') 
plt.show()

The above graph shows that the correlation between population density and mental health in LA county in 2002. There is little association.

As a supplementary analysis, we create a heatmap here to look at associations with all variables considered.

In [31]:
overallheatmap = districts_health.copy()

def heatmap(x, y, **kwargs):
    if 'color' in kwargs:
        color = kwargs['color']
    else:
        color = [1]*len(x)

    if 'palette' in kwargs:
        palette = kwargs['palette']
        n_colors = len(palette)
    else:
        n_colors = 256 # Use 256 colors for the diverging color palette
        palette = sns.color_palette("Blues", n_colors) 

    if 'color_range' in kwargs:
        color_min, color_max = kwargs['color_range']
    else:
        color_min, color_max = min(color), max(color) # Range of values that will be mapped to the palette

    def value_to_color(val):
        if color_min == color_max:
            return palette[-1]
        else:
            val_position = float((val - color_min)) / (color_max - color_min) # position of value in the input range, relative to the length of the input range
            val_position = min(max(val_position, 0), 1) # bound the position betwen 0 and 1
            ind = int(val_position * (n_colors - 1)) # target index in the color palette
            return palette[ind]

    if 'size' in kwargs:
        size = kwargs['size']
    else:
        size = [1]*len(x)

    if 'size_range' in kwargs:
        size_min, size_max = kwargs['size_range'][0], kwargs['size_range'][1]
    else:
        size_min, size_max = min(size), max(size)

    size_scale = kwargs.get('size_scale', 500)

    def value_to_size(val):
        if size_min == size_max:
            return 1 * size_scale
        else:
            val_position = (val - size_min) * 0.99 / (size_max - size_min) + 0.01 # position of value in the input range, relative to the length of the input range
            val_position = min(max(val_position, 0), 1) # bound the position betwen 0 and 1
            return val_position * size_scale
    if 'x_order' in kwargs: 
        x_names = [t for t in kwargs['x_order']]
    else:
        x_names = [t for t in sorted(set([v for v in x]))]
    x_to_num = {p[1]:p[0] for p in enumerate(x_names)}

    if 'y_order' in kwargs: 
        y_names = [t for t in kwargs['y_order']]
    else:
        y_names = [t for t in sorted(set([v for v in y]))]
    y_to_num = {p[1]:p[0] for p in enumerate(y_names)}

    plot_grid = plt.GridSpec(1, 15, hspace=0.2, wspace=0.1) # Setup a 1x10 grid
    ax = plt.subplot(plot_grid[:,:-1]) # Use the left 14/15ths of the grid for the main plot

    marker = kwargs.get('marker', 's')

    kwargs_pass_on = {k:v for k,v in kwargs.items() if k not in [
         'color', 'palette', 'color_range', 'size', 'size_range', 'size_scale', 'marker', 'x_order', 'y_order'
    ]}

    ax.scatter(
        x=[x_to_num[v] for v in x],
        y=[y_to_num[v] for v in y],
        marker=marker,
        s=[value_to_size(v) for v in size], 
        c=[value_to_color(v) for v in color],
        **kwargs_pass_on
    )
    ax.set_xticks([v for k,v in x_to_num.items()])
    ax.set_xticklabels([k for k in x_to_num], rotation=45, horizontalalignment='right')
    ax.set_yticks([v for k,v in y_to_num.items()])
    ax.set_yticklabels([k for k in y_to_num])

    ax.grid(False, 'major')
    ax.grid(True, 'minor')
    ax.set_xticks([t + 0.5 for t in ax.get_xticks()], minor=True)
    ax.set_yticks([t + 0.5 for t in ax.get_yticks()], minor=True)

    ax.set_xlim([-0.5, max([v for v in x_to_num.values()]) + 0.5])
    ax.set_ylim([-0.5, max([v for v in y_to_num.values()]) + 0.5])
    ax.set_facecolor('#F1F1F1')

    # Add color legend on the right side of the plot
    if color_min < color_max:
        ax = plt.subplot(plot_grid[:,-1]) # Use the rightmost column of the plot

        col_x = [0]*len(palette) # Fixed x coordinate for the bars
        bar_y=np.linspace(color_min, color_max, n_colors) # y coordinates for each of the n_colors bars

        bar_height = bar_y[1] - bar_y[0]
        ax.barh(
            y=bar_y,
            width=[5]*len(palette), # Make bars 5 units wide
            left=col_x, # Make bars start at 0
            height=bar_height,
            color=palette,
            linewidth=0
        )
        ax.set_xlim(1, 2) # Bars are going from 0 to 5, so lets crop the plot somewhere in the middle
        ax.grid(False) # Hide grid
        ax.set_facecolor('white') # Make background white
        ax.set_xticks([]) # Remove horizontal ticks
        ax.set_yticks(np.linspace(min(bar_y), max(bar_y), 3)) # Show vertical ticks for min, middle and max
        ax.yaxis.tick_right() # Show vertical ticks on the right 


def corrplot(data, size_scale=500, marker='s'):
    corr = pd.melt(data.reset_index(), id_vars='index')
    corr.columns = ['x', 'y', 'value']
    heatmap(
        corr['x'], corr['y'],
        color=corr['value'], color_range=[-1, 1],
        palette=sns.diverging_palette(20, 220, n=256),
        size=corr['value'].abs(), size_range=[0,1],
        marker=marker,
        x_order=data.columns,
        y_order=data.columns[::-1],
        size_scale=size_scale
    )
corr = overallheatmap.corr()
ax.set_xticklabels(
    ax.get_xticklabels(),
    rotation=45,
    horizontalalignment='right'
)
plt.figure(figsize=(13, 13))
corrplot(corr)

The above correlation heatmap shows the correlations of all of our variables in comparison to eachother. We are able to visualize the relationships of these correlations with eachother with color and square size representations; blue representing a positive correlalation and red representing a negative. The stronger the colors are, the larger the correlation magnitudes are. The weakest relationships are almost invisible. This heatmap goes a step further from the usual heatmaps by adding a size parameter in addition to color, so that the size of the squares correspond to the magnitude of the correlation as well (size(c1,c2) ~ abs(corr(c1,c2))). This is for easier viewer interpretability. The darkest variables represent comparisons to eachother, and they are blocked out in large portions for population estimates and densities. We'll have to look at these more closely.

This heatmap shows us how population densities in 2018, 2007 and 2005 had negative correlations to depression cases during those years, while in 2015 and 2007 there is somewhat of a stronger positive relationship between these variables. It seems that in the years 2002 and 1999 there was hardly any relationship that existed between depression and densities.

Ethics & Privacy

It is ethical to collect data from government reports because the data does not contain personally identifiable records and it is in the public domain. Any data that we used from LA County is publicly accessible and is not associated with singular individuals.

Our research question does not cause harm to any person and was not an unethical question to ask. In particular, the data was not analyzed for the purpose of making a healthcare or policy decision that would affect anyone. The data analysis is only being conducted as an exploration of the research question.

Conclusion & Discussion

The objective of our project was to see if there was a relationship between population density and depression. We limited our analysis to Los Angeles county because of data availability. Our original prediction was that there would be a significant relationship between population density and depression rates; however, the analyses we conducted show weak or insignificant correlations between these variables. As a result, we have no evidence to support our hypothesis (i.e., we fail to reject the null hypothesis) of a relationship between population density and depression.

One important trend we noticed is that depression rates seem to be increasing over time. This could be partially a reflection of the growing priority society has put on mental health in recent years (i.e., leading to more diagnoses), but it is likely also a sign of depression becoming more likely—particularly among younger populations.

There are several limitations with our data analysis: namely, in our final regressions, outliers (e.g. the Central district) greatly affected our results. Perhaps performing statistical analyses with these outliers removed would lead to more conclusive results.

Team Contributions

  • Naval Patel: Project proposal research, finding and extracting data from web and PDF files, data cleaning, generating graphs & analyzing results
  • Jennifer Nunez: Project proposal research, finding datasets, data cleaning, generating graphs & analyzing results
  • Yoomin Oh: Data cleaning, generating graphs & analyzing results
  • Nathaniel Satriya: Project proposal research, project structure management, video production and editing, and project writeup
  • Ray Zhao: Data cleaning, generating graphs & analyzing results