Assignment 6¶

In this assignment, you'll analyze a collection of data sets from the San Francisco Open Data Portal and Zillow. The data sets have been stored in the SQLite database sf_data.sqlite, which you can download here. The database contains the following tables:

Table	Description
`crime`	Crime reports dating back to 2010.
`mobile_food_locations`	List of all locations where mobile food vendors sell.
`mobile_food_permits`	List of all mobile food vendor permits. More details here.
`mobile_food_schedule`	Schedules for mobile food vendors.
`noise`	Noise complaints dating back to August 2015.
`parking`	List of all parking lots.
`parks`	List of all parks.
`schools`	List of all schools.
`zillow`	Zillow rent and housing statistics dating back to 1996. More details here.

The mobile_food_ tables are explicitly connected through the locationid and permit columns. The other tables are not connected, but you may be able to connect them using dates, latitude/longitude, or postal codes.

Shapefiles for US postal codes are available here. These may be useful for converting latitude/longitude to postal codes.

Shapefiles for San Francisco Neighborhoods are available here.

Exercise 1.1. Which mobile food vendor(s) sells at the most locations?

import pandas as pd
%matplotlib inline
# Make plots larger.
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (12, 12)
import sqlalchemy as sqla
import sqlite3
plt.style.use('ggplot')

sqlite_file = 'sf_data.sqlite'
conn = sqlite3.connect(sqlite_file)

conn2 = sqla.create_engine('sqlite:///' + sqlite_file)

conn2

Engine(sqlite:///sf_data.sqlite)

pd.read_sql_query("select permit, count(locationid) AS places from mobile_food_schedule group by permit order by places desc",conn2).head()

From here we see that the top three vendors sell at 340, 221, and 212 places.

pd.read_sql_query("select Applicant from mobile_food_permits where permit = '17MFF-0110' OR permit = '13MFF-0102' OR permit = '17MFF-0111'",conn2)

These vendors are Natan's Catering, May Catering, Anas Goodies Catering.

Exercise 1.2. Ask and use the database to analyze 5 questions about San Francisco. For each question, write at least 150 words and support your answer with plots. Make a map for at least 2 of the 5 questions.

You should try to come up with some questions on your own, but these are examples of reasonable questions:

Which parts of the city are the most and least expensive?
Which parts of the city are the most dangerous (and at what times)?
Are noise complaints and mobile food vendors related?
What are the best times and places to find food trucks?
Is there a relationship between housing prices and any of the other tables?

Please make sure to clearly state each of your questions in your submission.

Question 1

Which parts of the city are the most dangerous and at what times?

pd.read_sql_query("select DISTINCT CATEGORY from crime",conn2)

pd.read_sql_query("select * from crime",conn2).head()

crimesByDistrict = pd.read_sql_query("select PdDistrict, count(Category) as totCrimes from crime group by PdDistrict",conn2)

districtCrimes = crimesByDistrict[1:]
maxCrimes = max(districtCrimes.totCrimes)
minCrimes = min(districtCrimes.totCrimes)
maxDistrict = districtCrimes[districtCrimes.totCrimes == maxCrimes]
minDistrict = districtCrimes[districtCrimes.totCrimes == minCrimes]

maxDistrict, minDistrict

(  PdDistrict  totCrimes
 8   SOUTHERN     196178,   PdDistrict  totCrimes
 7   RICHMOND      56258)

From here we see that the district with the most crimes is the Southern district and the district with the least crimes is Richmond.

southern = pd.read_sql_query("select Category, count(Category) as occurance from crime where PdDistrict='SOUTHERN' group by Category",conn2)

maxCrime = max(southern.occurance)
southern[southern.occurance == maxCrime]

From here we see that the most reoccuring crime is theft in the southern district with over 60,000 occurances since 2010.

richmond = pd.read_sql_query("select Category, count(Category) as occurance from crime where PdDistrict='RICHMOND' group by Category",conn2)

minCrime = min(richmond.occurance)
richmond[richmond.occurance == minCrime]

To answer this question we first check to see how many different categories of crimes there are. We see that there are 39 categories. For this problem we will assume that all categories are considered to be dangerous to answer the question. We see that the district with the last crimes is Richmond and the crime in Richmond that occured the least is pornography. The district with the most crimes was in the southern district. The crime in the southern district that occurred the most was larcency and theft

southernTime = pd.read_sql_query("select Category, Datetime, count(Datetime) as occurance from crime where PdDistrict='SOUTHERN' group by Datetime",conn2)

from datetime import datetime

hour = []
for i in range(0, len(southernTime)):
    date = southernTime.Datetime[i].split(" ")[1]
    date = datetime.strptime(date, '%H:%M:%S')
    hour.append(date.hour)

df = pd.DataFrame(hour)

df.describe()

Lastly, we check to see what time in the most dangerous area was most common for crimes to occur. We see that the most crimes in the southern district occured at around 11 PM.

crimeTime = pd.read_sql_query("select Category, Datetime, count(Datetime) as occurance from crime group by Datetime",conn2)

hour = []
for i in range(0, len(crimeTime)):
    date = crimeTime.Datetime[i].split(" ")[1]
    date = datetime.strptime(date, '%H:%M:%S')
    hour.append(date.hour)

timedf = pd.DataFrame(hour)

timedf.columns = ['hour']

groups = timedf.groupby(['hour']).size()

groups = pd.DataFrame(groups)

groups.columns = ['count']

times = []
for i in range(0, 24):
    times.append(groups.iloc[i])

times = pd.DataFrame(times)

times.plot(y = 'count', kind = 'pie', legend = False)

<matplotlib.axes._subplots.AxesSubplot at 0x142fd71d0>

This is a pie chart of the number of crimes per hour (not by district). We see form here that there are less crimes occuring early in the day (but they are somewhat identical) and as the day gets later and later more crimes occur with the max number of crimes at 11 PM (23).

Question 2

Which region is the most and least expensive?

zillow = pd.read_sql_query("select  * from zillow",conn2)

zillow.head()

zillow = zillow.dropna()

priceMax = pd.read_sql_query("select RegionName, max(MedianSoldPricePerSqft_AllHomes) from zillow",conn2)
priceMin = pd.read_sql_query("select RegionName, min(MedianSoldPricePerSqft_AllHomes) from zillow",conn2)

priceMax, priceMin

(   RegionName  max(MedianSoldPricePerSqft_AllHomes)
 0       94104                           1336.128581,
    RegionName  min(MedianSoldPricePerSqft_AllHomes)
 0       94124                               103.938)

Since there are many rows with NAs in this dataset, I chose to drop them so it wouldn't interfere with my analysis for maximum and minimum median price for homes. From here, we see the most expensive region is 94104 (median sold price per sqft is \$1336) and the least expensive region is 94124 (median sold price per sqft is \$103).

price = pd.read_sql_query("select RegionName, max(MedianSoldPricePerSqft_AllHomes), min(MedianSoldPricePerSqft_AllHomes) from zillow group by RegionName",conn2)

price.plot(kind = 'bar', y = ['max(MedianSoldPricePerSqft_AllHomes)', 'min(MedianSoldPricePerSqft_AllHomes)'], x = 'RegionName')

<matplotlib.axes._subplots.AxesSubplot at 0x158bdc590>

From the plot below we see that there is a large difference between the max and min prices of homes per region. The only place that has around the same value for the max and min price is region 94104. This must be a more expensive/nice area if the min price is about equivalent to the max price.

Question 3

Which areas are the most and least quiet.

areas = pd.read_sql_query("select Neighborhood, count(neighborhood) as occurances, Lat, Lon from noise group by Neighborhood",conn2)

areas = areas[1:len(areas)]

areas = areas.reset_index()

del areas['index']

group = areas.groupby(['occurances']).size()
group = pd.DataFrame(group)

group.columns = ['count']

group.plot(kind = 'bar', y = 'count')

<matplotlib.axes._subplots.AxesSubplot at 0x1621e4390>

From the grpah above, we see that there were more cases of smaller occurances than larger occurances. The max number of noise complaints is 480, 386 is right after but that only occured one time in one region.

min(areas.occurances), max(areas.occurances)

(1, 480)

minArea = areas[areas.occurances == 1]

maxAreas = areas[areas.occurances == 480]

minArea = minArea.reset_index()
del minArea['index']

maxAreas = maxAreas.reset_index()
del maxAreas['index']

minArea

maxAreas

From here we see that there are 7 neighborhoods that have the smallest noise complaints. Those areas are Balboa terrace, candlestick point, india basin, laguna honda, lincoln park, st francis wood, sunnydale. The area with the msot noise complaints is in south of market. We will now plot these locations to see if the neighborhoods with little noise complaints are far from the one with the most.

import shapely.geometry as geom
import geopandas as gpd

lonlat = [geom.Point(lon, lat) for lon, lat in zip(minArea.Lon, minArea.Lat)]
areas.drop(["Lon", "Lat"], axis = 1)

areas = gpd.GeoDataFrame(minArea, geometry = lonlat, crs = {'init' :'epsg:4326'})
areas.geometry.head()

0      POINT (-122.463938474175 37.731598603694)
1    POINT (-122.388130559067 37.71629808383071)
2      POINT (-122.378361489645 37.733626733232)
3      POINT (-122.458640998954 37.747639025578)
4      POINT (-122.498916234597 37.784330141308)
Name: geometry, dtype: object

lonlat = [geom.Point(lon, lat) for lon, lat in zip(maxAreas.Lon, maxAreas.Lat)]
maxAreas.drop(["Lon", "Lat"], axis = 1)
maxAreas = gpd.GeoDataFrame(maxAreas, geometry = lonlat, crs = {'init' :'epsg:4326'})
maxAreas.geometry.head()

0    POINT (-122.409397210188 37.781046453006)
Name: geometry, dtype: object

sf = gpd.read_file("sf/geo_export_c9c00fb7-1521-4238-9257-b6176bf1cd62.shp")
sf.head()

base = sf.plot()
areas.plot(ax = base, marker = "x", color = "red")
maxAreas.plot(ax = base, marker = "x", color = "blue")

<matplotlib.axes._subplots.AxesSubplot at 0x1684f27d0>

Although the size of the dots are small (sorry couldn't figure out how to increase the size :(), we can see that the areas with the least noise complaints (dots in red) are quite far from the area with the most noise complaints (dot in blue). But overall, all the areas are quite far from each other.

Question 4

Are noise complaints and school locations related?

schools = pd.read_sql_query("select * from schools",conn2)

schoolsGroup = pd.read_sql_query("select GradeRange, count(GradeRange) as count from schools group by GradeRange",conn2)

schoolsGroup

schoolsGroup.plot(kind = 'pie', y = 'count')

<matplotlib.axes._subplots.AxesSubplot at 0x148abb6d0>

max(schoolsGroup['count'])

165

schoolsGroup[schoolsGroup['count'] == 165]

From here we see that there are more schools that are just pre-kindergarten than anything else. This leads me to believe that there probably isn't a correlation between schools and noise complaints since most of the noise complaints occurred at 5 PM which is typically after school is over (this value is calculated later -- a couple steps below this).

noise = pd.read_sql_query("select * from noise",conn2)

noise = noise.dropna()

lonlat = [geom.Point(lon, lat) for lon, lat in zip(schools.Lon, schools.Lat)]

s = gpd.GeoDataFrame(schools, geometry = lonlat, crs = {'init' :'epsg:4326'})
s.geometry.head()

0    POINT (-122.4822998 37.78300479999999)
1           POINT (-122.4381943 37.7536812)
2           POINT (-122.4657822 37.7296715)
3           POINT (-122.4740601 37.7739677)
4           POINT (-122.4763107 37.7753067)
Name: geometry, dtype: object

lonlat = [geom.Point(lon, lat) for lon, lat in zip(noise.Lon, noise.Lat)]

n = gpd.GeoDataFrame(noise, geometry = lonlat, crs = {'init' :'epsg:4326'})
n.geometry.head()

1    POINT (-122.424474850717 37.7691482064722)
2     POINT (-122.508365256702 37.762047023755)
3     POINT (-122.414943216196 37.787982901643)
4      POINT (-122.502232253358 37.76138549134)
5     POINT (-122.420087435923 37.742435250569)
Name: geometry, dtype: object

sfp = sf.plot()
s.plot(ax = sfp, marker = "x", color = "red")
n.plot(ax=sfp,marker = "x", color = "blue")

<matplotlib.axes._subplots.AxesSubplot at 0x14155fd90>

Since there are so many datapoints it is hard to tell if there is a correlation. Now, I will separate noise to be during time of school (8 - 3 PM - I'm thinking of only elementary, middle, high school)

noise = pd.read_sql_query("select Datetime, Lat, Lon from noise",conn2)

noise = noise.dropna()

noise = noise.reset_index()

del noise['index']

hour = []
for i in range(0, len(noise)):
    date = noise.Datetime[i].split(" ")[1]
    date = datetime.strptime(date, '%H:%M:%S')
    hour.append(date.hour)

from collections import defaultdict
d = defaultdict(int)
for i in hour:
    d[i] += 1
result = max(d.iteritems(), key=lambda x: x[1])
print result

(23, 354)

noise['hour'] = hour

g8 = noise.where(noise['hour']>=8)

g8 = g8.dropna()

l15 = noise.where(noise['hour']<=15)

l15 = l15.dropna()

newNoise = pd.merge(g8, l15)

lonlat = [geom.Point(lon, lat) for lon, lat in zip(newNoise.Lon, newNoise.Lat)]

n = gpd.GeoDataFrame(newNoise, geometry = lonlat, crs = {'init' :'epsg:4326'})
n.geometry.head()

0    POINT (-122.424474850717 37.7691482064722)
1     POINT (-122.508365256702 37.762047023755)
2      POINT (-122.502232253358 37.76138549134)
3     POINT (-122.418177369469 37.805615941191)
4     POINT (-122.510211620936 37.773766740399)
Name: geometry, dtype: object

sfp = sf.plot()
s.plot(ax = sfp, marker = "x", color = "red")
n.plot(ax=sfp,marker = "x", color = "blue")

<matplotlib.axes._subplots.AxesSubplot at 0x153543b50>

It is still fairly difficult to determine whether there is some sort of correlation between noise complaints and school locations, but it does seem like most of the schools are fairly close to where there was a noise complaint. This would make sense because there does tend to be noise complaints in areas where there are kids since kids/younger people are fairly quite loud.

Question 5

Which type of park is the most popular in San Francisco?

parks = pd.read_sql_query("select Type, count(Type) as occurance, avg(Acreage) from parks group by Type",conn2)

parks

From here we see that there are more Neighborhood parks/playgrounds more than any other type of park in San Francisco. These neighborhood parks are around 4.4 acres on average. This will be displayed visually in the graph below.

parks.plot(kind = 'pie', y = 'occurance', labels = ['Civic Plaza', 'Community Garden', 'Concession', 'Family Camp', 'Mini Park', 'Neighborhood Park', 'Parkway', 'Regional Park', 'Zoological Garden'])

<matplotlib.axes._subplots.AxesSubplot at 0x189056990>

Although there are more neighborhood parks than any other kind of park in san francisco, we see from the stats above that family camps make up the most acres for a single unit. This is surprising to me because I have never heard of a family camp. However second highest for most acres is the zoo which makes sense because the sf zoo is quite large wheras neighborhood parks can be as small as a small grass field.

	Category
0	NON-CRIMINAL
1	ROBBERY
2	ASSAULT
3	SECONDARY CODES
4	VANDALISM
5	BURGLARY
6	LARCENY/THEFT
7	DRUG/NARCOTIC
8	WARRANTS
9	VEHICLE THEFT
10	OTHER OFFENSES
11	WEAPON LAWS
12	ARSON
13	MISSING PERSON
14	DRIVING UNDER THE INFLUENCE
15	SUSPICIOUS OCC
16	RECOVERED VEHICLE
17	DRUNKENNESS
18	TRESPASS
19	FRAUD
20	DISORDERLY CONDUCT
21	SEX OFFENSES, FORCIBLE
22	FORGERY/COUNTERFEITING
23	KIDNAPPING
24	EMBEZZLEMENT
25	STOLEN PROPERTY
26	LIQUOR LAWS
27	FAMILY OFFENSES
28	LOITERING
29	BAD CHECKS
30	TREA
31	GAMBLING
32	RUNAWAY
33	BRIBERY
34	PROSTITUTION
35	PORNOGRAPHY/OBSCENE MAT
36	SEX OFFENSES, NON FORCIBLE
37	SUICIDE
38	EXTORTION

	IncidntNum	Category	Descript	DayOfWeek	Datetime	PdDistrict	Resolution	Address	Lon	Lat
0	150060275	NON-CRIMINAL	LOST PROPERTY	Monday	2015-01-19 14:00:00	MISSION	NONE	18TH ST / VALENCIA ST	-122.421582	37.761701
1	150098210	ROBBERY	ROBBERY, BODILY FORCE	Sunday	2015-02-01 15:45:00	TENDERLOIN	NONE	300 Block of LEAVENWORTH ST	-122.414406	37.784191
2	150098210	ASSAULT	AGGRAVATED ASSAULT WITH BODILY FORCE	Sunday	2015-02-01 15:45:00	TENDERLOIN	NONE	300 Block of LEAVENWORTH ST	-122.414406	37.784191
3	150098210	SECONDARY CODES	DOMESTIC VIOLENCE	Sunday	2015-02-01 15:45:00	TENDERLOIN	NONE	300 Block of LEAVENWORTH ST	-122.414406	37.784191
4	150098226	VANDALISM	MALICIOUS MISCHIEF, VANDALISM OF VEHICLES	Tuesday	2015-01-27 19:00:00	NORTHERN	NONE	LOMBARD ST / LAGUNA ST	-122.431119	37.800469

	0
count	128378.000000
mean	13.538683
std	6.399302
min	0.000000
25%	9.000000
50%	14.000000
75%	19.000000
max	23.000000

	RegionName	Date	ZriPerSqft_AllHomes	MedianSoldPricePerSqft_AllHomes	PriceToRentRatio_AllHomes	Turnover_AllHomes
0	94109	2010-11-01 00:00:00	3.156	675.1913	19.14	6.0771
1	94110	2010-11-01 00:00:00	2.566	599.6785	18.10	5.4490
2	94122	2010-11-01 00:00:00	2.168	495.4432	17.99	2.4198
3	94080	2010-11-01 00:00:00	1.666	369.5538	18.31	3.8757
4	94112	2010-11-01 00:00:00	2.322	422.4538	14.44	3.1288

	Neighborhood	occurances	Lat	Lon
0	Balboa Terrace	1	37.731599	-122.463938
1	Candlestick Point SRA	1	37.716298	-122.388131
2	India Basin	1	37.733627	-122.378361
3	Laguna Honda	1	37.747639	-122.458641
4	Lincoln Park / Ft. Miley	1	37.784330	-122.498916
5	St. Francis Wood	1	37.735891	-122.466476
6	Sunnydale	1	37.708415	-122.416303

	permit	places
0	17MFF-0110	340
1	13MFF-0102	221
2	17MFF-0111	212
3	17MFF-0090	182
4	14MFF-0109	165

	geometry	link	name
0	POLYGON ((-122.4934552679999 37.78351817100008...	http://en.wikipedia.org/wiki/Sea_Cliff,_San_Fr...	Seacliff
1	POLYGON ((-122.4871507149999 37.78378542700005...	None	Lake Street
2	POLYGON ((-122.4775801709999 37.81099311300005...	http://www.nps.gov/prsf/index.htm	Presidio National Park
3	POLYGON ((-122.4724105299999 37.78734653900005...	None	Presidio Terrace
4	POLYGON ((-122.4726257899999 37.78631480600006...	http://www.sfgate.com/neighborhoods/sf/innerri...	Inner Richmond

	GradeRange	count
0	1-4	1
1	1-6	1
2	1-8	1
3	13-14	10
4	2-12	1
5	3-8	2
6	4-8	1
7	5-12	1
8	5-8	1
9	6-12	6
10	6-8	16
11	7-12	2
12	8-12	4
13	9-12	36
14	INF	3
15	K-1	1
16	K-12	11
17	K-2	1
18	K-4	1
19	K-5	51
20	K-6	2
21	K-8	65
22	PK	165
23	PK-12	1
24	PK-4	6
25	PK-5	26
26	PK-6	1
27	PK-8	8
28	PK-K	13
29	PK-TK	2
30	TK-2	2
31	TK-8	3

	Type	occurance	avg(Acreage)
0	Civic Plaza or Square	10	4.373000
1	Community Garden	18	0.316111
2	Concession	2	43.620000
3	Family Camp	1	337.510000
4	Mini Park	45	0.276000
5	Neighborhood Park or Playground	131	4.422748
6	Parkway	2	20.740000
7	Regional Park	19	155.626842
8	Zoological Garden	1	131.520000