Assignment 6

In this assignment, you'll analyze a collection of data sets from the San Francisco Open Data Portal and Zillow. The data sets have been stored in the SQLite database sf_data.sqlite, which you can download here. The database contains the following tables:

Table Description
crime Crime reports dating back to 2010.
mobile_food_locations List of all locations where mobile food vendors sell.
mobile_food_permits List of all mobile food vendor permits. More details here.
mobile_food_schedule Schedules for mobile food vendors.
noise Noise complaints dating back to August 2015.
parking List of all parking lots.
parks List of all parks.
schools List of all schools.
zillow Zillow rent and housing statistics dating back to 1996. More details here.

The mobile_food_ tables are explicitly connected through the locationid and permit columns. The other tables are not connected, but you may be able to connect them using dates, latitude/longitude, or postal codes.

Shapefiles for US postal codes are available here. These may be useful for converting latitude/longitude to postal codes.

Shapefiles for San Francisco Neighborhoods are available here.

Exercise 1.1. Which mobile food vendor(s) sells at the most locations?

In [72]:
import pandas as pd
%matplotlib inline
# Make plots larger.
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (12, 12)
import sqlalchemy as sqla
import sqlite3
plt.style.use('ggplot')
In [73]:
sqlite_file = 'sf_data.sqlite'
conn = sqlite3.connect(sqlite_file)
In [74]:
conn2 = sqla.create_engine('sqlite:///' + sqlite_file)
In [75]:
conn2
Out[75]:
Engine(sqlite:///sf_data.sqlite)
In [76]:
pd.read_sql_query("select permit, count(locationid) AS places from mobile_food_schedule group by permit order by places desc",conn2).head()
Out[76]:
permit places
0 17MFF-0110 340
1 13MFF-0102 221
2 17MFF-0111 212
3 17MFF-0090 182
4 14MFF-0109 165

From here we see that the top three vendors sell at 340, 221, and 212 places.

In [77]:
pd.read_sql_query("select Applicant from mobile_food_permits where permit = '17MFF-0110' OR permit = '13MFF-0102' OR permit = '17MFF-0111'",conn2)
Out[77]:
Applicant
0 Natan's Catering
1 May Catering
2 Anas Goodies Catering

These vendors are Natan's Catering, May Catering, Anas Goodies Catering.

Exercise 1.2. Ask and use the database to analyze 5 questions about San Francisco. For each question, write at least 150 words and support your answer with plots. Make a map for at least 2 of the 5 questions.

You should try to come up with some questions on your own, but these are examples of reasonable questions:

  • Which parts of the city are the most and least expensive?
  • Which parts of the city are the most dangerous (and at what times)?
  • Are noise complaints and mobile food vendors related?
  • What are the best times and places to find food trucks?
  • Is there a relationship between housing prices and any of the other tables?

Please make sure to clearly state each of your questions in your submission.

Question 1

Which parts of the city are the most dangerous and at what times?

In [78]:
pd.read_sql_query("select DISTINCT CATEGORY from crime",conn2)
Out[78]:
Category
0 NON-CRIMINAL
1 ROBBERY
2 ASSAULT
3 SECONDARY CODES
4 VANDALISM
5 BURGLARY
6 LARCENY/THEFT
7 DRUG/NARCOTIC
8 WARRANTS
9 VEHICLE THEFT
10 OTHER OFFENSES
11 WEAPON LAWS
12 ARSON
13 MISSING PERSON
14 DRIVING UNDER THE INFLUENCE
15 SUSPICIOUS OCC
16 RECOVERED VEHICLE
17 DRUNKENNESS
18 TRESPASS
19 FRAUD
20 DISORDERLY CONDUCT
21 SEX OFFENSES, FORCIBLE
22 FORGERY/COUNTERFEITING
23 KIDNAPPING
24 EMBEZZLEMENT
25 STOLEN PROPERTY
26 LIQUOR LAWS
27 FAMILY OFFENSES
28 LOITERING
29 BAD CHECKS
30 TREA
31 GAMBLING
32 RUNAWAY
33 BRIBERY
34 PROSTITUTION
35 PORNOGRAPHY/OBSCENE MAT
36 SEX OFFENSES, NON FORCIBLE
37 SUICIDE
38 EXTORTION
In [79]:
pd.read_sql_query("select * from crime",conn2).head()
Out[79]:
IncidntNum Category Descript DayOfWeek Datetime PdDistrict Resolution Address Lon Lat
0 150060275 NON-CRIMINAL LOST PROPERTY Monday 2015-01-19 14:00:00 MISSION NONE 18TH ST / VALENCIA ST -122.421582 37.761701
1 150098210 ROBBERY ROBBERY, BODILY FORCE Sunday 2015-02-01 15:45:00 TENDERLOIN NONE 300 Block of LEAVENWORTH ST -122.414406 37.784191
2 150098210 ASSAULT AGGRAVATED ASSAULT WITH BODILY FORCE Sunday 2015-02-01 15:45:00 TENDERLOIN NONE 300 Block of LEAVENWORTH ST -122.414406 37.784191
3 150098210 SECONDARY CODES DOMESTIC VIOLENCE Sunday 2015-02-01 15:45:00 TENDERLOIN NONE 300 Block of LEAVENWORTH ST -122.414406 37.784191
4 150098226 VANDALISM MALICIOUS MISCHIEF, VANDALISM OF VEHICLES Tuesday 2015-01-27 19:00:00 NORTHERN NONE LOMBARD ST / LAGUNA ST -122.431119 37.800469
In [80]:
crimesByDistrict = pd.read_sql_query("select PdDistrict, count(Category) as totCrimes from crime group by PdDistrict",conn2)
In [81]:
districtCrimes = crimesByDistrict[1:]
maxCrimes = max(districtCrimes.totCrimes)
minCrimes = min(districtCrimes.totCrimes)
maxDistrict = districtCrimes[districtCrimes.totCrimes == maxCrimes]
minDistrict = districtCrimes[districtCrimes.totCrimes == minCrimes]
In [82]:
maxDistrict, minDistrict
Out[82]:
(  PdDistrict  totCrimes
 8   SOUTHERN     196178,   PdDistrict  totCrimes
 7   RICHMOND      56258)

From here we see that the district with the most crimes is the Southern district and the district with the least crimes is Richmond.

In [83]:
southern = pd.read_sql_query("select Category, count(Category) as occurance from crime where PdDistrict='SOUTHERN' group by Category",conn2)
In [84]:
maxCrime = max(southern.occurance)
southern[southern.occurance == maxCrime]
Out[84]:
Category occurance
16 LARCENY/THEFT 60068

From here we see that the most reoccuring crime is theft in the southern district with over 60,000 occurances since 2010.

In [85]:
richmond = pd.read_sql_query("select Category, count(Category) as occurance from crime where PdDistrict='RICHMOND' group by Category",conn2)
In [86]:
minCrime = min(richmond.occurance)
richmond[richmond.occurance == minCrime]
Out[86]:
Category occurance
21 PORNOGRAPHY/OBSCENE MAT 3

To answer this question we first check to see how many different categories of crimes there are. We see that there are 39 categories. For this problem we will assume that all categories are considered to be dangerous to answer the question. We see that the district with the last crimes is Richmond and the crime in Richmond that occured the least is pornography. The district with the most crimes was in the southern district. The crime in the southern district that occurred the most was larcency and theft

In [87]:
southernTime = pd.read_sql_query("select Category, Datetime, count(Datetime) as occurance from crime where PdDistrict='SOUTHERN' group by Datetime",conn2)
In [88]:
from datetime import datetime
In [89]:
hour = []
for i in range(0, len(southernTime)):
    date = southernTime.Datetime[i].split(" ")[1]
    date = datetime.strptime(date, '%H:%M:%S')
    hour.append(date.hour)
In [90]:
df = pd.DataFrame(hour)
In [91]:
df.describe()
Out[91]:
0
count 128378.000000
mean 13.538683
std 6.399302
min 0.000000
25% 9.000000
50% 14.000000
75% 19.000000
max 23.000000

Lastly, we check to see what time in the most dangerous area was most common for crimes to occur. We see that the most crimes in the southern district occured at around 11 PM.

In [92]:
crimeTime = pd.read_sql_query("select Category, Datetime, count(Datetime) as occurance from crime group by Datetime",conn2)
In [93]:
hour = []
for i in range(0, len(crimeTime)):
    date = crimeTime.Datetime[i].split(" ")[1]
    date = datetime.strptime(date, '%H:%M:%S')
    hour.append(date.hour)
In [94]:
timedf = pd.DataFrame(hour)
In [95]:
timedf.columns = ['hour']
In [96]:
groups = timedf.groupby(['hour']).size()
In [97]:
groups = pd.DataFrame(groups)
In [98]:
groups.columns = ['count']
In [99]:
times = []
for i in range(0, 24):
    times.append(groups.iloc[i])
In [100]:
times = pd.DataFrame(times)
In [101]:
times.plot(y = 'count', kind = 'pie', legend = False)
Out[101]:
<matplotlib.axes._subplots.AxesSubplot at 0x142fd71d0>

This is a pie chart of the number of crimes per hour (not by district). We see form here that there are less crimes occuring early in the day (but they are somewhat identical) and as the day gets later and later more crimes occur with the max number of crimes at 11 PM (23).

Question 2

Which region is the most and least expensive?

In [102]:
zillow = pd.read_sql_query("select  * from zillow",conn2)
In [103]:
zillow.head()
Out[103]:
RegionName Date ZriPerSqft_AllHomes MedianSoldPricePerSqft_AllHomes PriceToRentRatio_AllHomes Turnover_AllHomes
0 94109 2010-11-01 00:00:00 3.156 675.1913 19.14 6.0771
1 94110 2010-11-01 00:00:00 2.566 599.6785 18.10 5.4490
2 94122 2010-11-01 00:00:00 2.168 495.4432 17.99 2.4198
3 94080 2010-11-01 00:00:00 1.666 369.5538 18.31 3.8757
4 94112 2010-11-01 00:00:00 2.322 422.4538 14.44 3.1288
In [104]:
zillow = zillow.dropna()
In [105]:
priceMax = pd.read_sql_query("select RegionName, max(MedianSoldPricePerSqft_AllHomes) from zillow",conn2)
priceMin = pd.read_sql_query("select RegionName, min(MedianSoldPricePerSqft_AllHomes) from zillow",conn2)
In [106]:
priceMax, priceMin
Out[106]:
(   RegionName  max(MedianSoldPricePerSqft_AllHomes)
 0       94104                           1336.128581,
    RegionName  min(MedianSoldPricePerSqft_AllHomes)
 0       94124                               103.938)

Since there are many rows with NAs in this dataset, I chose to drop them so it wouldn't interfere with my analysis for maximum and minimum median price for homes. From here, we see the most expensive region is 94104 (median sold price per sqft is \$1336) and the least expensive region is 94124 (median sold price per sqft is \$103).

In [107]:
price = pd.read_sql_query("select RegionName, max(MedianSoldPricePerSqft_AllHomes), min(MedianSoldPricePerSqft_AllHomes) from zillow group by RegionName",conn2)
In [108]:
price.plot(kind = 'bar', y = ['max(MedianSoldPricePerSqft_AllHomes)', 'min(MedianSoldPricePerSqft_AllHomes)'], x = 'RegionName')
Out[108]:
<matplotlib.axes._subplots.AxesSubplot at 0x158bdc590>

From the plot below we see that there is a large difference between the max and min prices of homes per region. The only place that has around the same value for the max and min price is region 94104. This must be a more expensive/nice area if the min price is about equivalent to the max price.

Question 3

Which areas are the most and least quiet.

In [109]:
areas = pd.read_sql_query("select Neighborhood, count(neighborhood) as occurances, Lat, Lon from noise group by Neighborhood",conn2)
In [110]:
areas = areas[1:len(areas)]
In [111]:
areas = areas.reset_index()
In [112]:
del areas['index']
In [113]:
group = areas.groupby(['occurances']).size()
group = pd.DataFrame(group)
In [114]:
group.columns = ['count']
In [115]:
group.plot(kind = 'bar', y = 'count')
Out[115]:
<matplotlib.axes._subplots.AxesSubplot at 0x1621e4390>

From the grpah above, we see that there were more cases of smaller occurances than larger occurances. The max number of noise complaints is 480, 386 is right after but that only occured one time in one region.

In [116]:
min(areas.occurances), max(areas.occurances)
Out[116]:
(1, 480)
In [117]:
minArea = areas[areas.occurances == 1]
In [118]:
maxAreas = areas[areas.occurances == 480]
In [119]:
minArea = minArea.reset_index()
del minArea['index']
In [120]:
maxAreas = maxAreas.reset_index()
del maxAreas['index']
In [121]:
minArea
Out[121]:
Neighborhood occurances Lat Lon
0 Balboa Terrace 1 37.731599 -122.463938
1 Candlestick Point SRA 1 37.716298 -122.388131
2 India Basin 1 37.733627 -122.378361
3 Laguna Honda 1 37.747639 -122.458641
4 Lincoln Park / Ft. Miley 1 37.784330 -122.498916
5 St. Francis Wood 1 37.735891 -122.466476
6 Sunnydale 1 37.708415 -122.416303
In [122]:
maxAreas
Out[122]:
Neighborhood occurances Lat Lon
0 South of Market 480 37.781046 -122.409397

From here we see that there are 7 neighborhoods that have the smallest noise complaints. Those areas are Balboa terrace, candlestick point, india basin, laguna honda, lincoln park, st francis wood, sunnydale. The area with the msot noise complaints is in south of market. We will now plot these locations to see if the neighborhoods with little noise complaints are far from the one with the most.

In [123]:
import shapely.geometry as geom
import geopandas as gpd
In [124]:
lonlat = [geom.Point(lon, lat) for lon, lat in zip(minArea.Lon, minArea.Lat)]
areas.drop(["Lon", "Lat"], axis = 1)

areas = gpd.GeoDataFrame(minArea, geometry = lonlat, crs = {'init' :'epsg:4326'})
areas.geometry.head()
Out[124]:
0      POINT (-122.463938474175 37.731598603694)
1    POINT (-122.388130559067 37.71629808383071)
2      POINT (-122.378361489645 37.733626733232)
3      POINT (-122.458640998954 37.747639025578)
4      POINT (-122.498916234597 37.784330141308)
Name: geometry, dtype: object
In [125]:
lonlat = [geom.Point(lon, lat) for lon, lat in zip(maxAreas.Lon, maxAreas.Lat)]
maxAreas.drop(["Lon", "Lat"], axis = 1)
maxAreas = gpd.GeoDataFrame(maxAreas, geometry = lonlat, crs = {'init' :'epsg:4326'})
maxAreas.geometry.head()
Out[125]:
0    POINT (-122.409397210188 37.781046453006)
Name: geometry, dtype: object
In [127]:
sf = gpd.read_file("sf/geo_export_c9c00fb7-1521-4238-9257-b6176bf1cd62.shp")
sf.head()
Out[127]:
geometry link name
0 POLYGON ((-122.4934552679999 37.78351817100008... http://en.wikipedia.org/wiki/Sea_Cliff,_San_Fr... Seacliff
1 POLYGON ((-122.4871507149999 37.78378542700005... None Lake Street
2 POLYGON ((-122.4775801709999 37.81099311300005... http://www.nps.gov/prsf/index.htm Presidio National Park
3 POLYGON ((-122.4724105299999 37.78734653900005... None Presidio Terrace
4 POLYGON ((-122.4726257899999 37.78631480600006... http://www.sfgate.com/neighborhoods/sf/innerri... Inner Richmond
In [128]:
base = sf.plot()
areas.plot(ax = base, marker = "x", color = "red")
maxAreas.plot(ax = base, marker = "x", color = "blue")
Out[128]:
<matplotlib.axes._subplots.AxesSubplot at 0x1684f27d0>

Although the size of the dots are small (sorry couldn't figure out how to increase the size :(), we can see that the areas with the least noise complaints (dots in red) are quite far from the area with the most noise complaints (dot in blue). But overall, all the areas are quite far from each other.

Question 4

Are noise complaints and school locations related?

In [129]:
schools = pd.read_sql_query("select * from schools",conn2)
In [131]:
schoolsGroup = pd.read_sql_query("select GradeRange, count(GradeRange) as count from schools group by GradeRange",conn2)
In [132]:
schoolsGroup
Out[132]:
GradeRange count
0 1-4 1
1 1-6 1
2 1-8 1
3 13-14 10
4 2-12 1
5 3-8 2
6 4-8 1
7 5-12 1
8 5-8 1
9 6-12 6
10 6-8 16
11 7-12 2
12 8-12 4
13 9-12 36
14 INF 3
15 K-1 1
16 K-12 11
17 K-2 1
18 K-4 1
19 K-5 51
20 K-6 2
21 K-8 65
22 PK 165
23 PK-12 1
24 PK-4 6
25 PK-5 26
26 PK-6 1
27 PK-8 8
28 PK-K 13
29 PK-TK 2
30 TK-2 2
31 TK-8 3
In [133]:
schoolsGroup.plot(kind = 'pie', y = 'count')
Out[133]:
<matplotlib.axes._subplots.AxesSubplot at 0x148abb6d0>
In [134]:
max(schoolsGroup['count'])
Out[134]:
165
In [135]:
schoolsGroup[schoolsGroup['count'] == 165]
Out[135]:
GradeRange count
22 PK 165

From here we see that there are more schools that are just pre-kindergarten than anything else. This leads me to believe that there probably isn't a correlation between schools and noise complaints since most of the noise complaints occurred at 5 PM which is typically after school is over (this value is calculated later -- a couple steps below this).

In [136]:
noise = pd.read_sql_query("select * from noise",conn2)
In [137]:
noise = noise.dropna()
In [138]:
lonlat = [geom.Point(lon, lat) for lon, lat in zip(schools.Lon, schools.Lat)]

s = gpd.GeoDataFrame(schools, geometry = lonlat, crs = {'init' :'epsg:4326'})
s.geometry.head()
Out[138]:
0    POINT (-122.4822998 37.78300479999999)
1           POINT (-122.4381943 37.7536812)
2           POINT (-122.4657822 37.7296715)
3           POINT (-122.4740601 37.7739677)
4           POINT (-122.4763107 37.7753067)
Name: geometry, dtype: object
In [139]:
lonlat = [geom.Point(lon, lat) for lon, lat in zip(noise.Lon, noise.Lat)]

n = gpd.GeoDataFrame(noise, geometry = lonlat, crs = {'init' :'epsg:4326'})
n.geometry.head()
Out[139]:
1    POINT (-122.424474850717 37.7691482064722)
2     POINT (-122.508365256702 37.762047023755)
3     POINT (-122.414943216196 37.787982901643)
4      POINT (-122.502232253358 37.76138549134)
5     POINT (-122.420087435923 37.742435250569)
Name: geometry, dtype: object
In [140]:
sfp = sf.plot()
s.plot(ax = sfp, marker = "x", color = "red")
n.plot(ax=sfp,marker = "x", color = "blue")
Out[140]:
<matplotlib.axes._subplots.AxesSubplot at 0x14155fd90>

Since there are so many datapoints it is hard to tell if there is a correlation. Now, I will separate noise to be during time of school (8 - 3 PM - I'm thinking of only elementary, middle, high school)

In [141]:
noise = pd.read_sql_query("select Datetime, Lat, Lon from noise",conn2)
In [142]:
noise = noise.dropna()
In [143]:
noise = noise.reset_index()
In [144]:
del noise['index']
In [145]:
hour = []
for i in range(0, len(noise)):
    date = noise.Datetime[i].split(" ")[1]
    date = datetime.strptime(date, '%H:%M:%S')
    hour.append(date.hour)
In [146]:
from collections import defaultdict
d = defaultdict(int)
for i in hour:
    d[i] += 1
result = max(d.iteritems(), key=lambda x: x[1])
print result
(23, 354)
In [147]:
noise['hour'] = hour
In [148]:
g8 = noise.where(noise['hour']>=8)
In [149]:
g8 = g8.dropna()
In [150]:
l15 = noise.where(noise['hour']<=15)
In [151]:
l15 = l15.dropna()
In [152]:
newNoise = pd.merge(g8, l15)
In [153]:
lonlat = [geom.Point(lon, lat) for lon, lat in zip(newNoise.Lon, newNoise.Lat)]

n = gpd.GeoDataFrame(newNoise, geometry = lonlat, crs = {'init' :'epsg:4326'})
n.geometry.head()
Out[153]:
0    POINT (-122.424474850717 37.7691482064722)
1     POINT (-122.508365256702 37.762047023755)
2      POINT (-122.502232253358 37.76138549134)
3     POINT (-122.418177369469 37.805615941191)
4     POINT (-122.510211620936 37.773766740399)
Name: geometry, dtype: object
In [154]:
sfp = sf.plot()
s.plot(ax = sfp, marker = "x", color = "red")
n.plot(ax=sfp,marker = "x", color = "blue")
Out[154]:
<matplotlib.axes._subplots.AxesSubplot at 0x153543b50>

It is still fairly difficult to determine whether there is some sort of correlation between noise complaints and school locations, but it does seem like most of the schools are fairly close to where there was a noise complaint. This would make sense because there does tend to be noise complaints in areas where there are kids since kids/younger people are fairly quite loud.

Question 5

Which type of park is the most popular in San Francisco?

In [155]:
parks = pd.read_sql_query("select Type, count(Type) as occurance, avg(Acreage) from parks group by Type",conn2)
In [156]:
parks
Out[156]:
Type occurance avg(Acreage)
0 Civic Plaza or Square 10 4.373000
1 Community Garden 18 0.316111
2 Concession 2 43.620000
3 Family Camp 1 337.510000
4 Mini Park 45 0.276000
5 Neighborhood Park or Playground 131 4.422748
6 Parkway 2 20.740000
7 Regional Park 19 155.626842
8 Zoological Garden 1 131.520000

From here we see that there are more Neighborhood parks/playgrounds more than any other type of park in San Francisco. These neighborhood parks are around 4.4 acres on average. This will be displayed visually in the graph below.

In [157]:
parks.plot(kind = 'pie', y = 'occurance', labels = ['Civic Plaza', 'Community Garden', 'Concession', 'Family Camp', 'Mini Park', 'Neighborhood Park', 'Parkway', 'Regional Park', 'Zoological Garden'])
Out[157]:
<matplotlib.axes._subplots.AxesSubplot at 0x189056990>

Although there are more neighborhood parks than any other kind of park in san francisco, we see from the stats above that family camps make up the most acres for a single unit. This is surprising to me because I have never heard of a family camp. However second highest for most acres is the zoo which makes sense because the sf zoo is quite large wheras neighborhood parks can be as small as a small grass field.

In [ ]: