In this assignment, you'll analyze a collection of data sets from the San Francisco Open Data Portal and Zillow. The data sets have been stored in the SQLite database sf_data.sqlite, which you can download here. The database contains the following tables:
| Table | Description |
|---|---|
crime |
Crime reports dating back to 2010. |
mobile_food_locations |
List of all locations where mobile food vendors sell. |
mobile_food_permits |
List of all mobile food vendor permits. More details here. |
mobile_food_schedule |
Schedules for mobile food vendors. |
noise |
Noise complaints dating back to August 2015. |
parking |
List of all parking lots. |
parks |
List of all parks. |
schools |
List of all schools. |
zillow |
Zillow rent and housing statistics dating back to 1996. More details here. |
The mobile_food_ tables are explicitly connected through the locationid and permit columns. The other tables are not connected, but you may be able to connect them using dates, latitude/longitude, or postal codes.
Shapefiles for US postal codes are available here. These may be useful for converting latitude/longitude to postal codes.
Shapefiles for San Francisco Neighborhoods are available here.
Exercise 1.1. Which mobile food vendor(s) sells at the most locations?
import pandas as pd
%matplotlib inline
# Make plots larger.
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (12, 12)
import sqlalchemy as sqla
import sqlite3
plt.style.use('ggplot')
sqlite_file = 'sf_data.sqlite'
conn = sqlite3.connect(sqlite_file)
conn2 = sqla.create_engine('sqlite:///' + sqlite_file)
conn2
pd.read_sql_query("select permit, count(locationid) AS places from mobile_food_schedule group by permit order by places desc",conn2).head()
From here we see that the top three vendors sell at 340, 221, and 212 places.
pd.read_sql_query("select Applicant from mobile_food_permits where permit = '17MFF-0110' OR permit = '13MFF-0102' OR permit = '17MFF-0111'",conn2)
These vendors are Natan's Catering, May Catering, Anas Goodies Catering.
Exercise 1.2. Ask and use the database to analyze 5 questions about San Francisco. For each question, write at least 150 words and support your answer with plots. Make a map for at least 2 of the 5 questions.
You should try to come up with some questions on your own, but these are examples of reasonable questions:
Please make sure to clearly state each of your questions in your submission.
pd.read_sql_query("select DISTINCT CATEGORY from crime",conn2)
pd.read_sql_query("select * from crime",conn2).head()
crimesByDistrict = pd.read_sql_query("select PdDistrict, count(Category) as totCrimes from crime group by PdDistrict",conn2)
districtCrimes = crimesByDistrict[1:]
maxCrimes = max(districtCrimes.totCrimes)
minCrimes = min(districtCrimes.totCrimes)
maxDistrict = districtCrimes[districtCrimes.totCrimes == maxCrimes]
minDistrict = districtCrimes[districtCrimes.totCrimes == minCrimes]
maxDistrict, minDistrict
From here we see that the district with the most crimes is the Southern district and the district with the least crimes is Richmond.
southern = pd.read_sql_query("select Category, count(Category) as occurance from crime where PdDistrict='SOUTHERN' group by Category",conn2)
maxCrime = max(southern.occurance)
southern[southern.occurance == maxCrime]
From here we see that the most reoccuring crime is theft in the southern district with over 60,000 occurances since 2010.
richmond = pd.read_sql_query("select Category, count(Category) as occurance from crime where PdDistrict='RICHMOND' group by Category",conn2)
minCrime = min(richmond.occurance)
richmond[richmond.occurance == minCrime]
To answer this question we first check to see how many different categories of crimes there are. We see that there are 39 categories. For this problem we will assume that all categories are considered to be dangerous to answer the question. We see that the district with the last crimes is Richmond and the crime in Richmond that occured the least is pornography. The district with the most crimes was in the southern district. The crime in the southern district that occurred the most was larcency and theft
southernTime = pd.read_sql_query("select Category, Datetime, count(Datetime) as occurance from crime where PdDistrict='SOUTHERN' group by Datetime",conn2)
from datetime import datetime
hour = []
for i in range(0, len(southernTime)):
date = southernTime.Datetime[i].split(" ")[1]
date = datetime.strptime(date, '%H:%M:%S')
hour.append(date.hour)
df = pd.DataFrame(hour)
df.describe()
Lastly, we check to see what time in the most dangerous area was most common for crimes to occur. We see that the most crimes in the southern district occured at around 11 PM.
crimeTime = pd.read_sql_query("select Category, Datetime, count(Datetime) as occurance from crime group by Datetime",conn2)
hour = []
for i in range(0, len(crimeTime)):
date = crimeTime.Datetime[i].split(" ")[1]
date = datetime.strptime(date, '%H:%M:%S')
hour.append(date.hour)
timedf = pd.DataFrame(hour)
timedf.columns = ['hour']
groups = timedf.groupby(['hour']).size()
groups = pd.DataFrame(groups)
groups.columns = ['count']
times = []
for i in range(0, 24):
times.append(groups.iloc[i])
times = pd.DataFrame(times)
times.plot(y = 'count', kind = 'pie', legend = False)
This is a pie chart of the number of crimes per hour (not by district). We see form here that there are less crimes occuring early in the day (but they are somewhat identical) and as the day gets later and later more crimes occur with the max number of crimes at 11 PM (23).
zillow = pd.read_sql_query("select * from zillow",conn2)
zillow.head()
zillow = zillow.dropna()
priceMax = pd.read_sql_query("select RegionName, max(MedianSoldPricePerSqft_AllHomes) from zillow",conn2)
priceMin = pd.read_sql_query("select RegionName, min(MedianSoldPricePerSqft_AllHomes) from zillow",conn2)
priceMax, priceMin
Since there are many rows with NAs in this dataset, I chose to drop them so it wouldn't interfere with my analysis for maximum and minimum median price for homes. From here, we see the most expensive region is 94104 (median sold price per sqft is \$1336) and the least expensive region is 94124 (median sold price per sqft is \$103).
price = pd.read_sql_query("select RegionName, max(MedianSoldPricePerSqft_AllHomes), min(MedianSoldPricePerSqft_AllHomes) from zillow group by RegionName",conn2)
price.plot(kind = 'bar', y = ['max(MedianSoldPricePerSqft_AllHomes)', 'min(MedianSoldPricePerSqft_AllHomes)'], x = 'RegionName')
From the plot below we see that there is a large difference between the max and min prices of homes per region. The only place that has around the same value for the max and min price is region 94104. This must be a more expensive/nice area if the min price is about equivalent to the max price.
areas = pd.read_sql_query("select Neighborhood, count(neighborhood) as occurances, Lat, Lon from noise group by Neighborhood",conn2)
areas = areas[1:len(areas)]
areas = areas.reset_index()
del areas['index']
group = areas.groupby(['occurances']).size()
group = pd.DataFrame(group)
group.columns = ['count']
group.plot(kind = 'bar', y = 'count')
From the grpah above, we see that there were more cases of smaller occurances than larger occurances. The max number of noise complaints is 480, 386 is right after but that only occured one time in one region.
min(areas.occurances), max(areas.occurances)
minArea = areas[areas.occurances == 1]
maxAreas = areas[areas.occurances == 480]
minArea = minArea.reset_index()
del minArea['index']
maxAreas = maxAreas.reset_index()
del maxAreas['index']
minArea
maxAreas
From here we see that there are 7 neighborhoods that have the smallest noise complaints. Those areas are Balboa terrace, candlestick point, india basin, laguna honda, lincoln park, st francis wood, sunnydale. The area with the msot noise complaints is in south of market. We will now plot these locations to see if the neighborhoods with little noise complaints are far from the one with the most.
import shapely.geometry as geom
import geopandas as gpd
lonlat = [geom.Point(lon, lat) for lon, lat in zip(minArea.Lon, minArea.Lat)]
areas.drop(["Lon", "Lat"], axis = 1)
areas = gpd.GeoDataFrame(minArea, geometry = lonlat, crs = {'init' :'epsg:4326'})
areas.geometry.head()
lonlat = [geom.Point(lon, lat) for lon, lat in zip(maxAreas.Lon, maxAreas.Lat)]
maxAreas.drop(["Lon", "Lat"], axis = 1)
maxAreas = gpd.GeoDataFrame(maxAreas, geometry = lonlat, crs = {'init' :'epsg:4326'})
maxAreas.geometry.head()
sf = gpd.read_file("sf/geo_export_c9c00fb7-1521-4238-9257-b6176bf1cd62.shp")
sf.head()
base = sf.plot()
areas.plot(ax = base, marker = "x", color = "red")
maxAreas.plot(ax = base, marker = "x", color = "blue")
Although the size of the dots are small (sorry couldn't figure out how to increase the size :(), we can see that the areas with the least noise complaints (dots in red) are quite far from the area with the most noise complaints (dot in blue). But overall, all the areas are quite far from each other.
schools = pd.read_sql_query("select * from schools",conn2)
schoolsGroup = pd.read_sql_query("select GradeRange, count(GradeRange) as count from schools group by GradeRange",conn2)
schoolsGroup
schoolsGroup.plot(kind = 'pie', y = 'count')
max(schoolsGroup['count'])
schoolsGroup[schoolsGroup['count'] == 165]
From here we see that there are more schools that are just pre-kindergarten than anything else. This leads me to believe that there probably isn't a correlation between schools and noise complaints since most of the noise complaints occurred at 5 PM which is typically after school is over (this value is calculated later -- a couple steps below this).
noise = pd.read_sql_query("select * from noise",conn2)
noise = noise.dropna()
lonlat = [geom.Point(lon, lat) for lon, lat in zip(schools.Lon, schools.Lat)]
s = gpd.GeoDataFrame(schools, geometry = lonlat, crs = {'init' :'epsg:4326'})
s.geometry.head()
lonlat = [geom.Point(lon, lat) for lon, lat in zip(noise.Lon, noise.Lat)]
n = gpd.GeoDataFrame(noise, geometry = lonlat, crs = {'init' :'epsg:4326'})
n.geometry.head()
sfp = sf.plot()
s.plot(ax = sfp, marker = "x", color = "red")
n.plot(ax=sfp,marker = "x", color = "blue")
Since there are so many datapoints it is hard to tell if there is a correlation. Now, I will separate noise to be during time of school (8 - 3 PM - I'm thinking of only elementary, middle, high school)
noise = pd.read_sql_query("select Datetime, Lat, Lon from noise",conn2)
noise = noise.dropna()
noise = noise.reset_index()
del noise['index']
hour = []
for i in range(0, len(noise)):
date = noise.Datetime[i].split(" ")[1]
date = datetime.strptime(date, '%H:%M:%S')
hour.append(date.hour)
from collections import defaultdict
d = defaultdict(int)
for i in hour:
d[i] += 1
result = max(d.iteritems(), key=lambda x: x[1])
print result
noise['hour'] = hour
g8 = noise.where(noise['hour']>=8)
g8 = g8.dropna()
l15 = noise.where(noise['hour']<=15)
l15 = l15.dropna()
newNoise = pd.merge(g8, l15)
lonlat = [geom.Point(lon, lat) for lon, lat in zip(newNoise.Lon, newNoise.Lat)]
n = gpd.GeoDataFrame(newNoise, geometry = lonlat, crs = {'init' :'epsg:4326'})
n.geometry.head()
sfp = sf.plot()
s.plot(ax = sfp, marker = "x", color = "red")
n.plot(ax=sfp,marker = "x", color = "blue")
It is still fairly difficult to determine whether there is some sort of correlation between noise complaints and school locations, but it does seem like most of the schools are fairly close to where there was a noise complaint. This would make sense because there does tend to be noise complaints in areas where there are kids since kids/younger people are fairly quite loud.
parks = pd.read_sql_query("select Type, count(Type) as occurance, avg(Acreage) from parks group by Type",conn2)
parks
From here we see that there are more Neighborhood parks/playgrounds more than any other type of park in San Francisco. These neighborhood parks are around 4.4 acres on average. This will be displayed visually in the graph below.
parks.plot(kind = 'pie', y = 'occurance', labels = ['Civic Plaza', 'Community Garden', 'Concession', 'Family Camp', 'Mini Park', 'Neighborhood Park', 'Parkway', 'Regional Park', 'Zoological Garden'])
Although there are more neighborhood parks than any other kind of park in san francisco, we see from the stats above that family camps make up the most acres for a single unit. This is surprising to me because I have never heard of a family camp. However second highest for most acres is the zoo which makes sense because the sf zoo is quite large wheras neighborhood parks can be as small as a small grass field.