In this assignment, you'll combine the assignment 3 data set with nutrition data from the USDA Food Composition Databases. The CSV file fresh.csv contains the fresh fruits and vegetables data you extracted in assignment 3.
The USDA Food Composition Databases have a documented web API that returns data in JSON format . You need a key in order to use the API. Only 1000 requests are allowed per hour, so it would be a good idea to use caching.
Sign up for an API key here. The key will work with any Data.gov API. You may need the key again later in the quarter, so make sure you save it.
These modules may be useful:
Exercise 1.1. Read the search request documentation, then write a function called ndb_search() that makes a search request. The function should accept the search term as an argument. The function should return the search result items as a list (for 0 items, return an empty list).
Note that the search url is: https://api.nal.usda.gov/ndb/search
As an example, a search for "quail eggs" should return this list:
[{u'ds': u'BL',
u'group': u'Branded Food Products Database',
u'name': u'CHAOKOH, QUAIL EGG IN BRINE, UPC: 044738074186',
u'ndbno': u'45094707',
u'offset': 0},
{u'ds': u'BL',
u'group': u'Branded Food Products Database',
u'name': u'L&W, QUAIL EGGS, UPC: 024072000256',
u'ndbno': u'45094890',
u'offset': 1},
{u'ds': u'BL',
u'group': u'Branded Food Products Database',
u'name': u'BUDDHA, QUAIL EGGS IN BRINE, UPC: 761934535098',
u'ndbno': u'45099560',
u'offset': 2},
{u'ds': u'BL',
u'group': u'Branded Food Products Database',
u'name': u'GRAN SABANA, QUAIL EGGS, UPC: 819140010103',
u'ndbno': u'45169279',
u'offset': 3},
{u'ds': u'BL',
u'group': u'Branded Food Products Database',
u'name': u"D'ARTAGNAN, QUAIL EGGS, UPC: 736622102630",
u'ndbno': u'45178254',
u'offset': 4},
{u'ds': u'SR',
u'group': u'Dairy and Egg Products',
u'name': u'Egg, quail, whole, fresh, raw',
u'ndbno': u'01140',
u'offset': 5}]
As usual, make sure you document and test your function.
from urllib2 import Request, urlopen
import pandas as pd
import requests
import json
from urlparse import urlunparse, urlparse
key = "IcqWz29klKjRfiCAGy2AZvLbt5COAgmqpWy7CbAP"
def ndb_search(term):
temp = []
dictList = []
url = "https://api.nal.usda.gov/ndb/search/?format=json&max=500"
url = url + "&q=" + term + "&api_key=" + key
req = requests.get(url)
x = req.json()
try:
x['errors']
return 0
except:
return x['list']['item']
ndb_search("quail eggs")
Exercise 1.2. Use your search function to get NDB numbers for the foods in the fresh.csv file. It's okay if you don't get an NDB number for every food, but try to come up with a strategy that gets most of them. Discuss your strategy in a short paragraph.
Hints:
ndb_search() to a data frame with pd.DataFrame().pd.merge().data = pd.read_csv("fresh.csv")
data
l = {}
for j in range(0, len(data)):
x = ndb_search(data['food'][j])
for i in range(0, len(x)):
if 'raw' in x[i]['name']:
l[data['food'][j]] = x[i]['ndbno']
break
df = pd.DataFrame(l.items())
df.columns = ['food', 'ndb_num']
df = df.merge(data)
df
Exercise 1.3. Read the food reports V2 documentation, then write a function called ndb_report() that requests a basic food report. The function should accept the NDB number as an argument and return the list of nutrients for the food.
Note that the report url is: https://api.nal.usda.gov/ndb/V2/reports
For example, for "09279" (raw plums) the first element of the returned list should be:
{u'group': u'Proximates',
u'measures': [{u'eqv': 165.0,
u'eunit': u'g',
u'label': u'cup, sliced',
u'qty': 1.0,
u'value': u'143.93'},
{u'eqv': 66.0,
u'eunit': u'g',
u'label': u'fruit (2-1/8" dia)',
u'qty': 1.0,
u'value': u'57.57'},
{u'eqv': 151.0,
u'eunit': u'g',
u'label': u'NLEA serving',
u'qty': 1.0,
u'value': u'131.72'}],
u'name': u'Water',
u'nutrient_id': u'255',
u'unit': u'g',
u'value': u'87.23'}
Be sure to document and test your function.
def ndb_report(num):
url = "https://api.nal.usda.gov/ndb/V2/reports?ndbno="
url = url + num + "&type=b" + "&format=json&api_key=" + key
req = requests.get(url)
x = req.json()
return x['foods'][0]['food']['nutrients']
ndb_report("09279")
#TEST TO MAKE SURE THIS ACTUALLY GETS NUTRIENTS YO
listcal = ndb_report("09279")[4]
#int(listcal[u'value'])
listcal
Exercise 1.4. Which foods provide the best combination of price, yield, and nutrition? You can use kilocalories as a measure of "nutrition" here, but more a detailed analysis is better. Use plots to support your analysis.
def calc_nutrients(num):
l = {}
for i in range(0, len(df)):
cals = ndb_report(df['ndb_num'][i])[num]
try:
l[df['ndb_num'][i]] = float(cals[u'value'])
except:
l[df['ndb_num'][i]] = int(cals[u'value'])
return l
l = calc_nutrients(1) #for kcal
l = pd.DataFrame(l.items())
l.columns = ['ndb_num', 'kcal']
df = df.merge(l)
l = calc_nutrients(2)
l = pd.DataFrame(l.items())
l.columns = ['ndb_num', 'protein']
df = df.merge(l)
l = calc_nutrients(3)
l = pd.DataFrame(l.items())
l.columns = ['ndb_num', 'fat']
df = df.merge(l)
l = calc_nutrients(4)
l = pd.DataFrame(l.items())
l.columns = ['ndb_num', 'carbs']
df = df.merge(l)
df
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(df["price_per_lb"], df["yield"], df["kcal"])
ax.set_xlabel('Price Per lb')
ax.set_ylabel('Yield')
ax.set_zlabel('Nutrition in Kcal')
ax.set_title('Price per lb vs. Yield vs. Nutrition (kcal)')
plt.show()
From here we can see that one of the most expensive fruits also has one of the highest yields and a low kcal value. The next cell will determine what that fruit is.
x = max(df['price_per_lb'])
df[df['price_per_lb'] == x]
From here we see that raspberries take on the value described above.
plt.scatter(df['kcal'], df['yield'])
plt.xlabel("Nutrition (kcal)")
plt.ylabel("Yield")
plt.title("Nutrition vs. Yield")
plt.show()
From this graph, we see that there is not much of a correlation between nutrition and yield. It seems like nutrition can not directly determine percent yield. There does seem to be an outlier in this data which is the fruit with kcal approximately equal to 160. The next cell will determine which fruit that is.
x = max(df['kcal'])
df[df['kcal'] == x]
From here we can see that avocados have a highest calories than all of the fruits.
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(df["price_per_lb"], df["yield"], df["protein"])
ax.set_xlabel('Price Per lb')
ax.set_ylabel('Yield')
ax.set_zlabel('Nutrition in Protein')
ax.set_title('Price per lb vs. Yield vs. Nutrition (Protein)')
plt.show()
From here we see that the vegetables/fruit with the most protein are the ones that have mid price per pound and have a high yield. The fruit/vegetable with the hightest amount of protein is: kale which has a moderate price per pound and also a very high yield.
x = max(df['protein'])
df[df['protein'] == x]
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(df["price_per_lb"], df["yield"], df["fat"])
ax.set_xlabel('Price Per lb')
ax.set_ylabel('Yield')
ax.set_zlabel('Nutrition in fat')
ax.set_title('Price per lb vs. Yield vs. Nutrition (fat)')
plt.show()
From here we see that most of the fruit/veggies are low in fat (which makes sense). There is only one value that seems to be an outlier for amount of fat since it is significantly much higher than the rest of the points. This value is fairly inexpensive per pound and has a somewhat high yield. From the output below we see that this value is avocado. This makes sense since avocados are known to be high in fat.
x = max(df['fat'])
df[df['fat']== x]
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(df["price_per_lb"], df["yield"], df["carbs"])
ax.set_xlabel('Price Per lb')
ax.set_ylabel('Yield')
ax.set_zlabel('Nutrition in carbs')
ax.set_title('Price per lb vs. Yield vs. Nutrition (carbs)')
plt.show()
Lastly we have price vs yield vs carbs. There is one value that seems to have the highest amount of carbs. It is fairly cheap and seems to have a moderately low yield in comparison to the other points. The block below will calculate the value with the highest carb content. Thus bananas have the highest amount of carbs from the points collected.
x = max(df['carbs'])
df[df['carbs']== x]