The US Department of Agriculture publishes price estimates for fruits and vegetables online. The most recent estimates are based on a 2013 survey of US retail stores.
The estimates are provided as a collection of MS Excel files, with one file per fruit or vegetable. The assignment3_data.zip file contains the fruit and vegetable files in the directories fruit and vegetables, respectively.
Exercise 1.1. Use pandas to extract the "Fresh" row(s) from the fruit Excel files. Combine the data into a single data frame. Your data frame should look something like this:
| type | food | form | price_per_lb | yield | lb_per_cup | price_per_cup |
|---|---|---|---|---|---|---|
| fruit | watermelon | Fresh1 | 0.333412 | 0.52 | 0.330693 | 0.212033 |
| fruit | cantaloupe | Fresh1 | 0.535874 | 0.51 | 0.374786 | 0.3938 |
| vegetables | onions | Fresh1 | 1.03811 | 0.9 | 0.35274 | 0.406868 |
| ... |
It's okay if the rows and columns of your data frame are in a different order. These modules are especially relevant:
str methodsosos.pathread_excel(), concat(), .fillna(), .str, plotting methodsAsk questions and search the documentation/web to find the functions you need.
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import os
def getThemDataFrames(path):
#path = 'assignment3_data/fruit/'
files = os.listdir(path)
#print files
x = [pd.read_excel(path+y) for y in files]
#x = pd.DataFrame(x)
for i in range(len(files)):
x[i].iloc[:,5] = files[i]
for i in range(len(files)): #get the fruits
x[i].iloc[:,2] = "fruit"
for i in range(len(files)):
x[i].iloc[:,5] = [t.rstrip(".xlsx") for t in x[i].iloc[:,5]]
for i in range(len(files)):
x[i] = x[i].iloc[:,:7]
for i in range(len(files)):
x[i].columns = ['form','price_per_lb','type','yield','lb_per_cup','food','price_per_cup']
df = []
for i in range(len(files)):
df.append(x[i])
new = pd.concat(df)
fnon = new[~new['form'].isnull()]
fnon = new[~new['price_per_lb'].isnull()]
fnon.iloc[:,0] = fnon.iloc[:,0].str.lower()
fruit = fnon[fnon['form'].str.contains("fresh")]
return fruit
fruit = getThemDataFrames('assignment3_data/fruit/')
fruit
Exercise 1.2. Reuse your code from exercise 1.1 to extract the "Fresh" row(s) from the vegetable Excel files.
Does your code produce the correct prices for tomatoes? If not, why not? Do any other files have the same problem as the tomatoes file?
You don't need to extract the prices for these problem files. However, make sure the prices are extracted for files like asparagus that don't have this problem.
veggies = getThemDataFrames('assignment3_data/vegetables/')
veggies
No it does not produce the correct price for tomatoes. This is because it is not on the same line as the word "fresh" thus the price does not get extracted. This occurs for beets, black beans, black eyed peas, carrots, celery, great northern beans, green peas,kidney beans, lentils, lima beans, mixed vegetables, mushrooms, navy beans, olives, pinto beans, pumpkin, and spinach.
Exercise 1.3. Remove rows without a price from the vegetable data frame and then combine the fruit and vegetable data frames. Make sure all columns of numbers are numeric (not strings).
veggies = veggies[~veggies.price_per_cup.isnull()]
veggies
df_new = pd.concat([veggies, fruit])
df_new
df_new = df_new.reset_index()
del df_new['index']
df_new
Exercise 1.4. Discuss the questions below (a paragraph each is sufficient). Use plots to support your ideas.
fruits = df_new[df_new.type == 'fruit']
plt.bar(range(25, 25+len(fruits)),fruits['price_per_lb'])
plt.xlabel("index")
plt.ylabel('Price per pound')
plt.title("index vs price per pound for fruits")
plt.show()
From here we can see that the most expensive fruit is the one at around index 45 and the cheapest fruit per pound is at index 48. The next cell will print out the fruit name. Most expensive fruit per pound is raspberries and cheapest is watermelon.
print fruits["food"][45], fruits["food"][48]
veg = df_new[df_new.type == 'vegetable']
plt.hist(veg['price_per_lb'], alpha=.4, bins=15, label="veggies")
plt.hist(fruits['price_per_lb'], alpha=.2, bins=15, label="fruit")
plt.title("Historgram of prices for veggies and fruit")
plt.legend()
plt.show()
plt.scatter(range(0, len(veg)),veg.price_per_lb, label="veggies")
plt.scatter(range(0, len(fruits)),fruits.price_per_lb,c='r', label="fruits")
plt.plot((0, len(veg)), (veg["price_per_lb"].mean(), veg["price_per_lb"].mean()), 'blue', label="veg means")
plt.plot((0, len(fruit)), (fruit["price_per_lb"].mean(), fruit["price_per_lb"].mean()), 'red', label = "fruit means")
plt.plot((0, len(fruit)), (fruit["price_per_lb"].median(), fruit["price_per_lb"].median()), 'green', label="median of fruits")
plt.plot((0, len(fruit)), (veg["price_per_lb"].median(), veg["price_per_lb"].median()), 'yellow', label="median of veggies")
plt.legend(loc=2,prop={'size':8})
plt.xlabel("index")
plt.ylabel("price per pound")
plt.title("index vs price per pound for veggies and fruit")
plt.show()
From the scatter plot we see that fruits are more expensive than vegetables on average. However, using the median, vegetables are more expensive than fruits. This is because there are some fruits that way a lot more thus they skew the mean. In this example, looking at the median is a better estimate for price distribution of fruits and vegetables.
print "vegetable mean is ", veg["price_per_lb"].mean(), "fruit mean is", fruit["price_per_lb"].mean()
print "vegetable median is ", veg["price_per_lb"].median(), "fruit median is", fruit["price_per_lb"].median()
plt.bar(range(0, len(veg)),veg['price_per_lb'])
plt.xlabel("index")
plt.ylabel('Price per pound')
plt.title("index vs price per pound for veggies")
plt.show()
From the bar graph above we see that the most expensive vegetable per pound is at either index 2 or 17.
expensive = veg['price_per_lb'].max()
veg[veg['price_per_lb'] == expensive]
From the code above we see that the most expensive vegetable per pound is okra. We do the same thing to find the cheapest
cheap = veg['price_per_lb'].min()
veg[veg['price_per_lb'] == cheap]
cheapFruit = fruit['price_per_lb'].min()
fruit[fruit['price_per_lb'] == cheapFruit]
From the outputs above we see that the most vegetable and fruit you get per pound are potatoes and watermelon respectively.
One thing thats interesting about this dataset is that some of the yields are greater than 1. After looking up the meaning of "yield" it says online that yield refers to the percent thats edible from the total amount of the fruit/vegetable. The following code will show the foods that have an error in their yield value (greater than 1).
df_new[df_new['yield']>1]
Here we see that only the vegetables have a yield greater than 1. These vegetables are brussel sprouts, collard greens, and kale.
#taken from stackoverflow
def best_fit(X, Y):
xbar = sum(X)/len(X)
ybar = sum(Y)/len(Y)
n = len(X) # or len(Y)
numer = sum([xi*yi for xi,yi in zip(X, Y)]) - n * xbar * ybar
denum = sum([xi**2 for xi in X]) - n * xbar**2
b = numer / denum
a = ybar - b * xbar
#print('best fit line:\ny = {:.2f} + {:.2f}x'.format(a, b))
return a, b
plt.scatter(df_new['yield'],df_new['price_per_lb'])
plt.xlabel("yield")
a,b = a, b = best_fit(df_new['yield'], df_new['price_per_lb'])
yfit = [a + b * xi for xi in df_new['yield']]
plt.plot(df_new['yield'], yfit)
plt.ylabel("price per pound")
plt.title("Yield vs. Price per pound")
plt.show()
From here we see that there is a somewhat positive trend between yield and price per pound. However, we can not state that a higher yield means that the price will be more expensive because there are some yields that are close to 100% that are somewhat less expensive than lesser yields.
With this dataset it is hard to make a conclusive report on nutrition for these food items.