Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
89 commits
Select commit Hold shift + click to select a range
ce28924
test
May 14, 2011
dc19e8c
Test commit.
May 14, 2011
85c2d7a
test
lbling May 14, 2011
4cb0203
Added initial folders.
May 15, 2011
48c1662
Reference Docs
May 15, 2011
6701f75
Made eclipse compatible project.
May 15, 2011
4d8689d
Added files for pydev project.
May 15, 2011
1d616bc
Fixed up readme.
May 15, 2011
8ae30fa
Add 100K data for generation.
May 15, 2011
5370266
Added inputGen, can read and dump as json. Need to do rest still.
May 15, 2011
3d680b0
partially working with arrays, but cannot search.
May 16, 2011
a2179d4
Now working for 1000 randomly selected data points
May 16, 2011
32b6873
Added Zip Codes to lat, long
May 16, 2011
6141e27
Now with lat long.
May 16, 2011
033e9b5
Now working with lat long.
May 16, 2011
0729fec
Updated to remove zip code.
May 16, 2011
e1ca0ce
reduce min node count to 100 to reduce tree size.
May 16, 2011
872aeb6
test
May 18, 2011
cda7e7b
Here's what I have so far. This is completely untested. mr-equi-his…
May 18, 2011
22f919f
Cleaned up results.
May 19, 2011
fdf478e
Made data numeric.
May 19, 2011
180eeaf
Added planet model lib for read/writing to file.
May 19, 2011
a3118fd
Now we can print node model to file, however, need to get applicable …
May 20, 2011
8de893e
almost working, but results have changed.
May 20, 2011
2a3a7c6
Now works with our iterative model file approach. Can try mapreduce …
May 20, 2011
944a873
expected result, for testing.
May 20, 2011
c71fdf7
Added project files to access Scott's work.
May 20, 2011
9509d1a
Now we can graph.
May 20, 2011
9d285e2
Working with digraph, need to balance out tree.
May 20, 2011
9e30e72
Works!
May 20, 2011
3bfd3c0
refactored code. No duplicates
May 20, 2011
9db67fd
Now working well with graphviz!
May 20, 2011
5b5266a
Initial presentation.
May 20, 2011
7692933
Now working again. Looks like rowmath needs help
Feb 20, 2012
9b9330f
latest changes
Feb 20, 2012
c17dd7c
removed compiled python
Feb 20, 2012
deeb762
Upgraded to latest pygraph routine.
Feb 20, 2012
dc619c9
Initial data set and results thus far. RMLSE 0.95 :(
jsrawan-mobo May 26, 2012
3671944
now lm in a loop
jsrawan-mobo May 26, 2012
81ed583
Campaign.r with mikes original data.
jsrawan-mobo May 26, 2012
36701d4
Campaign.r with mikes original data.
jsrawan-mobo May 26, 2012
4934b7b
now with cleaning.
jsrawan-mobo May 26, 2012
4946b3f
Added mike's gbm model and update the linear model to match with subm…
jsrawan-mobo May 29, 2012
e72b5d0
The original source.
jsrawan-mobo May 29, 2012
267d526
Checked with default output.
jsrawan-mobo May 29, 2012
57f6d30
For inputs columns 13:29. Only first column thus far.
jsrawan-mobo May 29, 2012
b4c5c21
Only column 13:29
jsrawan-mobo May 29, 2012
2af5d3f
Does not work. Validation error goes off the hook once all params a…
jsrawan-mobo May 30, 2012
d560e54
Mike gbm, without any output
jsrawan-mobo May 31, 2012
ce6d349
From Eugene's. Submitted to kaggle already.
jsrawan-mobo Jun 16, 2012
4105376
Now RMLSE equals the first submission that we made with RMLSE 0.76 (p…
jsrawan-mobo Jun 17, 2012
9d5fa2e
from competition
jsrawan-mobo Jun 17, 2012
d8175e7
This is running as per the submitted values. Noticed small differenc…
jsrawan-mobo Jun 17, 2012
75aa521
Whoops. Fixed now.
jsrawan-mobo Jun 17, 2012
431e98f
Modified but worse results.
jsrawan-mobo Jun 17, 2012
5ed4396
temp not working
jsrawan-mobo Jun 17, 2012
3409c3d
unvalidated.
jsrawan-mobo Jun 17, 2012
db1c7ae
fix xtest.
jsrawan-mobo Jun 17, 2012
c5e8b43
works on ubuntu12 with latest.
jsrawan-mobo Aug 18, 2012
8f01bf2
Now checkin working copy.
jsrawan-mobo Aug 18, 2012
9e49bbf
titanic submits
jsrawan-mobo Jan 26, 2013
1963883
added inputs
jsrawan-mobo Jan 26, 2013
75d601b
yelp readme
jsrawan-mobo Feb 2, 2013
d499e23
initial yelp normalization
jsrawan-mobo Feb 9, 2013
c31ec09
Updated instructions.
jsrawan-mobo Feb 10, 2013
926c64d
now reads tab delimited format...
jsrawan-mobo Feb 10, 2013
98bf484
Canopies are generated.
jsrawan-mobo Feb 10, 2013
aa17903
working but slow.
jsrawan-mobo Feb 10, 2013
fc8af7e
Clusters 3K canopies. Further inspection shows the locations ar dis…
jsrawan-mobo Feb 10, 2013
c6d87a9
Now we get canopy centers in dictoary, to speed up mappers in iterate…
jsrawan-mobo Feb 10, 2013
1582408
Change keys, so they can be quickly looked up in hash.
jsrawan-mobo Feb 10, 2013
58acc58
Now clusters are index by business_id, for fast lookup.
jsrawan-mobo Feb 11, 2013
155b4ec
Runs through, now we need to validate, somehow, by taking average ove…
jsrawan-mobo Feb 11, 2013
8447c15
Clean up error handling.
jsrawan-mobo Feb 11, 2013
23bba4a
Validation Works!! we have 130K users in 3k clusters..
jsrawan-mobo Feb 11, 2013
6f654eb
added auxilary files.
jsrawan-mobo Feb 11, 2013
c0153a3
Now user'ids included in cluster assignments.
jsrawan-mobo Feb 11, 2013
38356f6
dump stats fails on reducer!
jsrawan-mobo Feb 11, 2013
de33d6d
walmart version 1
jsrawan-mobo Apr 7, 2014
f2819fa
remove lock files
jsrawan-mobo Apr 7, 2014
2502325
submission on small trees
jsrawan-mobo Apr 7, 2014
60d8dad
submission 2
jsrawan-mobo Apr 7, 2014
a257239
submission full trees and data
jsrawan-mobo Apr 7, 2014
404e9a8
added WMAE
jsrawan-mobo Apr 8, 2014
1b26981
WMAE is much better, however, only after exluding negative and small …
jsrawan-mobo Apr 8, 2014
880cbc3
submission 7
jsrawan-mobo Apr 8, 2014
fb78064
submission 8
jsrawan-mobo Apr 8, 2014
a8eee4e
leave in reload state
jsrawan-mobo May 20, 2014
179165e
walmart
jsrawan-mobo Sep 17, 2014
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
418 changes: 418 additions & 0 deletions src/Titanic/genderbasedmodelpy.csv

Large diffs are not rendered by default.

64 changes: 64 additions & 0 deletions src/Titanic/gendermodel.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
""" This simple code is desinged to teach a basic user to read in the files in python, simply find what proportion of males and females survived and make a predictive model based on this
Author : AstroDave
Date : 18th September, 2012

"""


import csv as csv
import numpy as np

csv_file_object = csv.reader(open('train.csv', 'rb')) #Load in the csv file
header = csv_file_object.next() #Skip the fist line as it is a header
data=[] #Creat a variable called 'data'
for row in csv_file_object: #Skip through each row in the csv file
data.append(row) #adding each row to the data variable
data = np.array(data) #Then convert from a list to an array

#Now I have an array of 11 columns and 891 rows
#I can access any element I want so the entire first column would
#be data[0::,0].astype(np.flaot) This means all of the columen and column 0
#I have to add the astype command
#as when reading in it thought it was a string so needed to convert

number_passengers = np.size(data[0::,0].astype(np.float))
number_survived = np.sum(data[0::,0].astype(np.float))
proportion_survivors = number_passengers / number_survived

# I can now find the stats of all the women on board
women_only_stats = data[0::,3] == "female" #This finds where all the women are
men_only_stats = data[0::,3] != "female" #This finds where all the men are
# != means not equal

#I can now find for example the ages of all the women by just placing
#women_only_stats in the '0::' part of the array index. You can test it by
#placing it in the 4 column and it should all read 'female'

women_onboard = data[women_only_stats,0].astype(np.float)
men_onboard = data[men_only_stats,0].astype(np.float)

proportion_women_survived = np.sum(women_onboard) / np.size(women_onboard)
proportion_men_survived = np.sum(men_onboard) / np.size(men_onboard)

print 'Proportion of women who survived is %s' % proportion_women_survived
print 'Proportion of men who survived is %s' % proportion_men_survived

#Now I have my indicator I can read in the test file and write out
#if a women then survived(1) if a man then did not survived (0)
#1st Read in test
test_file_obect = csv.reader(open('test.csv', 'rb'))
header = test_file_obect.next()

#Now also open the a new file so we can write to it call it something
#descriptive

open_file_object = csv.writer(open("genderbasedmodelpy.csv", "wb"))

for row in test_file_obect:
if row[2] == 'female':
row.insert(0,'1') #Insert the prediciton at the start of the row
open_file_object.writerow(row) #Write the row to the file
else:
row.insert(0,'0')
open_file_object.writerow(row)

418 changes: 418 additions & 0 deletions src/Titanic/myfirstforest.csv

Large diffs are not rendered by default.

157 changes: 157 additions & 0 deletions src/Titanic/myfirstforest.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,157 @@
""" Writing my first randomforest code.
Author : AstroDave
Date : 23rd September, 2012
please see packages.python.org/milk/randomforests.html for more

"""

import numpy as np
import csv as csv
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import GradientBoostingRegressor

csv_file_object = csv.reader(open('train.csv', 'rb')) #Load in the training csv file
header = csv_file_object.next() #Skip the fist line as it is a header
train_data=[] #Creat a variable called 'train_data'
for row in csv_file_object: #Skip through each row in the csv file
train_data.append(row) #adding each row to the data variable
train_data = np.array(train_data) #Then convert from a list to an array

#I need to convert all strings to integer classifiers:
#Male = 1, female = 0:
train_data[train_data[0::,3]=='male',3] = 1
train_data[train_data[0::,3]=='female',3] = 0
#embark c=0, s=1, q=2
train_data[train_data[0::,10] =='C',10] = 0
train_data[train_data[0::,10] =='S',10] = 1
train_data[train_data[0::,10] =='Q',10] = 2

train_data = np.delete(train_data,[2,7,9],1) #remove the name data, cabin and ticket
#I need to do the same with the test data now so that the columns are in the same
#as the training data

#I need to fill in the gaps of the data and make it complete.
#So where there is no price, I will assume price on median of that class
#Where there is no age I will give predicted value of age
age_predict = train_data[0::,3]
age_data = np.delete(train_data,[3],10)
age_predict_paired = age_predict
age_data_paired = age_data


raw_input(".")
gb_age = GradientBoostingRegressor().fit(age_data_paired,age_predict_paired)
age_output = gb_age.predict(age_data)
j = 0
for rows in train_data:
if(row[3]==''):
row[3] = age_output[j]
j += 1

#All missing embarks just predict where they're coming from:
em_predict = train_data[0::,7]
em_data = np.delete(train_data[0::,7],1)
em_predict_paired = em_predict
em_data_paired = em_data
i = 0
for row in em_predict_paired:
if(em_predict_paired[7] == ''):
np.delete(em_predict_paired[i,0::],1)
np.delete(em_data_paired[i,0::],1)
i += 1
gb_em = GradientBoostingRegressor().fit(em_data_paired,em_predict_paired)
em_output = gb_em.predict(em_data)
j = 0
for rows in train_data:
if(row[7]==''):
row[7] = em_output[j]
j += 1


test_file_object = csv.reader(open('test.csv', 'rb')) #Load in the test csv file
header = test_file_object.next() #Skip the fist line as it is a header
test_data=[] #Creat a variable called 'test_data'
for row in test_file_object: #Skip through each row in the csv file
test_data.append(row) #adding each row to the data variable
test_data = np.array(test_data) #Then convert from a list to an array

#I need to convert all strings to integer classifiers:
#Male = 1, female = 0:
test_data[test_data[0::,2]=='male',2] = 1
test_data[test_data[0::,2]=='female',2] = 0
#ebark c=0, s=1, q=2
test_data[test_data[0::,9] =='C',9] = 0 #Note this is not ideal, in more complex 3 is not 3 tmes better than 1 than 2 is 2 times better than 1
test_data[test_data[0::,9] =='S',9] = 1
test_data[test_data[0::,9] =='Q',9] = 2

test_data = np.delete(test_data,[1,6,8],1) #remove the name data, cabin and ticket

#All the ages with no data make the median of the data
age_predict = test_data[0::,2]
age_data = np.delete(test_data[0::,2],1)
age_predict_paired = age_predict
age_data_paired = age_data
i = 0
for row in age_predict_paired:
if(age_predict_paired[2] == ''):
np.delete(age_predict_paired[i,0::],1)
np.delete(age_data_paired[i,0::],1)
i += 1
gb_age = GradientBoostingRegressor().fit(age_data_paired,age_predict_paired)
age_output = gb_age.predict(age_data)
j = 0
for rows in test_data:
if(row[2]==''):
row[2] = age_output[j]
j += 1

#All missing embarks just predict where they're coming from:
em_predict = test_data[0::,6]
em_data = np.delete(test_data[0::,6],1)
em_predict_paired = em_predict
em_data_paired = em_data
i = 0
for row in em_predict_paired:
if(em_predict_paired[6] == ''):
np.delete(em_predict_paired[i,0::],1)
np.delete(em_data_paired[i,0::],1)
i += 1
gb_em = GradientBoostingRegressor().fit(em_data_paired,em_predict_paired)
em_output = gb_em.predict(em_data)
j = 0
for rows in test_data:
if(row[6]==''):
row[6] = em_output[j]
j += 1




#The data is now ready to go. So lets train then test!
# Random FOREST
# print 'Training '
# forest = RandomForestClassifier(n_estimators=100)

# forest = forest.fit(train_data[0::,1::],\
# train_data[0::,0])

# print 'Predicting'
# output = forest.predict(test_data)

gb = GradientBoostingClassifier().fit(train_data[0::,1::],\
train_data[0::,0])

output = gb.predict(test_data)

open_file_object = csv.writer(open("myfirstgb.csv", "wb"))
test_file_object = csv.reader(open('test.csv', 'rb')) #Load in the csv file


test_file_object.next()
i = 0
for row in test_file_object:
row.insert(0,output[i].astype(np.uint8))
open_file_object.writerow(row)
i += 1

Loading