Creation date: 2015-09-15
It's time to use our machine learning knowledge in combination with real data. There is no need to teach logical functions or other well defined things to the computer. But do you know whether you would have survived the Titanic disaster? Probably not and well you will never know for sure but we can make a prediction using machine learning and a neural net.
Several month ago I found this challenge on kaggle. They provide several information about the passengers on the titanic. Kaggle has always at least two different datasets for supervised learning challenges. One includes the training set and the other one the test set. This time the challenge is to predict whether a given passenger survived the disaster or not.
First of all machine learning is not about knowing, its about predicting and see if there are any good predictions or if it's more or less random.
Let's have a look at the datasets. Both the train.csv and the test.csv have the following columns:
'PassengerId', 'Pclass', 'Name', 'Sex', 'Age',
'SibSp','Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'
The columns are explained at the datasets page as well. The train dataset has an additional column Survived
. Our task is to predict that value for the test dataset.
You've learned the most basic strategy to predict using a simple neural network without the weird hidden layers.
This time we have the data inside a csv file so we have to read the file first and save only the data we need to predict.
Therefore we need to import csv
and numpy
.
import csv
import numpy as np
# load in the csv file and read it as text
csv_file_object = csv.reader(open('train.csv', 'rU'))
# we want to store the csv data inside a numpy array
data=np.array(csv_file_object.next())
data=np.expand_dims(data,axis=0)
The last line is used to make it possible to add more rows into the data
array.
# save each row to our new data array
for row in csv_file_object:
data = np.append(data,[row],axis=0)
header = data[0]
data = np.delete(data,0,0)
print("header: ",header)
hd = {}
i = 0
for h in header:
hd[h] = i
i += 1
The first part is used to save the rest of the csv data into the data
numpy array. We don't want to have the header inside data
but we want to use it to access the several columns later. So we delete the first row after saving it inside header
and construct a hd
dictionary which looks like this afterwards:
{'Fare': 9, 'Name': 3, 'Embarked': 11, 'Age': 5, 'Parch': 7, 'Pclass': 2,
'Sex': 4, 'Survived': 1, 'SibSp': 6, 'PassengerId': 0, 'Ticket': 8, 'Cabin': 10}
Now we can get the column number when we know the column name using something like: hd['Fare']
which will output 9
. (The tenth column)
We can easily output all entries in the survived column using:
print(data[::,hd['Survived']])
And see how many passengers survived (only inside the training set).
survived = data[::,hd['Survived']].astype(np.float)
print("{0} % survived.".format((np.sum(survived)/len(survived))*100))
We have to convert the Survived
column into float to get the sum.
38.3838383838 % survived.
Now we can use the same functions as in my earlier blog entry.
def sigmoid(x):
return 1.0/(1+np.exp(-x))
"""
Predicts the outcome for the input x and the weights w
x_0 is 1 and w_0 is the bias
"""
def predict(x,w):
return sigmoid(np.dot(x.T,w.T))
"""
get the gradient for the inputs x,weights w and the outputs y
"""
def gradient(x,w,y):
# create an empty gradient array which has the same length as the array of weights
grads = np.empty((len(w)))
# compute the gradient for each weight
for j in range(len(w)):
grads[j] = np.mean(np.sum([x[i,j]*(predict(x[i],w)-y[i]) for i in range(len(x))]))
return grads
"""
get the new weights based on the old ones w, learning rate a and the gradients
"""
def getWeights(w,a,grads):
return w-a*grads
"""
Determine the cost of the prediction (pred)
"""
def cost(real,pred):
return np.sqrt(1.0/2*np.mean(np.power(real-pred,2)))
I've changed the function predict
which is a short form of the one before:
def predict(x,w):
# return sigmoid(np.sum([w[p]*x[p] for p in range(len(x))]))
return sigmoid(np.dot(x.T,w.T))
We only need an input representation for the neural net.
def getInputRepresentation(m,entry,test=False):
if test:
t = 1
else:
t = 0
inp = np.zeros(m+1)
inp[0] = 1
inp[1] = entry[hd['Sex']-t] == "female"
pClass = entry[hd['Pclass']-t].astype(np.float)
inp[2] = pClass == 1
inp[3] = pClass == 2
inp[4] = pClass == 3
# we can also add the age column and divide it into 5 groups
if entry[hd['Age']-t] == '':
inp[5:10] = [0.1,0.3,0.2,0.2,0.2]
else:
inp[5] = (entry[hd['Age']-t].astype(np.float) <= 15)
inp[6] = (entry[hd['Age']-t].astype(np.float) > 15) & (entry[hd['Age']-t].astype(np.float) <= 25)
inp[7] = (entry[hd['Age']-t].astype(np.float) > 25) & (entry[hd['Age']-t].astype(np.float) <= 32)
inp[8] = (entry[hd['Age']-t].astype(np.float) > 32) & (entry[hd['Age']-t].astype(np.float) <= 41)
inp[9] = (entry[hd['Age']-t].astype(np.float) > 41)
inp[10] = (entry[hd['SibSp']-t].astype(np.float) <= 0)
inp[11] = (entry[hd['SibSp']-t].astype(np.float) > 0) & (entry[hd['SibSp']-t].astype(np.float) <= 1)
inp[12] = (entry[hd['SibSp']-t].astype(np.float) > 1)
inp[13] = (entry[hd['Parch']-t].astype(np.float) <= 0)
inp[14] = (entry[hd['Parch']-t].astype(np.float) > 0) & (entry[hd['Parch']-t].astype(np.float) <= 1)
inp[15] = (entry[hd['Parch']-t].astype(np.float) > 1)
if entry[hd['Fare']-t] == '':
inp[16:21] = [0.2,0.2,0.2,0.2,0.2]
else:
inp[16] = (entry[hd['Fare']-t].astype(np.float) <= 8)
inp[17] = (entry[hd['Fare']-t].astype(np.float) > 8) & (entry[hd['Fare']-t].astype(np.float) <= 11)
inp[18] = (entry[hd['Fare']-t].astype(np.float) > 11) & (entry[hd['Fare']-t].astype(np.float) <= 22)
inp[19] = (entry[hd['Fare']-t].astype(np.float) > 22) & (entry[hd['Fare']-t].astype(np.float) <= 40)
inp[20] = (entry[hd['Fare']-t].astype(np.float) > 40)
title = entry[hd['Name']-t].split(", ")[1].split(" ")[0]
inp[21] = title == "Mr."
inp[22] = title == "Mrs."
inp[23] = title == "Miss."
inp[24] = title == "Master."
inp[25] = title == "Dr."
inp[26] = title == "Sir."
if (np.count_nonzero(inp[21:27]) == 0):
inp[27] = 1
return inp
Okay wait what?
We have the data
and the Survived
column, so just feed the neural net. Unfortunately it's not that easy but not complicated.
Our network only works with float
inputs, so we are not able to use the Name
column, which might be interesting. And probably there is no linear dependence between the first, second and third class so it might be better to have three features for the classes where each represents a single class and is either 0
or 1
.
Let's have a look at the first part of the function.
def getInputRepresentation(m,entry,test=False):
if test:
t = 1
else:
t = 0
inp = np.zeros(m+1)
inp[0] = 1
We want to get an input representation of an entry (a row of the dataset), which will be an numpy array. The first lines are needed later on, because the training dataset doesn't have a Survived
column which is the second column inside the training dataset. So we have to subtract 1
on each column number when we use the function to get an input representation for the test dataset.
With the last two rows we initialize an numpy array filled with m+1
zeros where m will be the number of features we need. And we add one, because we have the threshold input which is always set to 1
.
At the next section we are constructing the first input representations for the sex and the passenger class.
inp[1] = entry[hd['Sex']-t] == "female"
pClass = entry[hd['Pclass']-t].astype(np.float)
inp[2] = pClass == 1
inp[3] = pClass == 2
inp[4] = pClass == 3
The first row can be read like: "If the value inside the Sex
column is female
, inp[1]
will be 1
else (male) it will be 0
".
We use the same procedure for the class but here we need to convert the entry to float.
The second section:
# we can also add the age column and divide it into 5 groups
if entry[hd['Age']-t] == '':
inp[5:10] = [0.2,0.2,0.2,0.2,0.2]
else:
inp[5] = (entry[hd['Age']-t].astype(np.float) <= 19)
inp[6] = (entry[hd['Age']-t].astype(np.float) > 19) & (entry[hd['Age']-t].astype(np.float) <= 25)
inp[7] = (entry[hd['Age']-t].astype(np.float) > 25) & (entry[hd['Age']-t].astype(np.float) <= 32)
inp[8] = (entry[hd['Age']-t].astype(np.float) > 32) & (entry[hd['Age']-t].astype(np.float) <= 41)
inp[9] = (entry[hd['Age']-t].astype(np.float) > 41)
Not every passenger has a defined age, so we are not able to set proper values for each. I've decided to divide the passengers into five age groups.
x <= 19
19 < x <= 25
...
So we don't have a group for each year which wouldn't be really effective, because the neural network can't learn anything if we always have different inputs.
I divided the passengers in about equally sized groups of age using:
sorted_age = np.sort(data[data[0::,hd['Age']] != ""][0::,hd['Age']].astype(np.float))
print(sorted_age)
print(sorted_age[len(sorted_age)/5])
print(sorted_age[(2*len(sorted_age))/5])
print(sorted_age[(3*len(sorted_age))/5])
print(sorted_age[(4*len(sorted_age))/5])
Which outputs:
19.0
25.0
32.0
41.0
I've done the same code for SibSp,Parch,Fare
as you can see above. The last section of the function splits the name column into the title part which has values like "Mr.","Dr."
which might be interesting as well.
title = entry[hd['Name']-t].split(", ")[1].split(" ")[0]
inp[21] = title == "Mr."
inp[22] = title == "Mrs."
inp[23] = title == "Miss."
inp[24] = title == "Master."
inp[25] = title == "Dr."
inp[26] = title == "Sir."
if (np.count_nonzero(inp[21:27]) == 0):
inp[27] = 1
For all passengers who have a different title we use the last two lines, which set an extra feature for those.
Now we need the representation for each passenger.
inputs = []
outputs = []
m = 27 # number of features without threshold
for entry in data:
inp = getInputRepresentation(m,entry)
inputs.append(inp)
outputs.append(entry[hd['Survived']])
inputs = np.array(inputs).astype(np.float)
outputs = np.array(outputs).astype(np.float)
weights = np.random.rand(m+1) # one for the threshold
First of all we define the two arrays inputs
and outputs
and the number of features m
without counting the threshold. In the for
loop we call the getInputRepresentation
function for each passenger and then save the return value inside inputs
. Whereas outputs
is always defined as the value inside the Survived
column for the current passenger.
Now we have to initialize the weights for the neural network.
Let's train our network!
alpha = 0.001
epochs = 100
train_size = int((3*len(inputs))/4)
trainX = inputs[0:train_size]
trainY = outputs[0:train_size]
testX = inputs[train_size:]
testY = outputs[train_size:]
for t in range(epochs):
weights = getWeights(weights,alpha,gradient(trainX,weights,trainY))
sum_costs = 0
for inp,outp in zip(testX,testY):
prediction = predict(inp,weights)
last_cost = cost(outp,0 if prediction < 0.5 else 1)
sum_costs += last_cost
print(weights)
print(sum_costs/(len(inputs)-train_size))
We have to define the learning rate alpha
and the number of epochs. And we define trainX,trainY
which are ~75% of the training data and testX,testY
to compute the accuracy of our prediction.
For each row inside our internal test data (testX,testY
) we have the correct output outp
and the predicted output prediction
. Using our cost
function we can sum up the costs and get the accuracy. We have to use
last_cost = cost(outp,0 if prediction < 0.5 else 1)
because kaggle only allows us to predict "the passenger xy survived" or "the passenger xy died" but not "We predicted with a guaranty of 65% that passenger xy survived"
Let's have a look at the output of our script. The last section inside your terminal should look something like this:
[-0.95382855 0.94469917 0.89858109 0.43102736 -0.65090014 0.33426738
0.33741184 0.42538308 0.31467975 -0.12740568 0.1063886 0.20141303
-0.71368804 0.20324538 0.22165528 -0.13517055 0.13118224 -0.16832063
-0.11203696 0.01962982 0.35274372 -1.30658701 0.59768098 0.66461036
0.80450757 0.5089403 0.8805734 -0.01431949]
0.142689709208
It doesn't have to look the same on your machine, because we initialized the weights using a random function.
Okay what does the output mean? The last row in a section (here: 0.142689709208
) represents the failure rate of our neural net. At the kaggle leaderboard it will be displayed as the accuracy so 1-0.142689709208
.
How to interpret the array?
If the number inside the array is positive, than the feature is more or less good for the passenger. The first real feature (ignoring the weight for the threshold) is our "female" feature which has a weight of 0.94469917
Which means that it was good to be a woman on the Titanic, much better. The following three weights are 0.89858109 0.43102736 -0.65090014
Where the first value stands for the weight of being a first class passenger and so on. You can see that it was good to be first class passenger but being a woman was even slightly better than that. Being a third class passenger had an negative impact on the probability of survival.
Here you can see the whole table
Weight | Interpretation |
---|---|
0.94469917 | Female |
0.89858109 | First Class |
0.43102736 | Second Class |
-0.65090014 | Third Class |
0.33426738 | \(Age \leq 19\) |
0.33741184 | \(19 < Age \leq 25\) |
0.42538308 | \(25 < Age \leq 32\) |
0.31467975 | \(32 < Age \leq 41\) |
-0.12740568 | \(Age > 41\) |
0.1063886 | #Siblings/Spouse = 0 |
0.20141303 | #Siblings/Spouse = 1 |
-0.71368804 | #Siblings/Spouse > 1 |
0.20324538 | #Parents/Children = 0 |
0.22165528 | #Parents/Children = 1 |
-0.13517055 | #Parents/Children > 1 |
0.13118224 | \(Fare \leq 8\) |
-0.16832063 | \(8 < Fare \leq 11 \) |
-0.11203696 | \(11 < Fare \leq 22 \) |
0.01962982 | \(22 < Fare \leq 40 \) |
0.35274372 | \(Fare > 40\) |
-1.30658701 | Mr. |
0.59768098 | Mrs. |
0.66461036 | Miss. |
0.80450757 | Master |
0.5089403 | Dr. |
0.8805734 | Sir. |
-0.01431949 | None of the above titles |
To upload our prediction to kaggle we need to calculate it using the test.csv
and write the prediction into a csv file.
# First, read in test.csv
test_file = open('test.csv', 'rU')
test_file_object = csv.reader(test_file)
header = test_file_object.next()
The csv file must have two columns: PassengerId,Survived
# Write out the PassengerId, and my prediction.
predictions_file = open("prediction.csv", "w")
predictions_file_object = csv.writer(predictions_file)
# write the column headers
predictions_file_object.writerow(["PassengerId", "Survived"])
Now we have to make a prediction for each row inside the test dataset. Here we use the test=True
as the last parameter for getInputRepresentation
.
for row in test_file_object: # For each row in test file,
inp = getInputRepresentation(m,np.array(row),test=True)
prediction = predict(inp,weights)
predictions_file_object.writerow([row[0], "0" if prediction < 0.5 else "1"])
test_file.close() # Close out the files.
predictions_file.close()
Our most basic approach has an accuracy of ~77.5% which isn't perfect but not bad for such a simple code.
Some steps for the improving would be to fine tune the alpha rate, using different age groups which might be better. You should definitely have a look at the train.csv
file to see how many passengers don't have a defined age. It would be possible to predict the age first to get some better results.
Maybe there is no "good simple" neural net for that task. With "simple" I mean a neural net without hidden layers which are more complex in the mathematical sense.
You can download the complete file at Github
I hope you learned the basics about machine learning and Python in this and the last blog entries. There will be some more blog entries about machine learning in the next months as well as some other completely different topics in computer science, because I am working on several interesting projects at the moment.
Stay tuned!
If you are interested in Deep Neural Nets you might want to checkout this article: Multilayer Perceptron (http://deeplearning.net/tutorial/mlp.html)