Titanic Disaster

Creation date: 2015-09-15

Tags: kaggle, neural-nets, machine-learning

It's time to use our machine learning knowledge in combination with real data. There is no need to teach logical functions or other well defined things to the computer. But do you know whether you would have survived the Titanic disaster? Probably not and well you will never know for sure but we can make a prediction using machine learning and a neural net.

Several month ago I found this challenge on kaggle. They provide several information about the passengers on the titanic. Kaggle has always at least two different datasets for supervised learning challenges. One includes the training set and the other one the test set. This time the challenge is to predict whether a given passenger survived the disaster or not.

How can we know?

First of all machine learning is not about knowing, its about predicting and see if there are any good predictions or if it's more or less random.

Let's have a look at the datasets. Both the train.csv and the test.csv have the following columns:

'PassengerId', 'Pclass', 'Name', 'Sex', 'Age',
'SibSp','Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'

The columns are explained at the datasets page as well. The train dataset has an additional column Survived. Our task is to predict that value for the test dataset.

How to predict?

You've learned the most basic strategy to predict using a simple neural network without the weird hidden layers.

This time we have the data inside a csv file so we have to read the file first and save only the data we need to predict.

Therefore we need to import csv and numpy.

import csv
import numpy as np

# load in the csv file and read it as text
csv_file_object = csv.reader(open('train.csv', 'rU'))
# we want to store the csv data inside a numpy array
data=np.array(csv_file_object.next())
data=np.expand_dims(data,axis=0)

The last line is used to make it possible to add more rows into the data array.

# save each row to our new data array
for row in csv_file_object:
    data = np.append(data,[row],axis=0)

header = data[0]
data = np.delete(data,0,0)
print("header: ",header)
hd = {}
i = 0
for h in header:
    hd[h] = i
    i += 1

The first part is used to save the rest of the csv data into the data numpy array. We don't want to have the header inside data but we want to use it to access the several columns later. So we delete the first row after saving it inside header and construct a hd dictionary which looks like this afterwards:

{'Fare': 9, 'Name': 3, 'Embarked': 11, 'Age': 5, 'Parch': 7, 'Pclass': 2,
'Sex': 4, 'Survived': 1, 'SibSp': 6, 'PassengerId': 0, 'Ticket': 8, 'Cabin': 10}

Now we can get the column number when we know the column name using something like: hd['Fare'] which will output 9. (The tenth column)

We can easily output all entries in the survived column using:

print(data[::,hd['Survived']])

And see how many passengers survived (only inside the training set).

survived = data[::,hd['Survived']].astype(np.float)
print("{0} % survived.".format((np.sum(survived)/len(survived))*100))

We have to convert the Survived column into float to get the sum.

38.3838383838 % survived.

Now we can use the same functions as in my earlier blog entry.

def sigmoid(x):
    return 1.0/(1+np.exp(-x))

"""
    Predicts the outcome for the input x and the weights w
    x_0 is 1 and w_0 is the bias
"""
def predict(x,w):
    return sigmoid(np.dot(x.T,w.T))

"""
    get the gradient for the inputs x,weights w and the outputs y
"""
def gradient(x,w,y):
    # create an empty gradient array which has the same length as the array of weights
    grads = np.empty((len(w)))
    # compute the gradient for each weight
    for j in range(len(w)):
        grads[j] = np.mean(np.sum([x[i,j]*(predict(x[i],w)-y[i]) for i in range(len(x))]))
    return grads

"""
    get the new weights based on the old ones w, learning rate a and the gradients
"""
def getWeights(w,a,grads):
    return w-a*grads

"""
    Determine the cost of the prediction (pred)
"""
def cost(real,pred):
    return np.sqrt(1.0/2*np.mean(np.power(real-pred,2)))

I've changed the function predict which is a short form of the one before:

def predict(x,w):
    # return sigmoid(np.sum([w[p]*x[p] for p in range(len(x))]))
    return sigmoid(np.dot(x.T,w.T))

We only need an input representation for the neural net.

def getInputRepresentation(m,entry,test=False):
    if test:
        t = 1
    else:
        t = 0

    inp = np.zeros(m+1)
    inp[0] = 1
    inp[1] = entry[hd['Sex']-t] == "female"
    pClass = entry[hd['Pclass']-t].astype(np.float)
    inp[2] = pClass == 1
    inp[3] = pClass == 2
    inp[4] = pClass == 3

    # we can also add the age column and divide it into 5 groups
    if entry[hd['Age']-t] == '':
        inp[5:10] = [0.1,0.3,0.2,0.2,0.2]
    else:
        inp[5] = (entry[hd['Age']-t].astype(np.float) <= 15)
        inp[6] = (entry[hd['Age']-t].astype(np.float) > 15) & (entry[hd['Age']-t].astype(np.float) <= 25)
        inp[7] = (entry[hd['Age']-t].astype(np.float) > 25) & (entry[hd['Age']-t].astype(np.float) <= 32)
        inp[8] = (entry[hd['Age']-t].astype(np.float) > 32) & (entry[hd['Age']-t].astype(np.float) <= 41)
        inp[9] = (entry[hd['Age']-t].astype(np.float) > 41)


    inp[10] = (entry[hd['SibSp']-t].astype(np.float) <= 0)
    inp[11] = (entry[hd['SibSp']-t].astype(np.float) > 0) & (entry[hd['SibSp']-t].astype(np.float) <= 1)
    inp[12] = (entry[hd['SibSp']-t].astype(np.float) > 1)
    inp[13] = (entry[hd['Parch']-t].astype(np.float) <= 0)
    inp[14] = (entry[hd['Parch']-t].astype(np.float) > 0) & (entry[hd['Parch']-t].astype(np.float) <= 1)
    inp[15] = (entry[hd['Parch']-t].astype(np.float) > 1)
    if entry[hd['Fare']-t] == '':
        inp[16:21] = [0.2,0.2,0.2,0.2,0.2]
    else:
        inp[16] = (entry[hd['Fare']-t].astype(np.float) <= 8)
        inp[17] = (entry[hd['Fare']-t].astype(np.float) > 8) & (entry[hd['Fare']-t].astype(np.float) <= 11)
        inp[18] = (entry[hd['Fare']-t].astype(np.float) > 11) & (entry[hd['Fare']-t].astype(np.float) <= 22)
        inp[19] = (entry[hd['Fare']-t].astype(np.float) > 22) & (entry[hd['Fare']-t].astype(np.float) <= 40)
        inp[20] = (entry[hd['Fare']-t].astype(np.float) > 40)

    title = entry[hd['Name']-t].split(", ")[1].split(" ")[0]
    inp[21] = title == "Mr."
    inp[22] = title == "Mrs."
    inp[23] = title == "Miss."
    inp[24] = title == "Master."
    inp[25] = title == "Dr."
    inp[26] = title == "Sir."
    if (np.count_nonzero(inp[21:27]) == 0):
        inp[27] = 1
    return inp

Okay wait what?

We have the data and the Survived column, so just feed the neural net. Unfortunately it's not that easy but not complicated.

Our network only works with float inputs, so we are not able to use the Name column, which might be interesting. And probably there is no linear dependence between the first, second and third class so it might be better to have three features for the classes where each represents a single class and is either 0 or 1.

Let's have a look at the first part of the function.

def getInputRepresentation(m,entry,test=False):
    if test:
        t = 1
    else:
        t = 0

    inp = np.zeros(m+1)
    inp[0] = 1

We want to get an input representation of an entry (a row of the dataset), which will be an numpy array. The first lines are needed later on, because the training dataset doesn't have a Survived column which is the second column inside the training dataset. So we have to subtract 1 on each column number when we use the function to get an input representation for the test dataset.

With the last two rows we initialize an numpy array filled with m+1 zeros where m will be the number of features we need. And we add one, because we have the threshold input which is always set to 1.

At the next section we are constructing the first input representations for the sex and the passenger class.

inp[1] = entry[hd['Sex']-t] == "female"
pClass = entry[hd['Pclass']-t].astype(np.float)
inp[2] = pClass == 1
inp[3] = pClass == 2
inp[4] = pClass == 3

The first row can be read like: "If the value inside the Sex column is female, inp[1] will be 1 else (male) it will be 0".

We use the same procedure for the class but here we need to convert the entry to float.

The second section:

# we can also add the age column and divide it into 5 groups
if entry[hd['Age']-t] == '':
    inp[5:10] = [0.2,0.2,0.2,0.2,0.2]
else:
    inp[5] = (entry[hd['Age']-t].astype(np.float) <= 19)
    inp[6] = (entry[hd['Age']-t].astype(np.float) > 19) & (entry[hd['Age']-t].astype(np.float) <= 25)
    inp[7] = (entry[hd['Age']-t].astype(np.float) > 25) & (entry[hd['Age']-t].astype(np.float) <= 32)
    inp[8] = (entry[hd['Age']-t].astype(np.float) > 32) & (entry[hd['Age']-t].astype(np.float) <= 41)
    inp[9] = (entry[hd['Age']-t].astype(np.float) > 41)

Not every passenger has a defined age, so we are not able to set proper values for each. I've decided to divide the passengers into five age groups.

x <= 19
19 < x <= 25
...

So we don't have a group for each year which wouldn't be really effective, because the neural network can't learn anything if we always have different inputs.

I divided the passengers in about equally sized groups of age using:

sorted_age = np.sort(data[data[0::,hd['Age']] != ""][0::,hd['Age']].astype(np.float))
print(sorted_age)
print(sorted_age[len(sorted_age)/5])
print(sorted_age[(2*len(sorted_age))/5])
print(sorted_age[(3*len(sorted_age))/5])
print(sorted_age[(4*len(sorted_age))/5])

Which outputs:

19.0
25.0
32.0
41.0

I've done the same code for SibSp,Parch,Fare as you can see above. The last section of the function splits the name column into the title part which has values like "Mr.","Dr." which might be interesting as well.

title = entry[hd['Name']-t].split(", ")[1].split(" ")[0]
inp[21] = title == "Mr."
inp[22] = title == "Mrs."
inp[23] = title == "Miss."
inp[24] = title == "Master."
inp[25] = title == "Dr."
inp[26] = title == "Sir."
if (np.count_nonzero(inp[21:27]) == 0):
    inp[27] = 1

For all passengers who have a different title we use the last two lines, which set an extra feature for those.

Now we need the representation for each passenger.

inputs = []
outputs = []
m = 27 # number of features without threshold

for entry in data:
    inp = getInputRepresentation(m,entry)

    inputs.append(inp)
    outputs.append(entry[hd['Survived']])

inputs = np.array(inputs).astype(np.float)
outputs = np.array(outputs).astype(np.float)

weights = np.random.rand(m+1) # one for the threshold

First of all we define the two arrays inputs and outputs and the number of features m without counting the threshold. In the for loop we call the getInputRepresentation function for each passenger and then save the return value inside inputs. Whereas outputs is always defined as the value inside the Survived column for the current passenger.

Now we have to initialize the weights for the neural network.

Let's train our network!

alpha = 0.001
epochs = 100
train_size = int((3*len(inputs))/4)
trainX = inputs[0:train_size]
trainY = outputs[0:train_size]
testX = inputs[train_size:]
testY = outputs[train_size:]

for t in range(epochs):
    weights = getWeights(weights,alpha,gradient(trainX,weights,trainY))
    sum_costs = 0
    for inp,outp in zip(testX,testY):
        prediction = predict(inp,weights)
        last_cost = cost(outp,0 if prediction < 0.5 else 1)
        sum_costs += last_cost

    print(weights)
    print(sum_costs/(len(inputs)-train_size))

We have to define the learning rate alpha and the number of epochs. And we define trainX,trainY which are ~75% of the training data and testX,testY to compute the accuracy of our prediction.

For each row inside our internal test data (testX,testY) we have the correct output outp and the predicted output prediction. Using our cost function we can sum up the costs and get the accuracy. We have to use

last_cost = cost(outp,0 if prediction < 0.5 else 1)

because kaggle only allows us to predict "the passenger xy survived" or "the passenger xy died" but not "We predicted with a guaranty of 65% that passenger xy survived"

Let's have a look at the output of our script. The last section inside your terminal should look something like this:

[-0.95382855  0.94469917  0.89858109  0.43102736 -0.65090014  0.33426738
  0.33741184  0.42538308  0.31467975 -0.12740568  0.1063886   0.20141303
 -0.71368804  0.20324538  0.22165528 -0.13517055  0.13118224 -0.16832063
 -0.11203696  0.01962982  0.35274372 -1.30658701  0.59768098  0.66461036
  0.80450757  0.5089403   0.8805734  -0.01431949]
0.142689709208

It doesn't have to look the same on your machine, because we initialized the weights using a random function.

Okay what does the output mean? The last row in a section (here: 0.142689709208) represents the failure rate of our neural net. At the kaggle leaderboard it will be displayed as the accuracy so 1-0.142689709208.

How to interpret the array?

If the number inside the array is positive, than the feature is more or less good for the passenger. The first real feature (ignoring the weight for the threshold) is our "female" feature which has a weight of 0.94469917 Which means that it was good to be a woman on the Titanic, much better. The following three weights are 0.89858109 0.43102736 -0.65090014 Where the first value stands for the weight of being a first class passenger and so on. You can see that it was good to be first class passenger but being a woman was even slightly better than that. Being a third class passenger had an negative impact on the probability of survival.

Here you can see the whole table

Weight	Interpretation
0.94469917	Female
0.89858109	First Class
0.43102736	Second Class
-0.65090014	Third Class
0.33426738	\(Age \leq 19\)
0.33741184	\(19 < Age \leq 25\)
0.42538308	\(25 < Age \leq 32\)
0.31467975	\(32 < Age \leq 41\)
-0.12740568	\(Age > 41\)
0.1063886	#Siblings/Spouse = 0
0.20141303	#Siblings/Spouse = 1
-0.71368804	#Siblings/Spouse > 1
0.20324538	#Parents/Children = 0
0.22165528	#Parents/Children = 1
-0.13517055	#Parents/Children > 1
0.13118224	\(Fare \leq 8\)
-0.16832063	\(8 < Fare \leq 11 \)
-0.11203696	\(11 < Fare \leq 22 \)
0.01962982	\(22 < Fare \leq 40 \)
0.35274372	\(Fare > 40\)
-1.30658701	Mr.
0.59768098	Mrs.
0.66461036	Miss.
0.80450757	Master
0.5089403	Dr.
0.8805734	Sir.
-0.01431949	None of the above titles

Save & Upload our Prediction

To upload our prediction to kaggle we need to calculate it using the test.csv and write the prediction into a csv file.

# First, read in test.csv
test_file = open('test.csv', 'rU')
test_file_object = csv.reader(test_file)
header = test_file_object.next()

The csv file must have two columns: PassengerId,Survived

# Write out the PassengerId, and my prediction.
predictions_file = open("prediction.csv", "w")
predictions_file_object = csv.writer(predictions_file)
# write the column headers
predictions_file_object.writerow(["PassengerId", "Survived"])

Now we have to make a prediction for each row inside the test dataset. Here we use the test=True as the last parameter for getInputRepresentation.

for row in test_file_object:									# For each row in test file,
    inp = getInputRepresentation(m,np.array(row),test=True)
    prediction = predict(inp,weights)
    predictions_file_object.writerow([row[0], "0" if prediction < 0.5 else "1"])

test_file.close()												# Close out the files.
predictions_file.close()

Our most basic approach has an accuracy of ~77.5% which isn't perfect but not bad for such a simple code.

Some steps for the improving would be to fine tune the alpha rate, using different age groups which might be better. You should definitely have a look at the train.csv file to see how many passengers don't have a defined age. It would be possible to predict the age first to get some better results.

Maybe there is no "good simple" neural net for that task. With "simple" I mean a neural net without hidden layers which are more complex in the mathematical sense.

You can download the complete file at Github

I hope you learned the basics about machine learning and Python in this and the last blog entries. There will be some more blog entries about machine learning in the next months as well as some other completely different topics in computer science, because I am working on several interesting projects at the moment.

Stay tuned!

If you are interested in Deep Neural Nets you might want to checkout this article: Multilayer Perceptron (http://deeplearning.net/tutorial/mlp.html)

Want to be updated? Consider subscribing and receiving a mail whenever a new post comes out.

Subscribe to RSS

OpenSourc.ES

Titanic Disaster

How can we know?

How to predict?

Save & Upload our Prediction