TensorFlow/TFLearn学习案例:泰坦尼克

banq 16-07-27
    

在本教程中,您将学习使用TFLearn和tensorflow评估泰坦尼克号乘客幸存的机会,数据根据是利用他们的个人信息(如性别、年龄等)。为了解决这一经典的机器学习任务,我们要建立一个深神经网络分类器。

准备工作:首先按照指引安装好tensorflow 和 tflearn。

1912年4月15日,泰坦尼克号撞上冰山后沉没,造成2224名乘客和机组人员中1502人死亡。虽然在这场事故中生存下来存在一些运气因素,但是一些群体如妇女、儿童和船体上层人员生存概率更大。在本教程中,我们进行了分析,找出这些人是谁。

数据集
TFlearn会自动下载泰坦尼克号的下面数据:

VARIABLE DESCRIPTIONS:
survived Survived
(0 = No; 1 = Yes)
pclass Passenger Class
(1 = 1st; 2 = 2nd; 3 = 3rd)
name Name
sex Sex
age Age
sibsp Number of Siblings/Spouses Aboard
parch Number of Parents/Children Aboard
ticket Ticket Number
fare Passenger Fare


建立分类器
数据集存储在csv文件中,能够使用TFlearn的load_csv()函数加载数据,使用target_column作为存活与否的标签,也就是数据集第一列survived,函数返回一对数组(data, label)

import numpy as np
import tflearn

# Download the Titanic dataset
from tflearn.datasets import titanic
titanic.download_dataset('titanic_dataset.csv')

# Load CSV file, indicate that the first column represents labels
from tflearn.data_utils import load_csv
data, labels = load_csv('titanic_dataset.csv', target_column=0,
categorical_labels=True, n_classes=2)


预处理
数据作预先处理,数据中name对于预测没有什么用处,取消name和ticket两个字段;其次,神经网络只能处理数字,因此,将sex字段男女转为数字0或1。


# Preprocessing function
def preprocess(data, columns_to_ignore):
# Sort by descending id and delete columns
for id in sorted(columns_to_ignore, reverse=True):
[r.pop(id) for r in data]
for i in range(len(data)):
# Converting 'sex' field to float (id is 1 after removing labels column)
data[i][1] = 1. if data[i][1] == 'female' else 0.
return np.array(data, dtype=np.float32)

# Ignore 'name' and 'ticket' columns (id 1 & 6 of data array)
to_ignore=[1, 6]

# Preprocess data
data = preprocess(data, to_ignore)


建立深度神经网络
我们使用TFLearn建立一个3层神经网络,需要规定输入数据的形态,每个样本有6个特征,我们按批次处理可以节省内存,我们的数据输入形态是 [None, 6] ,其中None代码不知道维度,我们能改变批处理中被处理后的样本总数量。


# Build neural network
net = tflearn.input_data(shape=[None, 6])
net = tflearn.fully_connected(net, 32)
net = tflearn.fully_connected(net, 32)
net = tflearn.fully_connected(net, 2, activation='softmax')
net = tflearn.regression(net)


训练
TFLearn提供DNN包装器自动执行神经网络分类任务,比如训练 预测和保存恢复等,我们训练10次,神经网络10次会看到全部数据,每次批处理大小是16:


# Define model
model = tflearn.DNN(net)
# Start training (apply gradient descent algorithm)
model.fit(data, labels, n_epoch=10, batch_size=16, show_metric=True)


输出结果:

---------------------------------
Run id: MG9PV8
Log directory: /tmp/tflearn_logs/
---------------------------------
Training samples: 1309
Validation samples: 0
--
Training Step: 82 | total loss: 0.64003
| Adam | epoch: 001 | loss: 0.64003 - acc: 0.6620 -- iter: 1309/1309
--
Training Step: 164 | total loss: 0.61915
| Adam | epoch: 002 | loss: 0.61915 - acc: 0.6614 -- iter: 1309/1309
--
Training Step: 246 | total loss: 0.56067
| Adam | epoch: 003 | loss: 0.56067 - acc: 0.7171 -- iter: 1309/1309
--
Training Step: 328 | total loss: 0.51807
| Adam | epoch: 004 | loss: 0.51807 - acc: 0.7799 -- iter: 1309/1309
--
Training Step: 410 | total loss: 0.47475
| Adam | epoch: 005 | loss: 0.47475 - acc: 0.7962 -- iter: 1309/1309
--
Training Step: 492 | total loss: 0.51677
| Adam | epoch: 006 | loss: 0.51677 - acc: 0.7701 -- iter: 1309/1309
--
Training Step: 574 | total loss: 0.48988
| Adam | epoch: 007 | loss: 0.48988 - acc: 0.7891 -- iter: 1309/1309
--
Training Step: 656 | total loss: 0.55073
| Adam | epoch: 008 | loss: 0.55073 - acc: 0.7427 -- iter: 1309/1309
--
Training Step: 738 | total loss: 0.50242
| Adam | epoch: 009 | loss: 0.50242 - acc: 0.7854 -- iter: 1309/1309
--
Training Step: 820 | total loss: 0.41557
| Adam | epoch: 010 | loss: 0.41557 - acc: 0.8110 -- iter: 1309/1309
--


模型完成训练准确率达到81%,说明它对全部乘客存活与否能够有81%准确率。

下面我们试用这个模型,将泰坦尼克电影中男女主角杰克和露丝的资料输入:


# Let's create some data for DiCaprio and Winslet
dicaprio = [3, 'Jack Dawson', 'male', 19, 0, 0, 'N/A', 5.0000]
winslet = [1, 'Rose DeWitt Bukater', 'female', 17, 1, 2, 'N/A', 100.0000]
# Preprocess data
dicaprio, winslet = preprocess([dicaprio, winslet], to_ignore)
# Predict surviving chances (class 1 results)
pred = model.predict([dicaprio, winslet])
print("DiCaprio Surviving Rate:", pred[0][1])
print(
"Winslet Surviving Rate:", pred[1][1])


输出结果是:
DiCaprio Surviving Rate: 0.13849584758281708
Winslet Surviving Rate: 0.92201167345047

预测露丝有92的高概率生存,而杰克则相反。

更普遍的是,通过这项研究表明,第一层的妇女和儿童的乘客有最高的机会生存,而第三层的男乘客有最低。

tflearn/quickstart.md at master · tflearn/tflearn

    

3