Deepening – Highway Network in Keras and Thousands of Road – Significant Differences

I used keras and lasagna to implement a highway network, and the keras version has been lower than the lasagna version. I use the same data set and meta parameters in them. This is Keras version code:

X_train, y_train, X_test, y_test, X_all = hacking_script.load_all_data()
data_dim = 144
layer_count = 32
dropout = 0.04
hidden_units = 32
nb_epoch = 10

model = Sequential()
model.add(Dense(hidden_units, input_dim=data_dim ))
model.add(Dropout(dropout))
for index in range(layer_count):
model.add(Highway(activation ='relu'))
model. add(Dropout(dropout))
model.add(Dropout(dropout))
model.add(Dense(2, activation='softmax'))


print'compiling...'
model.compile(loss='binary_crossentropy', optimizer='adagrad')
model.fit(X_train, y_train, batch_size=100, nb_epoch=nb_epoch,
show_accuracy=True, validation_data=(X_test, y_test), shuffle=True, verbose=0)

predictions = model.predict_proba(X_test)

This is lasagna Version code:

class Multiplicati veGatingLayer(MergeLayer):
def __init__(self, gate, input1, input2, **kwargs):
incomings = [gate, input1, input2]
super(MultiplicativeGatingLayer, self).__init__ (incomings, **kwargs)
assert gate.output_shape == input1.output_shape == input2.output_shape

def get_output_shape_for(self, input_shapes):
return input_shapes[0]

def get_output_for(self, inputs, **kwargs):
return inputs[0] * inputs[1] + (1-inputs[0]) * inputs[2]


def highway_dense(incoming, Wh=Orthogonal(), bh=Constant(0.0),
Wt=Orthogonal(), bt=Constant(-4.0),
nonlinearity=rectify, **kwargs):
num_inputs = int(np.prod(incoming.output_shape[1:]))

l_h = DenseLayer(incoming, num_units=num_inputs, W= Wh, b=bh, nonlinearity=nonlinearity)
l_t = DenseLayer(incoming, num_units=num_inputs, W=Wt, b=bt, nonlinearity=sigmoid)

ret urn MultiplicativeGatingLayer(gate=l_t, input1=l_h, input2=incoming)

# ==== Parameters ====

num_features = X_train.shape[1]< br />epochs = 10

hidden_layers = 32
hidden_units = 32
dropout_p = 0.04

# ==== Defining the neural network shape = ===

l_in = InputLayer(shape=(None, num_features))
l_hidden1 = DenseLayer(l_in, num_units=hidden_units)
l_hidden2 = DropoutLayer(l_hidden1, p=dropout_p )
l_current = l_hidden2
for k in range(hidden_layers-1):
l_current = highway_dense(l_current)
l_current = DropoutLayer(l_current, p=dropout_p)
l_dropout = DropoutLayer(l_current, p=dropout_p)
l_out = DenseLayer(l_dropout, num_units=2, nonlinearity=softmax)

# ==== Neural network definition ====

net1 = NeuralNet(layers=l_out,
update=adadelta, update_rho=0.95, update_learning_rate=1.0,
objective_loss_function=categorical_crossentro py,
train_split=TrainSplit(eval_size=0), verbose=0, max_epochs=1)

net1.fit(X_train, y_train)
predictions = net1.predict_proba(X_test )[:, 1]

Now the keras version is hardly better than logistic regression, and the lasagna version is by far the best scoring algorithm. Any ideas why?

Here are some suggestions (I’m not sure if they will actually close the performance gap you are observing ):

According to the Keras documentation, use Glorot Uniform weights to initialize the Highway layer, while in your Lasagne code, you use orthogonal weights to initialize. Unless another part of your code is used for Keras The Highway layer settings weights are initialized to Orthogonal, otherwise this may be the source of the performance gap.

It seems that you are also using Adagrad as the Keras model, but you are using Adadelta as your lasagna model.

Also, I am not 100% sure about this, but you may also want to verify that your transformation bias term is initialized in the same way.

I use keras And lasagna implements the highway network, and the keras version has been lower than the lasagna version. I use the same data set and meta parameters in them. This is the code of the keras version:

X_train, y_train, X_test, y_test, X_all = hacking_script.load_all_data()
data_dim = 144
layer_count = 32
dropout = 0.04
hidden_units = 32
nb_epoch = 10

model = Sequential()
model.add(Dense(hidden_units, input_dim=data_dim))
model.add(Dropout(dropout) )
for index in range(layer_count):
model.add(Highway(activation ='relu'))
model.add(Dropout(dropout))
model.add (Dropout(dropout))
model.add(Dense(2, activation='softmax'))


print 'compiling...'
model.compile(loss='binary_crossentropy', optimizer='adagrad')
model.fit(X_train, y_train, batch_size=100, nb_epoch=nb_epoch,
show_accuracy=True, validation_data=(X_test, y_test), shuffle=True, verbose=0)

predictions = model.predict_proba(X_test)

This is the lasagna version Code:

class MultiplicativeGatingLayer(MergeLayer):
def __init__(self, gate, input1, input2, **kwargs):
incomings = [gate , input1, input2]
super(MultiplicativeGatingLayer, self).__init__(incomings, **kwargs)
assert gate.output_shape == input1.output_shape == input2.output_shape

def get_output_shape_for(self, input_shapes):
return input_shapes[0]

def get_output_for(self, inputs, **kwargs):
return inputs[0] * inputs[1 ] + (1-inputs[0]) * inputs[2]


def highway_dense(incoming, Wh=Orthogonal(), bh=Constant(0.0),
Wt =Orthogonal(), b t=Constant(-4.0),
nonlinearity=rectify, **kwargs):
num_inputs = int(np.prod(incoming.output_shape[1:]))

l_h = DenseLayer(incoming, num_units=num_inputs, W=Wh, b=bh, nonlinearity=nonlinearity)
l_t = DenseLayer(incoming, num_units=num_inputs, W=Wt, b=bt, nonlinearity=sigmoid)

return MultiplicativeGatingLayer(gate=l_t, input1=l_h, input2=incoming)

# ==== Parameters ====

num_features = X_train .shape[1]
epochs = 10

hidden_layers = 32
hidden_units = 32
dropout_p = 0.04

# ==== Defining the neural network shape ====

l_in = InputLayer(shape=(None, num_features))
l_hidden1 = DenseLayer(l_in, num_units=hidden_units)
l_hidden2 = DropoutLayer (l_hidden1, p=dropout_p)
l_current = l_hidden2
for k in range(hidden_layers-1):
l_current = highway_dense(l_current)
l_current = DropoutLayer(l_current, p= dropout_p)
l_dropout = DropoutLayer(l_current, p=dropout_p)
l_out = DenseLayer(l_dropout, num_units=2, nonlinearity=softmax)

# ==== Neural network definition ====

net1 = NeuralNet(layers=l_out,
update=adadelta, update_rho=0.95, update_learning_rate=1.0,
objective_loss_function=categorical_crossentropy,
train_split=TrainSplit(eval_size=0), verbose =0, max_epochs=1)

net1.fit(X_train, y_train)
predictions = net1.predict_proba(X_test)[:, 1]

Now keras version There is almost nothing better than logistic regression, and the lasagna version is by far the best scoring algorithm. Any ideas why?

Here are some suggestions (I’m not sure if they will really close the performance gap you have observed):

According to the Keras documentation, the Highway layer is initialized with Glorot Uniform weights, while in your Lasagne code, you use orthogonal weight initialization. Unless another part of your code is used to set the weight initialization for the Keras Highway layer to Orthogonal, this may be The source of the performance gap.

You also seem to be using Adagrad as the Keras model, but you are using Adadelta as your lasagna model.

Also, I’m not 100% sure about this A little bit, but you may also want to verify that your transformation bias terms are initialized in the same way.

Leave a Comment

Your email address will not be published.