How to potty train a Siamese Network

Time for an update on my One-Shot learning approach using a Siamese LSTM-based Deep Neural Network we developed for telecommunication network fault identification through traffic analysis. A lot of small details had to change as we upgraded our machine to the latest TensorFlow and Keras. That alone introduced a few new behaviors… As well as we obtained new data for new examples and found out some problems with our model. I don’t intend to go through all changes, but some of the main ones as well as some interesting findings. It feels a lot like potty training a cat… If you are new to this series, you can refer to my previous posts: “Do Telecom Networks Dreams of Siamese Memories?” and “What Siamese Dreams are made of…

First, Batch Normalization in Keras is now on my black magic list 😊 . I’ll have to dig more into how it is implemented, especially the differences between train time and prediction time. For a long time, I was wondering why I was getting extremely good train loss and poor validation losses until I removed the Batch Normalization I had on the input layer. So, something to investigate there.

Secondly, I introduced data generators for training and validation data. For a Siamese network approach where you must provide tons of similar and dissimilar pairs, using generators is a must to master at some point! Once you get the gist of it, it is quite convenient. I found Shervine Amidi blog: “A detailed example of how to use data generators with Keras” to be a very well explained example to build upon. I would recommend it to anyone learning about Keras data generators.

Along the way I found that my triplet_loss function as shown in previous post was flawed… because of the way I am packing the output of the base neural network with Keras concatenate, I must explicitly specify the ranges. Moreover, I painfully understood that a loss function in Keras is passed a mini-batch of y_true/y_pred values, not individual values. Well, that was not clear for me at first sight… I took also the opportunity to rework the logic to use more of a Keras approach than TensorFlow (subtle changes). Below is the new loss function.

The fourth interesting thing to mention is that while I was debugging all those issues, I felt a need to better visualize the results than simply looking at the prediction value. I reduced the output vector space from 10 dimensions to 3 dimensions as anyway I do not have that much different examples for now, so 3D should be more than enough to separate them. Furthermore, I changed my output layer to use a sigmoid activation function to limit the output space to the [0,1] range. Those changes in turn enabled me to look at the location of the predicted point in the transformed space e.g. a traffic pattern now corresponds to a 3D location in this output space.

SiameseSeparation

Below I made a video of how this projection evolve through training. Initially, as the neural net is initialized with random values, the output points clutter together at the center. But quickly we see them being separated and each taking a corner of the space. Sure, there is a lot of bouncing back and forth as the neural net try to find a better solution, but we can see that we can find a sweet spot where the different traffic patterns are well separated. As a side note we see three different traffic patterns here. Normal traffic in green and two different error cases, one dramatic in red where all traffic is blocked, and one subtler error in orange where we reach the capacity limit of the communication link.

Now while acquiring more data from our test bed, we are trying out with different loss functions to separate the traffic. One of my colleague has just posted on a comparison between different loss functions: “Lossless Triplet Loss” . I might also try some different loss functions and show my findings.

I hope this shows that One-Shot learning using Siamese networks can be used for other purpose than face recognition. In this case we are successfully using it for signalling traffic categorization and fault detection.


Cover photo by Jan-Mallander at Pixabay.

Advertisements

4 thoughts on “How to potty train a Siamese Network

  1. Could you please explain the anchor = y_pred[:,0:3]? Why is anchor 3 columns? And in the previous version how many columns was anchor and why was it that many columns?
    Thank you, Jon

    Like

    1. This need a little bit of explaining I agree 🙂 . So the output of the LSTM base network for anchor, negative and positive are passed as a flatened list to the loss function by the siamese network. In this instance I was using a 3D representation, so this is how I retrieved each “component” in the loss function. The other thing to notice is the passing of a “whole” column of them, as the loss function is called on mini-batches, so it is not only one example at a time, but a batch of them. I must say that my loss function handling changed a lot in later iterations, might come back to write about it later…
      So to summarize, Anchor, Positive and Negative are output from the same LSTM base network which encodes the traffic pattern on a 3D space (a little like word embedding if you want); this is why it is 3 columns. In previous version it was buggy… hence the update.

      Like

  2. Hi,
    I read your amazing article on triplet loss. I am performing a similar experiment which I believe that triplet loss will have. But I don’t understand why it is not working out for me. Instead of going to lossless triplet loss. I thought to make triplet loss work first to some extent.
    Now the task is to classify similar and not similar items from a vector space and then do the clustering using annoy with euclidean, which will end in making clusters of similar items.

    The entire process –
    I have 300D dimensional word vectors, ran on the universal encyclopedia, and there could be huge number of classes to classify, like, animal, vegetable, fruits, etc.
    Now, what I did was to create the training set and testing set, the training set contains anchor, positive and negative vectors.
    My implementation of Triplet Loss –

    def triplet_loss(y_true, y_pred):
    anchor = y_pred[:, 0:300]
    positive = y_pred[:, 300:600]
    negative = y_pred[:, 600:900]

    # distance between the anchor and the positive
    pos_dist = K.sum(K.square(anchor – positive), axis=1)

    # distance between the anchor and the negative
    neg_dist = K.sum(K.square(anchor – negative), axis=1)

    # compute loss
    alpha = 0.5
    basic_loss = pos_dist – neg_dist + alpha
    loss = K.mean(K.maximum(basic_loss, 0.0))

    return loss

    My network –

    def create_base_network(input_dim=300):
    t = ‘tanh’
    model = Sequential()
    model.add(Dense(input_dim, input_shape=(input_dim, ), activation=t))
    model.add(Dense(600, activation=t))
    model.add(Dense(input_dim, activation=t))

    return model

    def init_model(self):
    base_network = create_base_network()

    anchor_in = Input(shape=(300,))
    positive_in = Input(shape=(300,))
    negative_in = Input(shape=(300,))

    anchor_out = base_network(anchor_in)
    positive_out = base_network(positive_in)
    negative_out = base_network(negative_in)

    merged_vector = concatenate([anchor_out, positive_out, negative_out], axis=1)

    model = Model(inputs=[anchor_in, positive_in, negative_in], outputs=merged_vector)

    return model

    Training –
    model.fit(train_data, np.zeros(),
    epochs=100, shuffle=True, steps_per_epoch=None, batch_size=128,
    )

    Total Number of training samples is ~7k

    Note – the labels i am passing in the model.fit is zeroes, because it doesn’t matter for the triplet loss.

    When training the model, the loss converges to 0 after 10 epochs on an average. But the actual separation is not even affected.

    The test is being performed in the following ways –
    To test the separation between two similar vectors and non similar vectors, you just need two vectors. We can not use this exact model, so what I did was extracted the weights of the base network and in a different script made the exact same network so that I can set the learned weights. My expectation here is that the learned weights should know what separation to make between the two input vectors.

    But, it turns out that I does not make any separation.

    For measuring the separation I am using a metric as Cohen’ D value.

    = {\displaystyle {\frac {\text{mean difference}}{\text{standard deviation}}}}
    Image result for cohen’s d

    In separation between the samples in the input vector space is 1, ideally after getting transformed this value should become 4, if only taking the -2-sigma, +2-sigma solution space.
    But, while testing the model to see the separation, the value resulted is 1.
    If you don’t even train the model, still the value is 1, which is not expected, because the weights will be completely randomised.

    I have got no sense left to reason, what is happening with triplet loss, why is not able to perform the job.

    I have been dealing with this situation from a long time. Please share your thoughts on this. Any help is very much appreciated.

    Thank You,
    Shivam Srivastava

    Hi,
    I read your amazing article on triplet loss. I am performing a similar experiment which I believe that triplet loss will have. But I don’t understand why it is not working out for me. Instead of going to lossless triplet loss. I thought to make triplet loss work first to some extent.
    Now the task is to classify similar and not similar items from a vector space and then do the clustering using annoy with euclidean, which will end in making clusters of similar items.

    The entire process –
    I have 300D dimensional word vectors, ran on the universal encyclopedia, and there could be huge number of classes to classify, like, animal, vegetable, fruits, etc.
    Now, what I did was to create the training set and testing set, the training set contains anchor, positive and negative vectors.
    My implementation of Triplet Loss –

    def triplet_loss(y_true, y_pred):
    anchor = y_pred[:, 0:300]
    positive = y_pred[:, 300:600]
    negative = y_pred[:, 600:900]

    # distance between the anchor and the positive
    pos_dist = K.sum(K.square(anchor – positive), axis=1)

    # distance between the anchor and the negative
    neg_dist = K.sum(K.square(anchor – negative), axis=1)

    # compute loss
    alpha = 0.5
    basic_loss = pos_dist – neg_dist + alpha
    loss = K.mean(K.maximum(basic_loss, 0.0))

    return loss

    My network –

    def create_base_network(input_dim=300):
    t = ‘tanh’
    model = Sequential()
    model.add(Dense(input_dim, input_shape=(input_dim, ), activation=t))
    model.add(Dense(600, activation=t))
    model.add(Dense(input_dim, activation=t))

    return model

    def init_model(self):
    base_network = create_base_network()

    anchor_in = Input(shape=(300,))
    positive_in = Input(shape=(300,))
    negative_in = Input(shape=(300,))

    anchor_out = base_network(anchor_in)
    positive_out = base_network(positive_in)
    negative_out = base_network(negative_in)

    merged_vector = concatenate([anchor_out, positive_out, negative_out], axis=1)

    model = Model(inputs=[anchor_in, positive_in, negative_in], outputs=merged_vector)

    return model

    Training –
    model.fit(train_data, np.zeros(),
    epochs=100, shuffle=True, steps_per_epoch=None, batch_size=128,
    )

    Total Number of training samples is ~7k

    Note – the labels i am passing in the model.fit is zeroes, because it doesn’t matter for the triplet loss.

    When training the model, the loss converges to 0 after 10 epochs on an average. But the actual separation is not even affected.

    The test is being performed in the following ways –
    To test the separation between two similar vectors and non similar vectors, you just need two vectors. We can not use this exact model, so what I did was extracted the weights of the base network and in a different script made the exact same network so that I can set the learned weights. My expectation here is that the learned weights should know what separation to make between the two input vectors.

    But, it turns out that I does not make any separation.

    For measuring the separation I am using a metric as Cohen’ D value.

    = {\displaystyle {\frac {\text{mean difference}}{\text{standard deviation}}}}
    Image result for cohen’s d

    In separation between the samples in the input vector space is 1, ideally after getting transformed this value should become 4, if only taking the -2-sigma, +2-sigma solution space.
    But, while testing the model to see the separation, the value resulted is 1.
    If you don’t even train the model, still the value is 1, which is not expected, because the weights will be completely randomised.

    I have got no sense left to reason, what is happening with triplet loss, why is not able to perform the job.

    I have been dealing with this situation from a long time. Please share your thoughts on this. Any help is very much appreciated.

    Thank You,
    Shivam Srivastava

    Like

  3. My experience up to now is that the most probable culprit is the loss function. I suggest debugging it individually and make sure what you get from the Keras framework is really what you think you get… Basically, getting the concatenation right, then de-concatenating it can be tricky. Once you get it right, make sure your tensorflow (through Keras in your case “K.”) are operating on the right dimensions and performing the calculation you want (because what you get in the loss function is a batch of examples, not only a single one). That would be my 2 cents of advice. Hope it can help!

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s