Save and Restore Models

  • by
Save and Restore Models

Save and Restore Models – In the past few articles, we have been building Machine Learning models using TensorFlow APIs. There are situations where the model trains for a long time and we would like to store the intermediate step of the model so that we can assess how it is performing on the test data or safeguard against unforeseen situations due to which the training loop may not complete.

The model can resume training, where it left off and avoid long training times after restoring the weights. Saving the model also helps us share our work with others so that they can recreate it. When publishing research models and technique most machine learning practitioners share code to create a model and trained weights or parameters of the model.

Sharing this data helps others understand how the model work and tried themselves with the new data. In this module, we will learn how to store the model during or after the training. I would like to caution you using any untrusted code because TensorFlow models are code at the end of the day. Hence you should be careful and a certain the origin of the code before using any untrusted code.

There are different ways to save TensorFlow models depending on the API that you are using. Here we use tf dot keras which is a high level API for building and training the model let us begin by importing TensorFlow and other dependencies. Let us install TensorFlow 2 dot o and make sure that the right version is installed. We also installed OS package because we want to write and read files to the disk.

Let us load a MNIST dataset and take 1000 examples each from training and test tensors so that are modelled can run faster and will be able to demonstrate the same and restore functionality. Let us define the model in a python function so that we can call this function for creating the model before and after saving and for restoration purpose. So, we define a simple neural network model which has got a single hidden layer with 512 units, we use relu as an activation function and a input to this particular hidden layer are 784 values.

So, these 784 values come from 28 cross 28 image of digit that is stored in a MNIST dataset. In addition to that we use a dropout regularization with a dropout rate of 0.2. And finally, we have a dense layer with 10 units as an output layer that uses Softmax as an activation. We want to output one of the 10 digits as a desired output, we use Adam as an optimizer and sparse categorical crossentropy loss as we are we are interested in getting integers as an output and we will track accuracy as a metric.

Let us create a model and examine the model through model dot summary method. So, you can see that the model has got exactly two layers; one hidden layer with 512 units. Then there is a dropout layer, dropout is actually applied on the on the first layer and then we have an output layer with 10 units. So, there are 400050 parameters in the model. We would like to automatically save checkpoints during training; this way we can use a trained model without having to retrain it or we can pick up the training where we stopped it last time; in case the training process was interrupted or stopped for some reason.

We use a call back model checkpoint for performing this task; this call back takes a few arguments for configuring the checkpointing. Let us look at the usage of checkpoint call back. So, first we will have to define and configure the checkpoint call backs with call back which is done through this particular code that is highlighted on your screen. We define the checkpoint path; this is the directory path where we want to store the checkpoint.

And then we also configure the checkpoint with a checkpoint path and by specifying what part of model we want to save. Here we are trying to only save the weights; we are not; we are not saving the architecture or the optimizer configuration along with the weights, here we only want to save the weights. So, this is the configuration of the checkpointing; the simplest configuration of the checkpointing.

We will see even more advanced usage of checkpointing later in this particular exercise. Then we create a model with create model command; remind you that create model command actually creates a TensorFlow model that has got one hidden layer of 512 units and an output layer with 10 units. And you will fit the model by running the training loop for 10 epochs and notice that we are using a call back in the training process.

This call back creates a single collection of TensorFlow checkpoint files that are updated at end of each epoch; so, this particular configuration checkpoints the model at the end of each epoch. Let us train the model because we had very few example we finish training the model very quickly; let us look at the checkpoint directory.

So, exclamation mark followed by any command that we write is interpreted as a unix command and is run as if we are running it on the command line. So, this particular code snippet will print the directory listing for the checkpoint directory; note that we trained the model for 10 epochs. So, you can see that there is a checkpoint; there are there are few files that are that are created in the checkpoint directory.

So, here we will have to first create a model with the same architecture as the original model and then restore the weights and apply those weights in the new model. It is perfectly fine to share the weights from the previous run; even though this is a different instance of a model. Before applying the weight what we will do is; we will create a model and we evaluate the model performance in the test even before restoring the parameters.

So, in this case some random values will be used for parameters and we will see the accuracy that we get is just by chance. So, here we get only 10 percent accuracy as against the 99 percent accuracy that we got or 87 percent validation accuracy, that we got during the training. Now, let us load the weights from the checkpoint path and again evaluate the model and check the accuracy.

We can see that we are able to get the accuracy of 87 percent as we got earlier during the training of the model. So, we can see that just by using the same model, but just by building the architecture as the original model and restoring weights helped us to get the same performance as the original model. So, you can see that this is very very powerful; imagine you build a model and share it, share its weights with your friend or with with your colleagues and then your colleague can take advantage of these weights and recreate the same model and use it for the prediction task.

Let us look at various options that we have for creating a checkpoint call back. We can specify a period instead of saving the checkpoint after every epoch; we can specify a period after which the model should be saved. So, in this case we can; we can do that with a period argument and here we are setting period to 5. So, we are going to save weights every 5 epochs rather than doing it after every epoch and here we are only going to save the weights.

We also give the checkpoint path and configure it to store the ID of the epoch. So, this is a unique ID that is created for a checkpoint which consist of the idea of the epoch so that it is easy to identify what epoch is a checkpoint from. Then we create the model, we save the weights to the checkpoint path and then fit the model. And note that in in the fit function we give; the call back as one of the arguments and you can see that the model is getting saved after every 5 epochs; here we are training the model for 50 epochs.

So, we should see there are 10 checkpoints. So, you can see the checkpoint at 5; 5th epoch, 10th epoch and so on up to 50th epoch. Let us look at the content; content of the checkpoint directory and we can see that there are now 10 different epoch you can see that there are 10 different checkpoints that are stored in the directory. If we use latest underscore checkpoint as a function and give checkpoint underscore directory or the checkpoint directory as a argument; we get the latest checkpoint.

By default TensorFlow format only saves the 5 most recent checkpoints. So, let us retrieve the latest checkpoint and create the model with the weights from the latest checkpoint. You must be wondering; what are these different files that are there in the checkpoint directory? Let us take a look at them. So, you can see that there is a format there is a starting kind of a format.

Since we train our model on a single machine each checkpoint will have all the weight stored in a single shard. If you are doing it on multiple machines there could be there could have been multiple shards over here. Apart from the call back we can also manually save the weight that is the other way of saving the weight.

And we can simply use model dot save underscore weights function and we can provide the directory or the path and we have to provide the file name where we want to store the weights. Let us run it to check it. So, we are essentially saving the weights to my underscore checkpoint file and we are loading the weight from that particular file.

So, you can see that we are getting again 87 percent accuracy after saving the weight and restoring it in a new model. Instead of only saving the weights we can also save the architecture of the model or the optimizer configuration; let us see how to do that. So, the entire model can be saved using hierarchical data format or HDF5; we can specify the HDF5 with h5 as an extension.

Here we create the model, we train the model and we will save the model into HDF5 file with the file name my underscore model dot h5. Later we can load the model from this particular file the HDF5 file and use it for use it for prediction. I would like you; I would like to point out the difference between the earlier checkpointing method where we were only storing weight as against this particular method where we are storing the entire model.

In the checkpointing, we had to first create the model and then load the weights into the model and then use it for the prediction task. In this case, we do not have to create the model as the model itself has been saved in HDF format, we simply load the model that; that essentially creates the model, puts the weight and the model is used and the model is ready for the prediction task; it is important to note this particular difference; let us load the model.

And we can see that this model has got exactly the same summary as before and then check the accuracy of the model yeah; it is almost the same accuracy of around 87 percent. So, this technique saves everything essentially weights, model configuration and optimizer configuration and keras saves the model by
inspecting its architecture. In this module, we studied how to store and restore TensorFlow models to and from the disk.

These techniques are very handy when you have models that are training for long period of time or to export model for deployment on different platforms. Hope you enjoyed learning these concepts; see you in the next articles.

Share this post ...

Leave a Reply

Your email address will not be published. Required fields are marked *