TF’s Estimator API is very convenient to use. Fill in the next few functions and you can write highly structured model code. But the more advanced the encapsulation, once encountered in the use of problems, it will be quite troublesome to deal with.

Such a problem is encountered today in daily alchemy. When the model training is completed, the saved model is exported with a key and the corresponding var cannot be found.

NotFoundError (see above for traceback): Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the  checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

This error is usually caused by inconsistent graphs used twice. Possible reasons generally include:

  • Different versions of the API are used, and the API may add different prefixes or suffixes to the variables. Different APIs may handle logic inconsistently. The graph that causes the two loads is not the same.
  • This variable is not stored in checkpoint. This will not usually exist, and variables that require train will definitely need to be saved. Some variables that do not require train make this possible.

These two cases are relatively easy to detect. Check variables in a checkpoint using the following tool:

/tensorflow/python/tools/inspect_checkpoint.py checkpoint_file_name

In general, you get the following output:

My mistake this time was that I couldn’t find a variable for AGE_1 / Embeddings. The save variable name is Age/Embeddings, but the load is looking for an AGE_1 / Embeddings variable. The name was changed. It seems that this is the first reason. Different pictures were used for save and export. However, my code train and export are in the same code, using the same model_fn, and tested in the same environment. It should be said that different graphs will not be used.

It took a long time to troubleshoot, but finally the problem was solved on the graph inconsistency. Estimator has two input functions, one is input_fn and one is model_fn. In addition to the model_fn will build the model graph, the tensor in the input_fn will also be added to the graph. The input_fn used in the training phase and export phase are inconsistent. The export model is for online serving, so there are a bunch of placeholders defined as input in input_fn, and there’s a placeholder whose name is set to “age”. So the AGE/Embeddings in model_fn is changed to AGE_1 / Embeddings in the build graph. If you check the value of this variable in the checkpoint, you can’t find it.

This problem was misled by the error reported by TF at the beginning. I have been looking for the problem in model_fn. An error is not necessarily a problem, and TF1.13 does not report such a problem with the same name. Most of the time, the Tensor we create will not give a name, TF will automatically name a name, automatic naming has a set of rules, there is a new name generated, which leads to the automatic set name or manually defined name, will not give the same name error. Once you name yourself with the same name, it is likely to cause problems.