Training and Deployment Pipeline, Part 2
From Deep Learning Patterns and Practices by Andrew Ferlitsch
This article covers:
- Feeding models training data in a production environment.
- Scheduling for continuous retraining.
- Using version control and evaluating models before and after deployment.
- Deploying models for large scale on-demand and batch requests, in both monolithic and distributed deployments.
You can find the first part of this article here.
Model Feeding with TFX
In this section, we cover the corresponding TFX model feeding aspect of the training pipeline components, as an alternative implementation. Figure 6 depicts the training pipeline components and their relationship to the data pipeline. The training pipeline consists of the components:
- Trainer — Trains the model.
- Tuner — Tunes the hyperparameters –e.g. learning rate.
- Evaluator — Evaluates the model’s objective(s) –e.g. accuracy, and compared results against baseline –e.g. previous version
- Infra Evaluator — Tests the model in a sandbox serving environment, before deployment.
Let’s review the benefits of TFX and pipelines in general. If we execute each step in training/deploying a model individually, we refer to this as a task aware architecture. Each component is aware of itself, but unaware of connecting components, or history of previous execution.
TFX implements orchestration. In orchestration a management interface oversees the execution of each component, remembers the execution of past components and maintains history. The output of each component are artifacts–these are the results and history of the execution. In orchestration, these artifacts, or references to them, are stored as metadata. For TFX, the metadata is stored in a relational format, and can be stored and accessed via a SQL database.
Let’s dive a little deeper into the benefits of orchestration, and then we’ll cover how within TFX model feeding works. With orchestration, which is depicted in figure 7, one can:
- Schedule execution of a component after another component(s) is completed. For example, scheduling the execution of data transformations after completion of generating a feature schema from training data.
- Schedule execution of components in parallel when the execution of the components is independent of each other. For example, scheduling in parallel hyperparameter tuning and training after completion of the data transformations.
- Reuse the artifacts from a previous execution of a component, i.e. cache, if nothing has changed. For example, if the training data hasn’t changed, the cached artifacts (i.e. transform graph) from the transformation component can be reused without re-execution.
- Provision different instances of compute engines for each component. For example, the data pipeline components may be provisioned on a first CPU compute instance, and the training component on a GPU compute instance.
- If a task supports distribution, such as tuning and training, the task can be distributed across multiple compute instances.
- Compare artifacts of a component to previous artifacts from previous executions of the component. For example, the evaluator component can compare the model’s objective, e.g. accuracy, to previously trained versions of the model.
- Debug and audit execution of the pipeline by being able to move forward and backwards through the generated artifacts.
The Trainer component supports training Tensorflow estimators, TF.Keras models and other custom training loops. Because TF 2.x recommends phasing out estimators, we only focus on configuring a trainer component for TF.Keras models, and feeding it data. The trainer component takes the following minimum parameters:
- module_file — This is the python script for custom training the model. It must contain a run_fn() function as the entry point for training.
- examples — the examples to train the model, which come from the output of the ExampleGen component — example_gen.outputs[‘examples`].
- schema — the dataset schema, which comes from the output of the SchemaGen component — schema_gen[‘schema’].
- custom_executor_spec — the executor for a custom training, which invokes the run_fn() function in module_file.
from tfx.components import Trainer
from tfx.components.base import executor_spec #A
from tfx.components.trainer import GenericExecutor #A
trainer = Trainer(
#A Imports for custom training.
#B The custom training python script.
#C The training data source for feeding the model during training.
#D The schema inferred from the dataset.
#E The custom executor for custom training.
If the training data is to be preprocessed by the Transform component, we need to set the following two parameters:
- transformed_examples — set to the output of the Transform component — transform.outputs[‘transformed_examples’].
- transform_graph — the static transformation graph produced by the Transform component — transform.outputs[‘transformed_raph’].
trainer = Trainer(
#A Training data is fed from the Transform component into the static transform graph.
Generally, one wants to pass other hyperparameters into the training module. These can be passed as additional parameters train_args and eval_args to the Trainer component. These parameters are set as a list of key/value pairs converted to Google’s protobuf format. The code example below shows passing the number of steps for training and evaluation.
from tfx.proto import trainer_pb2 #A
trainer = Trainer(
#A Import for the TFX protobuf format for passing hyperparameters.
#B Hyperparameters passed into the Trainer component as protobuf messages.
Let’s look now at the basic requirements for the run_fn() function in the custom Python script. The arguments to run_fn() are constructed from the parameters passed into the Trainer component, and are accessed as properties. In the example implementation below, we:
- Extract the total number of steps for training — training_args.train_steps (#B).
- Extract the number of steps for validation after each epoch — training_args.eval_steps (#B).
- Gets the TFRecord file paths for the training and eval data — training_args.train_files. Note that ExampleGen isn’t feeding in-memory tf.Examples, but on-disk TFRecords containing the tf.Examples (#C).
- Get the transform graph — training_args.transform_output, and construct a transform execution function — tft.TFTransformOutput() (#D).
- Calls internal function _input_fn() to create the dataset iterators for training and validation datasets (#D). Build, or load, a TF.Keras model with the internal function _build_model() (#E).
- Train the model with the fit() method. (#F and #G)
- Get the serving directory to store the trained model — training_args.output, which is optionally specified as the parameter output to the Trainer component (#H).
- Save the trained model to the specified serving output location — model.save(serving_dir) (#H).
from tfx.components.trainer.executor import TrainerFnArgs
import tensorflow_transform as tft
BATCH_SIZE = 64 #A
STEPS_PER_EPOCH = 250 #A
def run_fn(training_args: TrainerrFnArgs):
train_steps = training_args.train_steps #B
eval_steps = training_args.eval_steps #B
train_files = training_args.train_files #C
eval_files = training_args.eval_files #C
tf_transform_output = tft.TFTransformOutput(training_args.transform_output) #D
train_dataset = _input_fn(train_files, tf_transform_output, BATCH_SIZE) #D
eval_dataset = _input_fn(eval_files, tf_transform_output, BATCH_SIZE) #D
model = _build_model() #E
epochs = train_steps // STEPS_PER_EPOCH #F
model.fit(train_dataset, epochs=epochs, validation_data=eval_dataset,
serving_dir = training_args.output #H
#A Hyperparameters set as constants.
#B Training/Validation steps passed as parameters to the Trainer component.
#C Training/Validation data passed as parameters to the Trainer component.
#D Create the dataset iterators for train and validation data.
#E Build or load the model to train.
#F Calculate the number of epochs.
#G Train the model.
#H Save the model in SavedModel format to the specified serving directory.
A lot of fine details and various directions can be taken when constructing the custom python training script. For more details and directions, we recommend reviewing TFX’s guide for the Trainer component.
The Tuner component is an optional task in the training pipeline. You can either hardwire the hyperparameters for training in the custom python training script, or use the tuner to find the best values for hyperparameters.
The parameters to the Tuner are similar to the Trainer; the Tuner does short training runs to find the best hyperparameters, but unlike the Trainer, which returns a trained model, the Tuner’s outputs are the tuned hyperparameter values. One of the parameters that typically differs are the train_args and eval_args. Because these are shorter training runs, the number of steps for the tuner is typically twenty percent or less than that of the full training.The other requirement is that the custom Python training script, module_file, contains the function entry point tuner_fn(). The typical practice is to have a single Python training script that has both the run_fn() and tuner_fn() functions.
tuner = Tuner(
#A The number of steps for shorter training runs when tuning.
Next, we look at an example implementation of tuner_fn(), where we use the KerasTuner to do hyperparameter tuning –but you can use any tuner compatible with your model framework. KerasTuner is a separate package from Tensorflow, and you need to install it, as follows:
pip install keras-tuner
Like the Trainer component, the parameters and default to the Tuner component are passed in to tuner_fn() as properties of the parameter tuner_args. Note how the function starts the same as the run_fn(), but differs when we get to the training step. Instead of calling the fit() method and saving the trained model, we:
- Instantiate a KerasTuner, where:
- We use build_model() as our hyperparameter model argument.
- Call an internal function _get_hyperparameters() to specify the hyperparameter search space.
- The maximum number of trials is set to 6.
- The objective for selecting the best values for the hyperparameters. In this case, it’s validation accuracy.
- Pass the tuner and remaining parameters for training to an instance of TunerFnResult(), which executes the tuner.
- Return the results from the tuning trials.
def tuner_fn(tuner_args: FnArgs) -> TunerFnResult: #A
train_steps = training_args.train_steps
eval_steps = training_args.eval_steps
train_files = training_args.train_files
eval_files = training_args.eval_files
tf_transform_output = tft.TFTransformOutput(training_args.transform_output)
train_dataset = _input_fn(train_files, tf_transform_output, BATCH_SIZE)
eval_dataset = _input_fn(eval_files, tf_transform_output, BATCH_SIZE)
tuner = kerastuner.RandomSearch(_build_model(), #B
result = TunerFnResult(tuner=tuner, #D
'x': train_dataset, #E
'validation_data': eval_dataset, #E
'steps_per_epoch': train_steps, #E
'validation_steps': eval_steps #E
#A The entry point function for hyperparameter tuning.
#B Instantiate a KerasTuner for RandomSearch.
#C Retrieve the hyperparameter search space.
#D Instantiate and execute the tuning trials with the specified tuner instance.
#E Training parameters for the short training runs during tuning.
Now let’s see how the Tuner and Trainer components are chained together to form an executable pipeline. In the example implementation below, the single modification we made to the instantiation to the Trainer component is the addition of the optional parameter hyperparameters and connecting the input to the output of the Tuner component. Now when we execute the Trainer instance with context.run(), the orchestrator sees the dependency on the Tuner and schedules its execution prior to the full training by the Trainer component.
tuner = Tuner(
trainer = Trainer(
#A Get the tuned hyperparameters from the Tuner component.
#B Execute the Tuner/Trainer pipeline.
As with the trainer, the python hyperparameter tuning script can be customized. See TFX’s guide for the Tuner component for more.
That’s all for now. Thanks for reading.