ml.rl.training package

Submodules

ml.rl.training.c51_trainer module

class ml.rl.training.c51_trainer.C51Trainer(q_network, q_network_target, parameters: ml.rl.parameters.DiscreteActionModelParameters, use_gpu=False, metrics_to_score=None)

Bases: ml.rl.training.rl_trainer_pytorch.RLTrainer

Implementation of 51 Categorical DQN (C51)

See https://arxiv.org/abs/1707.06887 for details

argmax_with_mask(q_values, possible_actions_mask)
boost_rewards(rewards: torch.Tensor, actions: torch.Tensor) → torch.Tensor
internal_prediction(input)

Only used by Gym

train(training_batch)

ml.rl.training.cem_trainer module

The Trainer for Cross-Entropy Method. The idea is that an ensemble of

world models are fitted to predict transitions and reward functions.

A cross entropy method-based planner will then plan the best next action based on simulation data generated by the fitted world models.

The idea is inspired by: https://arxiv.org/abs/1805.12114

class ml.rl.training.cem_trainer.CEMTrainer(cem_planner_network: ml.rl.models.cem_planner.CEMPlannerNetwork, world_model_trainers: List[ml.rl.training.world_model.mdnrnn_trainer.MDNRNNTrainer], parameters: ml.rl.parameters.CEMParameters, use_gpu: bool = False)

Bases: ml.rl.training.rl_trainer_pytorch.RLTrainer

internal_prediction(state: torch.Tensor) → Union[ml.rl.types.SacPolicyActionSet, ml.rl.types.DqnPolicyActionSet]

Only used by Gym. Return the predicted next action

internal_reward_estimation(input)

Only used by Gym

train(training_batch, batch_first=False)

ml.rl.training.dqn_predictor module

class ml.rl.training.dqn_predictor.DQNPredictor(pem, ws, predict_net=None)

Bases: ml.rl.training.sandboxed_predictor.SandboxedRLPredictor

predict(float_state_features)

Returns values for each state :param float_state_features A list of feature -> float value dict examples

ml.rl.training.dqn_trainer module

class ml.rl.training.dqn_trainer.DQNTrainer(q_network, q_network_target, reward_network, parameters: ml.rl.parameters.DiscreteActionModelParameters, use_gpu=False, q_network_cpe=None, q_network_cpe_target=None, metrics_to_score=None, imitator=None)

Bases: ml.rl.training.dqn_trainer_base.DQNTrainerBase

get_detached_q_values(state) → Tuple[ml.rl.types.AllActionQValues, Optional[ml.rl.types.AllActionQValues]]

Gets the q values from the model and target networks

internal_prediction(input)

Only used by Gym

internal_reward_estimation(input)

Only used by Gym

train(training_batch)
warm_start_components()

The trainer should specify what members to save and load

ml.rl.training.dqn_trainer_base module

class ml.rl.training.dqn_trainer_base.DQNTrainerBase(parameters, use_gpu, metrics_to_score=None, actions: Optional[List[str]] = None)

Bases: ml.rl.training.rl_trainer_pytorch.RLTrainer

boost_rewards(rewards: torch.Tensor, actions: torch.Tensor) → torch.Tensor
get_max_q_values(q_values, possible_actions_mask)

Used in Q-learning update.

Parameters
  • states – Numpy array with shape (batch_size, state_dim). Each row contains a representation of a state.

  • possible_actions_mask – Numpy array with shape (batch_size, action_dim). possible_actions[i][j] = 1 iff the agent can take action j from state i.

  • double_q_learning – bool to use double q-learning

get_max_q_values_with_target(q_values, q_values_target, possible_actions_mask)

Used in Q-learning update.

Parameters
  • states – Numpy array with shape (batch_size, state_dim). Each row contains a representation of a state.

  • possible_actions_mask – Numpy array with shape (batch_size, action_dim). possible_actions[i][j] = 1 iff the agent can take action j from state i.

  • double_q_learning – bool to use double q-learning

ml.rl.training.imitator_training module

class ml.rl.training.imitator_training.ImitatorTrainer(imitator, parameters: ml.rl.parameters.DiscreteActionModelParameters, use_gpu=False)

Bases: ml.rl.training.rl_trainer_pytorch.RLTrainer

train(training_batch, train=True)
ml.rl.training.imitator_training.get_valid_actions_from_imitator(imitator, input, drop_threshold)

Create mask for non-viable actions under the imitator.

ml.rl.training.loss_reporter module

class ml.rl.training.loss_reporter.BatchStats(td_loss, reward_loss, imitator_loss, logged_actions, logged_propensities, logged_rewards, logged_values, model_propensities, model_rewards, model_values, model_values_on_logged_actions, model_action_idxs)

Bases: tuple

static add_custom_scalars(action_names: Optional[List[str]])
property imitator_loss

Alias for field number 2

property logged_actions

Alias for field number 3

property logged_propensities

Alias for field number 4

property logged_rewards

Alias for field number 5

property logged_values

Alias for field number 6

property model_action_idxs

Alias for field number 11

property model_propensities

Alias for field number 7

property model_rewards

Alias for field number 8

property model_values

Alias for field number 9

property model_values_on_logged_actions

Alias for field number 10

property reward_loss

Alias for field number 1

property td_loss

Alias for field number 0

write_summary(actions: List[str])
class ml.rl.training.loss_reporter.LossReporter(action_names: Optional[List[str]] = None)

Bases: object

RECENT_WINDOW_SIZE = 100
static calculate_recent_window_average(arr, window_size, num_entries)
flush()
get_logged_action_distribution()
get_model_action_distribution()
get_recent_imitator_loss()
get_recent_reward_loss()
get_recent_rewards()
get_recent_td_loss()
get_td_loss_after_n(n)
log_to_tensorboard(epoch: int) → None
property num_batches
report(**kwargs)
class ml.rl.training.loss_reporter.StatsByAction(actions)

Bases: object

append(stats)
items()
ml.rl.training.loss_reporter.merge_tensor_namedtuple_list(l, cls)

ml.rl.training.off_policy_predictor module

class ml.rl.training.off_policy_predictor.RLPredictor(net, init_net, parameters, ws=None)

Bases: object

analyze(named_features)
get_predictor_export_meta()

Returns a PredictorExportMeta object

in_order_dense_to_sparse(dense)

Convert dense observation to sparse observation assuming in order feature ids.

classmethod load(db_path, db_type)

Creates Predictor by loading from a database

:param db_path see load_from_db :param db_type see load_from_db

predict(float_state_features)

Returns values for each state :param float_state_features A list of feature -> float value dict examples

property predict_net
save(db_path, db_type)

Saves network to db

:param db_path see save_to_db :param db_type see save_to_db

ml.rl.training.on_policy_predictor module

class ml.rl.training.on_policy_predictor.CEMPlanningPredictor(trainer, action_dim: int, use_gpu: bool)

Bases: ml.rl.training.on_policy_predictor.OnPolicyPredictor

discrete_action() → bool

Return True if this predictor is for a discrete action network

policy(states: torch.Tensor, possible_actions_presence: Optional[torch.Tensor] = None) → Union[ml.rl.types.SacPolicyActionSet, ml.rl.types.DqnPolicyActionSet]
policy_net() → bool

Return True if this predictor is for a policy network

class ml.rl.training.on_policy_predictor.ContinuousActionOnPolicyPredictor(trainer, action_dim: int, use_gpu: bool)

Bases: ml.rl.training.on_policy_predictor.OnPolicyPredictor

policy(states: torch.Tensor) → ml.rl.types.SacPolicyActionSet
policy_net() → bool

Return True if this predictor is for a policy network

class ml.rl.training.on_policy_predictor.DiscreteDQNOnPolicyPredictor(trainer, action_dim: int, use_gpu: bool)

Bases: ml.rl.training.on_policy_predictor.OnPolicyPredictor

discrete_action() → bool

Return True if this predictor is for a discrete action network

estimate_reward(state)
policy(state: torch.Tensor, possible_actions_presence: torch.Tensor) → ml.rl.types.DqnPolicyActionSet
policy_net() → bool

Return True if this predictor is for a policy network

predict(state)
class ml.rl.training.on_policy_predictor.OnPolicyPredictor(trainer, action_dim: int, use_gpu: bool)

Bases: object

This class generates actions given a trainer and a state. It’s used for on-policy learning. If you have a TorchScript (i.e. serialized) model, Use the classes in off_policy_predictor.py

discrete_action() → bool

Return True if this predictor is for a discrete action network

policy_net() → bool

Return True if this predictor is for a policy network

class ml.rl.training.on_policy_predictor.ParametricDQNOnPolicyPredictor(trainer, action_dim: int, use_gpu: bool)

Bases: ml.rl.training.on_policy_predictor.OnPolicyPredictor

discrete_action() → bool

Return True if this predictor is for a discrete action network

estimate_reward(states_tiled: torch.Tensor, possible_actions: torch.Tensor)
policy(states_tiled: torch.Tensor, possible_actions_with_presence: Tuple[torch.Tensor, torch.Tensor])
policy_net() → bool

Return True if this predictor is for a policy network

predict(states_tiled: torch.Tensor, possible_actions: torch.Tensor)

ml.rl.training.parametric_dqn_trainer module

class ml.rl.training.parametric_dqn_trainer.ParametricDQNTrainer(q_network, q_network_target, reward_network, parameters: ml.rl.parameters.ContinuousActionModelParameters, use_gpu: bool = False)

Bases: ml.rl.training.dqn_trainer_base.DQNTrainerBase

get_detached_q_values(state, action) → Tuple[ml.rl.types.SingleQValue, ml.rl.types.SingleQValue]

Gets the q values from the model and target networks

internal_prediction(state, action)

Only used by Gym

internal_reward_estimation(state, action)

Only used by Gym

train(training_batch) → None

ml.rl.training.qrdqn_trainer module

class ml.rl.training.qrdqn_trainer.QRDQNTrainer(q_network, q_network_target, parameters: ml.rl.parameters.DiscreteActionModelParameters, use_gpu=False, metrics_to_score=None, reward_network=None, q_network_cpe=None, q_network_cpe_target=None)

Bases: ml.rl.training.dqn_trainer_base.DQNTrainerBase

Implementation of QR-DQN (Quantile Regression Deep Q-Network)

See https://arxiv.org/abs/1710.10044 for details

argmax_with_mask(q_values, possible_actions_mask)
boost_rewards(rewards: torch.Tensor, actions: torch.Tensor) → torch.Tensor
get_detached_q_values(state)

Gets the q values from the model and target networks

huber(x)
internal_prediction(input)

Only used by Gym

train(training_batch)
warm_start_components()

The trainer should specify what members to save and load

ml.rl.training.rl_dataset module

class ml.rl.training.rl_dataset.RLDataset(file_path)

Bases: object

insert(**kwargs)
insert_pre_timeline_format(mdp_id, sequence_number, state, timeline_format_action, reward, terminal, possible_actions, time_diff, action_probability, possible_actions_mask)

Insert a new sample to the dataset in the pre-timeline json format. Format needed for running timeline operator and for uploading dataset to hive.

insert_replay_buffer_format(state, action, reward, next_state, next_action, terminal, possible_next_actions, possible_next_actions_mask, time_diff, possible_actions, possible_actions_mask, policy_id)

Insert a new sample to the dataset in the same format as the replay buffer.

load()

Load samples from a gzipped json file.

save()

Save samples as a pickle file or JSON file.

ml.rl.training.rl_trainer_pytorch module

class ml.rl.training.rl_trainer_pytorch.RLTrainer(parameters, use_gpu, metrics_to_score=None, actions: Optional[List[str]] = None)

Bases: object

ACTION_NOT_POSSIBLE_VAL = -1000000000.0
FINGERPRINT = 12345
internal_prediction(input)

Q-network forward pass method for internal domains. :param input input to network

internal_reward_estimation(input)

Reward-network forward pass for internal domains.

load_state_dict(state_dict)
property num_actions
state_dict()
train(training_samples) → None
warm_start_components() → List[str]

The trainer should specify what members to save and load

ml.rl.training.sac_trainer module

class ml.rl.training.sac_trainer.SACTrainer(q1_network, actor_network, parameters: ml.rl.parameters.SACModelParameters, use_gpu=False, value_network=None, q2_network=None, min_action_range_tensor_training=None, max_action_range_tensor_training=None, min_action_range_tensor_serving=None, max_action_range_tensor_serving=None)

Bases: ml.rl.training.rl_trainer_pytorch.RLTrainer

Soft Actor-Critic trainer as described in https://arxiv.org/pdf/1801.01290

The actor is assumed to implement reparameterization trick.

internal_prediction(states, test=False)

Returns list of actions output from actor network :param states states as list of states to produce actions for

train(training_batch) → None

IMPORTANT: the input action here is assumed to be preprocessed to match the range of the output of the actor.

warm_start_components()

The trainer should specify what members to save and load

ml.rl.training.sandboxed_predictor module

class ml.rl.training.sandboxed_predictor.SandboxedRLPredictor(pem, ws, predict_net=None)

Bases: ml.rl.training.off_policy_predictor.RLPredictor

classmethod load(db_path, db_type, *args, **kwargs)

Creates Predictor by loading from a database

:param db_path see load_from_db :param db_type see load_from_db

property predict_net
save(db_path, db_type)

Saves network to db

:param db_path see save_to_db :param db_type see save_to_db

ml.rl.training.td3_trainer module

class ml.rl.training.td3_trainer.TD3Trainer(q1_network, actor_network, parameters: ml.rl.parameters.TD3ModelParameters, use_gpu=False, q2_network=None, min_action_range_tensor_training=None, max_action_range_tensor_training=None, min_action_range_tensor_serving=None, max_action_range_tensor_serving=None)

Bases: ml.rl.training.rl_trainer_pytorch.RLTrainer

Twin Delayed Deep Deterministic Policy Gradient algorithm trainer as described in https://arxiv.org/pdf/1802.09477

internal_prediction(states, test=False)

Returns list of actions output from actor network :param states states as list of states to produce actions for

train(training_batch) → None

IMPORTANT: the input action here is assumed to be preprocessed to match the range of the output of the actor.

warm_start_components()

The trainer should specify what members to save and load

ml.rl.training.training_data_page module

class ml.rl.training.training_data_page.TrainingDataPage(mdp_ids: Optional[numpy.ndarray] = None, sequence_numbers: Optional[torch.Tensor] = None, states: Optional[torch.Tensor] = None, actions: Optional[torch.Tensor] = None, propensities: Optional[torch.Tensor] = None, rewards: Optional[torch.Tensor] = None, possible_actions_mask: Optional[torch.Tensor] = None, possible_actions_state_concat: Optional[torch.Tensor] = None, next_states: Optional[torch.Tensor] = None, next_actions: Optional[torch.Tensor] = None, possible_next_actions_mask: Optional[torch.Tensor] = None, possible_next_actions_state_concat: Optional[torch.Tensor] = None, not_terminal: Optional[torch.Tensor] = None, time_diffs: Optional[torch.Tensor] = None, metrics: Optional[torch.Tensor] = None, step: Optional[torch.Tensor] = None, max_num_actions: Optional[int] = None)

Bases: object

actions
as_cem_training_batch(batch_first=False)

Generate one-step samples needed by CEM trainer. The samples will be used to train an ensemble of world models used by CEM.

If batch_first = True:

state/next state shape: batch_size x 1 x state_dim action shape: batch_size x 1 x action_dim reward/terminal shape: batch_size x 1

else (default):

state/next state shape: 1 x batch_size x state_dim action shape: 1 x batch_size x action_dim reward/terminal shape: 1 x batch_size

as_discrete_maxq_training_batch()
as_parametric_maxq_training_batch()
as_policy_network_training_batch()
max_num_actions
mdp_ids
metrics
next_actions
next_states
not_terminal
possible_actions_mask
possible_actions_state_concat
possible_next_actions_mask
possible_next_actions_state_concat
propensities
rewards
sequence_numbers
set_device(device)
set_type(dtype)
size() → int
states
step
time_diffs

Module contents