reagent.lite package

Submodules

reagent.lite.optimizer module

class reagent.lite.optimizer.BayesianMLPEnsemblerOptimizer(param: nevergrad.parametrization.container.Dict, start_temp: float = 1.0, min_temp: float = 0.0, obj_func: Optional[Callable[[Dict[str, torch.Tensor]], torch.Tensor]] = None, acq_type: str = 'its', mutation_type: str = 'random', anneal_rate: float = 0.9997, num_mutations: int = 50, epochs: int = 1, learning_rate: float = 0.001, batch_size: int = 512, obj_exp_offset_scale: Optional[Tuple[float, float]] = None, model_dim: int = 128, num_ensemble: int = 5)

Bases: reagent.lite.optimizer.BayesianOptimizer

Bayessian Optimizer with ensemble of mlp networks, random mutation, and ITS. The Method is motivated by the BANANAS optimization method, White, 2019. https://arxiv.org/abs/1910.11858.

The mutation rate (temp) is starting from start_temp and is decreasing over time with anneal_rate. It’s lowest possible value is min_temp. Thus, initially the algorithm explores mutations with a higer mutation rate (more variables are randomly mutated). As time passes, the algorithm exploits the best solutions recorded so far (less variables are mutated).

Parameters
  • param (ng.p.Dict) – a nevergrad dictionary for specifying input choices

  • obj_func (Callable[[Dict[str, torch.Tensor]], torch.Tensor]) –

    a function which consumes sampled solutions and returns rewards as tensors of shape (batch_size, 1).

    The input dictionary has choice names as the key and sampled choice indices as the value (of shape (batch_size, ))

  • acq_type (str) – type of acquisition function.

  • mutation_type (str) – type of mutation, e.g., random.

  • num_mutations (int) – number of best solutions recorded so far that will be mutated.

  • num_ensemble (int) – number of predictors.

  • start_temp (float) – initial temperature (ratio) for mutation, e.g., with 1.0 all variables will be initally mutated.

  • min_temp (float) – lowest temperature (ratio) for mutation, e.g., with 0.0 no mutation will occur.

sample_internal(batch_size: Optional[int] = None) Tuple[Dict[str, torch.Tensor]]

Record and return sampled solutions and any other important information for learning.

It samples self.batch_size number of solutions, unless batch_size is provided.

update_params(reward: torch.Tensor)

Update model parameters by reward. Reward is objective function values evaluated on the solutions sampled by sample_internal()

update_predictor(sampled_solutions: Dict[str, torch.Tensor], sampled_reward: torch.Tensor) List[float]
class reagent.lite.optimizer.BayesianOptimizer(param: nevergrad.parametrization.container.Dict, start_temp: float, min_temp: float, obj_func: Optional[Callable[[Dict[str, torch.Tensor]], torch.Tensor]] = None, acq_type: str = 'its', mutation_type: str = 'random', anneal_rate: float = 0.9997, batch_size: int = 512, obj_exp_offset_scale: Optional[Tuple[float, float]] = None)

Bases: reagent.lite.optimizer.ComboOptimizerBase

Bayessian Optimization with mutation optimization and acquisition function. The method is motivated from BANANAS, White, 2020. https://arxiv.org/abs/1910.11858

In this method, the searching is based on mutation over the current best solutions. Acquisition function, e.g., its estimates the expected imrpovement.

Parameters
  • param (ng.p.Dict) – a nevergrad dictionary for specifying input choices

  • obj_func (Callable[[Dict[str, torch.Tensor]], torch.Tensor]) –

    a function which consumes sampled solutions and returns rewards as tensors of shape (batch_size, 1).

    The input dictionary has choice names as the key and sampled choice indices as the value (of shape (batch_size, ))

  • acq_type (str) – type of acquisition function.

  • mutation_type (str) – type of mutation, e.g., random.

  • temp (float) – percentage of mutation - how many variables will be mutated.

acquisition(acq_type: str, sampled_sol: Dict[str, torch.Tensor], predictor: List[torch.nn.modules.module.Module]) torch.Tensor
sample(batch_size: int, temp: Optional[float] = None) Dict[str, torch.Tensor]

Applies a type of mutation, e.g., random mutation, on the best solutions recorded so far. For example, with random mutation, variables are randomly selected, and their values are randomly set with respect to their domains.

class reagent.lite.optimizer.BestResultsQueue(max_len: int)

Bases: object

Maintain the max_len lowest numbers

insert(reward: torch.Tensor, sol: Dict[str, torch.Tensor]) None
topk(k: int) List[Tuple[torch.Tensor, Dict[str, torch.Tensor]]]
class reagent.lite.optimizer.ComboOptimizerBase(param: nevergrad.parametrization.container.Dict, obj_func: Optional[Callable[[Dict[str, torch.Tensor]], torch.Tensor]] = None, batch_size: int = 512, obj_exp_offset_scale: Optional[Tuple[float, float]] = None)

Bases: object

best_solutions(k: int = 1) List[Tuple[torch.Tensor, Dict[str, torch.Tensor]]]

k solutions with the smallest rewards Return is a list of tuples (reward, solution)

indices_to_raw_choices(sampled_sol: Dict[str, torch.Tensor]) List[Dict[str, str]]
optimize_step() Tuple
sample(batch_size: int, temp: Optional[float] = None) Dict[str, torch.Tensor]

Return sampled solutions, keyed by parameter names. For discrete parameters, the values are choice indices; For continuous parameters, the values are sampled float vectors.

This function is usually called after learning is done.

abstract sample_internal(batch_size: Optional[int] = None) Tuple

Record and return sampled solutions and any other important information for learning.

It samples self.batch_size number of solutions, unless batch_size is provided.

abstract update_params(reward: torch.Tensor) None

Update model parameters by reward. Reward is objective function values evaluated on the solutions sampled by sample_internal()

class reagent.lite.optimizer.GumbelSoftmaxOptimizer(param: nevergrad.parametrization.container.Dict, obj_func: Optional[Callable[[Dict[str, torch.Tensor]], torch.Tensor]] = None, start_temp: float = 1.0, min_temp: float = 0.1, learning_rate: float = 0.001, anneal_rate: float = 0.9997, batch_size: int = 512, update_params_within_optimizer: bool = True)

Bases: reagent.lite.optimizer.LogitBasedComboOptimizerBase

Minimize a differentiable objective function which takes in categorical inputs. The method is based on Categorical Reparameterization with Gumbel-Softmax, Jang, Gu, & Poole, 2016. https://arxiv.org/abs/1611.01144.

Parameters
  • param (ng.p.Dict) – a nevergrad dictionary for specifying input choices

  • obj_func (Callable[[Dict[str, torch.Tensor]], torch.Tensor]) –

    an analytical function which consumes sampled solutions and returns rewards as tensors of shape (batch_size, 1).

    The input dictionary has choice names as the key and sampled gumbel-softmax distributions of shape (batch_size, num_choices) as the value

  • start_temp – starting temperature

  • min_temp – minimal temperature (towards the end of learning) for sampling gumbel-softmax

  • update_params_within_optimizer (bool) – If False, skip updating parameters within this Optimizer. The Gumbel-softmax parameters will be updated in external systems.

Example

>>> _ = torch.manual_seed(0)
>>> np.random.seed(0)
>>> BATCH_SIZE = 4
>>> ng_param = ng.p.Dict(choice1=ng.p.Choice(["blue", "green", "red"]))
>>>
>>> def obj_func(sampled_sol: Dict[str, torch.Tensor]):
...     # best action is "red"
...     reward = torch.mm(sampled_sol['choice1'], torch.tensor([[1.], [1.], [0.]]))
...     return reward
...
>>> optimizer = GumbelSoftmaxOptimizer(
...     ng_param, obj_func, anneal_rate=0.9, batch_size=BATCH_SIZE, learning_rate=0.1
... )
...
>>> for i in range(30):
...     res = optimizer.optimize_step()
...
>>> assert optimizer.sample(1)['choice1'] == 2
sample_internal(batch_size: Optional[int] = None) Tuple[Dict[str, torch.Tensor]]

Record and return sampled solutions and any other important information for learning.

It samples self.batch_size number of solutions, unless batch_size is provided.

update_params(reward: torch.Tensor) None

Update model parameters by reward. Reward is objective function values evaluated on the solutions sampled by sample_internal()

class reagent.lite.optimizer.LogitBasedComboOptimizerBase(param: nevergrad.parametrization.container.Dict, start_temp: float, min_temp: float, obj_func: Optional[Callable[[Dict[str, torch.Tensor]], torch.Tensor]] = None, learning_rate: float = 0.001, anneal_rate: float = 0.9997, batch_size: int = 512, obj_exp_offset_scale: Optional[Tuple[float, float]] = None)

Bases: reagent.lite.optimizer.ComboOptimizerBase

sample(batch_size: int, temp: Optional[float] = 0.0001) Dict[str, torch.Tensor]

Return sampled solutions, keyed by parameter names. For discrete parameters, the values are choice indices; For continuous parameters, the values are sampled float vectors.

This function is usually called after learning is done.

class reagent.lite.optimizer.NeverGradOptimizer(param: nevergrad.parametrization.container.Dict, estimated_budgets: int, obj_func: Optional[Callable[[Dict[str, torch.Tensor]], torch.Tensor]] = None, batch_size: int = 512, optimizer_name: Optional[str] = None)

Bases: reagent.lite.optimizer.ComboOptimizerBase

Minimize a black-box function using NeverGrad, Rapin & Teytaud, 2018. https://facebookresearch.github.io/nevergrad/.

Parameters
  • param (ng.p.Dict) – a nevergrad dictionary for specifying input choices

  • estimated_budgets (int) – estimated number of budgets (objective evaluation times) for nevergrad to perform auto tuning.

  • obj_func (Callable[[Dict[str, torch.Tensor]], torch.Tensor]) –

    a function which consumes sampled solutions and returns rewards as tensors of shape (batch_size, 1).

    The input dictionary has choice names as the key and sampled choice indices as the value (of shape (batch_size, ))

  • optimizer_name (Optional[str]) – ng optimizer to be used specifically All possible nevergrad optimizers are available at: https://facebookresearch.github.io/nevergrad/optimization.html#choosing-an-optimizer. If not specified, we use the meta optimizer NGOpt

Example

>>> _ = torch.manual_seed(0)
>>> np.random.seed(0)
>>> BATCH_SIZE = 4
>>> ng_param = ng.p.Dict(choice1=ng.p.Choice(["blue", "green", "red"]))
>>>
>>> def obj_func(sampled_sol: Dict[str, torch.Tensor]):
...     reward = torch.ones(BATCH_SIZE, 1)
...     for i in range(BATCH_SIZE):
...         # the best action is "red"
...         if sampled_sol['choice1'][i] == 2:
...             reward[i, 0] = 0.0
...     return reward
...
>>> estimated_budgets = 40
>>> optimizer = NeverGradOptimizer(
...    ng_param, estimated_budgets, obj_func, batch_size=BATCH_SIZE,
... )
>>>
>>> for i in range(10):
...     res = optimizer.optimize_step()
...
>>> best_reward, best_choice = optimizer.best_solutions(k=1)[0]
>>> assert best_reward == 0
>>> assert best_choice['choice1'] == 2
sample(batch_size: int, temp: Optional[float] = None) Dict[str, torch.Tensor]

Return sampled solutions, keyed by parameter names. For discrete parameters, the values are choice indices; For continuous parameters, the values are sampled float vectors.

This function is usually called after learning is done.

sample_internal(batch_size: Optional[int] = None) Tuple

Return sampled solutions in two formats. (1) our own format, which is a dictionary and consistent with other optimizers.

The dictionary has choice names as the key and sampled choice indices as the value (of shape (batch_size, ))

  1. nevergrad format returned by optimizer.ask()

update_params(reward: torch.Tensor) None

Update model parameters by reward. Reward is objective function values evaluated on the solutions sampled by sample_internal()

class reagent.lite.optimizer.PolicyGradientOptimizer(param: nevergrad.parametrization.container.Dict, obj_func: Optional[Callable[[Dict[str, torch.Tensor]], torch.Tensor]] = None, start_temp: float = 1.0, min_temp: float = 1.0, learning_rate: float = 0.001, anneal_rate: float = 0.9997, batch_size: int = 512, obj_exp_offset_scale: Optional[Tuple[float, float]] = None)

Bases: reagent.lite.optimizer.LogitBasedComboOptimizerBase

Minimize a black-box objective function which takes in categorical inputs. The method is based on REINFORCE, Williams, 1992. https://link.springer.com/article/10.1007/BF00992696

In this method, the action distribution is a joint distribution of multiple independent softmax distributions, each corresponding to one discrete choice type.

Parameters
  • param (ng.p.Dict) – a nevergrad dictionary for specifying input choices

  • obj_func (Callable[[Dict[str, torch.Tensor]], torch.Tensor]) –

    a function which consumes sampled solutions and returns rewards as tensors of shape (batch_size, 1).

    The input dictionary has choice names as the key and sampled choice indices as the value (of shape (batch_size, ))

Example

>>> _ = torch.manual_seed(0)
>>> np.random.seed(0)
>>> BATCH_SIZE = 16
>>> ng_param = ng.p.Dict(choice1=ng.p.Choice(["blue", "green", "red"]))
>>>
>>> def obj_func(sampled_sol: Dict[str, torch.Tensor]):
...     reward = torch.ones(BATCH_SIZE, 1)
...     for i in range(BATCH_SIZE):
...         # the best action is "red"
...         if sampled_sol['choice1'][i] == 2:
...             reward[i, 0] = 0.0
...     return reward
...
>>> optimizer = PolicyGradientOptimizer(
...     ng_param, obj_func, batch_size=BATCH_SIZE, learning_rate=0.1
... )
>>> for i in range(30):
...    res = optimizer.optimize_step()
...
>>> best_reward, best_choice = optimizer.best_solutions(k=1)[0]
>>> assert best_reward == 0
>>> assert best_choice['choice1'] == 2
>>> assert optimizer.sample(1)['choice1'] == 2
sample(batch_size: int, temp: Optional[float] = 0.0001) Dict[str, torch.Tensor]

Return sampled solutions, keyed by parameter names. For discrete parameters, the values are choice indices; For continuous parameters, the values are sampled float vectors.

This function is usually called after learning is done.

sample_internal(batch_size: Optional[int] = None) Tuple[Dict[str, torch.Tensor], torch.Tensor]

Record and return sampled solutions and any other important information for learning.

It samples self.batch_size number of solutions, unless batch_size is provided.

update_params(reward: torch.Tensor)

Update model parameters by reward. Reward is objective function values evaluated on the solutions sampled by sample_internal()

class reagent.lite.optimizer.QLearningOptimizer(param: nevergrad.parametrization.container.Dict, obj_func: Optional[Callable[[Dict[str, torch.Tensor]], torch.Tensor]] = None, start_temp: float = 1.0, min_temp: float = 0.1, learning_rate: float = 0.001, anneal_rate: float = 0.9997, batch_size: int = 512, model_dim: int = 128, obj_exp_offset_scale: Optional[Tuple[float, float]] = None, num_batches_per_learning: int = 10, replay_size: int = 100)

Bases: reagent.lite.optimizer.ComboOptimizerBase

Treat the problem of minimizing a black-box function as a sequential decision problem, and solve it by Deep Q-Learning. See “Human-Level Control through Deep Reinforcement Learning”, Mnih et al., 2015. https://www.nature.com/articles/nature14236.

In each episode step, Q-learning makes a decision for one categorical input. The reward is given only at the end of the episode, which is the value of the black-box function at the input determined by the choices made at all steps.

Parameters
  • param (ng.p.Dict) – a nevergrad dictionary for specifying input choices

  • start_temp (float) – the starting exploration rate in epsilon-greedy sampling

  • min_temp (float) – the minimal exploration rate in epsilon-greedy

  • obj_func (Callable[[Dict[str, torch.Tensor]], torch.Tensor]) –

    a function which consumes sampled solutions and returns rewards as tensors of shape (batch_size, 1).

    The input dictionary has choice names as the key and sampled choice indices as the value (of shape (batch_size, ))

  • model_dim (int) – hidden layer size for the q-network: input -> model_dim -> model_dim -> output

  • num_batches_per_learning (int) – the number of batches sampled from replay buffer for q-learning.

  • replay_size (int) – the maximum batches held in the replay buffer. Note, a problem instance of n choices will generate n batches in the replay buffer.

Example

>>> _ = torch.manual_seed(0)
>>> np.random.seed(0)
>>> BATCH_SIZE = 4
>>> ng_param = ng.p.Dict(choice1=ng.p.Choice(["blue", "green", "red"]))
>>>
>>> def obj_func(sampled_sol: Dict[str, torch.Tensor]):
...     reward = torch.ones(BATCH_SIZE, 1)
...     for i in range(BATCH_SIZE):
...         # the best action is "red"
...         if sampled_sol['choice1'][i] == 2:
...             reward[i, 0] = 0.0
...     return reward
...
>>> optimizer = QLearningOptimizer(ng_param, obj_func, batch_size=BATCH_SIZE)
>>> for i in range(10):
...     res = optimizer.optimize_step()
...
>>> best_reward, best_choice = optimizer.best_solutions(k=1)[0]
>>> assert best_reward == 0
>>> assert best_choice['choice1'] == 2
>>> assert optimizer.sample(1)['choice1'] == 2
sample(batch_size: int, temp: Optional[float] = 0.0001) Dict[str, torch.Tensor]

Return sampled solutions, keyed by parameter names. For discrete parameters, the values are choice indices; For continuous parameters, the values are sampled float vectors.

This function is usually called after learning is done.

sample_internal(batch_size: Optional[int] = None) Tuple[Dict[str, torch.Tensor], List[Any]]

Record and return sampled solutions and any other important information for learning.

It samples self.batch_size number of solutions, unless batch_size is provided.

update_params(reward: torch.Tensor) None

Update model parameters by reward. Reward is objective function values evaluated on the solutions sampled by sample_internal()

class reagent.lite.optimizer.RandomSearchOptimizer(param: nevergrad.parametrization.container.Dict, obj_func: Optional[Callable[[Dict[str, torch.Tensor]], torch.Tensor]] = None, batch_size: int = 512, sampling_weights: Optional[Dict[str, numpy.ndarray]] = None)

Bases: reagent.lite.optimizer.ComboOptimizerBase

Find the best solution to minimize a black-box function by random search

Parameters
  • param (ng.p.Dict) – a nevergrad dictionary for specifying input choices

  • obj_func (Callable[[Dict[str, torch.Tensor]], torch.Tensor]) –

    a function which consumes sampled solutions and returns rewards as tensors of shape (batch_size, 1).

    The input dictionary has choice names as the key and sampled choice indices as the value (of shape (batch_size, ))

  • sampling_weights (Optional[Dict[str, np.ndarray]]) – Instead of uniform sampling, we sample solutions with preferred weights. Key: choice name, value: sampling weights

Example

>>> _ = torch.manual_seed(0)
>>> np.random.seed(0)
>>> BATCH_SIZE = 4
>>> ng_param = ng.p.Dict(choice1=ng.p.Choice(["blue", "green", "red"]))
>>>
>>> def obj_func(sampled_sol: Dict[str, torch.Tensor]):
...     reward = torch.ones(BATCH_SIZE, 1)
...     for i in range(BATCH_SIZE):
...         # the best action is "red"
...         if sampled_sol['choice1'][i] == 2:
...             reward[i, 0] = 0.0
...     return reward
...
>>> optimizer = RandomSearchOptimizer(ng_param, obj_func, batch_size=BATCH_SIZE)
>>> for i in range(10):
...     res = optimizer.optimize_step()
...
>>> best_reward, best_choice = optimizer.best_solutions(k=1)[0]
>>> assert best_reward == 0
>>> assert best_choice['choice1'] == 2
sample(batch_size: int, temp: Optional[float] = None) Dict[str, torch.Tensor]

Return sampled solutions, keyed by parameter names. For discrete parameters, the values are choice indices; For continuous parameters, the values are sampled float vectors.

This function is usually called after learning is done.

sample_internal(batch_size: Optional[int] = None) Tuple[Dict[str, torch.Tensor]]

Record and return sampled solutions and any other important information for learning.

It samples self.batch_size number of solutions, unless batch_size is provided.

update_params(reward: torch.Tensor)

Update model parameters by reward. Reward is objective function values evaluated on the solutions sampled by sample_internal()

reagent.lite.optimizer.gumbel_softmax(logits: torch.Tensor, temperature: float) torch.Tensor
reagent.lite.optimizer.obj_func_scaler(obj_func: Optional[Callable[[Dict[str, torch.Tensor]], torch.Tensor]], exp_offset_and_scale: Optional[Tuple[float, float]]) Optional[Callable]

Scale objective functions to make optimizers get out of local minima more easily.

The scaling formula is: exp((reward - offset) / scale)

if obj_exp_offset_scale is None, do not scale the obj_function (i.e., reward == scaled_reward)

reagent.lite.optimizer.sample_from_logits(keyed_logits: Dict[str, torch.nn.parameter.Parameter], batch_size: int, temp: float) Tuple[Dict[str, torch.Tensor], torch.Tensor]

Return sampled solutions and sampled log probabilities

reagent.lite.optimizer.sample_gumbel(shape: Tuple[int, ...], eps: float = 1e-20) torch.Tensor
reagent.lite.optimizer.shuffle_exp_replay(exp_replay: List[Any]) Any
reagent.lite.optimizer.sol_to_tensors(sampled_sol: Dict[str, torch.Tensor], input_param: nevergrad.parametrization.container.Dict) torch.Tensor

Module contents