reagent.mab package

Submodules

reagent.mab.mab_algorithm module

class reagent.mab.mab_algorithm.GreedyAlgo(*, n_arms: Optional[int] = None, arm_ids: Optional[List[str]] = None)

Bases: reagent.mab.mab_algorithm.MABAlgo

Greedy algorithm, which always chooses the best arm played so far Arms that haven’t been played yet are given priority by assigning inf score Ties are resolved in favor of the arm with the smallest index.

forward() torch.Tensor

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

training: bool
class reagent.mab.mab_algorithm.MABAlgo(*, n_arms: Optional[int] = None, arm_ids: Optional[List[str]] = None)

Bases: torch.nn.modules.module.Module, abc.ABC

add_batch_observations(n_obs_per_arm: torch.Tensor, sum_reward_per_arm: torch.Tensor, sum_reward_squared_per_arm: torch.Tensor, arm_ids: Optional[List[str]] = None)
add_single_observation(arm_id: str, reward: float)

Add a single observation (arm played, reward) to the bandit

Parameters
  • arm_id (int) – Which arm was played

  • reward (float) – Reward renerated by the arm

abstract forward()

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

get_action() str

Get the id of the action chosen by the MAB algorithm

Returns

The integer ID of the chosen action

Return type

int

get_avg_reward_values() torch.Tensor
classmethod get_scores_from_batch(n_obs_per_arm: torch.Tensor, sum_reward_per_arm: torch.Tensor, sum_reward_squared_per_arm: torch.Tensor) torch.Tensor

A utility method used to create the bandit, feed in a batch of observations and get the scores in one function call

Parameters
  • n_obs_per_arm (Tensor) – A tensor of counts of per-arm numbers of observations

  • sum_reward_per_arm (Tensor) – A tensor of sums of rewards for each arm

  • sum_reward_squared_per_arm (Tensor) – A tensor of sums of squared rewards for each arm

Returns

Array of per-arm scores

Return type

Tensor

reset()

Reset the MAB to the initial (empty) state.

training: bool
class reagent.mab.mab_algorithm.RandomActionsAlgo(*, n_arms: Optional[int] = None, arm_ids: Optional[List[str]] = None)

Bases: reagent.mab.mab_algorithm.MABAlgo

A MAB algorithm which samples actions uniformly at random

forward() torch.Tensor

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

training: bool
reagent.mab.mab_algorithm.get_arm_indices(ids_of_all_arms: List[str], ids_of_arms_in_batch: List[str]) List[int]
reagent.mab.mab_algorithm.place_values_at_indices(values: torch.Tensor, idxs: List[int], total_len: int) torch.Tensor
We place the values provided in values at indices provided in idxs. The values at indices

not included in idxs are filled with zeros.

TODO: maybe replace with sparse-to-dense tensor function? .. rubric:: Example

place_values_at_indices(Tensor([4,5]), [2,0], 4) == Tensor([5, 0, 4, 0])

Parameters
  • values (Tensor) – The values

  • idxs (List[int]) – The indices at which the values have to be placed

  • total_len (int) – Length of the output tensor

Returns

The output tensor

reagent.mab.mab_algorithm.reindex_multiple_tensors(all_ids: List[str], batch_ids: Optional[List[str]], value_tensors: Tuple[torch.Tensor, ...]) Tuple[torch.Tensor, ...]
Each tensor from value_tensors is ordered by ids from batch_ids. In the output we

return these tensors reindexed by all_ids, filling in zeros for missing entries.

Parameters
  • all_ids (List[str]) – The IDs that specify how to order the elements in the output

  • batch_ids (Optional[List[str]]) – The IDs that specify how the elements are ordered in the input

  • value_tensors (Tuple[Tensor]) – A tuple of tensors with elements ordered by batch_ids

Returns

A Tuple of reindexed tensors

reagent.mab.simulation module

class reagent.mab.simulation.BernoilliMAB(max_steps: int, probs: torch.Tensor, arm_ids: Optional[List[str]] = None)

Bases: reagent.mab.simulation.MAB

A class that simulates a bandit

Parameters
  • probs – A tensor of per-arm success probabilities

  • max_steps – Max number os steps to simulate. This has to be specified because we pre-generate all the rewards at initialization

act(arm_id: str) float

Sample a reward from a specific arm

Parameters

arm_idx – Index of arm from which reward is sampled

Returns

Sampled reward

class reagent.mab.simulation.MAB(max_steps: int, expected_rewards: torch.Tensor, arm_ids: Optional[List[str]] = None)

Bases: abc.ABC

abstract act(arm_id: str) float
reagent.mab.simulation.compare_bandit_algos(algo_clss: List[Type[reagent.mab.mab_algorithm.MABAlgo]], bandit_cls: Type[reagent.mab.simulation.MAB], n_bandits: int, max_steps: int, algo_kwargs: Optional[Union[Dict, List[Dict]]] = None, bandit_kwargs: Optional[Dict] = None) Tuple[List[str], List[numpy.ndarray]]
Parameters
  • algo_clss – A list of MAB algorithm classes to be evaluated

  • bandit_cls – Bandit class on which we perform evaluations

  • n_bandits – Number of bandit instances among which the results are averaged

  • max_steps – Number of time steps to simulate

  • algo_kwargs – A dict (or list of dicts, one per algorightm class) of kwargs to pass to algo_cls at initialization

  • bandit_kwargs – A dict of kwargs to pass to bandit_cls at initialization

Returns

A list of algorithm names that were evaluated (based on class names) A list of cumulative regret trajectories (one per evaluated algorithm)

reagent.mab.simulation.multiple_evaluations_bandit_algo(algo_cls: Type[reagent.mab.mab_algorithm.MABAlgo], bandit_cls: Type[reagent.mab.simulation.MAB], n_bandits: int, max_steps: int, num_processes: Optional[int] = None, algo_kwargs: Optional[Dict] = None, bandit_kwargs: Optional[Dict] = None) numpy.ndarray

Perform evaluations on multiple bandit instances and aggregate (average) the result

Parameters
  • algo_cls – MAB algorithm class to be evaluated

  • bandit_cls – Bandit class on which we perform evaluations

  • n_bandits – Number of bandit instances among which the results are averaged

  • max_steps – Number of time steps to simulate

  • algo_kwargs – A dict of kwargs to pass to algo_cls at initialization

  • bandit_kwargs – A dict of kwargs to pass to bandit_cls at initialization

Returns

An array of cumulative presudo regret (average across multple bandit instances)

reagent.mab.simulation.single_evaluation_bandit_algo(bandit: reagent.mab.simulation.MAB, algo: reagent.mab.mab_algorithm.MABAlgo) numpy.ndarray

Evaluate a bandit algorithm on a single bandit instance. Pseudo-regret (difference between expected values of best and chosen actions) is used to minimize variance of evaluation

Parameters
  • bandit – Bandit instance on which we evaluate

  • algo – Bandit algorithm to be evaluated

Returns

An array of cumulative presudo regret

reagent.mab.thompson_sampling module

class reagent.mab.thompson_sampling.BaseThompsonSampling(*, n_arms: Optional[int] = None, arm_ids: Optional[List[str]] = None)

Bases: reagent.mab.mab_algorithm.MABAlgo

forward()

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

training: bool
class reagent.mab.thompson_sampling.BernoulliBetaThompson(*, n_arms: Optional[int] = None, arm_ids: Optional[List[str]] = None)

Bases: reagent.mab.thompson_sampling.BaseThompsonSampling

The Thompson Sampling MAB with Bernoulli-Beta distribution for rewards. Appropriate for MAB with Bernoulli rewards (e.g CTR)

training: bool
class reagent.mab.thompson_sampling.NormalGammaThompson(*, n_arms: Optional[int] = None, arm_ids: Optional[List[str]] = None)

Bases: reagent.mab.thompson_sampling.BaseThompsonSampling

The Thompson Sampling MAB with Normal-Gamma distribution for rewards. Appropriate for MAB with normally distributed rewards. We use poterior update equations from

add_batch_observations(n_obs_per_arm: torch.Tensor, sum_reward_per_arm: torch.Tensor, sum_reward_squared_per_arm: torch.Tensor, arm_ids: Optional[List[str]] = None)
add_single_observation(arm_id: str, reward: float)

Add a single observation (arm played, reward) to the bandit

Parameters
  • arm_id (int) – Which arm was played

  • reward (float) – Reward renerated by the arm

training: bool

reagent.mab.ucb module

class reagent.mab.ucb.BaseUCB(estimate_variance: bool = True, alpha: float = 1.0, *, n_arms: Optional[int] = None, arm_ids: Optional[List[str]] = None)

Bases: reagent.mab.mab_algorithm.MABAlgo, abc.ABC

Base class for UCB-like Multi-Armed Bandits (MAB)

Parameters
  • estimate_variance – If True, per-arm reward variance is estimated and we multiply thconfidence interval width by its square root

  • alpha – Scalar multiplier for confidence interval width. Values above 1.0 make exploration more aggressive, below 1.0 less aggressive

forward() torch.Tensor

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

abstract get_ucb_scores() torch.Tensor
training: bool
property var
class reagent.mab.ucb.MetricUCB(estimate_variance: bool = True, alpha: float = 1.0, *, n_arms: Optional[int] = None, arm_ids: Optional[List[str]] = None)

Bases: reagent.mab.ucb.BaseUCB

This is an improvement over UCB1 which uses a more precise confidence radius, especially for small expected rewards. This algorithm has been constructed for Benroulli reward distributions. Reference: https://arxiv.org/pdf/0809.4882.pdf

get_ucb_scores() torch.Tensor

Get per-arm UCB scores. The formula is UCB_i = AVG([rewards_i]) + SQRT(AVG([rewards_i]) * LN(T+1)/N_i) + LN(T+1)/N_i

Returns

An array of UCB scores (one per arm)

Return type

Tensor

training: bool
class reagent.mab.ucb.UCB1(estimate_variance: bool = True, alpha: float = 1.0, *, n_arms: Optional[int] = None, arm_ids: Optional[List[str]] = None)

Bases: reagent.mab.ucb.BaseUCB

Canonical implementation of UCB1 Reference: https://www.cs.bham.ac.uk/internal/courses/robotics/lectures/ucb1.pdf

get_ucb_scores() torch.Tensor

Get per-arm UCB scores. The formula is UCB_i = AVG([rewards_i]) + SQRT(2*LN(T)/N_i*VAR) VAR=1 if estimate_variance==False, otherwise VAR=AVG([rewards_i**2]) - AVG([rewards_i])**2

Returns

An array of UCB scores (one per arm)

Return type

Tensor

training: bool
reagent.mab.ucb.get_bernoulli_tuned_ucb_scores(n_obs_per_arm, num_success_per_arm)

Module contents