reagent.gym.envs package

Subpackages

Submodules

reagent.gym.envs.changing_arms module

Traditional MAB setup has sequence length = 1 always. In this setup, the distributions of the arms rewards changes every round, and the agent is presented with some information and control about how the arms will change. In particular, the observation includes “mu_changes”, which is the possible changes to mu; only the arm picked by agent will have it’s mu_changes reflected. This way, the next state depend on (only) the previous state and action; hence this a MDP.

The reward for picking an action is the change in mu corresponding to that arm. With following set-up (where ARM_INIT_VALUE = 100 and NUM_ARMS = 5), the optimal policy can accumulate a reward of 500 per run. Note that if the policy picks an illegal action at any time, the game ends.

class reagent.gym.envs.changing_arms.ChangingArms(num_arms: int = 5)

Bases: reagent.gym.envs.env_wrapper.EnvWrapper

make() gym.core.Env
property normalization_data
num_arms: int = 5
obs_preprocessor(obs: numpy.ndarray) reagent.core.types.FeatureData
serving_obs_preprocessor(obs: numpy.ndarray) reagent.core.types.ServingFeatureData
split_state_transform(elem: torch.Tensor)

For generate data

trainer_preprocessor(obs: torch.Tensor)
class reagent.gym.envs.changing_arms.ChangingArmsEnv(num_arms)

Bases: gym.core.Env

This is just the gym environment, without extra functionality

property action_space
property observation_space

It should really be a Dict, but we return them all stacked since it’s more convenient for RB.

reset()

Resets the environment to an initial state and returns an initial observation.

Note that this function should not reset the environment’s random number generator(s); random variables in the environment’s state should be sampled independently between multiple calls to reset(). In other words, each call of reset() should yield an environment suitable for a new episode, independent of previous episodes.

Returns

the initial observation.

Return type

observation (object)

seed(seed: int)

Sets the seed for this env’s random number generator(s).

Note

Some environments use multiple pseudorandom number generators. We want to capture all such seeds used in order to ensure that there aren’t accidental correlations between multiple generators.

Returns

Returns the list of seeds used in this env’s random

number generators. The first value in the list should be the “main” seed, or the value which a reproducer should pass to ‘seed’. Often, the main seed equals the provided ‘seed’, but this won’t be true if seed=None, for example.

Return type

list<bigint>

property state

State comprises of: - initial mus - legal_indices mask - randomly-generated mu changes

step(action)

Run one timestep of the environment’s dynamics. When end of episode is reached, you are responsible for calling reset() to reset this environment’s state.

Accepts an action and returns a tuple (observation, reward, done, info).

Parameters

action (object) – an action provided by the agent

Returns

agent’s observation of the current environment reward (float) : amount of reward returned after previous action done (bool): whether the episode has ended, in which case further step() calls will return undefined results info (dict): contains auxiliary diagnostic information (helpful for debugging, and sometimes learning)

Return type

observation (object)

reagent.gym.envs.changing_arms.get_initial_mus(num_arms)
reagent.gym.envs.changing_arms.get_mu_changes(num_arms)

reagent.gym.envs.env_wrapper module

class reagent.gym.envs.env_wrapper.EnvWrapper

Bases: gym.core.Wrapper

Wrapper around it’s environment, to simplify configuration.

REGISTRY = {'ChangingArms': <class 'reagent.gym.envs.changing_arms.ChangingArms'>, 'Gym': <class 'reagent.gym.envs.gym.Gym'>, 'OraclePVM': <class 'reagent.gym.envs.oracle_pvm.OraclePVM'>, 'RecSim': <class 'reagent.gym.envs.recsim.RecSim'>, 'ToyVM': <class 'reagent.gym.envs.toy_vm.ToyVM'>}
REGISTRY_FROZEN = True
REGISTRY_NAME = 'EnvWrapper'
action_extractor(actor_output: reagent.core.types.ActorOutput) torch.Tensor
get_action_extractor()
get_obs_preprocessor(*ctor_args, **ctor_kwargs)
get_serving_action_extractor()
get_serving_obs_preprocessor()
abstract make() gym.core.Env
property max_steps: Optional[int]
abstract obs_preprocessor(obs: numpy.ndarray) reagent.core.types.FeatureData
property possible_actions_mask: Optional[numpy.ndarray]
serving_action_extractor(actor_output: reagent.core.types.ActorOutput) torch.Tensor
abstract serving_obs_preprocessor(obs: numpy.ndarray) reagent.core.types.ServingFeatureData

reagent.gym.envs.gym module

class reagent.gym.envs.gym.Gym(env_name: str, set_max_steps: Optional[int] = None)

Bases: reagent.gym.envs.env_wrapper.EnvWrapper

env_name: str
make() gym.core.Env
obs_preprocessor(obs: numpy.ndarray) reagent.core.types.FeatureData
serving_obs_preprocessor(obs: numpy.ndarray) Tuple[torch.Tensor, torch.Tensor]
set_max_steps: Optional[int] = None

reagent.gym.envs.oracle_pvm module

class reagent.gym.envs.oracle_pvm.OraclePVM(num_candidates: int, slate_size: int, resample_documents: bool = True, single_selection: bool = True, is_interest_exploration: bool = False, initial_seed: int = 1, user_feat_dim: int = 1, candidate_feat_dim: int = 3, num_weights: int = 3)

Bases: reagent.gym.envs.recsim.RecSim

Wrapper over RecSim for simulating (Personalized) VM Tuning. The state is the same as for RecSim (user feature + candidate features). There are num_weights VM weights to tune, and so action space is a vector of length num_weights. OraclePVM hides num_weights number of (1) score_fns (akin to VM models), that take in

user + candidate_i feature and produces a score for candidate_i.

  1. ground_truth_weights, that are used to produce “ground truth”, a.k.a. “Oracle”, rankings.

Reward is the Kendall-Tau between ground truth and the ranking created from the weights given by action. If the rankings match exactly, the reward is boosted to 3. NOTE: This environment only tests if the Agent can learn the hidden ground truth weights, which may be far from optimal (in terms of RecSim’s rewards, which we’re ignoring). This is easier for unit tests, but in the real world we will be trying to learn the optimal weights, and the reward signal would reflect that.

TODO: made environment easier to learn from by not using RecSim.

property action_space
candidate_feat_dim: int = 3
is_match(reward)
num_weights: int = 3
obs_preprocessor(obs: numpy.ndarray) reagent.core.types.FeatureData
reset()

Resets the environment to an initial state and returns an initial observation.

Note that this function should not reset the environment’s random number generator(s); random variables in the environment’s state should be sampled independently between multiple calls to reset(). In other words, each call of reset() should yield an environment suitable for a new episode, independent of previous episodes.

Returns

the initial observation.

Return type

observation (object)

serving_obs_preprocessor(obs: numpy.ndarray)
step(action)

Run one timestep of the environment’s dynamics. When end of episode is reached, you are responsible for calling reset() to reset this environment’s state.

Accepts an action and returns a tuple (observation, reward, done, info).

Parameters

action (object) – an action provided by the agent

Returns

agent’s observation of the current environment reward (float) : amount of reward returned after previous action done (bool): whether the episode has ended, in which case further step() calls will return undefined results info (dict): contains auxiliary diagnostic information (helpful for debugging, and sometimes learning)

Return type

observation (object)

user_feat_dim: int = 1
reagent.gym.envs.oracle_pvm.get_default_score_fns(num_weights)
reagent.gym.envs.oracle_pvm.get_ground_truth_weights(num_weights)
reagent.gym.envs.oracle_pvm.make_default_score_fn(fn_i: int) Callable[[numpy.ndarray, numpy.ndarray], float]

Make ith score_fn (constructor of ith score)

reagent.gym.envs.recsim module

class reagent.gym.envs.recsim.MulticlickIEvUserModel(slate_size, choice_model_ctor=None, response_model_ctor=<class 'recsim.environments.interest_evolution.IEvResponse'>, user_state_ctor=<class 'recsim.environments.interest_evolution.IEvUserState'>, no_click_mass=1.0, seed=0, alpha_x_intercept=1.0, alpha_y_intercept=0.3)

Bases: recsim.environments.interest_evolution.IEvUserModel

simulate_response(documents)

Simulates the user’s response to a slate of documents with choice model.

Parameters

documents – a list of IEvVideo objects

Returns

a list of IEvResponse objects, one for each document

Return type

responses

class reagent.gym.envs.recsim.RecSim(num_candidates: int, slate_size: int, resample_documents: bool = True, single_selection: bool = True, is_interest_exploration: bool = False, initial_seed: int = 1)

Bases: reagent.gym.envs.env_wrapper.EnvWrapper

initial_seed: int = 1
is_interest_exploration: bool = False
make() gym.core.Env
num_candidates: int
obs_preprocessor(obs: numpy.ndarray) reagent.core.types.FeatureData
resample_documents: bool = True
reset(**kwargs)

Resets the environment to an initial state and returns an initial observation.

Note that this function should not reset the environment’s random number generator(s); random variables in the environment’s state should be sampled independently between multiple calls to reset(). In other words, each call of reset() should yield an environment suitable for a new episode, independent of previous episodes.

Returns

the initial observation.

Return type

observation (object)

serving_obs_preprocessor(obs: numpy.ndarray)
single_selection: bool = True
slate_size: int
step(action)

Run one timestep of the environment’s dynamics. When end of episode is reached, you are responsible for calling reset() to reset this environment’s state.

Accepts an action and returns a tuple (observation, reward, done, info).

Parameters

action (object) – an action provided by the agent

Returns

agent’s observation of the current environment reward (float) : amount of reward returned after previous action done (bool): whether the episode has ended, in which case further step() calls will return undefined results info (dict): contains auxiliary diagnostic information (helpful for debugging, and sometimes learning)

Return type

observation (object)

class reagent.gym.envs.recsim.UserState(user_interests, time_budget=None, score_scaling=None, attention_prob=None, no_click_mass=None, keep_interact_prob=None, min_doc_utility=None, user_update_alpha=None, watched_videos=None, impressed_videos=None, liked_videos=None, step_penalty=None, min_normalizer=None, user_quality_factor=None, document_quality_factor=None)

Bases: recsim.environments.interest_evolution.IEvUserState

score_document(doc_obs)
reagent.gym.envs.recsim.create_multiclick_environment(env_config)

Creates an interest evolution environment.

reagent.gym.envs.recsim.dot_value_fn(user, doc)
reagent.gym.envs.recsim.multi_selection_value_fn(user, doc)

reagent.gym.envs.toy_vm module

class reagent.gym.envs.toy_vm.ToyVM(slate_size: int = 5, max_episode_steps: int = 100, initial_seed: Optional[int] = None)

Bases: reagent.gym.envs.env_wrapper.EnvWrapper

action_extractor(actor_output)
initial_seed: Optional[int] = None
make()
max_episode_steps: int = 100
obs_preprocessor(obs)
serving_obs_preprocessor(obs)
slate_size: int = 5
class reagent.gym.envs.toy_vm.ToyVMEnv(slate_size: int)

Bases: gym.core.Env

reset()

Resets the environment to an initial state and returns an initial observation.

Note that this function should not reset the environment’s random number generator(s); random variables in the environment’s state should be sampled independently between multiple calls to reset(). In other words, each call of reset() should yield an environment suitable for a new episode, independent of previous episodes.

Returns

the initial observation.

Return type

observation (object)

seed(seed: Optional[int] = None)

Sets the seed for this env’s random number generator(s).

Note

Some environments use multiple pseudorandom number generators. We want to capture all such seeds used in order to ensure that there aren’t accidental correlations between multiple generators.

Returns

Returns the list of seeds used in this env’s random

number generators. The first value in the list should be the “main” seed, or the value which a reproducer should pass to ‘seed’. Often, the main seed equals the provided ‘seed’, but this won’t be true if seed=None, for example.

Return type

list<bigint>

step(action)

Run one timestep of the environment’s dynamics. When end of episode is reached, you are responsible for calling reset() to reset this environment’s state.

Accepts an action and returns a tuple (observation, reward, done, info).

Parameters

action (object) – an action provided by the agent

Returns

agent’s observation of the current environment reward (float) : amount of reward returned after previous action done (bool): whether the episode has ended, in which case further step() calls will return undefined results info (dict): contains auxiliary diagnostic information (helpful for debugging, and sometimes learning)

Return type

observation (object)

reagent.gym.envs.toy_vm.random_document(prng)
reagent.gym.envs.toy_vm.simulate_reward(slate: List[collections.Document], prng: numpy.random.mtrand.RandomState)
reagent.gym.envs.toy_vm.zero_augment(user, doc)

reagent.gym.envs.utils module

reagent.gym.envs.utils.register_if_not_exists(id, entry_point)

Preventing tests from failing trying to re-register environments

Module contents

class reagent.gym.envs.ChangingArms(num_arms: int = 5)

Bases: reagent.gym.envs.env_wrapper.EnvWrapper

make() gym.core.Env
property normalization_data
num_arms: int = 5
obs_preprocessor(obs: numpy.ndarray) reagent.core.types.FeatureData
serving_obs_preprocessor(obs: numpy.ndarray) reagent.core.types.ServingFeatureData
split_state_transform(elem: torch.Tensor)

For generate data

trainer_preprocessor(obs: torch.Tensor)
class reagent.gym.envs.Env__Union(ChangingArms: Optional[reagent.gym.envs.changing_arms.ChangingArms] = None, Gym: Optional[reagent.gym.envs.gym.Gym] = None, RecSim: Optional[reagent.gym.envs.recsim.RecSim] = None, OraclePVM: Optional[reagent.gym.envs.oracle_pvm.OraclePVM] = None, ToyVM: Optional[reagent.gym.envs.toy_vm.ToyVM] = None)

Bases: reagent.core.tagged_union.TaggedUnion

ChangingArms: Optional[reagent.gym.envs.changing_arms.ChangingArms] = None
Gym: Optional[reagent.gym.envs.gym.Gym] = None
OraclePVM: Optional[reagent.gym.envs.oracle_pvm.OraclePVM] = None
RecSim: Optional[reagent.gym.envs.recsim.RecSim] = None
ToyVM: Optional[reagent.gym.envs.toy_vm.ToyVM] = None
make_union_instance(instance_class=None)
class reagent.gym.envs.Gym(env_name: str, set_max_steps: Optional[int] = None)

Bases: reagent.gym.envs.env_wrapper.EnvWrapper

env_name: str
make() gym.core.Env
obs_preprocessor(obs: numpy.ndarray) reagent.core.types.FeatureData
serving_obs_preprocessor(obs: numpy.ndarray) Tuple[torch.Tensor, torch.Tensor]
set_max_steps: Optional[int] = None
class reagent.gym.envs.OraclePVM(num_candidates: int, slate_size: int, resample_documents: bool = True, single_selection: bool = True, is_interest_exploration: bool = False, initial_seed: int = 1, user_feat_dim: int = 1, candidate_feat_dim: int = 3, num_weights: int = 3)

Bases: reagent.gym.envs.recsim.RecSim

Wrapper over RecSim for simulating (Personalized) VM Tuning. The state is the same as for RecSim (user feature + candidate features). There are num_weights VM weights to tune, and so action space is a vector of length num_weights. OraclePVM hides num_weights number of (1) score_fns (akin to VM models), that take in

user + candidate_i feature and produces a score for candidate_i.

  1. ground_truth_weights, that are used to produce “ground truth”, a.k.a. “Oracle”, rankings.

Reward is the Kendall-Tau between ground truth and the ranking created from the weights given by action. If the rankings match exactly, the reward is boosted to 3. NOTE: This environment only tests if the Agent can learn the hidden ground truth weights, which may be far from optimal (in terms of RecSim’s rewards, which we’re ignoring). This is easier for unit tests, but in the real world we will be trying to learn the optimal weights, and the reward signal would reflect that.

TODO: made environment easier to learn from by not using RecSim.

property action_space
candidate_feat_dim: int = 3
is_match(reward)
num_candidates: int
num_weights: int = 3
obs_preprocessor(obs: numpy.ndarray) reagent.core.types.FeatureData
reset()

Resets the environment to an initial state and returns an initial observation.

Note that this function should not reset the environment’s random number generator(s); random variables in the environment’s state should be sampled independently between multiple calls to reset(). In other words, each call of reset() should yield an environment suitable for a new episode, independent of previous episodes.

Returns

the initial observation.

Return type

observation (object)

serving_obs_preprocessor(obs: numpy.ndarray)
slate_size: int
step(action)

Run one timestep of the environment’s dynamics. When end of episode is reached, you are responsible for calling reset() to reset this environment’s state.

Accepts an action and returns a tuple (observation, reward, done, info).

Parameters

action (object) – an action provided by the agent

Returns

agent’s observation of the current environment reward (float) : amount of reward returned after previous action done (bool): whether the episode has ended, in which case further step() calls will return undefined results info (dict): contains auxiliary diagnostic information (helpful for debugging, and sometimes learning)

Return type

observation (object)

user_feat_dim: int = 1
class reagent.gym.envs.RecSim(num_candidates: int, slate_size: int, resample_documents: bool = True, single_selection: bool = True, is_interest_exploration: bool = False, initial_seed: int = 1)

Bases: reagent.gym.envs.env_wrapper.EnvWrapper

initial_seed: int = 1
is_interest_exploration: bool = False
make() gym.core.Env
num_candidates: int
obs_preprocessor(obs: numpy.ndarray) reagent.core.types.FeatureData
resample_documents: bool = True
reset(**kwargs)

Resets the environment to an initial state and returns an initial observation.

Note that this function should not reset the environment’s random number generator(s); random variables in the environment’s state should be sampled independently between multiple calls to reset(). In other words, each call of reset() should yield an environment suitable for a new episode, independent of previous episodes.

Returns

the initial observation.

Return type

observation (object)

serving_obs_preprocessor(obs: numpy.ndarray)
single_selection: bool = True
slate_size: int
step(action)

Run one timestep of the environment’s dynamics. When end of episode is reached, you are responsible for calling reset() to reset this environment’s state.

Accepts an action and returns a tuple (observation, reward, done, info).

Parameters

action (object) – an action provided by the agent

Returns

agent’s observation of the current environment reward (float) : amount of reward returned after previous action done (bool): whether the episode has ended, in which case further step() calls will return undefined results info (dict): contains auxiliary diagnostic information (helpful for debugging, and sometimes learning)

Return type

observation (object)

class reagent.gym.envs.ToyVM(slate_size: int = 5, max_episode_steps: int = 100, initial_seed: Optional[int] = None)

Bases: reagent.gym.envs.env_wrapper.EnvWrapper

action_extractor(actor_output)
initial_seed: Optional[int] = None
make()
max_episode_steps: int = 100
obs_preprocessor(obs)
serving_obs_preprocessor(obs)
slate_size: int = 5