reagent.gym package

Subpackages

Submodules

reagent.gym.normalizers module

reagent.gym.normalizers.discrete_action_normalizer(feats)
reagent.gym.normalizers.normalizer_helper(feats, feature_type, min_value=None, max_value=None)
reagent.gym.normalizers.only_continuous_action_normalizer(feats, min_value=None, max_value=None)
reagent.gym.normalizers.only_continuous_normalizer(feats, min_value=None, max_value=None)

reagent.gym.types module

class reagent.gym.types.GaussianSamplerScore(loc: torch.Tensor, scale_log: torch.Tensor)

Bases: reagent.core.base_dataclass.BaseDataClass

loc: torch.Tensor
scale_log: torch.Tensor
reagent.gym.types.PostStep

Called after end of episode

alias of Callable[[reagent.gym.types.Transition], None]

class reagent.gym.types.Sampler

Bases: abc.ABC

Given scores, select the action.

abstract log_prob(scores: Any, action: torch.Tensor) torch.Tensor
abstract sample_action(scores: Any) reagent.core.types.ActorOutput
update() None

Call to update internal parameters (e.g. decay epsilon)

reagent.gym.types.TrainerPreprocessor

Called after env.step(action) Args: (state, action, reward, terminal, log_prob)

alias of Callable[[Any], Any]

class reagent.gym.types.Trajectory(transitions: List[reagent.gym.types.Transition] = <factory>)

Bases: reagent.core.base_dataclass.BaseDataClass

add_transition(transition: reagent.gym.types.Transition)
calculate_cumulative_reward(gamma: float = 1.0)

Return (discounted) sum of rewards.

to_dict()
transitions: List[reagent.gym.types.Transition]
class reagent.gym.types.Transition(mdp_id: int, sequence_number: int, observation: Any, action: Any, reward: float, terminal: bool, log_prob: Optional[float] = None, possible_actions_mask: Optional[numpy.ndarray] = None, info: Optional[Dict] = None)

Bases: reagent.core.base_dataclass.BaseDataClass

action: Any
asdict()
info: Optional[Dict] = None
log_prob: Optional[float] = None
mdp_id: int
observation: Any
possible_actions_mask: Optional[numpy.ndarray] = None
reward: float
sequence_number: int
terminal: bool
reagent.gym.types.get_optional_fields(cls) List[str]

return list of optional annotated fields

reagent.gym.utils module

reagent.gym.utils.build_action_normalizer(env: reagent.gym.envs.env_wrapper.EnvWrapper)
reagent.gym.utils.build_normalizer(env: reagent.gym.envs.env_wrapper.EnvWrapper) Dict[str, reagent.core.parameters.NormalizationData]
reagent.gym.utils.build_state_normalizer(env: reagent.gym.envs.env_wrapper.EnvWrapper)
reagent.gym.utils.create_df_from_replay_buffer(env, problem_domain: reagent.core.parameters.ProblemDomain, desired_size: int, multi_steps: Optional[int], ds: str, shuffle_df: bool = True) pandas.core.frame.DataFrame
reagent.gym.utils.feature_transform(features, single_elem_transform, is_next_with_multi_steps=False, replace_when_terminal=None, terminal=None)

feature_transform is a method on a single row. We assume features is List[features] (batch of features). This can also be called for next_features with multi_steps which we assume to be List[List[features]]. First List is denoting that it’s a batch, second List is denoting that a single row consists of a list of features.

reagent.gym.utils.fill_replay_buffer(env, replay_buffer: reagent.replay_memory.circular_replay_buffer.ReplayBuffer, desired_size: int, agent: reagent.gym.agents.agent.Agent)

Fill replay buffer with transitions until size reaches desired_size.

reagent.gym.utils.set_seed(env: gym.core.Env, seed: int)
reagent.gym.utils.validate_mdp_ids_seq_nums(df)

Module contents

class reagent.gym.Agent(policy: reagent.gym.policies.policy.Policy, post_transition_callback: Optional[Callable[[reagent.gym.types.Transition], None]] = None, post_episode_callback: Optional[Callable[[reagent.gym.types.Trajectory, Dict], None]] = None, obs_preprocessor=<function _id>, action_extractor=<function _id>, device: Optional[torch.device] = None)

Bases: object

act(obs: Any, possible_actions_mask: Optional[numpy.ndarray] = None) Tuple[Any, Optional[float]]

Act on a single observation

classmethod create_for_env(env: reagent.gym.envs.env_wrapper.EnvWrapper, policy: Optional[reagent.gym.policies.policy.Policy], *, device: Union[str, torch.device] = 'cpu', obs_preprocessor=None, action_extractor=None, **kwargs)

If policy is not given, we will try to create a random policy

classmethod create_for_env_with_serving_policy(env: reagent.gym.envs.env_wrapper.EnvWrapper, serving_policy: reagent.gym.policies.policy.Policy, *, obs_preprocessor=None, action_extractor=None, **kwargs)
post_episode(trajectory: reagent.gym.types.Trajectory, info: Dict)

to be called after step(action)

post_step(transition: reagent.gym.types.Transition)

to be called after step(action)

class reagent.gym.Gym(env_name: str, set_max_steps: Optional[int] = None)

Bases: reagent.gym.envs.env_wrapper.EnvWrapper

env_name: str
make() gym.core.Env
obs_preprocessor(obs: numpy.ndarray) reagent.core.types.FeatureData
serving_obs_preprocessor(obs: numpy.ndarray) Tuple[torch.Tensor, torch.Tensor]
set_max_steps: Optional[int] = None