reagent.gym.envs.pomdp package

Submodules

reagent.gym.envs.pomdp.pocman module

Pocman environment first introduced in Monte-Carlo Planning in Large POMDPs by Silver and Veness, 2010: https://papers.nips.cc/paper/4031-monte-carlo-planning-in-large-pomdps.pdf

class reagent.gym.envs.pomdp.pocman.Action

Bases: object

DOWN = 2
LEFT = 3
RIGHT = 1
UP = 0
class reagent.gym.envs.pomdp.pocman.Element

Bases: object

CLEAR_WALK_WAY = 1
FOOD_PELLET = 3
POWER = 2
WALL = 0
class reagent.gym.envs.pomdp.pocman.Ghost(env, pos, direction, ghost_range)

Bases: object

home: reagent.gym.envs.pomdp.pocman.Position
max_x: int
max_y: int
move(agent_pos, agent_in_power)
reset()
update(pos, direction)
class reagent.gym.envs.pomdp.pocman.InternalState

Bases: object

class reagent.gym.envs.pomdp.pocman.PocManEnv

Bases: gym.core.Env

next_pos(pos, action)
static print_action(action)
print_internal_state()
print_ob(ob)
static random_action()
reset()

Resets the environment to an initial state and returns an initial observation.

Note that this function should not reset the environment’s random number generator(s); random variables in the environment’s state should be sampled independently between multiple calls to reset(). In other words, each call of reset() should yield an environment suitable for a new episode, independent of previous episodes.

Returns

the initial observation.

Return type

observation (object)

seed(seed=None)

Sets the seed for this env’s random number generator(s).

Note

Some environments use multiple pseudorandom number generators. We want to capture all such seeds used in order to ensure that there aren’t accidental correlations between multiple generators.

Returns

Returns the list of seeds used in this env’s random

number generators. The first value in the list should be the “main” seed, or the value which a reproducer should pass to ‘seed’. Often, the main seed equals the provided ‘seed’, but this won’t be true if seed=None, for example.

Return type

list<bigint>

step(action)

Run one timestep of the environment’s dynamics. When end of episode is reached, you are responsible for calling reset() to reset this environment’s state.

Accepts an action and returns a tuple (observation, reward, done, info).

Parameters

action (object) – an action provided by the agent

Returns

agent’s observation of the current environment reward (float) : amount of reward returned after previous action done (bool): whether the episode has ended, in which case further step() calls will return undefined results info (dict): contains auxiliary diagnostic information (helpful for debugging, and sometimes learning)

Return type

observation (object)

class reagent.gym.envs.pomdp.pocman.Position(x: int, y: int)

Bases: NamedTuple

The index at the left up corner is (0, 0); The index at the right bottom corner is (height-1, width-1)

x: int

Alias for field number 0

y: int

Alias for field number 1

reagent.gym.envs.pomdp.pocman.manhattan_distance(c1, c2)
reagent.gym.envs.pomdp.pocman.opposite_direction(d)
reagent.gym.envs.pomdp.pocman.select_maze(maze)

reagent.gym.envs.pomdp.state_embed_env module

This file shows an example of using embedded states to feed to RL models in partially observable environments (POMDPs). Embedded states are generated by a world model which learns how to encode past n observations into a low-dimension vector.Embedded states improve performance in POMDPs compared to just using one-step observations as states because they encode more historical information than one-step observations.

class reagent.gym.envs.pomdp.state_embed_env.StateEmbedEnvironment(gym_env: reagent.gym.envs.env_wrapper.EnvWrapper, mdnrnn: reagent.models.world_model.MemoryNetwork, max_embed_seq_len: int, state_min_value: Optional[float] = None, state_max_value: Optional[float] = None)

Bases: gym.core.Env

embed_state(state)

Embed state after either reset() or step()

reset()

Resets the environment to an initial state and returns an initial observation.

Note that this function should not reset the environment’s random number generator(s); random variables in the environment’s state should be sampled independently between multiple calls to reset(). In other words, each call of reset() should yield an environment suitable for a new episode, independent of previous episodes.

Returns

the initial observation.

Return type

observation (object)

seed(seed)

Sets the seed for this env’s random number generator(s).

Note

Some environments use multiple pseudorandom number generators. We want to capture all such seeds used in order to ensure that there aren’t accidental correlations between multiple generators.

Returns

Returns the list of seeds used in this env’s random

number generators. The first value in the list should be the “main” seed, or the value which a reproducer should pass to ‘seed’. Often, the main seed equals the provided ‘seed’, but this won’t be true if seed=None, for example.

Return type

list<bigint>

step(action)

Run one timestep of the environment’s dynamics. When end of episode is reached, you are responsible for calling reset() to reset this environment’s state.

Accepts an action and returns a tuple (observation, reward, done, info).

Parameters

action (object) – an action provided by the agent

Returns

agent’s observation of the current environment reward (float) : amount of reward returned after previous action done (bool): whether the episode has ended, in which case further step() calls will return undefined results info (dict): contains auxiliary diagnostic information (helpful for debugging, and sometimes learning)

Return type

observation (object)

reagent.gym.envs.pomdp.string_game module

The agent can observe a character at one time. But the reward is given based on last n (n>1) steps’ observation (a string). In this environment, the agent can observe a character (“A”, “B”) at each time step, but the reward it receives actually depends on past 3 steps: if the agent observes “ABB” in the past 3 steps, it receives +5 reward; if the agent observes “BBB”, it receives -5 reward; otherwise, the agent receives 0. The action is the next character the agent wants to reveal, and the next state is exactly the action just taken (i.e., the transition function only depends on the action). Each episode is limited to 6 steps. Therefore, the optimal policy is to choose actions “ABBABB” in sequence which results to +10 reward.

class reagent.gym.envs.pomdp.string_game.StringGameEnv(max_steps=6)

Bases: gym.core.Env

get_observation()

The function you can write to customize transitions. In this specific environment, the next state is exactly the latest action taken. The initial observation is all zeros.

get_reward()

The function you can write to customize rewards. In this specific environment, the reward only depends on action history

static print_action(action)
print_internal_state()
static print_ob(ob)
static random_action()
reset()

Resets the environment to an initial state and returns an initial observation.

Note that this function should not reset the environment’s random number generator(s); random variables in the environment’s state should be sampled independently between multiple calls to reset(). In other words, each call of reset() should yield an environment suitable for a new episode, independent of previous episodes.

Returns

the initial observation.

Return type

observation (object)

seed(seed=None)

Sets the seed for this env’s random number generator(s).

Note

Some environments use multiple pseudorandom number generators. We want to capture all such seeds used in order to ensure that there aren’t accidental correlations between multiple generators.

Returns

Returns the list of seeds used in this env’s random

number generators. The first value in the list should be the “main” seed, or the value which a reproducer should pass to ‘seed’. Often, the main seed equals the provided ‘seed’, but this won’t be true if seed=None, for example.

Return type

list<bigint>

step(action)

Run one timestep of the environment’s dynamics. When end of episode is reached, you are responsible for calling reset() to reset this environment’s state.

Accepts an action and returns a tuple (observation, reward, done, info).

Parameters

action (object) – an action provided by the agent

Returns

agent’s observation of the current environment reward (float) : amount of reward returned after previous action done (bool): whether the episode has ended, in which case further step() calls will return undefined results info (dict): contains auxiliary diagnostic information (helpful for debugging, and sometimes learning)

Return type

observation (object)

reagent.gym.envs.pomdp.string_game_v1 module

A game with a stochastic length of the MDP but no longer than 3.

An agent can choose one character to reveal (either “A” or “B”) as the action, and the next state is exactly the action just taken (i.e., the transition function only depends on the action). Each episode is limited to 3 steps.

There is some probability to terminate at any step (but the agent must terminate if making 3 steps) If the current state is “A”, the agent has 0.5 probability to make to the next step. If the current state is “B”, the agent has 0.9 probability to make to the next step. The reward is given at the terminal state, based on the accumulated observation (a string).

If the agent observes “AAA” (survive the first 2 steps and terminate at the last step

no matter what action taken), it receives +5 reward.

If the agent observes “BA” (survive the first step and terminate at the second step), it receives +4 reward. For all other scenarios, the agent receives 0 reward.

If we plan for 3 steps ahead from the beginning, “A” is the better action to take first. If we plan with consideration of termination probabilities, “B” is better. Because: The expected Q-value of “A” = 0.5 * 0 + 0.5 * max(0.5 * 0 + 0.5 * max(5, 0), 0) = 1.25 The expected Q-value of “B” = 0.1 * 0 + 0.9 * max(0.5 * 4 + 0.5 * max(0, 0), 0) = 1.8

class reagent.gym.envs.pomdp.string_game_v1.StringGameEnvV1(max_steps=3)

Bases: gym.core.Env

get_observation()

The function you can write to customize transitions. In this specific environment, the next state is exactly the latest action taken. The initial observation is all zeros.

get_reward()

The function you can write to customize rewards. In this specific environment, the reward only depends on action history

static print_action(action)
print_internal_state()
static print_ob(ob)
static random_action()
reset()

Resets the environment to an initial state and returns an initial observation.

Note that this function should not reset the environment’s random number generator(s); random variables in the environment’s state should be sampled independently between multiple calls to reset(). In other words, each call of reset() should yield an environment suitable for a new episode, independent of previous episodes.

Returns

the initial observation.

Return type

observation (object)

sample_terminal(action)
seed(seed=None)

Sets the seed for this env’s random number generator(s).

Note

Some environments use multiple pseudorandom number generators. We want to capture all such seeds used in order to ensure that there aren’t accidental correlations between multiple generators.

Returns

Returns the list of seeds used in this env’s random

number generators. The first value in the list should be the “main” seed, or the value which a reproducer should pass to ‘seed’. Often, the main seed equals the provided ‘seed’, but this won’t be true if seed=None, for example.

Return type

list<bigint>

step(action)

Run one timestep of the environment’s dynamics. When end of episode is reached, you are responsible for calling reset() to reset this environment’s state.

Accepts an action and returns a tuple (observation, reward, done, info).

Parameters

action (object) – an action provided by the agent

Returns

agent’s observation of the current environment reward (float) : amount of reward returned after previous action done (bool): whether the episode has ended, in which case further step() calls will return undefined results info (dict): contains auxiliary diagnostic information (helpful for debugging, and sometimes learning)

Return type

observation (object)

Module contents