basicgym.envs.synthetic.BasicEnv#

class basicgym.envs.synthetic.BasicEnv(step_per_episode=10, state_dim=5, action_type='continuous', n_actions=10, action_dim=3, action_context=None, reward_type='continuous', reward_std=0.0, obs_std=0.0, StateTransitionFunction=<class 'basicgym.envs.simulator.function.StateTransitionFunction'>, RewardFunction=<class 'basicgym.envs.simulator.function.RewardFunction'>, random_state=None)[source]#

Class for a basic environment for reinforcement learning (RL) agent to interact.

Bases: gym.Env

Imported as: basicgym.BasicEnv

Note

SyntheticGym works with OpenAI Gym and Gymnasium-like interface. See Examples below for the usage.

Markov Decision Process (CMDP) definition are given as follows:

timestep: int (> 0)

state: array-like of shape (state_dim, )

action: int, float, or array-like of shape (action_dim, )

reward: bool or continuous

discount_rate: float

Parameters:
  • step_per_episode (int, default=10 (> 0)) – Number of timesteps in an episode.

  • state_dim (int, default=5 (> 0)) – Dimension of the state.

  • action_type ({"discrete", "continuous"}, default="continuous") – Type of the action space.

  • action_dim (int) – Dimension of the action (context).

  • n_actions (int, default=10 (> 0)) – Number of actions in the discrete action case.

  • action_context (array-like of shape (n_actions, action_dim), default=None) – Feature vectors that characterizes each action. Applicable only when action_type is “discrete”.

  • reward_type ({"continuous", "binary"}, default="continuous") – Reward type.

  • reward_std (float, default=0.0 (>=0)) – Noise level of the reward. Applicable only when reward_type is “continuous”.

  • obs_std (float, default=0.0 (>=0)) – Noise level of the state observation.

  • StateTransitionFunction (BaseStateTransitionFunction, default=StateTransitionFunction) – State transition function. Both class and instance are acceptable.

  • RewardFunction (BaseRewardFunction, default=RewardFunction) – Expected immediate reward function. Both class and instance are acceptable.

  • random_state (int, default=None (>= 0)) – Random state.

Examples

Setup:

# import necessary module from syntheticgym
from syntheticgym import SyntheticEnv
from scope_rl.policy import OnlineHead
from scope_rl.ope.online import calc_on_policy_policy_value

# import necessary module from other libraries
from d3rlpy.algos import RandomPolicy
from d3rlpy.preprocessing import MinMaxActionScaler

# initialize environment
env = SyntheticEnv(random_state=12345)

# the following commands also work
# import gym
# env = gym.make("SyntheticEnv-continuous-v0")

# define (RL) agent (i.e., policy)
agent = OnlineHead(
    RandomPolicy(
        action_scaler=MinMaxActionScaler(
            minimum=0.1,
            maximum=10,
        )
    ),
    name="random",
)
agent.build_with_env(env)

Interaction:

# OpenAI Gym and Gymnasium-like interaction with agent
for episode in range(1000):
    obs, info = env.reset()
    done = False

    while not done:
        action = agent.predict_online(obs)
        obs, reward, done, truncated, info = env.step(action)

Online Evaluation:

# calculate on-policy policy value
on_policy_performance = calc_on_policy_policy_value(
    env,
    agent,
    n_trajectories=100,
    random_state=12345
)

Output:

>>> on_policy_performance

27.59

References

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. “OpenAI Gym.” 2016.

Methods

reset([seed])

Initialize the environment.

step(action)

Simulate a action interaction with a context.

step(action)[source]#

Simulate a action interaction with a context.

Note

The simulation procedure is given as follows.

  1. Sample reward for the given state-action pair.

  2. Update state with state transition function.

  3. Return the feedback to the RL agent.

Parameters:

action ({int, array-like of shape (action_dim, )} (>= 0)) – Indicating which action to present to the context.

Returns:

feedbacks

obs: ndarray of shape (state_dim,)

State observation, which possibly be noisy.

reward: float

Observed immediate rewards.

done: bool

Whether the episode end or not.

truncated: False

For API consistency.

info: (empty) dict

Additional information that may be useful for the package users.

This is unavailable to the RL agent.

Return type:

Tuple

reset(seed=None)[source]#

Initialize the environment.

Returns:

  • obs (ndarray of shape (state_dim,)) – State observation, which possibly be noisy.

  • info ((empty) dict) – Additional information that may be useful for the package users. This is unavailable to the RL agent.

Return type:

ndarray

Methods