basicgym.envs.synthetic.BasicEnv#

class basicgym.envs.synthetic.BasicEnv(step_per_episode=10, state_dim=5, action_type='continuous', n_actions=10, action_dim=3, action_context=None, reward_type='continuous', reward_std=0.0, obs_std=0.0, StateTransitionFunction=<class 'basicgym.envs.simulator.function.StateTransitionFunction'>, RewardFunction=<class 'basicgym.envs.simulator.function.RewardFunction'>, random_state=None)[source]#

Class for a basic environment for reinforcement learning (RL) agent to interact.

Bases: gym.Env

Imported as: basicgym.BasicEnv

Note

SyntheticGym works with OpenAI Gym and Gymnasium-like interface. See Examples below for the usage.

Markov Decision Process (CMDP) definition are given as follows:

timestep: int (> 0)

state: array-like of shape (state_dim, )

action: int, float, or array-like of shape (action_dim, )

reward: bool or continuous

discount_rate: float

Parameters:

step_per_episode (int, default=10 (> 0)) – Number of timesteps in an episode.
state_dim (int, default=5 (> 0)) – Dimension of the state.
action_type ({"discrete", "continuous"}, default="continuous") – Type of the action space.
action_dim (int) – Dimension of the action (context).
n_actions (int, default=10 (> 0)) – Number of actions in the discrete action case.
action_context (array-like of shape (n_actions, action_dim), default=None) – Feature vectors that characterizes each action. Applicable only when action_type is “discrete”.
reward_type ({"continuous", "binary"}, default="continuous") – Reward type.
reward_std (float, default=0.0 (>=0)) – Noise level of the reward. Applicable only when reward_type is “continuous”.
obs_std (float, default=0.0 (>=0)) – Noise level of the state observation.
StateTransitionFunction (BaseStateTransitionFunction, default=StateTransitionFunction) – State transition function. Both class and instance are acceptable.
RewardFunction (BaseRewardFunction, default=RewardFunction) – Expected immediate reward function. Both class and instance are acceptable.
random_state (int, default=None (>= 0)) – Random state.

Examples

Setup:

# import necessary module from syntheticgym
from syntheticgym import SyntheticEnv
from scope_rl.policy import OnlineHead
from scope_rl.ope.online import calc_on_policy_policy_value

# import necessary module from other libraries
from d3rlpy.algos import RandomPolicy
from d3rlpy.preprocessing import MinMaxActionScaler

# initialize environment
env = SyntheticEnv(random_state=12345)

# the following commands also work
# import gym
# env = gym.make("SyntheticEnv-continuous-v0")

# define (RL) agent (i.e., policy)
agent = OnlineHead(
    RandomPolicy(
        action_scaler=MinMaxActionScaler(
            minimum=0.1,
            maximum=10,
        )
    ),
    name="random",
)
agent.build_with_env(env)

Interaction:

# OpenAI Gym and Gymnasium-like interaction with agent
for episode in range(1000):
    obs, info = env.reset()
    done = False

    while not done:
        action = agent.predict_online(obs)
        obs, reward, done, truncated, info = env.step(action)

Online Evaluation:

# calculate on-policy policy value
on_policy_performance = calc_on_policy_policy_value(
    env,
    agent,
    n_trajectories=100,
    random_state=12345
)

Output:

>>> on_policy_performance

27.59

References

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. “OpenAI Gym.” 2016.

Methods

`reset`([seed])	Initialize the environment.
`step`(action)	Simulate a action interaction with a context.

step(action)[source]#

Simulate a action interaction with a context.

Note

The simulation procedure is given as follows.

Sample reward for the given state-action pair.
Update state with state transition function.
Return the feedback to the RL agent.

Parameters:

action ({int, array-like of shape (action_dim, )} (>= 0)) – Indicating which action to present to the context.

Returns:

feedbacks –

obs: ndarray of shape (state_dim,): State observation, which possibly be noisy.
reward: float: Observed immediate rewards.
done: bool: Whether the episode end or not.
truncated: False: For API consistency.
info: (empty) dict: Additional information that may be useful for the package users.

This is unavailable to the RL agent.

Return type:

Tuple

reset(seed=None)[source]#

Initialize the environment.

Returns:

obs (ndarray of shape (state_dim,)) – State observation, which possibly be noisy.
info ((empty) dict) – Additional information that may be useful for the package users. This is unavailable to the RL agent.

Return type:

ndarray

Methods