basicgym.envs.synthetic.BasicEnv#
- class basicgym.envs.synthetic.BasicEnv(step_per_episode=10, state_dim=5, action_type='continuous', n_actions=10, action_dim=3, action_context=None, reward_type='continuous', reward_std=0.0, obs_std=0.0, StateTransitionFunction=<class 'basicgym.envs.simulator.function.StateTransitionFunction'>, RewardFunction=<class 'basicgym.envs.simulator.function.RewardFunction'>, random_state=None)[source]#
Class for a basic environment for reinforcement learning (RL) agent to interact.
Bases:
gym.EnvImported as:
basicgym.BasicEnvNote
SyntheticGym works with OpenAI Gym and Gymnasium-like interface. See Examples below for the usage.
- Markov Decision Process (CMDP) definition are given as follows:
timestep: int (> 0)
state: array-like of shape (state_dim, )
action: int, float, or array-like of shape (action_dim, )
reward: bool or continuous
discount_rate: float
- Parameters:
step_per_episode (int, default=10 (> 0)) – Number of timesteps in an episode.
state_dim (int, default=5 (> 0)) – Dimension of the state.
action_type ({"discrete", "continuous"}, default="continuous") – Type of the action space.
action_dim (int) – Dimension of the action (context).
n_actions (int, default=10 (> 0)) – Number of actions in the discrete action case.
action_context (array-like of shape (n_actions, action_dim), default=None) – Feature vectors that characterizes each action. Applicable only when action_type is “discrete”.
reward_type ({"continuous", "binary"}, default="continuous") – Reward type.
reward_std (float, default=0.0 (>=0)) – Noise level of the reward. Applicable only when reward_type is “continuous”.
obs_std (float, default=0.0 (>=0)) – Noise level of the state observation.
StateTransitionFunction (BaseStateTransitionFunction, default=StateTransitionFunction) – State transition function. Both class and instance are acceptable.
RewardFunction (BaseRewardFunction, default=RewardFunction) – Expected immediate reward function. Both class and instance are acceptable.
random_state (int, default=None (>= 0)) – Random state.
Examples
Setup:
# import necessary module from syntheticgym from syntheticgym import SyntheticEnv from scope_rl.policy import OnlineHead from scope_rl.ope.online import calc_on_policy_policy_value # import necessary module from other libraries from d3rlpy.algos import RandomPolicy from d3rlpy.preprocessing import MinMaxActionScaler # initialize environment env = SyntheticEnv(random_state=12345) # the following commands also work # import gym # env = gym.make("SyntheticEnv-continuous-v0") # define (RL) agent (i.e., policy) agent = OnlineHead( RandomPolicy( action_scaler=MinMaxActionScaler( minimum=0.1, maximum=10, ) ), name="random", ) agent.build_with_env(env)
Interaction:
# OpenAI Gym and Gymnasium-like interaction with agent for episode in range(1000): obs, info = env.reset() done = False while not done: action = agent.predict_online(obs) obs, reward, done, truncated, info = env.step(action)
Online Evaluation:
# calculate on-policy policy value on_policy_performance = calc_on_policy_policy_value( env, agent, n_trajectories=100, random_state=12345 )
Output:
>>> on_policy_performance 27.59
References
Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. “OpenAI Gym.” 2016.
Methods
reset([seed])Initialize the environment.
step(action)Simulate a action interaction with a context.
- step(action)[source]#
Simulate a action interaction with a context.
Note
The simulation procedure is given as follows.
Sample reward for the given state-action pair.
Update state with state transition function.
Return the feedback to the RL agent.
- Parameters:
action ({int, array-like of shape (action_dim, )} (>= 0)) – Indicating which action to present to the context.
- Returns:
feedbacks –
- obs: ndarray of shape (state_dim,)
State observation, which possibly be noisy.
- reward: float
Observed immediate rewards.
- done: bool
Whether the episode end or not.
- truncated: False
For API consistency.
- info: (empty) dict
Additional information that may be useful for the package users.
This is unavailable to the RL agent.
- Return type:
Tuple
Methods