BasicGym#
A Python-based configurative basic simulation environment for RL
Overview#
BasicGym is a basic simulation platform for RL. The simulator is particularly intended for reinforcement learning algorithms and follows OpenAI Gym and Gymnasium interface. We design BasicGym as a configurative environment so that researchers and practitioners can customize the environmental modules including UserModel.
Note that BasicGym is publicized as a sub-package of SCOPE-RL, which streamlines the implementation of offline reinforcement learning (offline RL) and off-policy evaluation and selection (OPE/OPS) procedures.
Basic Setting#
We formulate (Partially Observable) Markov Decision Process ((PO)MDP) as \(\langle \mathcal{S}, \mathcal{A}, \mathcal{T}, P_r \rangle\) containing the following components.
state (\(s \in \mathcal{S}\))
action (\(a \in \mathcal{A}\))
reward (\(r \in \mathbb{R}\))
Note that \(\mathcal{T}: \mathcal{S} \times \mathcal{A} \rightarrow \mathcal{P}(\mathcal{S})\) is the state transition probability where \(\mathcal{T}(s'\mid s,a)\) is the probability of observing state \(s'\) after taking action \(a\) given state \(s\). \(P_r: \mathcal{S} \times \mathcal{A} \times \mathbb{R} \rightarrow [0,1]\) is the probability distribution of the immediate reward. Given \(P_r\), \(R: \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}\) is the expected reward function where \(R(s,a) := \mathbb{E}_{r \sim P_r (r \mid s, a)}[r]\) is the expected reward when taking action \(a\) for state \(s\). Finally, \(\pi: \mathcal{S} \rightarrow \mathcal{P}(\mathcal{A})\) denotes a policy (i.e., agent) where \(\pi(a | s)\) is the probability of taking action \(a\) at a given state \(s\).
Supported Implementation#
Standard Environment#
BasicEnv-discrete-v0
: Standard synthetic environment with discrete action space.
BasicEnv-continuous-v0
: Standard synthetic environment with continuous action space.
Custom Environment#
BasicEnv
: The configurative environment with discrete action space.
Configurative Modules#
StateTransitionFunction
: Class to define the state transition of the synthetic simulation.
RewardFunction
: Class to define the reward function of the synthetic simulation.
Note that users can customize the above modules by following the abstract class.
Quickstart and Configurations#
We provide an example usage of the standard and customized environment.
Standard BasicEnv#
Our BasicEnv is available from gym.make()
,
following the OpenAI Gym and Gymnasium interface.
# import basicgym and gym
import basicgym
import gym
# (1) standard environment for continuous action space
env = gym.make('BasicEnv-continuous-v0')
The basic interaction is performed using only four lines of code as follows.
obs, info = env.reset(), False
while not done:
action = agent.act(obs)
obs, reward, done, truncated, info = env.step(action)
Let’s interact with a uniform random policy.
from scope_rl.policy import OnlineHead
from d3rlpy.algos import RandomPolicy as ContinuousRandomPolicy
# (1) define a random agent
agent = OnlineHead(
ContinuousRandomPolicy(
action_scaler=MinMaxActionScaler(
minimum=0.1, # minimum value that policy can take
maximum=10, # maximum value that policy can take
)
),
name="random",
)
agent.build_with_env(env)
# (2) basic interaction
obs, info = env.reset()
done = False
while not done:
action = agent.predict_online(obs)
obs, reward, done, truncated, info = env.step(action)
Note that while we use SCOPE-RL and d3rlpy here, BasicGym is compatible with any other libraries that is compatible with the OpenAI Gym and Gymnasium interface.
Customized BasicEnv#
Next, we describe how to customize the environment by instantiating the environment.
The list of arguments are given as follows.
step_per_episode
: Number of timesteps in an episode.state_dim
: Dimension of the state.action_type
: Type of the action space.n_actions
: Number of actions in the discrete action case.action_dim
: Dimension of the action (context).action_context
: Feature vectors that characterize each action. Applicable only when action_type is “discrete”.reward_type
: Reward type.reward_std
: Noise level of the reward. Applicable only when reward_type is “continuous”.obs_std
: Noise level of the state observation.StateTransitionFunction
: State transition function.RewardFunction
: Expected immediate reward functionrandom_state
: Random state.
Example:
from basicgym import BasicEnv
env = BasicEnv(
state_dim=10,
action_type="continuous", # "discrete"
action_dim=5,
reward_type="continuous", # "binary"
reward_std=0.3,
obs_std=0.3,
step_per_episode=10,
random_state=12345,
)
Specifically, users can define their own StateTransitionFunction
and RewardFunction
as follows.
Example of Custom State Transition Function:
# import basicgym modules
from basicgym import BaseStateTransitionFunction
# import other necessary stuffs
from dataclasses import dataclass
from typing import Optional
import numpy as np
@dataclass
class CustomizedStateTransitionFunction(BaseStateTransitionFunction):
state_dim: int
action_dim: int
random_state: Optional[int] = None
def __post_init__(self):
self.random_ = check_random_state(self.random_state)
self.state_coef = self.random_.normal(loc=0.0, scale=1.0, size=(self.state_dim, self.state_dim))
self.action_coef = self.random_.normal(loc=0.0, scale=1.0, size=(self.state_dim, self.action_dim))
def step(
self,
state: np.ndarray,
action: np.ndarray,
) -> np.ndarray:
state = self.state_coef @ state / self.state_dim + self.action_coef @ action / self.action_dim
state = state / np.linalg.norm(state, ord=2)
return state
Example of Custom Reward Function:
# import basicgym modules
from basicgym import BaseRewardFunction
# import other necessary stuffs
from dataclasses import dataclass
from typing import Optional
import numpy as np
@dataclass
class CustomizedRewardFunction(BaseRewardFunction):
state_dim: int
action_dim: int
reward_type: str = "continuous" # "binary"
reward_std: float = 0.0
random_state: Optional[int] = None
def __post_init__(self):
self.random_ = check_random_state(self.random_state)
self.state_coef = self.random_.normal(loc=0.0, scale=1.0, size=(self.state_dim, ))
self.action_coef = self.random_.normal(loc=0.0, scale=1.0, size=(self.action_dim, ))
def mean_reward_function(
self,
state: np.ndarray,
action: np.ndarray,
) -> float:
reward = self.state_coef.T @ state / self.state_dim + self.action_coef.T @ action / self.action_dim
return reward
Citation#
If you use our pipeline in your work, please cite our paper below.
@article{kiyohara2023towards,
author = {Kiyohara, Haruka and Kishimoto, Ren and Kawakami, Kosuke and Kobayashi, Ken and Nataka, Kazuhide and Saito, Yuta},
title = {Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation},
journal = {A GitHub repository},
pages = {xxx--xxx},
year = {2023},
}
Contact#
For any questions about the paper and pipeline, feel free to contact: hk844@cornell.edu
Contribution#
Any contributions to BasicGym are more than welcome! Please refer to CONTRIBUTING.md for general guidelines on how to contribute to the project.