BasicGym#

A Python-based configurative basic simulation environment for RL

Overview#

BasicGym is a basic simulation platform for RL. The simulator is particularly intended for reinforcement learning algorithms and follows OpenAI Gym and Gymnasium interface. We design BasicGym as a configurative environment so that researchers and practitioners can customize the environmental modules including UserModel.

Note that BasicGym is publicized as a sub-package of SCOPE-RL, which streamlines the implementation of offline reinforcement learning (offline RL) and off-policy evaluation and selection (OPE/OPS) procedures.

Basic Setting#

We formulate (Partially Observable) Markov Decision Process ((PO)MDP) as \(\langle \mathcal{S}, \mathcal{A}, \mathcal{T}, P_r \rangle\) containing the following components.

state (\(s \in \mathcal{S}\))
action (\(a \in \mathcal{A}\))
reward (\(r \in \mathbb{R}\))

Note that \(\mathcal{T}: \mathcal{S} \times \mathcal{A} \rightarrow \mathcal{P}(\mathcal{S})\) is the state transition probability where \(\mathcal{T}(s'\mid s,a)\) is the probability of observing state \(s'\) after taking action \(a\) given state \(s\). \(P_r: \mathcal{S} \times \mathcal{A} \times \mathbb{R} \rightarrow [0,1]\) is the probability distribution of the immediate reward. Given \(P_r\), \(R: \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}\) is the expected reward function where \(R(s,a) := \mathbb{E}_{r \sim P_r (r \mid s, a)}[r]\) is the expected reward when taking action \(a\) for state \(s\). Finally, \(\pi: \mathcal{S} \rightarrow \mathcal{P}(\mathcal{A})\) denotes a policy (i.e., agent) where \(\pi(a | s)\) is the probability of taking action \(a\) at a given state \(s\).

Supported Implementation#

Standard Environment#

BasicEnv-discrete-v0: Standard synthetic environment with discrete action space.

BasicEnv-continuous-v0: Standard synthetic environment with continuous action space.

Custom Environment#

BasicEnv: The configurative environment with discrete action space.

Configurative Modules#

StateTransitionFunction: Class to define the state transition of the synthetic simulation.

RewardFunction: Class to define the reward function of the synthetic simulation.

Note that users can customize the above modules by following the abstract class.

Quickstart and Configurations#

We provide an example usage of the standard and customized environment.

Standard BasicEnv#

Our BasicEnv is available from gym.make(), following the OpenAI Gym and Gymnasium interface.

# import basicgym and gym
import basicgym
import gym

# (1) standard environment for continuous action space
env = gym.make('BasicEnv-continuous-v0')

The basic interaction is performed using only four lines of code as follows.

obs, info = env.reset(), False
while not done:
   action = agent.act(obs)
   obs, reward, done, truncated, info = env.step(action)

Let’s interact with a uniform random policy.

from scope_rl.policy import OnlineHead
from d3rlpy.algos import RandomPolicy as ContinuousRandomPolicy

# (1) define a random agent
agent = OnlineHead(
    ContinuousRandomPolicy(
        action_scaler=MinMaxActionScaler(
            minimum=0.1,  # minimum value that policy can take
            maximum=10,  # maximum value that policy can take
        )
    ),
    name="random",
)
agent.build_with_env(env)

# (2) basic interaction
obs, info = env.reset()
done = False

while not done:
    action = agent.predict_online(obs)
    obs, reward, done, truncated, info = env.step(action)

Note that while we use SCOPE-RL and d3rlpy here, BasicGym is compatible with any other libraries that is compatible with the OpenAI Gym and Gymnasium interface.

Customized BasicEnv#

Next, we describe how to customize the environment by instantiating the environment.

The list of arguments are given as follows.

step_per_episode: Number of timesteps in an episode.
state_dim: Dimension of the state.
action_type: Type of the action space.
n_actions: Number of actions in the discrete action case.
action_dim: Dimension of the action (context).
action_context: Feature vectors that characterize each action. Applicable only when action_type is “discrete”.
reward_type: Reward type.
reward_std: Noise level of the reward. Applicable only when reward_type is “continuous”.
obs_std: Noise level of the state observation.
StateTransitionFunction: State transition function.
RewardFunction: Expected immediate reward function
random_state : Random state.

Example:

from basicgym import BasicEnv
env = BasicEnv(
    state_dim=10,
    action_type="continuous",  # "discrete"
    action_dim=5,
    reward_type="continuous",  # "binary"
    reward_std=0.3,
    obs_std=0.3,
    step_per_episode=10,
    random_state=12345,
)

Specifically, users can define their own StateTransitionFunction and RewardFunction as follows.

Example of Custom State Transition Function:

# import basicgym modules
from basicgym import BaseStateTransitionFunction
# import other necessary stuffs
from dataclasses import dataclass
from typing import Optional
import numpy as np

@dataclass
class CustomizedStateTransitionFunction(BaseStateTransitionFunction):
    state_dim: int
    action_dim: int
    random_state: Optional[int] = None

    def __post_init__(self):
        self.random_ = check_random_state(self.random_state)
        self.state_coef = self.random_.normal(loc=0.0, scale=1.0, size=(self.state_dim, self.state_dim))
        self.action_coef = self.random_.normal(loc=0.0, scale=1.0, size=(self.state_dim, self.action_dim))

    def step(
        self,
        state: np.ndarray,
        action: np.ndarray,
    ) -> np.ndarray:
        state = self.state_coef @ state / self.state_dim +  self.action_coef @ action / self.action_dim
        state = state / np.linalg.norm(state, ord=2)
        return state

Example of Custom Reward Function:

# import basicgym modules
from basicgym import BaseRewardFunction
# import other necessary stuffs
from dataclasses import dataclass
from typing import Optional
import numpy as np

@dataclass
class CustomizedRewardFunction(BaseRewardFunction):
    state_dim: int
    action_dim: int
    reward_type: str = "continuous"  # "binary"
    reward_std: float = 0.0
    random_state: Optional[int] = None

    def __post_init__(self):
        self.random_ = check_random_state(self.random_state)
        self.state_coef = self.random_.normal(loc=0.0, scale=1.0, size=(self.state_dim, ))
        self.action_coef = self.random_.normal(loc=0.0, scale=1.0, size=(self.action_dim, ))

    def mean_reward_function(
        self,
        state: np.ndarray,
        action: np.ndarray,
    ) -> float:
        reward = self.state_coef.T @ state / self.state_dim + self.action_coef.T @ action / self.action_dim
        return reward

Citation#

If you use our pipeline in your work, please cite our paper below.

Haruka Kiyohara, Ren Kishimoto, Kosuke Kawakami, Ken Kobayashi, Kazuhide Nakata, Yuta Saito.
Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation
(a preprint coming soon..)

@article{kiyohara2023towards,
   author = {Kiyohara, Haruka and Kishimoto, Ren and Kawakami, Kosuke and Kobayashi, Ken and Nataka, Kazuhide and Saito, Yuta},
   title = {Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation},
   journal = {A GitHub repository},
   pages = {xxx--xxx},
   year = {2023},
}

Contact#

For any questions about the paper and pipeline, feel free to contact: hk844@cornell.edu

Contribution#

Any contributions to BasicGym are more than welcome! Please refer to CONTRIBUTING.md for general guidelines on how to contribute to the project.

<<< Prev Sub_packages (Back to Top)

<<< Prev Documentation (Back to Top)

Next >>> Package Reference