RECGym#

A Python-based configurative simulation environment for recommender systems

Overview#

RECGym is an open-source simulation platform for recommender systems. The simulator is particularly intended for reinforcement learning algorithms and follows OpenAI Gym and Gymnasium interface. We design RECGym as a configurative environment so researchers and practitioners can customize the environmental modules including UserModel.

Note that RECGym is publicized as a sub-package of SCOPE-RL, which streamlines the implementation of offline reinforcement learning (offline RL) and off-policy evaluation and selection (OPE/OPS) procedures.

Basic Setting#

The objective of recommender interactions is to maximize reward. We often formulate this problem as the following (Partially Observable) Markov Decision Process ((PO)MDP) as \(\langle \mathcal{S}, \mathcal{A}, \mathcal{T}, P_r \rangle\).

  • state (\(s \in \mathcal{S}\)):

    • A vector representing user preference. The preference changes over time in an episode depending on the actions presented by the RL agent.

    • When the true state is unobservable, you can gain observation instead of a state.

  • action`(:math:`a in mathcal{A}): Index of an item to present to the user.

  • reward`(:math:`r in mathbb{R}): User engagement signal as a reward. Either binary or continuous.

Note that \(\mathcal{T}: \mathcal{S} \times \mathcal{A} \rightarrow \mathcal{P}(\mathcal{S})\) is the state transition probability where \(\mathcal{T}(s'\mid s,a)\) is the probability of observing state \(s'\) after taking action \(a\) given state \(s\). \(P_r: \mathcal{S} \times \mathcal{A} \times \mathbb{R} \rightarrow [0,1]\) is the probability distribution of the immediate reward. Given \(P_r\), \(R: \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}\) is the expected reward function where \(R(s,a) := \mathbb{E}_{r \sim P_r (r \mid s, a)}[r]\) is the expected reward when taking action \(a\) for state \(s\). Finally, \(\pi: \mathcal{S} \rightarrow \mathcal{P}(\mathcal{A})\) denotes a policy (i.e., agent) where \(\pi(a | s)\) is the probability of taking action \(a\) at a given state \(s\).

Supported Implementation#

Standard Environment#

  • RECEnv-v0: Standard recommender environment with discrete action space.

Custom Environment#

  • RECEnv: The configurative environment with discrete action space.

Configurative Modules#

  • UserModel: Class to define the user model of the recommender system.

Note that users can customize the above modules by following the abstract class.

Quickstart and Configurations#

We provide an example usage of the standard and customized environment.

Standard RECEnv#

Our RECEnv is available from gym.make(), following the OpenAI Gym and Gymnasium interface.

# import recgym and gym
import recgym
import gym

# (1) standard environment for discrete action space
env = gym.make('RECEnv-v0')

The basic interaction is performed using only four lines of code as follows.

obs, info = env.reset(), False
while not done:
   action = agent.act(obs)
   obs, reward, done, truncated, info = env.step(action)

Let’s interact uniform random policy with a discrete action REC environment.

# import from other libraries
from offlinegym.policy import EpsilonGreedyHead
from d3rlpy.algos import RandomPolicy as DiscreteRandomPolicy
import matplotlib.pyplot as plt

# define a random agent
agent = EpsilonGreedyHead(
    base_policy=DiscreteRandomPolicy(),
    n_actions=env.n_items,
    epsilon=1.0,
    name='random',
    random_state = random_state,
)

# (2) basic interaction
obs, info = env.reset()
done = False

while not done:
    action = agent.predict_online(obs)
    obs, reward, done, truncated, info = env.step(action)

Note that while we use SCOPE-RL and d3rlpy here, RECGym is compatible with any other libraries that are compatible with the OpenAI Gym and Gymnasium interface.

Customized RECEnv#

Next, we describe how to customize the environment by instantiating the environment.

The list of arguments are given as follows.

  • UserModel: User model which defines user_prefecture_dynamics (e.g., [61]) and reward_function.

  • n_items: Number of items used for recommendation.

  • n_users: Number of users used for recommendation.

  • item_feature_dim: Dimensions of the item feature vectors.

  • user_feature_dim: Dimensions of the user feature vectors.

  • item_feature_vector: Feature vectors that characterize each item.

  • user_feature_vector: Feature vectors that characterize each user.

  • reward_type: Reward type (i.e., continuous / binary).

  • reward_std: Standard deviation of the reward distribution. Applicable only when reward_type is “continuous”.

  • obs_std: Standard deviation of the observation distribution.

  • step_per_episode: Number of timesteps in an episode.

  • random_state : Random state

Example:

from recgym import RECEnv
env = RECEnv(
    UserModel = UserModel,
    n_items = 100,  # we use 100 items
    n_users = 100,  # 100 users exists
    item_feature_dim = 5,  #each item has 5 dimensional features
    user_feature_dim = 5,  #each user has 5 dimensional features
    item_feature_vector = None,  #determine item_feature_vector from n_items and item_feature_dim in RECEnv
    user_feature_vector = None,  #determine user_feature_vector from n_users and user_feature_dim in RECEnv
    reward_type = "continuous", #we use continuous reward
    reward_std = 0.0,
    obs_std = 0.0, #not add noise to the observation
    step_per_episode = 10,
    random_state = 12345,
)

Specifically, users can define their own UserModel as follows.

Example of Custom UserModel:

# import recgym modules
from recgym import BaseUserModel
from recgym.types import Action
# import other necessary stuffs
from dataclasses import dataclass
from typing import Optional
import numpy as np

@dataclass
class UserModel(BaseUserModel):
    """Initialization."""
    reward_type: str = "continuous"  # "binary"
    reward_std: float = 0.0
    item_feature_vector: Optional[np.ndarray] = None,
    random_state: Optional[int] = None

    def __post_init__(self):
        self.random_ = check_random_state(self.random_state)

    def user_preference_dynamics(
        self,
        state: np.ndarray,
        action: Action,
        alpha: float = 1.0,
    )-> np.ndarray:
        """Function that determines the user state transition.
        """
        state = (state + alpha * state @ self.item_feature_vector[action] * self.item_feature_vector[action])
        state = state / np.linalg.norm(state, ord=2)
        return state

    def reward_function(
        self,
        state: np.ndarray,
        action: Action,
    )-> float:
        """Reward function.
        """
        reward = state @ self.item_feature_vector[action]
        if self.reward_type is "continuous":
            reward = reward + self.random_.normal(loc=0.0, scale=self.reward_std)
        return reward

Citation#

If you use our pipeline in your work, please cite our paper below.

Haruka Kiyohara, Kosuke Kawakami, Yuta Saito.
Accelerating Offline Reinforcement Learning Application in Real-Time Bidding and Recommendation: Potential Use of Simulation [arXiv]
@article{kiyohara2021accelerating,
    title={Accelerating Offline Reinforcement Learning Application in Real-Time Bidding and Recommendation: Potential Use of Simulation},
    author={Kiyohara, Haruka and Kawakami, Kosuke and Saito, Yuta},
    journal={arXiv preprint arXiv:2109.08331},
    year={2021}
}

Contact#

For any questions about the paper and pipeline, feel free to contact: hk844@cornell.edu

Contribution#

Any contributions to RECGym are more than welcome! Please refer to CONTRIBUTING.md for general guidelines on how to contribute to the project.

<<< Prev Sub_packages (Back to Top)

<<< Prev Documentation (Back to Top)

Next >>> Package Reference