RECGym#
A Python-based configurative simulation environment for recommender systems
Overview#
RECGym is an open-source simulation platform for recommender systems. The simulator is particularly intended for reinforcement learning algorithms and follows OpenAI Gym and Gymnasium interface. We design RECGym as a configurative environment so researchers and practitioners can customize the environmental modules including UserModel.
Note that RECGym is publicized as a sub-package of SCOPE-RL, which streamlines the implementation of offline reinforcement learning (offline RL) and off-policy evaluation and selection (OPE/OPS) procedures.
Basic Setting#
The objective of recommender interactions is to maximize reward. We often formulate this problem as the following (Partially Observable) Markov Decision Process ((PO)MDP) as \(\langle \mathcal{S}, \mathcal{A}, \mathcal{T}, P_r \rangle\).
state (\(s \in \mathcal{S}\)):
A vector representing user preference. The preference changes over time in an episode depending on the actions presented by the RL agent.
When the true state is unobservable, you can gain observation instead of a state.
action`(:math:`a in mathcal{A}): Index of an item to present to the user.
reward`(:math:`r in mathbb{R}): User engagement signal as a reward. Either binary or continuous.
Note that \(\mathcal{T}: \mathcal{S} \times \mathcal{A} \rightarrow \mathcal{P}(\mathcal{S})\) is the state transition probability where \(\mathcal{T}(s'\mid s,a)\) is the probability of observing state \(s'\) after taking action \(a\) given state \(s\). \(P_r: \mathcal{S} \times \mathcal{A} \times \mathbb{R} \rightarrow [0,1]\) is the probability distribution of the immediate reward. Given \(P_r\), \(R: \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}\) is the expected reward function where \(R(s,a) := \mathbb{E}_{r \sim P_r (r \mid s, a)}[r]\) is the expected reward when taking action \(a\) for state \(s\). Finally, \(\pi: \mathcal{S} \rightarrow \mathcal{P}(\mathcal{A})\) denotes a policy (i.e., agent) where \(\pi(a | s)\) is the probability of taking action \(a\) at a given state \(s\).
Supported Implementation#
Standard Environment#
RECEnv-v0
: Standard recommender environment with discrete action space.
Custom Environment#
RECEnv
: The configurative environment with discrete action space.
Configurative Modules#
UserModel
: Class to define the user model of the recommender system.
Note that users can customize the above modules by following the abstract class.
Quickstart and Configurations#
We provide an example usage of the standard and customized environment.
Standard RECEnv#
Our RECEnv is available from gym.make()
,
following the OpenAI Gym and Gymnasium interface.
# import recgym and gym
import recgym
import gym
# (1) standard environment for discrete action space
env = gym.make('RECEnv-v0')
The basic interaction is performed using only four lines of code as follows.
obs, info = env.reset(), False
while not done:
action = agent.act(obs)
obs, reward, done, truncated, info = env.step(action)
Let’s interact uniform random policy with a discrete action REC environment.
# import from other libraries
from offlinegym.policy import EpsilonGreedyHead
from d3rlpy.algos import RandomPolicy as DiscreteRandomPolicy
import matplotlib.pyplot as plt
# define a random agent
agent = EpsilonGreedyHead(
base_policy=DiscreteRandomPolicy(),
n_actions=env.n_items,
epsilon=1.0,
name='random',
random_state = random_state,
)
# (2) basic interaction
obs, info = env.reset()
done = False
while not done:
action = agent.predict_online(obs)
obs, reward, done, truncated, info = env.step(action)
Note that while we use SCOPE-RL and d3rlpy here, RECGym is compatible with any other libraries that are compatible with the OpenAI Gym and Gymnasium interface.
Customized RECEnv#
Next, we describe how to customize the environment by instantiating the environment.
The list of arguments are given as follows.
UserModel
: User model which definesuser_prefecture_dynamics
(e.g., [61]) andreward_function
.n_items
: Number of items used for recommendation.n_users
: Number of users used for recommendation.item_feature_dim
: Dimensions of the item feature vectors.user_feature_dim
: Dimensions of the user feature vectors.item_feature_vector
: Feature vectors that characterize each item.user_feature_vector
: Feature vectors that characterize each user.reward_type
: Reward type (i.e., continuous / binary).reward_std
: Standard deviation of the reward distribution. Applicable only when reward_type is “continuous”.obs_std
: Standard deviation of the observation distribution.step_per_episode
: Number of timesteps in an episode.random_state
: Random state
Example:
from recgym import RECEnv
env = RECEnv(
UserModel = UserModel,
n_items = 100, # we use 100 items
n_users = 100, # 100 users exists
item_feature_dim = 5, #each item has 5 dimensional features
user_feature_dim = 5, #each user has 5 dimensional features
item_feature_vector = None, #determine item_feature_vector from n_items and item_feature_dim in RECEnv
user_feature_vector = None, #determine user_feature_vector from n_users and user_feature_dim in RECEnv
reward_type = "continuous", #we use continuous reward
reward_std = 0.0,
obs_std = 0.0, #not add noise to the observation
step_per_episode = 10,
random_state = 12345,
)
Specifically, users can define their own UserModel
as follows.
Example of Custom UserModel:
# import recgym modules
from recgym import BaseUserModel
from recgym.types import Action
# import other necessary stuffs
from dataclasses import dataclass
from typing import Optional
import numpy as np
@dataclass
class UserModel(BaseUserModel):
"""Initialization."""
reward_type: str = "continuous" # "binary"
reward_std: float = 0.0
item_feature_vector: Optional[np.ndarray] = None,
random_state: Optional[int] = None
def __post_init__(self):
self.random_ = check_random_state(self.random_state)
def user_preference_dynamics(
self,
state: np.ndarray,
action: Action,
alpha: float = 1.0,
)-> np.ndarray:
"""Function that determines the user state transition.
"""
state = (state + alpha * state @ self.item_feature_vector[action] * self.item_feature_vector[action])
state = state / np.linalg.norm(state, ord=2)
return state
def reward_function(
self,
state: np.ndarray,
action: Action,
)-> float:
"""Reward function.
"""
reward = state @ self.item_feature_vector[action]
if self.reward_type is "continuous":
reward = reward + self.random_.normal(loc=0.0, scale=self.reward_std)
return reward
Citation#
If you use our pipeline in your work, please cite our paper below.
@article{kiyohara2021accelerating,
title={Accelerating Offline Reinforcement Learning Application in Real-Time Bidding and Recommendation: Potential Use of Simulation},
author={Kiyohara, Haruka and Kawakami, Kosuke and Saito, Yuta},
journal={arXiv preprint arXiv:2109.08331},
year={2021}
}
Contact#
For any questions about the paper and pipeline, feel free to contact: hk844@cornell.edu
Contribution#
Any contributions to RECGym are more than welcome! Please refer to CONTRIBUTING.md for general guidelines on how to contribute to the project.