recgym.envs.rec#
Reinforcement Learning (RL) Environment for Recommender System (REC).
Classes
Class for a recommender system (REC) environment for reinforcement learning (RL) agent to interact. |
- class recgym.envs.rec.RECEnv(step_per_episode=10, n_items=5, n_users=100, item_feature_dim=5, user_feature_dim=5, item_feature_vector=None, user_feature_vector=None, reward_type='continuous', reward_std=0.0, obs_std=0.0, UserModel=<class 'recgym.envs.simulator.function.UserModel'>, random_state=None)[source]#
Class for a recommender system (REC) environment for reinforcement learning (RL) agent to interact.
Bases:
gym.EnvImported as:
recgym.RECEnvNote
RECGym works with OpenAI Gym and Gymnasium-like interface. See Examples below for the usage.
- (Partially Observable) Markov Decision Process ((PO)MDP) definition are given as follows:
- state: array-like of shape (user_feature_dim, )
A vector representing user preference. The preference changes over time in an episode depending on the actions presented by the RL agent. When the true state is unobservable, observation will be returned to the RL agent instead of state.
- action: int (>= 0)
Indicating which item to present to the user.
- reward: bool or float
User engagement signal as a reward. Either binary or continuous.
- Parameters:
step_per_episode (int, default=10 (> 0)) – Number of timesteps in an episode.
n_items (int, default=100 (> 0)) – Number of items used in the recommender system.
n_users (int, default=100 (> 0)) – Number of users used in the recommender system.
item_feature_dim (int, default=5 (> 0)) – Dimension of the item feature vectors.
user_feature_dim (int, default=5 (> 0)) – Dimension of the user feature vectors.
item_feature_vector (array-like of shape (n_items, item_feature_dim), default=None) – Feature vectors that characterize each item.
user_feature_vector (array-like of shape (n_users, user_feature_dim), default=None) – Feature vectors that characterize each user.
reward_type ({"continuous", "binary"}, default="continuous") – Reward type.
reward_std (float, default=0.0 (>=0)) – Noise level of the reward. Applicable only when reward_type is “continuous”.
obs_std (float, default=0.0 (>=0)) – Noise level of the state observation.
UserModel (BaseUserModel, default=UserModel) – User model that defines user_prefecture_dynamics (which simulates how the user preference changes through item interaction) and reward_function (which simulates how the user responds to the presented item). Both class and instance are acceptable.
random_state (int, default=None (>= 0)) – Random state.
Examples
Setup:
# import necessary module from recgym and scope_rl from recgym.rec import RECEnv from scope_rl.policy import OnlineHead from scope_rl.ope.online import calc_on_policy_policy_value # import necessary module from other libraries from d3rlpy.algos import DiscreteRandomPolicy # initialize environment and define (RL) agent (i.e., policy) env = RECEnv(random_state=12345) # the following commands also work # import gym # env = gym.make("RECEnv-v0") # define (RL) agent (i.e., policy) agent = OnlineHead( DiscreteRandomPolicy(), name="random", ) agent.build_with_env(env)
Interaction:
# OpenAI Gym and Gymnasium-like interaction with agent for episode in range(1000): obs, info = env.reset() done = False while not done: action = .sample_action_online(obs) obs, reward, done, truncated, info = env.step(action)
Online Evaluation:
# calculate on-policy policy value on_policy_performance = calc_on_policy_policy_value( env, agent, n_trajectories=100, random_state=12345 )
Output:
>>> on_policy_performance -0.022
References
David Rohde, Stephen Bonner, Travis Dunlop, Flavian Vasile, Alexandros Karatzoglou. “RecoGym: A Reinforcement Learning Environment for the Problem of Product Recommendation in Online Advertising.” 2018.
Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. “OpenAI Gym.” 2016.
Methods
reset([seed])Initialize the environment.
step(action)Simulate a recommender interaction with a user.
- step(action)[source]#
Simulate a recommender interaction with a user.
Note
The simulation procedure is given as follows.
Sample reward (i.e., feedback on user engagement) for the given item.
Update user state with user_preference_dynamics.
Return the user feedback to the RL agent.
- Parameters:
action (int or array-like of shape (1, )) – Indicating which item to present to the user.
- Returns:
feedbacks –
- obs: ndarray of shape (1,)
A vector representing user preference. The preference changes over time in an episode depending on the actions presented by the RL agent. When the true state is unobservable, the agent uses observations instead of the state.
- reward: float
User engagement signal as a reward. Either binary or continuous.
- done: bool
Whether the episode end or not.
- truncated: False
For API consistency.
- info: dict
Additional feedbacks (user_id, state) that may be useful for the package users. These are unavailable to the agent.
- Return type:
Tuple
- reset(seed=None)[source]#
Initialize the environment.
- Returns:
obs (ndarray of shape (1,)) – A vector representing user preference. The preference changes over time in an episode depending on the actions presented by the RL agent. When the true state is unobservable, the agent uses observations instead of the state.
info (dict) – Additional feedbacks (user_id, state) that may be useful for the package users. These are unavailable to the agent.
- Return type: