recgym.envs.rec#

Reinforcement Learning (RL) Environment for Recommender System (REC).

Classes

RECEnv

Class for a recommender system (REC) environment for reinforcement learning (RL) agent to interact.

class recgym.envs.rec.RECEnv(step_per_episode=10, n_items=5, n_users=100, item_feature_dim=5, user_feature_dim=5, item_feature_vector=None, user_feature_vector=None, reward_type='continuous', reward_std=0.0, obs_std=0.0, UserModel=<class 'recgym.envs.simulator.function.UserModel'>, random_state=None)[source]#

Class for a recommender system (REC) environment for reinforcement learning (RL) agent to interact.

Bases: gym.Env

Imported as: recgym.RECEnv

Note

RECGym works with OpenAI Gym and Gymnasium-like interface. See Examples below for the usage.

(Partially Observable) Markov Decision Process ((PO)MDP) definition are given as follows:

state: array-like of shape (user_feature_dim, ): A vector representing user preference. The preference changes over time in an episode depending on the actions presented by the RL agent. When the true state is unobservable, observation will be returned to the RL agent instead of state.
action: int (>= 0): Indicating which item to present to the user.
reward: bool or float: User engagement signal as a reward. Either binary or continuous.

Parameters:

step_per_episode (int, default=10 (> 0)) – Number of timesteps in an episode.
n_items (int, default=100 (> 0)) – Number of items used in the recommender system.
n_users (int, default=100 (> 0)) – Number of users used in the recommender system.
item_feature_dim (int, default=5 (> 0)) – Dimension of the item feature vectors.
user_feature_dim (int, default=5 (> 0)) – Dimension of the user feature vectors.
item_feature_vector (array-like of shape (n_items, item_feature_dim), default=None) – Feature vectors that characterize each item.
user_feature_vector (array-like of shape (n_users, user_feature_dim), default=None) – Feature vectors that characterize each user.
reward_type ({"continuous", "binary"}, default="continuous") – Reward type.
reward_std (float, default=0.0 (>=0)) – Noise level of the reward. Applicable only when reward_type is “continuous”.
obs_std (float, default=0.0 (>=0)) – Noise level of the state observation.
UserModel (BaseUserModel, default=UserModel) – User model that defines user_prefecture_dynamics (which simulates how the user preference changes through item interaction) and reward_function (which simulates how the user responds to the presented item). Both class and instance are acceptable.
random_state (int, default=None (>= 0)) – Random state.

Examples

Setup:

# import necessary module from recgym and scope_rl
from recgym.rec import RECEnv
from scope_rl.policy import OnlineHead
from scope_rl.ope.online import calc_on_policy_policy_value

# import necessary module from other libraries
from d3rlpy.algos import DiscreteRandomPolicy

# initialize environment and define (RL) agent (i.e., policy)
env = RECEnv(random_state=12345)

# the following commands also work
# import gym
# env = gym.make("RECEnv-v0")

# define (RL) agent (i.e., policy)
agent = OnlineHead(
    DiscreteRandomPolicy(),
    name="random",
)
agent.build_with_env(env)

Interaction:

# OpenAI Gym and Gymnasium-like interaction with agent
for episode in range(1000):
    obs, info = env.reset()
    done = False
    while not done:
        action = .sample_action_online(obs)
        obs, reward, done, truncated, info = env.step(action)

Online Evaluation:

# calculate on-policy policy value
on_policy_performance = calc_on_policy_policy_value(
    env,
    agent,
    n_trajectories=100,
    random_state=12345
)

Output:

>>> on_policy_performance

-0.022

References

David Rohde, Stephen Bonner, Travis Dunlop, Flavian Vasile, Alexandros Karatzoglou. “RecoGym: A Reinforcement Learning Environment for the Problem of Product Recommendation in Online Advertising.” 2018.

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. “OpenAI Gym.” 2016.

Methods

`reset`([seed])	Initialize the environment.
`step`(action)	Simulate a recommender interaction with a user.

step(action)[source]#

Simulate a recommender interaction with a user.

Note

The simulation procedure is given as follows.

Sample reward (i.e., feedback on user engagement) for the given item.
Update user state with user_preference_dynamics.
Return the user feedback to the RL agent.

Parameters:

action (int or array-like of shape (1, )) – Indicating which item to present to the user.

Returns:

feedbacks –

obs: ndarray of shape (1,): A vector representing user preference. The preference changes over time in an episode depending on the actions presented by the RL agent. When the true state is unobservable, the agent uses observations instead of the state.
reward: float: User engagement signal as a reward. Either binary or continuous.
done: bool: Whether the episode end or not.
truncated: False: For API consistency.
info: dict: Additional feedbacks (user_id, state) that may be useful for the package users. These are unavailable to the agent.

Return type:

Tuple

reset(seed=None)[source]#

Initialize the environment.

Returns:

obs (ndarray of shape (1,)) – A vector representing user preference. The preference changes over time in an episode depending on the actions presented by the RL agent. When the true state is unobservable, the agent uses observations instead of the state.
info (dict) – Additional feedbacks (user_id, state) that may be useful for the package users. These are unavailable to the agent.

Return type:

ndarray