rtbgym.envs.rtb.RTBEnv#
- class rtbgym.envs.rtb.RTBEnv(objective='conversion', cost_indicator='click', step_per_episode=7, initial_budget=3000, n_ads=100, n_users=100, ad_feature_dim=5, user_feature_dim=5, ad_feature_vector=None, user_feature_vector=None, ad_sampling_rate=None, user_sampling_rate=None, WinningPriceDistribution=<class 'rtbgym.envs.simulator.function.WinningPriceDistribution'>, ClickThroughRate=<class 'rtbgym.envs.simulator.function.ClickThroughRate'>, ConversionRate=<class 'rtbgym.envs.simulator.function.ConversionRate'>, standard_bid_price_distribution=None, minimum_standard_bid_price=None, search_volume_distribution=None, minimum_search_volume=10, random_state=None)[source]#
Class for Real-Time Bidding (RTB) environment for reinforcement learning (RL) agent to interact.
Bases:
gym.EnvImported as:
rtbgym.RTBEnvNote
RTBGym works with OpenAI Gym and Gymnasium-like interface. See Examples below for the usage. This environment uses
RTBSyntheticSimulatorto collect auction results.- Constrained Markov Decision Process (CMDP) definition are given as follows:
- timestep: int (> 0)
Set 24h a day or seven days per week for instance. We have (search volume, ) auctions during a timestep. Note that each single auction do NOT correspond to the timestep.
- state: array-like of shape (7, )
Statistical feedbacks of auctions during the timestep, including following values.
timestep
remaining budget
impression level features at the previous timestep (budget consumption rate, cost per mille of impressions, auction winning rate, and reward)
adjust rate (i.e., RL agent action) at the previous timestep
- action: {int, float, array-like of shape (1, )} (>= 0)
Adjust rate parameter used for determining the bid price as follows. (Bid price is individually determined for each auction.)
\[{bid price}_{t, i} = {adjust rate}_{t} \times {predicted reward}_{t,i} ( \times {const.})\]Note that you can also use predicted reward instead of the ground-truth reward in the above equation. Please also refer to CustomizedRTBEnv Wrapper.
- reward: int (>= 0)
Total clicks/conversions gained during the timestep.
- discount_rate: float
Discount factor for cumulative reward calculation. Set discount_rate = 1 (i.e., no discount) in RTB.
- constraint: float (> 0)
Total cost should not exceed the initial budget.
- Parameters:
objective ({"click", "conversion"}, default="conversion") – Objective outcome (i.e., reward) of the auctions.
cost_indicator ({"click", "conversion"}, default="click") – Defines when the cost arises.
step_per_episode (int, default=7 (> 0)) – Number of timesteps in an episode.
initial_budget (int, default=3000 (> 0)) – Initial budget (i.e., constraint) for an episode.
n_ads (int, default=100 (> 0)) – Number of (candidate) ads used for auction bidding.
n_users (int, default=100 (> 0)) – Number of (candidate) users used for auction bidding.
ad_feature_dim (int, default=5 (> 0)) – Dimension of the ad feature vectors.
user_feature_dim (int, default=5 (> 0)) – Dimension of the user feature vectors.
ad_feature_vector (ndarray of shape (n_ads, ad_feature_dim), default=None) – Feature vectors that characterizes each ad.
user_feature_vector (ndarray of shape (n_users, user_feature_dim), default=None) – Feature vectors that characterizes each user.
ad_sampling_rate (ndarray of shape (step_per_episode, n_ads), default=None) – Sampling probabilities to determine which ad (id) is used in each auction.
user_sampling_rate (ndarray of shape (step_per_episode, n_users), default=None) – Sampling probabilities to determine which user (id) is used in each auction.
WinningPriceDistribution (BaseWinningPriceDistribution) – Winning price distribution of auctions. Both class and instance are acceptable.
ClickThroughRate (BaseClickAndConversionRate) – Click through rate (i.e., click / impression). Both class and instance are acceptable.
ConversionRate (BaseClickAndConversionRate) – Conversion rate (i.e., conversion / click). Both class and instance are acceptable.
standard_bid_price_distribution (NormalDistribution, default=None) – Distribution of the bid price whose average impression probability is expected to be 0.5.
minimum_standard_bid_price (int, default=None (> 0)) – Minimum value for standard bid price. If None, minimum_standard_bid_price is set to
standard_bid_price_distribution.mean / 2.search_volume_distribution (NormalDistribution, default=None) – Search volume distribution for each timestep.
minimum_search_volume (int, default = 10 (> 0)) – Minimum search volume at each timestep.
random_state (int, default=None (>= 0)) – Random state.
Examples
Setup:
# import necessary module from rtbgym from rtbgym import RTBEnv from scope_rl.policy import OnlineHead from scope_rl.ope.online import calc_on_policy_policy_value # import necessary module from other libraries from d3rlpy.algos import RandomPolicy from d3rlpy.preprocessing import MinMaxActionScaler # initialize environment env = RTBEnv(random_state=12345) # the following commands also work # import gym # env = gym.make("RTBEnv-continuous-v0") # define (RL) agent (i.e., policy) agent = OnlineHead( RandomPolicy( action_scaler=MinMaxActionScaler( minimum=0.1, maximum=10, ) ), name="random", ) agent.build_with_env(env)
Interaction:
# OpenAI Gym and Gymnasium-like interaction with agent for episode in range(1000): obs, info = env.reset() done = False while not done: action = agent.predict_online(obs) obs, reward, done, truncated, info = env.step(action)
Online Evaluation:
# calculate on-policy policy value on_policy_performance = calc_on_policy_policy_value( env, agent, n_trajectories=100, random_state=12345 )
Output:
>>> on_policy_performance 13.44
References
Di Wu, Xiujun Chen, Xun Yang, Hao Wang, Qing Tan, Xiaoxun Zhang, Jian Xu, and Kun Gai. “Budget Constrained Bidding by Model-free Reinforcement Learning in Display Advertising.” 2018.
Jun Zhao, Guang Qiu, Ziyu Guan, Wei Zhao, and Xiaofei He. “Deep Reinforcement Learning for Sponsored Search Real-time Bidding.” 2018.
Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. “OpenAI Gym.” 2016.
Methods
reset([seed])Initialize the environment.
step(action)Rollout auctions arise during the timestep and return feedbacks to the agent.
- step(action)[source]#
Rollout auctions arise during the timestep and return feedbacks to the agent.
Note
The rollout procedure is given as follows.
Sample ads and users for (search volume, ) auctions occur during the timestep. (in Simulator)
Determine bid price. (In Bidder)
Calculate outcome probability and stochastically determine auction result. (in Simulator) The auction results include cost (i.e., second price), impression, click, and conversion.
Check if the cumulative cost during the timestep exceeds the remaining budget or not. (If exceeds, cancel the corresponding auction results.)
Aggregate auction results and return feedbacks to the RL agent.
- Parameters:
action ({int, float, array-like of shape (1, )} (>= 0)) – RL agent action which corresponds to the adjust rate parameter used for bid price calculation.
- Returns:
feedbacks –
- obs: ndarray of shape (7, )
Statistical feedbacks of auctions during the timestep. Corresponds to RL state, which include following components.
timestep
remaining budget
impression level features at the previous timestep (budget consumption rate, cost per mille of impressions, auction winning rate, and reward)
adjust rate (i.e., agent action) at the previous timestep
- reward: int (>= 0)
Total clicks/conversions gained during the timestep.
- done: bool
Whether the episode end or not.
- info: dict
Additional feedbacks (total impressions, clicks, and conversions) that may be useful for the package users. These are unavailable to the agent.
- Return type:
Tuple
- reset(seed=None)[source]#
Initialize the environment.
Note
Remaining budget is initialized to the initial budget of an episode.
- Parameters:
seed (Optional[int], default=None) – Random state.
- Returns:
obs (ndarray of shape (7, )) – Statistical feedbacks of auctions during the timestep. Corresponds to RL state, which include following components.
timestep
remaining budget
impression level features at the previous timestep (budget consumption rate, cost per mille of impressions, auction winning rate, and reward)
adjust rate (i.e., agent action) at the previous timestep
info ((empty) dict) – Additional information that may be useful for the package users. This is unavailable to the RL agent.
- Return type:
Methods,