rtbgym.envs.rtb.RTBEnv#

class rtbgym.envs.rtb.RTBEnv(objective='conversion', cost_indicator='click', step_per_episode=7, initial_budget=3000, n_ads=100, n_users=100, ad_feature_dim=5, user_feature_dim=5, ad_feature_vector=None, user_feature_vector=None, ad_sampling_rate=None, user_sampling_rate=None, WinningPriceDistribution=<class 'rtbgym.envs.simulator.function.WinningPriceDistribution'>, ClickThroughRate=<class 'rtbgym.envs.simulator.function.ClickThroughRate'>, ConversionRate=<class 'rtbgym.envs.simulator.function.ConversionRate'>, standard_bid_price_distribution=None, minimum_standard_bid_price=None, search_volume_distribution=None, minimum_search_volume=10, random_state=None)[source]#

Class for Real-Time Bidding (RTB) environment for reinforcement learning (RL) agent to interact.

Bases: gym.Env

Imported as: rtbgym.RTBEnv

Note

RTBGym works with OpenAI Gym and Gymnasium-like interface. See Examples below for the usage. This environment uses RTBSyntheticSimulator to collect auction results.

Constrained Markov Decision Process (CMDP) definition are given as follows:

timestep: int (> 0)

Set 24h a day or seven days per week for instance. We have (search volume, ) auctions during a timestep. Note that each single auction do NOT correspond to the timestep.

state: array-like of shape (7, )

Statistical feedbacks of auctions during the timestep, including following values.

timestep
remaining budget
impression level features at the previous timestep (budget consumption rate, cost per mille of impressions, auction winning rate, and reward)
adjust rate (i.e., RL agent action) at the previous timestep

action: {int, float, array-like of shape (1, )} (>= 0)

Adjust rate parameter used for determining the bid price as follows. (Bid price is individually determined for each auction.)

\[{bid price}_{t, i} = {adjust rate}_{t} \times {predicted reward}_{t,i} ( \times {const.})\]

Note that you can also use predicted reward instead of the ground-truth reward in the above equation. Please also refer to CustomizedRTBEnv Wrapper.

reward: int (>= 0)

Total clicks/conversions gained during the timestep.

discount_rate: float

Discount factor for cumulative reward calculation. Set discount_rate = 1 (i.e., no discount) in RTB.

constraint: float (> 0)

Total cost should not exceed the initial budget.

Parameters:

objective ({"click", "conversion"}, default="conversion") – Objective outcome (i.e., reward) of the auctions.
cost_indicator ({"click", "conversion"}, default="click") – Defines when the cost arises.
step_per_episode (int, default=7 (> 0)) – Number of timesteps in an episode.
initial_budget (int, default=3000 (> 0)) – Initial budget (i.e., constraint) for an episode.
n_ads (int, default=100 (> 0)) – Number of (candidate) ads used for auction bidding.
n_users (int, default=100 (> 0)) – Number of (candidate) users used for auction bidding.
ad_feature_dim (int, default=5 (> 0)) – Dimension of the ad feature vectors.
user_feature_dim (int, default=5 (> 0)) – Dimension of the user feature vectors.
ad_feature_vector (ndarray of shape (n_ads, ad_feature_dim), default=None) – Feature vectors that characterizes each ad.
user_feature_vector (ndarray of shape (n_users, user_feature_dim), default=None) – Feature vectors that characterizes each user.
ad_sampling_rate (ndarray of shape (step_per_episode, n_ads), default=None) – Sampling probabilities to determine which ad (id) is used in each auction.
user_sampling_rate (ndarray of shape (step_per_episode, n_users), default=None) – Sampling probabilities to determine which user (id) is used in each auction.
WinningPriceDistribution (BaseWinningPriceDistribution) – Winning price distribution of auctions. Both class and instance are acceptable.
ClickThroughRate (BaseClickAndConversionRate) – Click through rate (i.e., click / impression). Both class and instance are acceptable.
ConversionRate (BaseClickAndConversionRate) – Conversion rate (i.e., conversion / click). Both class and instance are acceptable.
standard_bid_price_distribution (NormalDistribution, default=None) – Distribution of the bid price whose average impression probability is expected to be 0.5.
minimum_standard_bid_price (int, default=None (> 0)) – Minimum value for standard bid price. If None, minimum_standard_bid_price is set to standard_bid_price_distribution.mean / 2.
search_volume_distribution (NormalDistribution, default=None) – Search volume distribution for each timestep.
minimum_search_volume (int, default = 10 (> 0)) – Minimum search volume at each timestep.
random_state (int, default=None (>= 0)) – Random state.

Examples

Setup:

# import necessary module from rtbgym
from rtbgym import RTBEnv
from scope_rl.policy import OnlineHead
from scope_rl.ope.online import calc_on_policy_policy_value

# import necessary module from other libraries
from d3rlpy.algos import RandomPolicy
from d3rlpy.preprocessing import MinMaxActionScaler

# initialize environment
env = RTBEnv(random_state=12345)

# the following commands also work
# import gym
# env = gym.make("RTBEnv-continuous-v0")

# define (RL) agent (i.e., policy)
agent = OnlineHead(
    RandomPolicy(
        action_scaler=MinMaxActionScaler(
            minimum=0.1,
            maximum=10,
        )
    ),
    name="random",
)
agent.build_with_env(env)

Interaction:

# OpenAI Gym and Gymnasium-like interaction with agent
for episode in range(1000):
    obs, info = env.reset()
    done = False

    while not done:
        action = agent.predict_online(obs)
        obs, reward, done, truncated, info = env.step(action)

Online Evaluation:

# calculate on-policy policy value
on_policy_performance = calc_on_policy_policy_value(
    env,
    agent,
    n_trajectories=100,
    random_state=12345
)

Output:

>>> on_policy_performance

13.44

References

Di Wu, Xiujun Chen, Xun Yang, Hao Wang, Qing Tan, Xiaoxun Zhang, Jian Xu, and Kun Gai. “Budget Constrained Bidding by Model-free Reinforcement Learning in Display Advertising.” 2018.

Jun Zhao, Guang Qiu, Ziyu Guan, Wei Zhao, and Xiaofei He. “Deep Reinforcement Learning for Sponsored Search Real-time Bidding.” 2018.

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. “OpenAI Gym.” 2016.

Methods

`reset`([seed])	Initialize the environment.
`step`(action)	Rollout auctions arise during the timestep and return feedbacks to the agent.

step(action)[source]#

Rollout auctions arise during the timestep and return feedbacks to the agent.

Note

The rollout procedure is given as follows.

Sample ads and users for (search volume, ) auctions occur during the timestep. (in Simulator)
Determine bid price. (In Bidder)
Calculate outcome probability and stochastically determine auction result. (in Simulator) The auction results include cost (i.e., second price), impression, click, and conversion.
Check if the cumulative cost during the timestep exceeds the remaining budget or not. (If exceeds, cancel the corresponding auction results.)
Aggregate auction results and return feedbacks to the RL agent.

Parameters:

action ({int, float, array-like of shape (1, )} (>= 0)) – RL agent action which corresponds to the adjust rate parameter used for bid price calculation.

Returns:

feedbacks –

obs: ndarray of shape (7, )

Statistical feedbacks of auctions during the timestep. Corresponds to RL state, which include following components.

timestep

remaining budget

impression level features at the previous timestep (budget consumption rate, cost per mille of impressions, auction winning rate, and reward)

adjust rate (i.e., agent action) at the previous timestep

reward: int (>= 0)

Total clicks/conversions gained during the timestep.

done: bool

Whether the episode end or not.

info: dict

Additional feedbacks (total impressions, clicks, and conversions) that may be useful for the package users. These are unavailable to the agent.

Return type:

Tuple

reset(seed=None)[source]#

Initialize the environment.

Note

Remaining budget is initialized to the initial budget of an episode.

Parameters:

seed (Optional[int], default=None) – Random state.

Returns:

obs (ndarray of shape (7, )) – Statistical feedbacks of auctions during the timestep. Corresponds to RL state, which include following components.
- timestep
- remaining budget
- impression level features at the previous timestep (budget consumption rate, cost per mille of impressions, auction winning rate, and reward)
- adjust rate (i.e., agent action) at the previous timestep
info ((empty) dict) – Additional information that may be useful for the package users. This is unavailable to the RL agent.

Return type:

ndarray

Methods,