rtbgym.envs.wrapper_rtb.CustomizedRTBEnv#

class rtbgym.envs.wrapper_rtb.CustomizedRTBEnv(original_env, reward_predictor=None, scaler=None, action_min=0.1, action_max=10.0, action_type='discrete', n_actions=10, action_meaning=None)[source]#

Wrapper class for RTBEnv to customize RL action space and bidder by decision makers.

Bases: gym.Env

Imported as: rtbgym.CustomizedRTBEnv

Note

Users can customize three following decision making using CustomizedEnv.

reward_predictor in Bidder class
We use predicted rewards to calculate bid price as follows.

\({bid price}_{t, i} = {adjust rate}_{t} \times {predicted reward}_{t,i} ( \times const.)\)
scaler in Bidder class
Scaler defines const.in the bid price calculation as follows.

\(const. = scaler \times {standard bid price}\)

where standard_bid_price indicates the average of the standard bid price (bid price which has approximately 50% impression probability) over all ads.
action space for agent
We transform continual adjust rate space \([0, \infty)\) into agent action space. Both discrete and continuous actions are acceptable.

Note that we recommend you to set action space within [0.1, 10]. Instead, you can tune multiplication of adjust rate using scaler.

Constrained Markov Decision Process (CMDP) definition are given as follows:

timestep: int (> 0)

Set 24h a day or seven days per week for instance. We have (search volume, ) auctions during a timestep. Note that each single auction do NOT correspond to the timestep.

state: array-like of shape (7, )

Statistical feedbacks of auctions during the timestep, including following values.

timestep
remaining budget
impression level features at the previous timestep (budget consumption rate, cost per mille of impressions, auction winning rate, and reward)
adjust rate (i.e., RL agent action) at the previous timestep

action: {int, float, array-like of shape (1, )} (>= 0)

Adjust rate parameter used for the bid price calculation as follows. Note that the following bid price is individually determined for each auction.

\({bid price}_{t, i} = {adjust rate}_{t} \times {predicted reward}_{t,i} ( \times {const.})\)

Both discrete and continuous actions are acceptable.

reward: int (> 0)

Total clicks/conversions gained during the timestep.

discount_rate: int

Discount factor for cumulative reward calculation. Set discount_rate = 1 (i.e., no discount) in RTB.

constraint: int (> 0)

Total cost should not exceed the initial budget.

Parameters:

original_env (RTBEnv) – Original RTB environment.
reward_predictor (BaseEstimator, default=None) – A machine learning model to predict the reward to determine the bidding price. If None, the ground-truth (expected) reward is used instead of the predicted one.
scaler ({int, float}, default=None (> 0)) – Scaling factor (constant value) used for bid price determination. If None, scaler is autofitted by bidder.auto_fit_scaler().
action_min (float, default=0.1 (> 0)) – Minimum value of action.
action_max (float, default=10.0 (> 0)) – Maximum value of action.
action_type ({"discrete", "continuous"}, default="discrete") – Type of the action space.
n_actions (int, default=10 (> 0)) – Number of actions. Used only when action_type=”discrete”.
action_meaning (ndarray of shape (n_actions, ), default=None) –
Dictionary to map discrete action index to a specific action. Used only when action_type == “discrete”.

If None, the values are automatically set to [action_min, action_max] as np.logspace(-1, 1, n_actions).

Examples

Setup:

# import necessary module from rtbgym
from rtbgym.env import RTBEnv
from rtbgym.policy import OnlineHead
from rtbgym.ope.online import calc_on_policy_policy_value

# import necessary module from other libraries
from sklearn.linear_model import LogisticRegression
from d3rlpy.algos import DiscreteRandomPolicy

# initialize and customize environment
env = RTBEnv(random_state=12345)
env = CustomizedRTBEnv(
    original_env=env,
    reward_predictor=LogisticRegression(),
    action_type="discrete",
)

# define (RL) agent (i.e., policy)
agent = OnlineHead(DiscreteRandomPolicy())
agent.build_with_env(env)

Interaction:

# OpenAI Gym and Gymnasium-like interaction with agent
for episode in range(1000):
    obs, info = env.reset()
    done = False

    while not done:
        action = agent.predict_online(obs)
        obs, reward, done, truncated, info = env.step(action)

Online Evaluation:

# calculate on-policy policy value
on_policy_performance = calc_on_policy_policy_value(
    env,
    agent,
    n_trajectories=100,
    random_state=12345
)

Output:

>>> on_policy_performance

11.75

References

Di Wu, Xiujun Chen, Xun Yang, Hao Wang, Qing Tan, Xiaoxun Zhang, Jian Xu, and Kun Gai. “Budget Constrained Bidding by Model-free Reinforcement Learning in Display Advertising.” 2018.

Jun Zhao, Guang Qiu, Ziyu Guan, Wei Zhao, and Xiaofei He. “Deep Reinforcement Learning for Sponsored Search Real-time Bidding.” 2018.

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. “OpenAI Gym.” 2016.

Attributes:

initial_budget
obs_keys
step_per_episode

Methods

`reset`([seed])	Initialize the environment.
`step`(action)	Rollout auctions arise during the timestep and return feedbacks to the agent.

step(action)[source]#

Rollout auctions arise during the timestep and return feedbacks to the agent.

Parameters:

action ({int, float, array-like of shape (1, )} (>= 0)) – RL agent action which indicates adjust rate parameter used for bid price determination. Both discrete and continuos actions are acceptable.

Returns:

obs (ndarray of shape (7, )) – Statistical feedbacks of auctions during the timestep. Corresponds to RL state, which include following components.
- timestep
- remaining budget
- impression level features at the previous timestep
(budget consumption rate, cost per mille of impressions, auction winning rate, and reward) - adjust rate (i.e., agent action) at the previous timestep
reward (int (>= 0)) – Total clicks/conversions gained during the timestep.
done (bool) – Whether the episode end or not.
info (dict) – Additional feedbacks (total impressions, clicks, and conversions) that may be useful for the package users. These are unavailable for the RL agent.

Return type:

Tuple[Any]

reset(seed=None)[source]#

Initialize the environment.

Note

Remaining budget is initialized to the initial budget of an episode.

Parameters:

seed (Optional[int], default=None) – Random state.

Returns:

obs (ndarray of shape (7, )) – Statistical feedbacks of auctions during the timestep. Corresponds to RL state, which include following components.
- timestep
- remaining budget
- impression level features at the previous timestep
(budget consumption rate, cost per mille of impressions, auction winning rate, and reward) - adjust rate (i.e., agent action) at the previous timestep
info ((empty) dict) – Additional information that may be useful for the package users. This is unavailable to the RL agent.

Return type:

ndarray

Methods,