rtbgym.envs.wrapper_rtb.CustomizedRTBEnv#

class rtbgym.envs.wrapper_rtb.CustomizedRTBEnv(original_env, reward_predictor=None, scaler=None, action_min=0.1, action_max=10.0, action_type='discrete', n_actions=10, action_meaning=None)[source]#

Wrapper class for RTBEnv to customize RL action space and bidder by decision makers.

Bases: gym.Env

Imported as: rtbgym.CustomizedRTBEnv

Note

Users can customize three following decision making using CustomizedEnv.
  • reward_predictor in Bidder class

    We use predicted rewards to calculate bid price as follows.

    \({bid price}_{t, i} = {adjust rate}_{t} \times {predicted reward}_{t,i} ( \times const.)\)

  • scaler in Bidder class

    Scaler defines const.in the bid price calculation as follows.

    \(const. = scaler \times {standard bid price}\)

    where standard_bid_price indicates the average of the standard bid price (bid price which has approximately 50% impression probability) over all ads.

  • action space for agent

    We transform continual adjust rate space \([0, \infty)\) into agent action space. Both discrete and continuous actions are acceptable.

    Note that we recommend you to set action space within [0.1, 10]. Instead, you can tune multiplication of adjust rate using scaler.

Constrained Markov Decision Process (CMDP) definition are given as follows:
timestep: int (> 0)

Set 24h a day or seven days per week for instance. We have (search volume, ) auctions during a timestep. Note that each single auction do NOT correspond to the timestep.

state: array-like of shape (7, )

Statistical feedbacks of auctions during the timestep, including following values.

  • timestep

  • remaining budget

  • impression level features at the previous timestep (budget consumption rate, cost per mille of impressions, auction winning rate, and reward)

  • adjust rate (i.e., RL agent action) at the previous timestep

action: {int, float, array-like of shape (1, )} (>= 0)

Adjust rate parameter used for the bid price calculation as follows. Note that the following bid price is individually determined for each auction.

\({bid price}_{t, i} = {adjust rate}_{t} \times {predicted reward}_{t,i} ( \times {const.})\)

Both discrete and continuous actions are acceptable.

reward: int (> 0)

Total clicks/conversions gained during the timestep.

discount_rate: int

Discount factor for cumulative reward calculation. Set discount_rate = 1 (i.e., no discount) in RTB.

constraint: int (> 0)

Total cost should not exceed the initial budget.

Parameters:
  • original_env (RTBEnv) – Original RTB environment.

  • reward_predictor (BaseEstimator, default=None) – A machine learning model to predict the reward to determine the bidding price. If None, the ground-truth (expected) reward is used instead of the predicted one.

  • scaler ({int, float}, default=None (> 0)) – Scaling factor (constant value) used for bid price determination. If None, scaler is autofitted by bidder.auto_fit_scaler().

  • action_min (float, default=0.1 (> 0)) – Minimum value of action.

  • action_max (float, default=10.0 (> 0)) – Maximum value of action.

  • action_type ({"discrete", "continuous"}, default="discrete") – Type of the action space.

  • n_actions (int, default=10 (> 0)) – Number of actions. Used only when action_type=”discrete”.

  • action_meaning (ndarray of shape (n_actions, ), default=None) –

    Dictionary to map discrete action index to a specific action. Used only when action_type == “discrete”.

    If None, the values are automatically set to [action_min, action_max] as np.logspace(-1, 1, n_actions).

Examples

Setup:

# import necessary module from rtbgym
from rtbgym.env import RTBEnv
from rtbgym.policy import OnlineHead
from rtbgym.ope.online import calc_on_policy_policy_value

# import necessary module from other libraries
from sklearn.linear_model import LogisticRegression
from d3rlpy.algos import DiscreteRandomPolicy

# initialize and customize environment
env = RTBEnv(random_state=12345)
env = CustomizedRTBEnv(
    original_env=env,
    reward_predictor=LogisticRegression(),
    action_type="discrete",
)

# define (RL) agent (i.e., policy)
agent = OnlineHead(DiscreteRandomPolicy())
agent.build_with_env(env)

Interaction:

# OpenAI Gym and Gymnasium-like interaction with agent
for episode in range(1000):
    obs, info = env.reset()
    done = False

    while not done:
        action = agent.predict_online(obs)
        obs, reward, done, truncated, info = env.step(action)

Online Evaluation:

# calculate on-policy policy value
on_policy_performance = calc_on_policy_policy_value(
    env,
    agent,
    n_trajectories=100,
    random_state=12345
)

Output:

>>> on_policy_performance

11.75

References

Di Wu, Xiujun Chen, Xun Yang, Hao Wang, Qing Tan, Xiaoxun Zhang, Jian Xu, and Kun Gai. “Budget Constrained Bidding by Model-free Reinforcement Learning in Display Advertising.” 2018.

Jun Zhao, Guang Qiu, Ziyu Guan, Wei Zhao, and Xiaofei He. “Deep Reinforcement Learning for Sponsored Search Real-time Bidding.” 2018.

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. “OpenAI Gym.” 2016.

Attributes:
initial_budget
obs_keys
step_per_episode

Methods

reset([seed])

Initialize the environment.

step(action)

Rollout auctions arise during the timestep and return feedbacks to the agent.

step(action)[source]#

Rollout auctions arise during the timestep and return feedbacks to the agent.

Parameters:

action ({int, float, array-like of shape (1, )} (>= 0)) – RL agent action which indicates adjust rate parameter used for bid price determination. Both discrete and continuos actions are acceptable.

Returns:

  • obs (ndarray of shape (7, )) – Statistical feedbacks of auctions during the timestep. Corresponds to RL state, which include following components.

    • timestep

    • remaining budget

    • impression level features at the previous timestep

    (budget consumption rate, cost per mille of impressions, auction winning rate, and reward) - adjust rate (i.e., agent action) at the previous timestep

  • reward (int (>= 0)) – Total clicks/conversions gained during the timestep.

  • done (bool) – Whether the episode end or not.

  • info (dict) – Additional feedbacks (total impressions, clicks, and conversions) that may be useful for the package users. These are unavailable for the RL agent.

Return type:

Tuple[Any]

reset(seed=None)[source]#

Initialize the environment.

Note

Remaining budget is initialized to the initial budget of an episode.

Parameters:

seed (Optional[int], default=None) – Random state.

Returns:

  • obs (ndarray of shape (7, )) – Statistical feedbacks of auctions during the timestep. Corresponds to RL state, which include following components.

    • timestep

    • remaining budget

    • impression level features at the previous timestep

    (budget consumption rate, cost per mille of impressions, auction winning rate, and reward) - adjust rate (i.e., agent action) at the previous timestep

  • info ((empty) dict) – Additional information that may be useful for the package users. This is unavailable to the RL agent.

Return type:

ndarray

Methods,