scope_rl.policy.head.EpsilonGreedyHead#

class scope_rl.policy.head.EpsilonGreedyHead(base_policy, name, n_actions, epsilon, random_state=None)[source]#

Class to convert a deterministic policy into an epsilon-greedy policy (applicable to discrete action case).

Bases: scope_rl.policy.BaseHead

Imported as: scope_rl.policy.EpsilonGreedyHead

Note

Epsilon-greedy policy stochastically chooses actions (i.e., \(a \in \mathcal{A}\)) given state \(s\) as follows.

\[\pi(a \mid s) := (1 - \epsilon) * \mathbb{I}(a = a*)) + \epsilon / |\mathcal{A}|\]

where \(\epsilon\) is the probability of taking random actions and \(a*\) is the greedy action. \(\mathbb{I}(\cdot)\) denotes the indicator function.

Note

To ensure API compatibility with d3rlpy, BaseHead inherits d3rlpy.algos.QLearningAlgoBase. This base class also has additional methods including fit, predict, and predict_value. Please also refer to the following documentation for the methods that are not described in this API reference.

Parameters:
  • base_policy (QLearningAlgoBase) – Reinforcement learning (RL) policy.

  • name (str) – Name of the policy.

  • n_actions (int (> 0)) – Number of actions.

  • epsilon (float) – Probability of exploration. The value should be within [0, 1].

  • random_state (int, default=None (>= 0)) – Random state.

Attributes:
random_state

Methods

calc_action_choice_probability(x)

Calculate action choice probabilities.

calc_pscore_given_action(x, action)

Calculate the pscore of a given action.

sample_action_and_output_pscore(x)

Sample an action stochastically based on the pscore.

sample_action_and_output_pscore(x)[source]#

Sample an action stochastically based on the pscore.

Parameters:

x (array-like of shape (n_samples, state_dim)) – State (we will follow the implementation of d3rlpy and thus use ‘x’ rather than ‘s’).

Returns:

  • action (ndarray of shape (n_samples, )) – Sampled action.

  • pscore (ndarray of shape (n_samples, )) – Propensity of the observed action being chosen under the behavior policy (pscore stands for propensity score).

calc_action_choice_probability(x)[source]#

Calculate action choice probabilities.

Parameters:

x (array-like of shape (n_samples, state_dim)) – State (we will follow the implementation of d3rlpy and thus use ‘x’ rather than ‘s’).

Returns:

pscore – Propensity of the observed action being chosen under the behavior policy (pscore stands for propensity score).

Return type:

ndarray of shape (n_samples, n_actions)

calc_pscore_given_action(x, action)[source]#

Calculate the pscore of a given action.

Parameters:
  • x (array-like of shape (n_samples, state_dim)) – State (we will follow the implementation of d3rlpy and thus use ‘x’ rather than ‘s’).

  • action (array-like of shape (n_samples, )) – Action.

Returns:

pscore – Pscore of the given state and action.

Return type:

ndarray of shape (n_samples, )

Methods,