scope_rl.policy.head#

Wrapper class to convert greedy policy into stochastic.

Classes

BaseHead

Base class to convert a greedy policy into a stochastic policy.

ContinuousEvalHead

Class to transform the base policy into a deterministic evaluation policy.

EpsilonGreedyHead

Class to convert a deterministic policy into an epsilon-greedy policy (applicable to discrete action case).

GaussianHead

Class to sample action from Gaussian distribution (applicable to continuous action case).

OnlineHead

Class to enable online interaction.

SoftmaxHead

Class to convert a Q-learning based policy into a softmax policy (applicable to discrete action space).

TruncatedGaussianHead

Class to sample continuous actions from Truncated Gaussian distribution (applicable to continuous action space).

class scope_rl.policy.head.BaseHead[source]#

Base class to convert a greedy policy into a stochastic policy.

Bases: d3rlpy.algos.QLearningAlgoBase

Imported as: scope_rl.policy.BaseHead

Note

To ensure API compatibility with d3rlpy, BaseHead inherits d3rlpy.algos.QLearningAlgoBase. This base class also has additional methods including fit, predict, and predict_value. Please also refer to the following documentation for the methods that are not described in this API reference.

Attributes:
action_scalar

Methods

calc_action_choice_probability(x)

Calculate the action choice probabilities.

calc_pscore_given_action(x, action)

Calculate the pscore of the given action.

predict_online(x)

Predict the best action in an online environment.

predict_value_online(x, action[, with_std])

Predict the state action value in an online environment.

sample_action_and_output_pscore(x)

Sample an action stochastically with its pscore.

sample_action_and_output_pscore_online(x)

Sample an action and calculate its pscore in an online environment.

sample_action_online(x)

Sample an action in an online environment.

abstract sample_action_and_output_pscore(x)[source]#

Sample an action stochastically with its pscore.

abstract calc_action_choice_probability(x)[source]#

Calculate the action choice probabilities.

abstract calc_pscore_given_action(x, action)[source]#

Calculate the pscore of the given action.

predict_online(x)[source]#

Predict the best action in an online environment.

predict_value_online(x, action, with_std=False)[source]#

Predict the state action value in an online environment.

sample_action_online(x)[source]#

Sample an action in an online environment.

sample_action_and_output_pscore_online(x)[source]#

Sample an action and calculate its pscore in an online environment.

class scope_rl.policy.head.OnlineHead(base_policy, name)[source]#

Class to enable online interaction.

Bases: scope_rl.policy.BaseHead

Imported as: scope_rl.policy.OnlineHead

Note

This class aims to make a d3rlpy’s policy an instance of BaseHead.

Note

To ensure API compatibility with d3rlpy, BaseHead inherits d3rlpy.algos.QLearningAlgoBase. This base class also has additional methods including fit, predict, and predict_value. Please also refer to the following documentation for the methods that are not described in this API reference.

Parameters:
  • base_policy (QLearningAlgoBase) – Reinforcement learning (RL) policy.

  • name (str) – Name of the policy.

Methods

calc_action_choice_probability(x)

Only for API consistency.

calc_pscore_given_action(x, action)

Only for API consistency.

sample_action_and_output_pscore(x)

Only for API consistency.

sample_action_and_output_pscore(x)[source]#

Only for API consistency.

calc_action_choice_probability(x)[source]#

Only for API consistency.

calc_pscore_given_action(x, action)[source]#

Only for API consistency.

class scope_rl.policy.head.EpsilonGreedyHead(base_policy, name, n_actions, epsilon, random_state=None)[source]#

Class to convert a deterministic policy into an epsilon-greedy policy (applicable to discrete action case).

Bases: scope_rl.policy.BaseHead

Imported as: scope_rl.policy.EpsilonGreedyHead

Note

Epsilon-greedy policy stochastically chooses actions (i.e., \(a \in \mathcal{A}\)) given state \(s\) as follows.

\[\pi(a \mid s) := (1 - \epsilon) * \mathbb{I}(a = a*)) + \epsilon / |\mathcal{A}|\]

where \(\epsilon\) is the probability of taking random actions and \(a*\) is the greedy action. \(\mathbb{I}(\cdot)\) denotes the indicator function.

Note

To ensure API compatibility with d3rlpy, BaseHead inherits d3rlpy.algos.QLearningAlgoBase. This base class also has additional methods including fit, predict, and predict_value. Please also refer to the following documentation for the methods that are not described in this API reference.

Parameters:
  • base_policy (QLearningAlgoBase) – Reinforcement learning (RL) policy.

  • name (str) – Name of the policy.

  • n_actions (int (> 0)) – Number of actions.

  • epsilon (float) – Probability of exploration. The value should be within [0, 1].

  • random_state (int, default=None (>= 0)) – Random state.

Attributes:
random_state

Methods

calc_action_choice_probability(x)

Calculate action choice probabilities.

calc_pscore_given_action(x, action)

Calculate the pscore of a given action.

sample_action_and_output_pscore(x)

Sample an action stochastically based on the pscore.

sample_action_and_output_pscore(x)[source]#

Sample an action stochastically based on the pscore.

Parameters:

x (array-like of shape (n_samples, state_dim)) – State (we will follow the implementation of d3rlpy and thus use ‘x’ rather than ‘s’).

Returns:

  • action (ndarray of shape (n_samples, )) – Sampled action.

  • pscore (ndarray of shape (n_samples, )) – Propensity of the observed action being chosen under the behavior policy (pscore stands for propensity score).

calc_action_choice_probability(x)[source]#

Calculate action choice probabilities.

Parameters:

x (array-like of shape (n_samples, state_dim)) – State (we will follow the implementation of d3rlpy and thus use ‘x’ rather than ‘s’).

Returns:

pscore – Propensity of the observed action being chosen under the behavior policy (pscore stands for propensity score).

Return type:

ndarray of shape (n_samples, n_actions)

calc_pscore_given_action(x, action)[source]#

Calculate the pscore of a given action.

Parameters:
  • x (array-like of shape (n_samples, state_dim)) – State (we will follow the implementation of d3rlpy and thus use ‘x’ rather than ‘s’).

  • action (array-like of shape (n_samples, )) – Action.

Returns:

pscore – Pscore of the given state and action.

Return type:

ndarray of shape (n_samples, )

class scope_rl.policy.head.SoftmaxHead(base_policy, name, n_actions, tau=1.0, random_state=None)[source]#

Class to convert a Q-learning based policy into a softmax policy (applicable to discrete action space).

Bases: scope_rl.policy.BaseHead

Imported as: scope_rl.policy.SoftmaxHead

Note

A softmax policy stochastically chooses an action (i.e., \(a \in \mathcal{A}\)) given state \(s\) as follows.

\[\pi(a \mid s) := \frac{\exp(Q(s, a) / \tau)}{\sum_{a' \in \mathcal{A}} \exp(Q(s, a') / \tau)}\]

where \(\tau\) is the temperature parameter of the softmax function. \(Q(s, a)\) is the predicted value for the given \((s, a)\) pair.

Note

To ensure API compatibility with d3rlpy, BaseHead inherits d3rlpy.algos.QLearningAlgoBase. This base class also has additional methods including fit, predict, and predict_value. Please also refer to the following documentation for the methods that are not described in this API reference.

Parameters:
  • base_policy (QLearningAlgoBase) – Reinforcement learning (RL) policy.

  • name (str) – Name of the policy.

  • n_actions (int (> 0)) – Number of actions.

  • tau (float, default=1.0) – Temperature parameter. The value should not be zero. A negative value leads to a sub-optimal policy.

  • random_state (int, default=None (>= 0)) – Random state.

Attributes:
random_state

Methods

calc_action_choice_probability(x)

Calculate action choice probabilities.

calc_pscore_given_action(x, action)

Calculate the pscore of a given action.

sample_action_and_output_pscore(x)

Sample stochastic action with its pscore.

sample_action_and_output_pscore(x)[source]#

Sample stochastic action with its pscore.

Parameters:

x (array-like of shape (n_samples, state_dim)) – State (we will follow the implementation of d3rlpy and thus use ‘x’ rather than ‘s’).

Returns:

  • action (ndarray of shape (n_samples, )) – Sampled action.

  • pscore (ndarray of shape (n_samples, )) – Propensity of the observed action being chosen under the behavior policy (pscore stands for propensity score).

calc_action_choice_probability(x)[source]#

Calculate action choice probabilities.

Parameters:

x (array-like of shape (n_samples, state_dim)) – State (we will follow the implementation of d3rlpy and thus use ‘x’ rather than ‘s’).

Returns:

pscore – Propensity of the observed action being chosen under the behavior policy (pscore stands for propensity score).

Return type:

ndarray of shape (n_samples, n_actions)

calc_pscore_given_action(x, action)[source]#

Calculate the pscore of a given action.

Parameters:
  • x (array-like of shape (n_samples, state_dim)) – State (we will follow the implementation of d3rlpy and thus use ‘x’ rather than ‘s’).

  • action (array-like of shape (n_samples, )) – Action.

Returns:

pscore – Pscore of the given state and action.

Return type:

ndarray of shape (n_samples, )

class scope_rl.policy.head.GaussianHead(base_policy, name, sigma, random_state=None)[source]#

Class to sample action from Gaussian distribution (applicable to continuous action case).

Bases: scope_rl.policy.BaseHead

Imported as: scope_rl.policy.GaussianHead

Note

This class should be used when action_space is not clipped. Otherwise, please use TruncatedGaussianHead instead.

Given a deterministic policy, a gaussian policy samples an action \(a \in \mathcal{A}\) given state \(s\) as follows.

\[a \sim Normal(\pi(s), \sigma)\]

where \(\sigma\) is the standard deviation of the normal distribution. \(\pi(s)\) is the action chosen by the deterministic policy.

Note

To ensure API compatibility with d3rlpy, BaseHead inherits d3rlpy.algos.QLearningAlgoBase. This base class also has additional methods including fit, predict, and predict_value. Please also refer to the following documentation for the methods that are not described in this API reference.

Parameters:
  • base_policy (QLearningAlgoBase) – Reinforcement learning (RL) policy.

  • name (str) – Name of the policy.

  • sigma (array-like of shape (action_dim, )) – Standard deviation of Gaussian distribution.

  • random_state (int, default=None (>= 0)) – Random state.

Attributes:
random_state

Methods

calc_action_choice_probability(...)

Calculate action choice probabilities.

calc_pscore_given_action(x, action)

Calculate the pscore of a given action.

sample_action_and_output_pscore(x)

Sample stochastic action with its pscore.

calc_action_choice_probability(greedy_action, action)[source]#

Calculate action choice probabilities.

Parameters:
  • greedy_action (array-like of shape (n_samples, action_dim)) – Greedy action.

  • action (array-like of shape (n_samples, action_dim)) – Sampled Action.

Returns:

pscore – Propensity of the observed action being chosen under the behavior policy (pscore stands for propensity score).

Return type:

ndarray of shape (n_samples, )

sample_action_and_output_pscore(x)[source]#

Sample stochastic action with its pscore.

Parameters:

x (array-like of shape (n_samples, state_dim)) – State (we will follow the implementation of d3rlpy and thus use ‘x’ rather than ‘s’).

Returns:

  • action (ndarray of shape (n_samples, action_dim)) – Sampled action.

  • pscore (ndarray of shape (n_samples, )) – Propensity of the observed action being chosen under the behavior policy (pscore stands for propensity score).

calc_pscore_given_action(x, action)[source]#

Calculate the pscore of a given action.

Parameters:
  • x (array-like of shape (n_samples, state_dim)) – State (we will follow the implementation of d3rlpy and thus use ‘x’ rather than ‘s’).

  • action (array-like of shape (n_samples, action_dim)) – Action.

Returns:

pscore – Pscore of the given state and action.

Return type:

ndarray of shape (n_samples, )

class scope_rl.policy.head.TruncatedGaussianHead(base_policy, name, sigma, minimum, maximum, random_state=None)[source]#

Class to sample continuous actions from Truncated Gaussian distribution (applicable to continuous action space).

Bases: scope_rl.policy.BaseHead

Imported as: scope_rl.policy.TruncatedGaussianHead

Note

Given a deterministic policy, a truncated gaussian policy samples an action \(a \in \mathcal{A}\) given state \(s\) as follows.

\[a \sim TruncNorm(\pi(s), \sigma)\]

where \(\sigma\) is the standard deviation of the truncated normal distribution. \(\pi(s)\) is the action chosen by the deterministic policy.

Note

To ensure API compatibility with d3rlpy, BaseHead inherits d3rlpy.algos.QLearningAlgoBase. This base class also has additional methods including fit, predict, and predict_value. Please also refer to the following documentation for the methods that are not described in this API reference.

Parameters:
  • base_policy (QLearningAlgoBase) – Reinforcement learning (RL) policy.

  • name (str) – Name of the policy.

  • sigma (array-like of shape (action_dim, )) – Standard deviation of Gaussian distribution.

  • minimum (array-like of shape (action_dim, )) – Minimum value of action vector.

  • maximum (array-like of shape (action_dim, )) – Maximum value of action vector.

  • random_state (int, default=None (>= 0)) – Random state.

Attributes:
random_state

Methods

calc_action_choice_probability(...)

Calculate action choice probabilities.

calc_pscore_given_action(x, action)

Calculate the pscore of a given action.

sample_action_and_output_pscore(x)

Sample stochastic action with its pscore.

calc_action_choice_probability(greedy_action, action)[source]#

Calculate action choice probabilities.

Parameters:
  • greedy_action (array-like of shape (n_samples, action_dim)) – Greedy action.

  • action (array-like of shape (n_samples, action_dim)) – Sampled Action.

Returns:

pscore – Propensity of the observed action being chosen under the behavior policy (pscore stands for propensity score).

Return type:

ndarray of shape (n_samples, )

sample_action_and_output_pscore(x)[source]#

Sample stochastic action with its pscore.

Parameters:

x (array-like of shape (n_samples, state_dim)) – State (we will follow the implementation of d3rlpy and thus use ‘x’ rather than ‘s’).

Returns:

  • action (ndarray of shape (n_samples, action_dim)) – Sampled action.

  • pscore (ndarray of shape (n_samples, )) – Propensity of the observed action being chosen under the behavior policy (pscore stands for propensity score).

calc_pscore_given_action(x, action)[source]#

Calculate the pscore of a given action.

Parameters:
  • x (array-like of shape (n_samples, state_dim)) – State (we will follow the implementation of d3rlpy and thus use ‘x’ rather than ‘s’).

  • action (array-like of shape (n_samples, action_dim)) – Action.

Returns:

pscore – Pscore of the given state and action.

Return type:

ndarray of shape (n_samples, )

class scope_rl.policy.head.ContinuousEvalHead(base_policy, name, random_state=None)[source]#

Class to transform the base policy into a deterministic evaluation policy.

Bases: scope_rl.policy.BaseHead

Imported as: scope_rl.policy.ContinuousEvalHead

Note

To ensure API compatibility with d3rlpy, BaseHead inherits d3rlpy.algos.QLearningAlgoBase. This base class also has additional methods including fit, predict, and predict_value. Please also refer to the following documentation for the methods that are not described in this API reference.

Parameters:
  • base_policy (QLearningAlgoBase) – Reinforcement learning (RL) policy.

  • name (str) – Name of the policy.

  • random_state (int, default=None (>= 0)) – Random state. (This is for API consistency.)

Attributes:
random_state

Methods

calc_action_choice_probability(x)

Only for API consistency.

calc_pscore_given_action(x, action)

Only for API consistency.

sample_action_and_output_pscore(x)

Only for API consistency.

sample_action_and_output_pscore(x)[source]#

Only for API consistency.

calc_action_choice_probability(x)[source]#

Only for API consistency.

calc_pscore_given_action(x, action)[source]#

Only for API consistency.