scope_rl.policy.head.SoftmaxHead#

class scope_rl.policy.head.SoftmaxHead(base_policy, name, n_actions, tau=1.0, random_state=None)[source]#

Class to convert a Q-learning based policy into a softmax policy (applicable to discrete action space).

Bases: scope_rl.policy.BaseHead

Imported as: scope_rl.policy.SoftmaxHead

Note

A softmax policy stochastically chooses an action (i.e., \(a \in \mathcal{A}\)) given state \(s\) as follows.

\[\pi(a \mid s) := \frac{\exp(Q(s, a) / \tau)}{\sum_{a' \in \mathcal{A}} \exp(Q(s, a') / \tau)}\]

where \(\tau\) is the temperature parameter of the softmax function. \(Q(s, a)\) is the predicted value for the given \((s, a)\) pair.

Note

To ensure API compatibility with d3rlpy, BaseHead inherits d3rlpy.algos.QLearningAlgoBase. This base class also has additional methods including fit, predict, and predict_value. Please also refer to the following documentation for the methods that are not described in this API reference.

Parameters:
  • base_policy (QLearningAlgoBase) – Reinforcement learning (RL) policy.

  • name (str) – Name of the policy.

  • n_actions (int (> 0)) – Number of actions.

  • tau (float, default=1.0) – Temperature parameter. The value should not be zero. A negative value leads to a sub-optimal policy.

  • random_state (int, default=None (>= 0)) – Random state.

Attributes:
random_state

Methods

calc_action_choice_probability(x)

Calculate action choice probabilities.

calc_pscore_given_action(x, action)

Calculate the pscore of a given action.

sample_action_and_output_pscore(x)

Sample stochastic action with its pscore.

sample_action_and_output_pscore(x)[source]#

Sample stochastic action with its pscore.

Parameters:

x (array-like of shape (n_samples, state_dim)) – State (we will follow the implementation of d3rlpy and thus use ‘x’ rather than ‘s’).

Returns:

  • action (ndarray of shape (n_samples, )) – Sampled action.

  • pscore (ndarray of shape (n_samples, )) – Propensity of the observed action being chosen under the behavior policy (pscore stands for propensity score).

calc_action_choice_probability(x)[source]#

Calculate action choice probabilities.

Parameters:

x (array-like of shape (n_samples, state_dim)) – State (we will follow the implementation of d3rlpy and thus use ‘x’ rather than ‘s’).

Returns:

pscore – Propensity of the observed action being chosen under the behavior policy (pscore stands for propensity score).

Return type:

ndarray of shape (n_samples, n_actions)

calc_pscore_given_action(x, action)[source]#

Calculate the pscore of a given action.

Parameters:
  • x (array-like of shape (n_samples, state_dim)) – State (we will follow the implementation of d3rlpy and thus use ‘x’ rather than ‘s’).

  • action (array-like of shape (n_samples, )) – Action.

Returns:

pscore – Pscore of the given state and action.

Return type:

ndarray of shape (n_samples, )

Methods,