scope_rl.policy.head.SoftmaxHead#
- class scope_rl.policy.head.SoftmaxHead(base_policy, name, n_actions, tau=1.0, random_state=None)[source]#
Class to convert a Q-learning based policy into a softmax policy (applicable to discrete action space).
Bases:
scope_rl.policy.BaseHeadImported as:
scope_rl.policy.SoftmaxHeadNote
A softmax policy stochastically chooses an action (i.e., \(a \in \mathcal{A}\)) given state \(s\) as follows.
\[\pi(a \mid s) := \frac{\exp(Q(s, a) / \tau)}{\sum_{a' \in \mathcal{A}} \exp(Q(s, a') / \tau)}\]where \(\tau\) is the temperature parameter of the softmax function. \(Q(s, a)\) is the predicted value for the given \((s, a)\) pair.
Note
To ensure API compatibility with d3rlpy,
BaseHeadinheritsd3rlpy.algos.QLearningAlgoBase. This base class also has additional methods includingfit,predict, andpredict_value. Please also refer to the following documentation for the methods that are not described in this API reference.See also
- Parameters:
base_policy (QLearningAlgoBase) – Reinforcement learning (RL) policy.
name (str) – Name of the policy.
n_actions (int (> 0)) – Number of actions.
tau (float, default=1.0) – Temperature parameter. The value should not be zero. A negative value leads to a sub-optimal policy.
random_state (int, default=None (>= 0)) – Random state.
- Attributes:
- random_state
Methods
Calculate action choice probabilities.
calc_pscore_given_action(x, action)Calculate the pscore of a given action.
Sample stochastic action with its pscore.
- sample_action_and_output_pscore(x)[source]#
Sample stochastic action with its pscore.
- Parameters:
x (array-like of shape (n_samples, state_dim)) – State (we will follow the implementation of d3rlpy and thus use ‘x’ rather than ‘s’).
- Returns:
action (ndarray of shape (n_samples, )) – Sampled action.
pscore (ndarray of shape (n_samples, )) – Propensity of the observed action being chosen under the behavior policy (pscore stands for propensity score).
- calc_action_choice_probability(x)[source]#
Calculate action choice probabilities.
- Parameters:
x (array-like of shape (n_samples, state_dim)) – State (we will follow the implementation of d3rlpy and thus use ‘x’ rather than ‘s’).
- Returns:
pscore – Propensity of the observed action being chosen under the behavior policy (pscore stands for propensity score).
- Return type:
ndarray of shape (n_samples, n_actions)
- calc_pscore_given_action(x, action)[source]#
Calculate the pscore of a given action.
- Parameters:
x (array-like of shape (n_samples, state_dim)) – State (we will follow the implementation of d3rlpy and thus use ‘x’ rather than ‘s’).
action (array-like of shape (n_samples, )) – Action.
- Returns:
pscore – Pscore of the given state and action.
- Return type:
ndarray of shape (n_samples, )
Methods,