scope_rl.policy.head#
Wrapper class to convert greedy policy into stochastic.
Classes
Base class to convert a greedy policy into a stochastic policy. |
|
Class to transform the base policy into a deterministic evaluation policy. |
|
Class to convert a deterministic policy into an epsilon-greedy policy (applicable to discrete action case). |
|
Class to sample action from Gaussian distribution (applicable to continuous action case). |
|
Class to enable online interaction. |
|
Class to convert a Q-learning based policy into a softmax policy (applicable to discrete action space). |
|
Class to sample continuous actions from Truncated Gaussian distribution (applicable to continuous action space). |
- class scope_rl.policy.head.BaseHead[source]#
Base class to convert a greedy policy into a stochastic policy.
Bases:
d3rlpy.algos.QLearningAlgoBaseImported as:
scope_rl.policy.BaseHeadNote
To ensure API compatibility with d3rlpy,
BaseHeadinheritsd3rlpy.algos.QLearningAlgoBase. This base class also has additional methods includingfit,predict, andpredict_value. Please also refer to the following documentation for the methods that are not described in this API reference.See also
- Attributes:
- action_scalar
Methods
Calculate the action choice probabilities.
calc_pscore_given_action(x, action)Calculate the pscore of the given action.
Predict the best action in an online environment.
predict_value_online(x, action[, with_std])Predict the state action value in an online environment.
Sample an action stochastically with its pscore.
Sample an action and calculate its pscore in an online environment.
Sample an action in an online environment.
- abstract sample_action_and_output_pscore(x)[source]#
Sample an action stochastically with its pscore.
- class scope_rl.policy.head.OnlineHead(base_policy, name)[source]#
Class to enable online interaction.
Bases:
scope_rl.policy.BaseHeadImported as:
scope_rl.policy.OnlineHeadNote
This class aims to make a d3rlpy’s policy an instance of
BaseHead.Note
To ensure API compatibility with d3rlpy,
BaseHeadinheritsd3rlpy.algos.QLearningAlgoBase. This base class also has additional methods includingfit,predict, andpredict_value. Please also refer to the following documentation for the methods that are not described in this API reference.See also
- Parameters:
base_policy (QLearningAlgoBase) – Reinforcement learning (RL) policy.
name (str) – Name of the policy.
Methods
Only for API consistency.
calc_pscore_given_action(x, action)Only for API consistency.
Only for API consistency.
- class scope_rl.policy.head.EpsilonGreedyHead(base_policy, name, n_actions, epsilon, random_state=None)[source]#
Class to convert a deterministic policy into an epsilon-greedy policy (applicable to discrete action case).
Bases:
scope_rl.policy.BaseHeadImported as:
scope_rl.policy.EpsilonGreedyHeadNote
Epsilon-greedy policy stochastically chooses actions (i.e., \(a \in \mathcal{A}\)) given state \(s\) as follows.
\[\pi(a \mid s) := (1 - \epsilon) * \mathbb{I}(a = a*)) + \epsilon / |\mathcal{A}|\]where \(\epsilon\) is the probability of taking random actions and \(a*\) is the greedy action. \(\mathbb{I}(\cdot)\) denotes the indicator function.
Note
To ensure API compatibility with d3rlpy,
BaseHeadinheritsd3rlpy.algos.QLearningAlgoBase. This base class also has additional methods includingfit,predict, andpredict_value. Please also refer to the following documentation for the methods that are not described in this API reference.See also
- Parameters:
- Attributes:
- random_state
Methods
Calculate action choice probabilities.
calc_pscore_given_action(x, action)Calculate the pscore of a given action.
Sample an action stochastically based on the pscore.
- sample_action_and_output_pscore(x)[source]#
Sample an action stochastically based on the pscore.
- Parameters:
x (array-like of shape (n_samples, state_dim)) – State (we will follow the implementation of d3rlpy and thus use ‘x’ rather than ‘s’).
- Returns:
action (ndarray of shape (n_samples, )) – Sampled action.
pscore (ndarray of shape (n_samples, )) – Propensity of the observed action being chosen under the behavior policy (pscore stands for propensity score).
- calc_action_choice_probability(x)[source]#
Calculate action choice probabilities.
- Parameters:
x (array-like of shape (n_samples, state_dim)) – State (we will follow the implementation of d3rlpy and thus use ‘x’ rather than ‘s’).
- Returns:
pscore – Propensity of the observed action being chosen under the behavior policy (pscore stands for propensity score).
- Return type:
ndarray of shape (n_samples, n_actions)
- calc_pscore_given_action(x, action)[source]#
Calculate the pscore of a given action.
- Parameters:
x (array-like of shape (n_samples, state_dim)) – State (we will follow the implementation of d3rlpy and thus use ‘x’ rather than ‘s’).
action (array-like of shape (n_samples, )) – Action.
- Returns:
pscore – Pscore of the given state and action.
- Return type:
ndarray of shape (n_samples, )
- class scope_rl.policy.head.SoftmaxHead(base_policy, name, n_actions, tau=1.0, random_state=None)[source]#
Class to convert a Q-learning based policy into a softmax policy (applicable to discrete action space).
Bases:
scope_rl.policy.BaseHeadImported as:
scope_rl.policy.SoftmaxHeadNote
A softmax policy stochastically chooses an action (i.e., \(a \in \mathcal{A}\)) given state \(s\) as follows.
\[\pi(a \mid s) := \frac{\exp(Q(s, a) / \tau)}{\sum_{a' \in \mathcal{A}} \exp(Q(s, a') / \tau)}\]where \(\tau\) is the temperature parameter of the softmax function. \(Q(s, a)\) is the predicted value for the given \((s, a)\) pair.
Note
To ensure API compatibility with d3rlpy,
BaseHeadinheritsd3rlpy.algos.QLearningAlgoBase. This base class also has additional methods includingfit,predict, andpredict_value. Please also refer to the following documentation for the methods that are not described in this API reference.See also
- Parameters:
base_policy (QLearningAlgoBase) – Reinforcement learning (RL) policy.
name (str) – Name of the policy.
n_actions (int (> 0)) – Number of actions.
tau (float, default=1.0) – Temperature parameter. The value should not be zero. A negative value leads to a sub-optimal policy.
random_state (int, default=None (>= 0)) – Random state.
- Attributes:
- random_state
Methods
Calculate action choice probabilities.
calc_pscore_given_action(x, action)Calculate the pscore of a given action.
Sample stochastic action with its pscore.
- sample_action_and_output_pscore(x)[source]#
Sample stochastic action with its pscore.
- Parameters:
x (array-like of shape (n_samples, state_dim)) – State (we will follow the implementation of d3rlpy and thus use ‘x’ rather than ‘s’).
- Returns:
action (ndarray of shape (n_samples, )) – Sampled action.
pscore (ndarray of shape (n_samples, )) – Propensity of the observed action being chosen under the behavior policy (pscore stands for propensity score).
- calc_action_choice_probability(x)[source]#
Calculate action choice probabilities.
- Parameters:
x (array-like of shape (n_samples, state_dim)) – State (we will follow the implementation of d3rlpy and thus use ‘x’ rather than ‘s’).
- Returns:
pscore – Propensity of the observed action being chosen under the behavior policy (pscore stands for propensity score).
- Return type:
ndarray of shape (n_samples, n_actions)
- calc_pscore_given_action(x, action)[source]#
Calculate the pscore of a given action.
- Parameters:
x (array-like of shape (n_samples, state_dim)) – State (we will follow the implementation of d3rlpy and thus use ‘x’ rather than ‘s’).
action (array-like of shape (n_samples, )) – Action.
- Returns:
pscore – Pscore of the given state and action.
- Return type:
ndarray of shape (n_samples, )
- class scope_rl.policy.head.GaussianHead(base_policy, name, sigma, random_state=None)[source]#
Class to sample action from Gaussian distribution (applicable to continuous action case).
Bases:
scope_rl.policy.BaseHeadImported as:
scope_rl.policy.GaussianHeadNote
This class should be used when action_space is not clipped. Otherwise, please use
TruncatedGaussianHeadinstead.Given a deterministic policy, a gaussian policy samples an action \(a \in \mathcal{A}\) given state \(s\) as follows.
\[a \sim Normal(\pi(s), \sigma)\]where \(\sigma\) is the standard deviation of the normal distribution. \(\pi(s)\) is the action chosen by the deterministic policy.
Note
To ensure API compatibility with d3rlpy,
BaseHeadinheritsd3rlpy.algos.QLearningAlgoBase. This base class also has additional methods includingfit,predict, andpredict_value. Please also refer to the following documentation for the methods that are not described in this API reference.See also
- Parameters:
- Attributes:
- random_state
Methods
Calculate action choice probabilities.
calc_pscore_given_action(x, action)Calculate the pscore of a given action.
Sample stochastic action with its pscore.
- calc_action_choice_probability(greedy_action, action)[source]#
Calculate action choice probabilities.
- Parameters:
greedy_action (array-like of shape (n_samples, action_dim)) – Greedy action.
action (array-like of shape (n_samples, action_dim)) – Sampled Action.
- Returns:
pscore – Propensity of the observed action being chosen under the behavior policy (pscore stands for propensity score).
- Return type:
ndarray of shape (n_samples, )
- sample_action_and_output_pscore(x)[source]#
Sample stochastic action with its pscore.
- Parameters:
x (array-like of shape (n_samples, state_dim)) – State (we will follow the implementation of d3rlpy and thus use ‘x’ rather than ‘s’).
- Returns:
action (ndarray of shape (n_samples, action_dim)) – Sampled action.
pscore (ndarray of shape (n_samples, )) – Propensity of the observed action being chosen under the behavior policy (pscore stands for propensity score).
- calc_pscore_given_action(x, action)[source]#
Calculate the pscore of a given action.
- Parameters:
x (array-like of shape (n_samples, state_dim)) – State (we will follow the implementation of d3rlpy and thus use ‘x’ rather than ‘s’).
action (array-like of shape (n_samples, action_dim)) – Action.
- Returns:
pscore – Pscore of the given state and action.
- Return type:
ndarray of shape (n_samples, )
- class scope_rl.policy.head.TruncatedGaussianHead(base_policy, name, sigma, minimum, maximum, random_state=None)[source]#
Class to sample continuous actions from Truncated Gaussian distribution (applicable to continuous action space).
Bases:
scope_rl.policy.BaseHeadImported as:
scope_rl.policy.TruncatedGaussianHeadNote
Given a deterministic policy, a truncated gaussian policy samples an action \(a \in \mathcal{A}\) given state \(s\) as follows.
\[a \sim TruncNorm(\pi(s), \sigma)\]where \(\sigma\) is the standard deviation of the truncated normal distribution. \(\pi(s)\) is the action chosen by the deterministic policy.
Note
To ensure API compatibility with d3rlpy,
BaseHeadinheritsd3rlpy.algos.QLearningAlgoBase. This base class also has additional methods includingfit,predict, andpredict_value. Please also refer to the following documentation for the methods that are not described in this API reference.See also
- Parameters:
base_policy (QLearningAlgoBase) – Reinforcement learning (RL) policy.
name (str) – Name of the policy.
sigma (array-like of shape (action_dim, )) – Standard deviation of Gaussian distribution.
minimum (array-like of shape (action_dim, )) – Minimum value of action vector.
maximum (array-like of shape (action_dim, )) – Maximum value of action vector.
random_state (int, default=None (>= 0)) – Random state.
- Attributes:
- random_state
Methods
Calculate action choice probabilities.
calc_pscore_given_action(x, action)Calculate the pscore of a given action.
Sample stochastic action with its pscore.
- calc_action_choice_probability(greedy_action, action)[source]#
Calculate action choice probabilities.
- Parameters:
greedy_action (array-like of shape (n_samples, action_dim)) – Greedy action.
action (array-like of shape (n_samples, action_dim)) – Sampled Action.
- Returns:
pscore – Propensity of the observed action being chosen under the behavior policy (pscore stands for propensity score).
- Return type:
ndarray of shape (n_samples, )
- sample_action_and_output_pscore(x)[source]#
Sample stochastic action with its pscore.
- Parameters:
x (array-like of shape (n_samples, state_dim)) – State (we will follow the implementation of d3rlpy and thus use ‘x’ rather than ‘s’).
- Returns:
action (ndarray of shape (n_samples, action_dim)) – Sampled action.
pscore (ndarray of shape (n_samples, )) – Propensity of the observed action being chosen under the behavior policy (pscore stands for propensity score).
- calc_pscore_given_action(x, action)[source]#
Calculate the pscore of a given action.
- Parameters:
x (array-like of shape (n_samples, state_dim)) – State (we will follow the implementation of d3rlpy and thus use ‘x’ rather than ‘s’).
action (array-like of shape (n_samples, action_dim)) – Action.
- Returns:
pscore – Pscore of the given state and action.
- Return type:
ndarray of shape (n_samples, )
- class scope_rl.policy.head.ContinuousEvalHead(base_policy, name, random_state=None)[source]#
Class to transform the base policy into a deterministic evaluation policy.
Bases:
scope_rl.policy.BaseHeadImported as:
scope_rl.policy.ContinuousEvalHeadNote
To ensure API compatibility with d3rlpy,
BaseHeadinheritsd3rlpy.algos.QLearningAlgoBase. This base class also has additional methods includingfit,predict, andpredict_value. Please also refer to the following documentation for the methods that are not described in this API reference.See also
- Parameters:
- Attributes:
- random_state
Methods
Only for API consistency.
calc_pscore_given_action(x, action)Only for API consistency.
Only for API consistency.