scope_rl.policy.head.TruncatedGaussianHead#

class scope_rl.policy.head.TruncatedGaussianHead(base_policy, name, sigma, minimum, maximum, random_state=None)[source]#

Class to sample continuous actions from Truncated Gaussian distribution (applicable to continuous action space).

Bases: scope_rl.policy.BaseHead

Imported as: scope_rl.policy.TruncatedGaussianHead

Note

Given a deterministic policy, a truncated gaussian policy samples an action \(a \in \mathcal{A}\) given state \(s\) as follows.

\[a \sim TruncNorm(\pi(s), \sigma)\]

where \(\sigma\) is the standard deviation of the truncated normal distribution. \(\pi(s)\) is the action chosen by the deterministic policy.

Note

To ensure API compatibility with d3rlpy, BaseHead inherits d3rlpy.algos.QLearningAlgoBase. This base class also has additional methods including fit, predict, and predict_value. Please also refer to the following documentation for the methods that are not described in this API reference.

Parameters:

base_policy (QLearningAlgoBase) – Reinforcement learning (RL) policy.
name (str) – Name of the policy.
sigma (array-like of shape (action_dim, )) – Standard deviation of Gaussian distribution.
minimum (array-like of shape (action_dim, )) – Minimum value of action vector.
maximum (array-like of shape (action_dim, )) – Maximum value of action vector.
random_state (int, default=None (>= 0)) – Random state.

Attributes:

random_state

Methods

`calc_action_choice_probability`(...)	Calculate action choice probabilities.
`calc_pscore_given_action`(x, action)	Calculate the pscore of a given action.
`sample_action_and_output_pscore`(x)	Sample stochastic action with its pscore.

calc_action_choice_probability(greedy_action, action)[source]#

Calculate action choice probabilities.

Parameters:

greedy_action (array-like of shape (n_samples, action_dim)) – Greedy action.
action (array-like of shape (n_samples, action_dim)) – Sampled Action.

Returns:

pscore – Propensity of the observed action being chosen under the behavior policy (pscore stands for propensity score).

Return type:

ndarray of shape (n_samples, )

sample_action_and_output_pscore(x)[source]#

Sample stochastic action with its pscore.

Parameters:

x (array-like of shape (n_samples, state_dim)) – State (we will follow the implementation of d3rlpy and thus use ‘x’ rather than ‘s’).

Returns:

action (ndarray of shape (n_samples, action_dim)) – Sampled action.
pscore (ndarray of shape (n_samples, )) – Propensity of the observed action being chosen under the behavior policy (pscore stands for propensity score).

calc_pscore_given_action(x, action)[source]#

Calculate the pscore of a given action.

Parameters:

x (array-like of shape (n_samples, state_dim)) – State (we will follow the implementation of d3rlpy and thus use ‘x’ rather than ‘s’).
action (array-like of shape (n_samples, action_dim)) – Action.

Returns:

pscore – Pscore of the given state and action.

Return type:

ndarray of shape (n_samples, )

Methods,