scope_rl.policy.head.GaussianHead#

class scope_rl.policy.head.GaussianHead(base_policy, name, sigma, random_state=None)[source]#

Class to sample action from Gaussian distribution (applicable to continuous action case).

Bases: scope_rl.policy.BaseHead

Imported as: scope_rl.policy.GaussianHead

Note

This class should be used when action_space is not clipped. Otherwise, please use TruncatedGaussianHead instead.

Given a deterministic policy, a gaussian policy samples an action \(a \in \mathcal{A}\) given state \(s\) as follows.

\[a \sim Normal(\pi(s), \sigma)\]

where \(\sigma\) is the standard deviation of the normal distribution. \(\pi(s)\) is the action chosen by the deterministic policy.

Note

To ensure API compatibility with d3rlpy, BaseHead inherits d3rlpy.algos.QLearningAlgoBase. This base class also has additional methods including fit, predict, and predict_value. Please also refer to the following documentation for the methods that are not described in this API reference.

Parameters:
  • base_policy (QLearningAlgoBase) – Reinforcement learning (RL) policy.

  • name (str) – Name of the policy.

  • sigma (array-like of shape (action_dim, )) – Standard deviation of Gaussian distribution.

  • random_state (int, default=None (>= 0)) – Random state.

Attributes:
random_state

Methods

calc_action_choice_probability(...)

Calculate action choice probabilities.

calc_pscore_given_action(x, action)

Calculate the pscore of a given action.

sample_action_and_output_pscore(x)

Sample stochastic action with its pscore.

calc_action_choice_probability(greedy_action, action)[source]#

Calculate action choice probabilities.

Parameters:
  • greedy_action (array-like of shape (n_samples, action_dim)) – Greedy action.

  • action (array-like of shape (n_samples, action_dim)) – Sampled Action.

Returns:

pscore – Propensity of the observed action being chosen under the behavior policy (pscore stands for propensity score).

Return type:

ndarray of shape (n_samples, )

sample_action_and_output_pscore(x)[source]#

Sample stochastic action with its pscore.

Parameters:

x (array-like of shape (n_samples, state_dim)) – State (we will follow the implementation of d3rlpy and thus use ‘x’ rather than ‘s’).

Returns:

  • action (ndarray of shape (n_samples, action_dim)) – Sampled action.

  • pscore (ndarray of shape (n_samples, )) – Propensity of the observed action being chosen under the behavior policy (pscore stands for propensity score).

calc_pscore_given_action(x, action)[source]#

Calculate the pscore of a given action.

Parameters:
  • x (array-like of shape (n_samples, state_dim)) – State (we will follow the implementation of d3rlpy and thus use ‘x’ rather than ‘s’).

  • action (array-like of shape (n_samples, action_dim)) – Action.

Returns:

pscore – Pscore of the given state and action.

Return type:

ndarray of shape (n_samples, )

Methods,