scope_rl.policy.head#

Wrapper class to convert greedy policy into stochastic.

Classes

`BaseHead`	Base class to convert a greedy policy into a stochastic policy.
`ContinuousEvalHead`	Class to transform the base policy into a deterministic evaluation policy.
`EpsilonGreedyHead`	Class to convert a deterministic policy into an epsilon-greedy policy (applicable to discrete action case).
`GaussianHead`	Class to sample action from Gaussian distribution (applicable to continuous action case).
`OnlineHead`	Class to enable online interaction.
`SoftmaxHead`	Class to convert a Q-learning based policy into a softmax policy (applicable to discrete action space).
`TruncatedGaussianHead`	Class to sample continuous actions from Truncated Gaussian distribution (applicable to continuous action space).

class scope_rl.policy.head.BaseHead[source]#

Base class to convert a greedy policy into a stochastic policy.

Bases: d3rlpy.algos.QLearningAlgoBase

Imported as: scope_rl.policy.BaseHead

Note

To ensure API compatibility with d3rlpy, BaseHead inherits d3rlpy.algos.QLearningAlgoBase. This base class also has additional methods including fit, predict, and predict_value. Please also refer to the following documentation for the methods that are not described in this API reference.

Attributes:

action_scalar

Methods

`calc_action_choice_probability`(x)	Calculate the action choice probabilities.
`calc_pscore_given_action`(x, action)	Calculate the pscore of the given action.
`predict_online`(x)	Predict the best action in an online environment.
`predict_value_online`(x, action[, with_std])	Predict the state action value in an online environment.
`sample_action_and_output_pscore`(x)	Sample an action stochastically with its pscore.
`sample_action_and_output_pscore_online`(x)	Sample an action and calculate its pscore in an online environment.
`sample_action_online`(x)	Sample an action in an online environment.

abstract sample_action_and_output_pscore(x)[source]#

Sample an action stochastically with its pscore.

abstract calc_action_choice_probability(x)[source]#

Calculate the action choice probabilities.

abstract calc_pscore_given_action(x, action)[source]#

Calculate the pscore of the given action.

predict_online(x)[source]#

Predict the best action in an online environment.

predict_value_online(x, action, with_std=False)[source]#

Predict the state action value in an online environment.

sample_action_online(x)[source]#

Sample an action in an online environment.

sample_action_and_output_pscore_online(x)[source]#

Sample an action and calculate its pscore in an online environment.

class scope_rl.policy.head.OnlineHead(base_policy, name)[source]#

Class to enable online interaction.

Bases: scope_rl.policy.BaseHead

Imported as: scope_rl.policy.OnlineHead

Note

This class aims to make a d3rlpy’s policy an instance of BaseHead.

Note

Parameters:

base_policy (QLearningAlgoBase) – Reinforcement learning (RL) policy.
name (str) – Name of the policy.

Methods

`calc_action_choice_probability`(x)	Only for API consistency.
`calc_pscore_given_action`(x, action)	Only for API consistency.
`sample_action_and_output_pscore`(x)	Only for API consistency.

sample_action_and_output_pscore(x)[source]#

Only for API consistency.

calc_action_choice_probability(x)[source]#

Only for API consistency.

calc_pscore_given_action(x, action)[source]#

Only for API consistency.

class scope_rl.policy.head.EpsilonGreedyHead(base_policy, name, n_actions, epsilon, random_state=None)[source]#

Class to convert a deterministic policy into an epsilon-greedy policy (applicable to discrete action case).

Bases: scope_rl.policy.BaseHead

Imported as: scope_rl.policy.EpsilonGreedyHead

Note

Epsilon-greedy policy stochastically chooses actions (i.e., \(a \in \mathcal{A}\)) given state \(s\) as follows.

\[\pi(a \mid s) := (1 - \epsilon) * \mathbb{I}(a = a*)) + \epsilon / |\mathcal{A}|\]

where \(\epsilon\) is the probability of taking random actions and \(a*\) is the greedy action. \(\mathbb{I}(\cdot)\) denotes the indicator function.

Note

Parameters:

base_policy (QLearningAlgoBase) – Reinforcement learning (RL) policy.
name (str) – Name of the policy.
n_actions (int (> 0)) – Number of actions.
epsilon (float) – Probability of exploration. The value should be within [0, 1].
random_state (int, default=None (>= 0)) – Random state.

Attributes:

random_state

Methods

`calc_action_choice_probability`(x)	Calculate action choice probabilities.
`calc_pscore_given_action`(x, action)	Calculate the pscore of a given action.
`sample_action_and_output_pscore`(x)	Sample an action stochastically based on the pscore.

sample_action_and_output_pscore(x)[source]#

Sample an action stochastically based on the pscore.

Parameters:

x (array-like of shape (n_samples, state_dim)) – State (we will follow the implementation of d3rlpy and thus use ‘x’ rather than ‘s’).

Returns:

action (ndarray of shape (n_samples, )) – Sampled action.
pscore (ndarray of shape (n_samples, )) – Propensity of the observed action being chosen under the behavior policy (pscore stands for propensity score).

calc_action_choice_probability(x)[source]#

Calculate action choice probabilities.

Parameters:: x (array-like of shape (n_samples, state_dim)) – State (we will follow the implementation of d3rlpy and thus use ‘x’ rather than ‘s’).
Returns:: pscore – Propensity of the observed action being chosen under the behavior policy (pscore stands for propensity score).
Return type:: ndarray of shape (n_samples, n_actions)

calc_pscore_given_action(x, action)[source]#

Calculate the pscore of a given action.

Parameters:

x (array-like of shape (n_samples, state_dim)) – State (we will follow the implementation of d3rlpy and thus use ‘x’ rather than ‘s’).
action (array-like of shape (n_samples, )) – Action.

Returns:

pscore – Pscore of the given state and action.

Return type:

ndarray of shape (n_samples, )

class scope_rl.policy.head.SoftmaxHead(base_policy, name, n_actions, tau=1.0, random_state=None)[source]#

Class to convert a Q-learning based policy into a softmax policy (applicable to discrete action space).

Bases: scope_rl.policy.BaseHead

Imported as: scope_rl.policy.SoftmaxHead

Note

A softmax policy stochastically chooses an action (i.e., \(a \in \mathcal{A}\)) given state \(s\) as follows.

\[\pi(a \mid s) := \frac{\exp(Q(s, a) / \tau)}{\sum_{a' \in \mathcal{A}} \exp(Q(s, a') / \tau)}\]

where \(\tau\) is the temperature parameter of the softmax function. \(Q(s, a)\) is the predicted value for the given \((s, a)\) pair.

Note

Parameters:

base_policy (QLearningAlgoBase) – Reinforcement learning (RL) policy.
name (str) – Name of the policy.
n_actions (int (> 0)) – Number of actions.
tau (float, default=1.0) – Temperature parameter. The value should not be zero. A negative value leads to a sub-optimal policy.
random_state (int, default=None (>= 0)) – Random state.

Attributes:

random_state

Methods

`calc_action_choice_probability`(x)	Calculate action choice probabilities.
`calc_pscore_given_action`(x, action)	Calculate the pscore of a given action.
`sample_action_and_output_pscore`(x)	Sample stochastic action with its pscore.

sample_action_and_output_pscore(x)[source]#

Sample stochastic action with its pscore.

Parameters:

x (array-like of shape (n_samples, state_dim)) – State (we will follow the implementation of d3rlpy and thus use ‘x’ rather than ‘s’).

Returns:

action (ndarray of shape (n_samples, )) – Sampled action.
pscore (ndarray of shape (n_samples, )) – Propensity of the observed action being chosen under the behavior policy (pscore stands for propensity score).

calc_action_choice_probability(x)[source]#

Calculate action choice probabilities.

Parameters:: x (array-like of shape (n_samples, state_dim)) – State (we will follow the implementation of d3rlpy and thus use ‘x’ rather than ‘s’).
Returns:: pscore – Propensity of the observed action being chosen under the behavior policy (pscore stands for propensity score).
Return type:: ndarray of shape (n_samples, n_actions)

calc_pscore_given_action(x, action)[source]#

Calculate the pscore of a given action.

Parameters:

x (array-like of shape (n_samples, state_dim)) – State (we will follow the implementation of d3rlpy and thus use ‘x’ rather than ‘s’).
action (array-like of shape (n_samples, )) – Action.

Returns:

pscore – Pscore of the given state and action.

Return type:

ndarray of shape (n_samples, )

class scope_rl.policy.head.GaussianHead(base_policy, name, sigma, random_state=None)[source]#

Class to sample action from Gaussian distribution (applicable to continuous action case).

Bases: scope_rl.policy.BaseHead

Imported as: scope_rl.policy.GaussianHead

Note

This class should be used when action_space is not clipped. Otherwise, please use TruncatedGaussianHead instead.

Given a deterministic policy, a gaussian policy samples an action \(a \in \mathcal{A}\) given state \(s\) as follows.

\[a \sim Normal(\pi(s), \sigma)\]

where \(\sigma\) is the standard deviation of the normal distribution. \(\pi(s)\) is the action chosen by the deterministic policy.

Note

Parameters:

base_policy (QLearningAlgoBase) – Reinforcement learning (RL) policy.
name (str) – Name of the policy.
sigma (array-like of shape (action_dim, )) – Standard deviation of Gaussian distribution.
random_state (int, default=None (>= 0)) – Random state.

Attributes:

random_state

Methods

`calc_action_choice_probability`(...)	Calculate action choice probabilities.
`calc_pscore_given_action`(x, action)	Calculate the pscore of a given action.
`sample_action_and_output_pscore`(x)	Sample stochastic action with its pscore.

calc_action_choice_probability(greedy_action, action)[source]#

Calculate action choice probabilities.

Parameters:

greedy_action (array-like of shape (n_samples, action_dim)) – Greedy action.
action (array-like of shape (n_samples, action_dim)) – Sampled Action.

Returns:

pscore – Propensity of the observed action being chosen under the behavior policy (pscore stands for propensity score).

Return type:

ndarray of shape (n_samples, )

sample_action_and_output_pscore(x)[source]#

Sample stochastic action with its pscore.

Parameters:

x (array-like of shape (n_samples, state_dim)) – State (we will follow the implementation of d3rlpy and thus use ‘x’ rather than ‘s’).

Returns:

action (ndarray of shape (n_samples, action_dim)) – Sampled action.
pscore (ndarray of shape (n_samples, )) – Propensity of the observed action being chosen under the behavior policy (pscore stands for propensity score).

calc_pscore_given_action(x, action)[source]#

Calculate the pscore of a given action.

Parameters:

x (array-like of shape (n_samples, state_dim)) – State (we will follow the implementation of d3rlpy and thus use ‘x’ rather than ‘s’).
action (array-like of shape (n_samples, action_dim)) – Action.

Returns:

pscore – Pscore of the given state and action.

Return type:

ndarray of shape (n_samples, )

class scope_rl.policy.head.TruncatedGaussianHead(base_policy, name, sigma, minimum, maximum, random_state=None)[source]#

Class to sample continuous actions from Truncated Gaussian distribution (applicable to continuous action space).

Bases: scope_rl.policy.BaseHead

Imported as: scope_rl.policy.TruncatedGaussianHead

Note

Given a deterministic policy, a truncated gaussian policy samples an action \(a \in \mathcal{A}\) given state \(s\) as follows.

\[a \sim TruncNorm(\pi(s), \sigma)\]

where \(\sigma\) is the standard deviation of the truncated normal distribution. \(\pi(s)\) is the action chosen by the deterministic policy.

Note

Parameters:

base_policy (QLearningAlgoBase) – Reinforcement learning (RL) policy.
name (str) – Name of the policy.
sigma (array-like of shape (action_dim, )) – Standard deviation of Gaussian distribution.
minimum (array-like of shape (action_dim, )) – Minimum value of action vector.
maximum (array-like of shape (action_dim, )) – Maximum value of action vector.
random_state (int, default=None (>= 0)) – Random state.

Attributes:

random_state

Methods

`calc_action_choice_probability`(...)	Calculate action choice probabilities.
`calc_pscore_given_action`(x, action)	Calculate the pscore of a given action.
`sample_action_and_output_pscore`(x)	Sample stochastic action with its pscore.

calc_action_choice_probability(greedy_action, action)[source]#

Calculate action choice probabilities.

Parameters:

greedy_action (array-like of shape (n_samples, action_dim)) – Greedy action.
action (array-like of shape (n_samples, action_dim)) – Sampled Action.

Returns:

pscore – Propensity of the observed action being chosen under the behavior policy (pscore stands for propensity score).

Return type:

ndarray of shape (n_samples, )

sample_action_and_output_pscore(x)[source]#

Sample stochastic action with its pscore.

Parameters:

x (array-like of shape (n_samples, state_dim)) – State (we will follow the implementation of d3rlpy and thus use ‘x’ rather than ‘s’).

Returns:

action (ndarray of shape (n_samples, action_dim)) – Sampled action.
pscore (ndarray of shape (n_samples, )) – Propensity of the observed action being chosen under the behavior policy (pscore stands for propensity score).

calc_pscore_given_action(x, action)[source]#

Calculate the pscore of a given action.

Parameters:

x (array-like of shape (n_samples, state_dim)) – State (we will follow the implementation of d3rlpy and thus use ‘x’ rather than ‘s’).
action (array-like of shape (n_samples, action_dim)) – Action.

Returns:

pscore – Pscore of the given state and action.

Return type:

ndarray of shape (n_samples, )

class scope_rl.policy.head.ContinuousEvalHead(base_policy, name, random_state=None)[source]#

Class to transform the base policy into a deterministic evaluation policy.

Bases: scope_rl.policy.BaseHead

Imported as: scope_rl.policy.ContinuousEvalHead

Note

Parameters:

base_policy (QLearningAlgoBase) – Reinforcement learning (RL) policy.
name (str) – Name of the policy.
random_state (int, default=None (>= 0)) – Random state. (This is for API consistency.)

Attributes:

random_state

Methods

`calc_action_choice_probability`(x)	Only for API consistency.
`calc_pscore_given_action`(x, action)	Only for API consistency.
`sample_action_and_output_pscore`(x)	Only for API consistency.

sample_action_and_output_pscore(x)[source]#

Only for API consistency.

calc_action_choice_probability(x)[source]#

Only for API consistency.

calc_pscore_given_action(x, action)[source]#

Only for API consistency.