Package 'msu' reference manual

Title:	Multivariate Symmetric Uncertainty and Other Measurements
Description:	Estimators for multivariate symmetrical uncertainty based on the work of Gustavo Sosa et al. (2016) <arXiv:1709.08730>, total correlation, information gain and symmetrical uncertainty of categorical variables.
Authors:	Gustavo Sosa [aut], Elias Maciel [cre]
Maintainer:	Elias Maciel <[email protected]>
License:	GPL-3 \| file LICENSE
Version:	0.0.1
Built:	2025-02-22 03:39:16 UTC
Source:	https://github.com/eliasmacielr/msu

Estimate the sample size for a variable in function of its categories.

Description

Estimate the sample size for a variable in function of its categories.

Usage

categorical_sample_size(categories, increment = 10)
categorical_sample_size(categories, increment = 10)

Arguments

`categories`	A vector containing the number of categories of each variable.
`increment`	A number as a constant to which the sample size is incremented as a product.

Value

The sample size for a categorical variable based on a ordered permutation heuristic approximation of its categories.

Estimating information gain between two categorical variables.

Description

Information gain (also called mutual information) is a measure of the mutual dependence between two variables (see https://en.wikipedia.org/wiki/Mutual_information).

Usage

information_gain(x, y)

IG(x, y)
information_gain(x, y)

IG(x, y)

Arguments

`x`	A factor representing a categorical variable.
`y`	A factor representing a categorical variable.

Value

Information gain estimation based on Sannon entropy for variables x and y.

Examples

information_gain(factor(c(0,1)), factor(c(1,0)))
information_gain(factor(c(0,0,1,1)), factor(c(0,1,1,1)))
information_gain(factor(c(0,0,1,1)), factor(c(0,1,0,1)))
## Not run: 
information_gain(c(0,1), c(1,0))

## End(Not run)
information_gain(factor(c(0,1)), factor(c(1,0)))
information_gain(factor(c(0,0,1,1)), factor(c(0,1,1,1)))
information_gain(factor(c(0,0,1,1)), factor(c(0,1,0,1)))
## Not run: 
information_gain(c(0,1), c(1,0))

## End(Not run)

Estimation of the Joint Shannon entropy for two categorical variables.

Description

The joint Shannon entropy provides an estimation of the measure of uncertainty between two random variables (see https://en.wikipedia.org/wiki/Joint_entropy).

Usage

joint_shannon_entropy(x, y)

joint_H(x, y)
joint_shannon_entropy(x, y)

joint_H(x, y)

Arguments

`x`	A factor as the represented categorical variable.
`y`	A factor as the represented categorical variable.

Value

Joint Shannon entropy estimation for variables x and y.

Examples

joint_shannon_entropy(factor(c(0,0,1,1)), factor(c(0,1,0,1)))
joint_shannon_entropy(factor(c('a','b','c')), factor(c('c','b','a')))
## Not run: 
joint_shannon_entropy(1)
joint_shannon_entropy(c('a','b'), c('d','e'))

## End(Not run)
joint_shannon_entropy(factor(c(0,0,1,1)), factor(c(0,1,0,1)))
joint_shannon_entropy(factor(c('a','b','c')), factor(c('c','b','a')))
## Not run: 
joint_shannon_entropy(1)
joint_shannon_entropy(c('a','b'), c('d','e'))

## End(Not run)

Estimating Multivariate Symmetrical Uncertainty.

Description

MSU is a generalization of symmetrical uncertainty (SU) where it is considered the interaction between two or more variables, whereas SU can only consider the interaction between two variables. For instance, consider a table with two variables X1 and X2 and a third variable, Y (the class of the case), that results from the logical XOR operator applied to X1 and X2

X1	X2	Y
0	0	0
0	1	1
1	0	1
1	1	0

For this case

$MSU(X1, X2, Y) = 0.5.$

This, in contrast to the measurements obtained by SU of the variables X1 and X2 against Y,

$SU(X1, Y) = 0$

and

$SU(X2, Y) = 0.$

Usage

msu(table_variables, table_class)
msu(table_variables, table_class)

Arguments

`table_variables`	A list of factors as categorical variables.
`table_class`	A factor representing the class of the case.

Value

Multivariate symmetrical uncertainty estimation for the variable set {table_variables, table_class}. The result is rounded to 7 decimal places.

Examples

# completely predictable
msu(list(factor(c(0,0,1,1))), factor(c(0,0,1,1)))
# XOR
msu(list(factor(c(0,0,1,1)), factor(c(0,1,0,1))), factor(c(0,1,1,0)))
## Not run: 
msu(c(factor(c(0,0,1,1)), factor(c(0,1,0,1))), factor(c(0,1,1,0)))
msu(list(factor(c(0,0,1,1)), factor(c(0,1,0,1))), c(0,1,1,0))

## End(Not run)
# completely predictable
msu(list(factor(c(0,0,1,1))), factor(c(0,0,1,1)))
# XOR
msu(list(factor(c(0,0,1,1)), factor(c(0,1,0,1))), factor(c(0,1,1,0)))
## Not run: 
msu(c(factor(c(0,0,1,1)), factor(c(0,1,0,1))), factor(c(0,1,1,0)))
msu(list(factor(c(0,0,1,1)), factor(c(0,1,0,1))), c(0,1,1,0))

## End(Not run)

Estimation of joint Shannon entropy for a set of categorical variables.

Description

The multivariate joint Shannon entropy provides an estimation of the measure of the uncertainty associated with a set of variables (see https://en.wikipedia.org/wiki/Joint_entropy).

Usage

multivar_joint_shannon_entropy(table_variables, table_class)

multivar_joint_H(table_variables, table_class)
multivar_joint_shannon_entropy(table_variables, table_class)

multivar_joint_H(table_variables, table_class)

Arguments

`table_variables`	A list of factors as categorical variables.
`table_class`	A factor representing the class of the case.

Value

Joint Shannon entropy estimation for the variable set table.variables, table.class.

Examples

multivar_joint_shannon_entropy(list(factor(c(0,1)), factor(c(1,0))),
    factor(c(1,1)))
multivar_joint_shannon_entropy(list(factor(c(0,1)), factor(c(1,0))),
    factor(c(1,1)))

Create an informative uniform categorical random variable.

Description

The sampling for the items of the created variable is done with replacement.

Usage

new_informative_variable(variable_labels, variable_class,
  information_level = 1)
new_informative_variable(variable_labels, variable_class,
  information_level = 1)

Arguments

`variable_labels`	A factor as the labels for the new informative variable.
`variable_class`	A factor as the class of the variable.
`information_level`	A integer as the information level of the new variable.

Value

A factor that represents an informative uniform categorical random variable created using the Kononenko method.

Create a uniform categorical random variable.

Description

The sampling for the items of the created variable is done with replacement.

Usage

new_variable(elements, n)
new_variable(elements, n)

Arguments

`elements`	A vector with the elements from which to choose to create the variable.
`n`	An integer indicating the number of items to be contained in the variable.

Value

A factor that represents a uniform categorical variable.

Examples

new_variable(c(0,1), 4)
new_variable(c('a','b','c'), 10)
new_variable(c(0,1), 4)
new_variable(c('a','b','c'), 10)

Create a set of categorical variables using the logical XOR operator.

Description

Create a set of categorical variables using the logical XOR operator.

Usage

new_xor_variables(n_variables = 2, n_instances = 1000, noise = 0)
new_xor_variables(n_variables = 2, n_instances = 1000, noise = 0)

Arguments

`n_variables`	An integer as the number of variables to be created. It is the number of column variables of the table, an additional column is added as a result of the XOR operator over the instances.
`n_instances`	An integer as the number of instances to be created. It is the number of rows of the table.
`noise`	A float number as the noise level for the variables.

Value

A set of random variables constructed using the logical XOR operator.

Examples

new_xor_variables(2, 4, 0)
new_xor_variables(5, 10, 0.5)
new_xor_variables(2, 4, 0)
new_xor_variables(5, 10, 0.5)

Relative frequency of values of a categorical variable.

Description

Relative frequency of values of a categorical variable.

Usage

rel_freq(variable)
rel_freq(variable)

Arguments

variable

A factor as a categorical variable

Value

Relative frecuency distribution table for the values in variable.

Examples

rel_freq(factor(c(0,1)))
rel_freq(factor(c('a','a','b')))
## Not run: 
rel_freq(c(0,1))

## End(Not run)
rel_freq(factor(c(0,1)))
rel_freq(factor(c('a','a','b')))
## Not run: 
rel_freq(c(0,1))

## End(Not run)

Estimate the sample size for a categorical variable.

Description

Estimate the sample size for a categorical variable.

Usage

sample_size(max, min = 1, z = 1.96, error = 0.05)
sample_size(max, min = 1, z = 1.96, error = 0.05)

Arguments

`max`	A number as the maximum value of the possible categories.
`min`	A number as the minimum value of the possible categories.
`z`	A number as the confidence coefficient.
`error`	Admissible sampling error.

Value

The sample size for a categorical variable based on a variance heuristic approximation.

Estimation of Shannon entropy for a categorical variable.

Description

The Shannon entropy estimates the average minimum number of bits needed to encode a string of symbols, based on the frequency of the symbols (see http://www.bearcave.com/misl/misl_tech/wavelets/compression/shannon.html).

Usage

shannon_entropy(x)

H(x)
shannon_entropy(x)

H(x)

Arguments

`x`	A factor as the represented categorical variable.

Value

Shannon entropy estimation of the categorical variable.

Examples

shannon_entropy(factor(c(1,0)))
shannon_entropy(factor(c('a','b','c')))
## Not run: 
shannon_entropy(1)
shannon_entropy(c('a','b','c'))

## End(Not run)
shannon_entropy(factor(c(1,0)))
shannon_entropy(factor(c('a','b','c')))
## Not run: 
shannon_entropy(1)
shannon_entropy(c('a','b','c'))

## End(Not run)

Estimating Symmetrical Uncertainty of two categorical variables.

Description

Symmetrical uncertainty (SU) is the product of a normalization of the information gain (IG) with respect to entropy. SU(X,Y) is a value in the range [0,1], where $SU(X,Y) = 0$ if X and Y are totally independent and $SU(X,Y) = 1$ if X and Y are totally dependent.

Usage

symmetrical_uncertainty(x, y)

SU(x, y)
symmetrical_uncertainty(x, y)

SU(x, y)

Arguments

`x`	A factor as the represented categorical variable.
`y`	A factor as the represented categorical variable.

Value

Symmetrical uncertainty estimation based on Sannon entropy. The result is rounded to 7 decimal places.

Examples

# completely predictable
symmetrical_uncertainty(factor(c(0,1,0,1)), factor(c(0,1,0,1)))
# XOR factor variables
symmetrical_uncertainty(factor(c(0,0,1,1)), factor(c(0,1,1,0)))
symmetrical_uncertainty(factor(c(0,1,0,1)), factor(c(0,1,1,0)))
## Not run: 
symmetrical_uncertainty(c(0,1,0,1), c(0,1,1,0))

## End(Not run)
# completely predictable
symmetrical_uncertainty(factor(c(0,1,0,1)), factor(c(0,1,0,1)))
# XOR factor variables
symmetrical_uncertainty(factor(c(0,0,1,1)), factor(c(0,1,1,0)))
symmetrical_uncertainty(factor(c(0,1,0,1)), factor(c(0,1,1,0)))
## Not run: 
symmetrical_uncertainty(c(0,1,0,1), c(0,1,1,0))

## End(Not run)

Estimation of total correlation for a set of categorical random variables.

Description

Total Correlation is a generalization of information gain (IG) to measure the dependency of a set of categorical random variables (see https://en.wikipedia.org/wiki/Total_correlation).

Usage

total_correlation(table_variables, table_class)

C(table_variables, table_class)
total_correlation(table_variables, table_class)

C(table_variables, table_class)

Arguments

`table_variables`	A list of factors as categorical variables.
`table_class`	A factor representing the class of the case.

Value

Total correlation estimation for the variable set table.variables, table.class.

Examples

total_correlation(list(factor(c(0,1)), factor(c(1,0))), factor(c(0,0)))
total_correlation(list(factor(c('a','b')), factor(c('a','b'))),
    factor(c('a','b')))
## Not run: 
total_correlation(list(factor(c(0,1)), factor(c(1,0))), c(0,0))
total_correlation(c(factor(c(0,1)), factor(c(1,0))), c(0,0))

## End(Not run)
total_correlation(list(factor(c(0,1)), factor(c(1,0))), factor(c(0,0)))
total_correlation(list(factor(c('a','b')), factor(c('a','b'))),
    factor(c('a','b')))
## Not run: 
total_correlation(list(factor(c(0,1)), factor(c(1,0))), c(0,0))
total_correlation(c(factor(c(0,1)), factor(c(1,0))), c(0,0))

## End(Not run)

Package 'msu'

Help Index

Estimate the sample size for a variable in function of its categories.

Description

Usage

Arguments

Value

Estimating information gain between two categorical variables.

Description

Usage

Arguments

Value

Examples

Estimation of the Joint Shannon entropy for two categorical variables.

Description

Usage

Arguments

Value

See Also

Examples

Estimating Multivariate Symmetrical Uncertainty.

Description

Usage

Arguments

Value

See Also

Examples

Estimation of joint Shannon entropy for a set of categorical variables.

Description

Usage

Arguments

Value

See Also

Examples

Create an informative uniform categorical random variable.

Description

Usage

Arguments

Value

Create a uniform categorical random variable.

Description

Usage

Arguments

Value

Examples

Create a set of categorical variables using the logical XOR operator.

Description

Usage

Arguments

Value

Examples

Relative frequency of values of a categorical variable.

Description

Usage

Arguments

Value

Examples

Estimate the sample size for a categorical variable.

Description

Usage

Arguments

Value

Estimation of Shannon entropy for a categorical variable.

Description

Usage

Arguments

Value

Examples

Estimating Symmetrical Uncertainty of two categorical variables.

Description

Usage

Arguments

Value

See Also

Examples

Estimation of total correlation for a set of categorical random variables.

Description

Usage

Arguments

Value