Title: | Multivariate Symmetric Uncertainty and Other Measurements |
---|---|
Description: | Estimators for multivariate symmetrical uncertainty based on the work of Gustavo Sosa et al. (2016) <arXiv:1709.08730>, total correlation, information gain and symmetrical uncertainty of categorical variables. |
Authors: | Gustavo Sosa [aut], Elias Maciel [cre] |
Maintainer: | Elias Maciel <[email protected]> |
License: | GPL-3 | file LICENSE |
Version: | 0.0.1 |
Built: | 2025-01-23 03:54:25 UTC |
Source: | https://github.com/eliasmacielr/msu |
Estimate the sample size for a variable in function of its categories.
categorical_sample_size(categories, increment = 10)
categorical_sample_size(categories, increment = 10)
categories |
A vector containing the number of categories of each variable. |
increment |
A number as a constant to which the sample size is incremented as a product. |
The sample size for a categorical variable based on a ordered permutation heuristic approximation of its categories.
Information gain (also called mutual information) is a measure of the mutual dependence between two variables (see https://en.wikipedia.org/wiki/Mutual_information).
information_gain(x, y) IG(x, y)
information_gain(x, y) IG(x, y)
x |
A factor representing a categorical variable. |
y |
A factor representing a categorical variable. |
Information gain estimation based on Sannon entropy for
variables x
and y
.
information_gain(factor(c(0,1)), factor(c(1,0))) information_gain(factor(c(0,0,1,1)), factor(c(0,1,1,1))) information_gain(factor(c(0,0,1,1)), factor(c(0,1,0,1))) ## Not run: information_gain(c(0,1), c(1,0)) ## End(Not run)
information_gain(factor(c(0,1)), factor(c(1,0))) information_gain(factor(c(0,0,1,1)), factor(c(0,1,1,1))) information_gain(factor(c(0,0,1,1)), factor(c(0,1,0,1))) ## Not run: information_gain(c(0,1), c(1,0)) ## End(Not run)
The joint Shannon entropy provides an estimation of the measure of uncertainty between two random variables (see https://en.wikipedia.org/wiki/Joint_entropy).
joint_shannon_entropy(x, y) joint_H(x, y)
joint_shannon_entropy(x, y) joint_H(x, y)
x |
A factor as the represented categorical variable. |
y |
A factor as the represented categorical variable. |
Joint Shannon entropy estimation for variables x
and y
.
shannon_entropy
for the entropy for a
single variable and
multivar_joint_shannon_entropy
for the entropy
associated with more than two random variables.
joint_shannon_entropy(factor(c(0,0,1,1)), factor(c(0,1,0,1))) joint_shannon_entropy(factor(c('a','b','c')), factor(c('c','b','a'))) ## Not run: joint_shannon_entropy(1) joint_shannon_entropy(c('a','b'), c('d','e')) ## End(Not run)
joint_shannon_entropy(factor(c(0,0,1,1)), factor(c(0,1,0,1))) joint_shannon_entropy(factor(c('a','b','c')), factor(c('c','b','a'))) ## Not run: joint_shannon_entropy(1) joint_shannon_entropy(c('a','b'), c('d','e')) ## End(Not run)
MSU is a generalization of symmetrical uncertainty
(SU
) where it is considered
the interaction between two or more variables, whereas SU can only
consider the interaction between two variables. For instance,
consider a table with two variables X1 and X2 and a third variable,
Y (the class of the case), that results from the logical XOR
operator applied to X1 and X2
X1 | X2 | Y |
0 | 0 | 0 |
0 | 1 | 1 |
1 | 0 | 1 |
1 | 1 | 0 |
For this case
This, in contrast to the measurements obtained by SU of the variables X1 and X2 against Y,
and
msu(table_variables, table_class)
msu(table_variables, table_class)
table_variables |
A list of factors as categorical variables. |
table_class |
A factor representing the class of the case. |
Multivariate symmetrical uncertainty estimation for the
variable set {table_variables
,
table_class
}. The result is round
ed to 7 decimal
places.
# completely predictable msu(list(factor(c(0,0,1,1))), factor(c(0,0,1,1))) # XOR msu(list(factor(c(0,0,1,1)), factor(c(0,1,0,1))), factor(c(0,1,1,0))) ## Not run: msu(c(factor(c(0,0,1,1)), factor(c(0,1,0,1))), factor(c(0,1,1,0))) msu(list(factor(c(0,0,1,1)), factor(c(0,1,0,1))), c(0,1,1,0)) ## End(Not run)
# completely predictable msu(list(factor(c(0,0,1,1))), factor(c(0,0,1,1))) # XOR msu(list(factor(c(0,0,1,1)), factor(c(0,1,0,1))), factor(c(0,1,1,0))) ## Not run: msu(c(factor(c(0,0,1,1)), factor(c(0,1,0,1))), factor(c(0,1,1,0))) msu(list(factor(c(0,0,1,1)), factor(c(0,1,0,1))), c(0,1,1,0)) ## End(Not run)
The multivariate joint Shannon entropy provides an estimation of the measure of the uncertainty associated with a set of variables (see https://en.wikipedia.org/wiki/Joint_entropy).
multivar_joint_shannon_entropy(table_variables, table_class) multivar_joint_H(table_variables, table_class)
multivar_joint_shannon_entropy(table_variables, table_class) multivar_joint_H(table_variables, table_class)
table_variables |
A list of factors as categorical variables. |
table_class |
A factor representing the class of the case. |
Joint Shannon entropy estimation for the variable set
table.variables
, table.class
.
shannon_entropy
for the entropy for a
single variable and
joint_shannon_entropy
for the entropy
associated with two random variables.
multivar_joint_shannon_entropy(list(factor(c(0,1)), factor(c(1,0))), factor(c(1,1)))
multivar_joint_shannon_entropy(list(factor(c(0,1)), factor(c(1,0))), factor(c(1,1)))
The sampling for the items of the created variable is done with replacement.
new_informative_variable(variable_labels, variable_class, information_level = 1)
new_informative_variable(variable_labels, variable_class, information_level = 1)
variable_labels |
A factor as the labels for the new informative variable. |
variable_class |
A factor as the class of the variable. |
information_level |
A integer as the information level of the new variable. |
A factor that represents an informative uniform categorical random variable created using the Kononenko method.
The sampling for the items of the created variable is done with replacement.
new_variable(elements, n)
new_variable(elements, n)
elements |
A vector with the elements from which to choose to create the variable. |
n |
An integer indicating the number of items to be contained in the variable. |
A factor that represents a uniform categorical variable.
new_variable(c(0,1), 4) new_variable(c('a','b','c'), 10)
new_variable(c(0,1), 4) new_variable(c('a','b','c'), 10)
Create a set of categorical variables using the logical XOR operator.
new_xor_variables(n_variables = 2, n_instances = 1000, noise = 0)
new_xor_variables(n_variables = 2, n_instances = 1000, noise = 0)
n_variables |
An integer as the number of variables to be created. It is the number of column variables of the table, an additional column is added as a result of the XOR operator over the instances. |
n_instances |
An integer as the number of instances to be created. It is the number of rows of the table. |
noise |
A float number as the noise level for the variables. |
A set of random variables constructed using the logical XOR operator.
new_xor_variables(2, 4, 0) new_xor_variables(5, 10, 0.5)
new_xor_variables(2, 4, 0) new_xor_variables(5, 10, 0.5)
Relative frequency of values of a categorical variable.
rel_freq(variable)
rel_freq(variable)
variable |
A factor as a categorical variable |
Relative frecuency distribution table for the values in
variable
.
rel_freq(factor(c(0,1))) rel_freq(factor(c('a','a','b'))) ## Not run: rel_freq(c(0,1)) ## End(Not run)
rel_freq(factor(c(0,1))) rel_freq(factor(c('a','a','b'))) ## Not run: rel_freq(c(0,1)) ## End(Not run)
Estimate the sample size for a categorical variable.
sample_size(max, min = 1, z = 1.96, error = 0.05)
sample_size(max, min = 1, z = 1.96, error = 0.05)
max |
A number as the maximum value of the possible categories. |
min |
A number as the minimum value of the possible categories. |
z |
A number as the confidence coefficient. |
error |
Admissible sampling error. |
The sample size for a categorical variable based on a variance heuristic approximation.
The Shannon entropy estimates the average minimum number of bits needed to encode a string of symbols, based on the frequency of the symbols (see http://www.bearcave.com/misl/misl_tech/wavelets/compression/shannon.html).
shannon_entropy(x) H(x)
shannon_entropy(x) H(x)
x |
A factor as the represented categorical variable. |
Shannon entropy estimation of the categorical variable.
shannon_entropy(factor(c(1,0))) shannon_entropy(factor(c('a','b','c'))) ## Not run: shannon_entropy(1) shannon_entropy(c('a','b','c')) ## End(Not run)
shannon_entropy(factor(c(1,0))) shannon_entropy(factor(c('a','b','c'))) ## Not run: shannon_entropy(1) shannon_entropy(c('a','b','c')) ## End(Not run)
Symmetrical uncertainty (SU) is the product of a normalization of
the information gain (IG
) with
respect to entropy. SU(X,Y) is a value in the range [0,1], where
if X and Y are totally independent and
if X and Y are totally dependent.
symmetrical_uncertainty(x, y) SU(x, y)
symmetrical_uncertainty(x, y) SU(x, y)
x |
A factor as the represented categorical variable. |
y |
A factor as the represented categorical variable. |
Symmetrical uncertainty estimation based on Sannon
entropy. The result is round
ed to 7 decimal places.
# completely predictable symmetrical_uncertainty(factor(c(0,1,0,1)), factor(c(0,1,0,1))) # XOR factor variables symmetrical_uncertainty(factor(c(0,0,1,1)), factor(c(0,1,1,0))) symmetrical_uncertainty(factor(c(0,1,0,1)), factor(c(0,1,1,0))) ## Not run: symmetrical_uncertainty(c(0,1,0,1), c(0,1,1,0)) ## End(Not run)
# completely predictable symmetrical_uncertainty(factor(c(0,1,0,1)), factor(c(0,1,0,1))) # XOR factor variables symmetrical_uncertainty(factor(c(0,0,1,1)), factor(c(0,1,1,0))) symmetrical_uncertainty(factor(c(0,1,0,1)), factor(c(0,1,1,0))) ## Not run: symmetrical_uncertainty(c(0,1,0,1), c(0,1,1,0)) ## End(Not run)
Total Correlation is a generalization of information gain
(IG
) to measure the dependency of
a set of categorical random variables (see
https://en.wikipedia.org/wiki/Total_correlation).
total_correlation(table_variables, table_class) C(table_variables, table_class)
total_correlation(table_variables, table_class) C(table_variables, table_class)
table_variables |
A list of factors as categorical variables. |
table_class |
A factor representing the class of the case. |
Total correlation estimation for the variable set
table.variables
, table.class
.
total_correlation(list(factor(c(0,1)), factor(c(1,0))), factor(c(0,0))) total_correlation(list(factor(c('a','b')), factor(c('a','b'))), factor(c('a','b'))) ## Not run: total_correlation(list(factor(c(0,1)), factor(c(1,0))), c(0,0)) total_correlation(c(factor(c(0,1)), factor(c(1,0))), c(0,0)) ## End(Not run)
total_correlation(list(factor(c(0,1)), factor(c(1,0))), factor(c(0,0))) total_correlation(list(factor(c('a','b')), factor(c('a','b'))), factor(c('a','b'))) ## Not run: total_correlation(list(factor(c(0,1)), factor(c(1,0))), c(0,0)) total_correlation(c(factor(c(0,1)), factor(c(1,0))), c(0,0)) ## End(Not run)