Tcl Library Source Code

Check-in [44761b447e]
Login

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:Add several commands to the statistics package: Wasserstein distance, KL divergence, logistic regression. Also documentation and tests for these commands.
Timelines: family | ancestors | descendants | both | trunk
Files: files | file ages | folders
SHA3-256:44761b447e071be7f617251f95c1a92e8c1f90d78d0d120e4df87be8bb5faa4c
User & Date: arjenmarkus 2018-09-04 19:11:39
Context
2018-09-06
09:46
Fix namespace resolution for relative namespaces. Add "normalize" and 'strip" commands. check-in: ccd9433cb2 user: pooryorick tags: trunk
2018-09-04
19:11
Add several commands to the statistics package: Wasserstein distance, KL divergence, logistic regression. Also documentation and tests for these commands. check-in: 44761b447e user: arjenmarkus tags: trunk
2018-09-03
08:19
Replace getheader/setheader with "header". Adapt header commands to work with header parameters. Add "header serialize". Add RFC 2231 support. check-in: 1388f3b7c9 user: pooryorick tags: trunk
Changes
Hide Diffs Unified Diffs Ignore Whitespace Patch

Changes to modules/math/ChangeLog.







1
2
3
4
5
6
7






2018-07-22  Arjen Markus <arjenmarkus@users.sourceforge.net>
	* trig.tcl: Implementation of (additional) trigonometric functions - package math::trig
	* trig.test: Trst for the (additional) trigonometric functions
	* trig.man: Documentation for the (additional) trigonometric functions
	* pkgIndex.tcl: Add the new package

2018-07-19  Arjen Markus <arjenmarkus@users.sourceforge.net>
>
>
>
>
>
>







1
2
3
4
5
6
7
8
9
10
11
12
13
2018-08-04  Arjen Markus <arjenmarkus@users.sourceforge.net>
	* statistics.tcl: Source stat_wasserstein.tcl and stat_logit.tcl - for new commands
	* statistics.test: Add corresponding tests
	* statistics.man: Add description of these commands
	* pkgIndex.tcl: Bump the version to 1.3.0

2018-07-22  Arjen Markus <arjenmarkus@users.sourceforge.net>
	* trig.tcl: Implementation of (additional) trigonometric functions - package math::trig
	* trig.test: Trst for the (additional) trigonometric functions
	* trig.man: Documentation for the (additional) trigonometric functions
	* pkgIndex.tcl: Add the new package

2018-07-19  Arjen Markus <arjenmarkus@users.sourceforge.net>

Changes to modules/math/TODO.

1






2
3
4
5
6
7
8
This file records outstanding actions for the math module







dd. 17 june 2018
- Factor out the backward rotation in the intersection routines for circles
- Add a normalisation routine for vectors
- Add routines to construct a perpendicular vector and line
- Add a routine to return the perpendicular bisector of a line segment
- Add routines to deal with triangles (incircle, circumcircle)

>
>
>
>
>
>







1
2
3
4
5
6
7
8
9
10
11
12
13
14
This file records outstanding actions for the math module

dd. 4 september 2018
- Implement a "typical profile" for timeseries and determining residuals
  (Plus perhaps a notion of outliers)
- Implement detection of extreme values/periods with extreme values


dd. 17 june 2018
- Factor out the backward rotation in the intersection routines for circles
- Add a normalisation routine for vectors
- Add routines to construct a perpendicular vector and line
- Add a routine to return the perpendicular bisector of a line segment
- Add routines to deal with triangles (incircle, circumcircle)

Changes to modules/math/pkgIndex.tcl.

17
18
19
20
21
22
23

24
25
26
27
28
29
30
31
32
33
34
package ifneeded math::bignum            3.1.1 [list source [file join $dir bignum.tcl]]
package ifneeded math::bigfloat          1.2.2 [list source [file join $dir bigfloat.tcl]]
package ifneeded math::machineparameters 0.1   [list source [file join $dir machineparameters.tcl]]

if {![package vsatisfies [package provide Tcl] 8.5]} {return}
package ifneeded math::calculus          0.8.1 [list source [file join $dir calculus.tcl]]
# statistics depends on linearalgebra (for multi-variate linear regression).

package ifneeded math::statistics        1.2.0 [list source [file join $dir statistics.tcl]]
package ifneeded math::linearalgebra     1.1.6 [list source [file join $dir linalg.tcl]]
package ifneeded math::calculus::symdiff 1.0.1 [list source [file join $dir symdiff.tcl]]
package ifneeded math::bigfloat          2.0.2 [list source [file join $dir bigfloat2.tcl]]
package ifneeded math::numtheory         1.1.1 [list source [file join $dir numtheory.tcl]]
package ifneeded math::decimal           1.0.3 [list source [file join $dir decimal.tcl]]
package ifneeded math::geometry          1.3.0 [list source [file join $dir geometry.tcl]]
package ifneeded math::trig              1.0   [list source [file join $dir trig.tcl]]
if {![package vsatisfies [package require Tcl] 8.6]} {return}
package ifneeded math::exact             1.0   [list source [file join $dir exact.tcl]]
package ifneeded math::PCA               1.0   [list source [file join $dir pca.tcl]]







>
|










17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
package ifneeded math::bignum            3.1.1 [list source [file join $dir bignum.tcl]]
package ifneeded math::bigfloat          1.2.2 [list source [file join $dir bigfloat.tcl]]
package ifneeded math::machineparameters 0.1   [list source [file join $dir machineparameters.tcl]]

if {![package vsatisfies [package provide Tcl] 8.5]} {return}
package ifneeded math::calculus          0.8.1 [list source [file join $dir calculus.tcl]]
# statistics depends on linearalgebra (for multi-variate linear regression).
# statistics depends on optimize (for logistic regression).
package ifneeded math::statistics        1.3.0 [list source [file join $dir statistics.tcl]]
package ifneeded math::linearalgebra     1.1.6 [list source [file join $dir linalg.tcl]]
package ifneeded math::calculus::symdiff 1.0.1 [list source [file join $dir symdiff.tcl]]
package ifneeded math::bigfloat          2.0.2 [list source [file join $dir bigfloat2.tcl]]
package ifneeded math::numtheory         1.1.1 [list source [file join $dir numtheory.tcl]]
package ifneeded math::decimal           1.0.3 [list source [file join $dir decimal.tcl]]
package ifneeded math::geometry          1.3.0 [list source [file join $dir geometry.tcl]]
package ifneeded math::trig              1.0   [list source [file join $dir trig.tcl]]
if {![package vsatisfies [package require Tcl] 8.6]} {return}
package ifneeded math::exact             1.0   [list source [file join $dir exact.tcl]]
package ifneeded math::PCA               1.0   [list source [file join $dir pca.tcl]]

Added modules/math/stat_logit.tcl.

















































































































































































































































>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
# stat_logit.tcl --
#     Logistic regression functions - part of the statistics package
#
#     Note:
#     The implementation was derived from the Wikipedia page on logistic regression,
#     (https://en.wikipedia.org/wiki/Logistic_regression) as is the test case.
#
#     TODO:
#     - Deviance to evaluate the goodness of fit
#     - Evaluate the probability
#

package require math::optimize

namespace eval ::math::statistics {
     variable xLogit {}
     variable yLogit {}
}

# logistic-model --
#     Fit 1/0 data to a logistic model
#
# Arguments:
#     xdata        Independent variables (list of lists if there are more than one)
#     ydata        Corresponding scores (0 or 1)
#
# Result:
#     Estimate of the parameters for a logistic model
#
# Note:
#     It is expected that the independent variables have roughly the same scale
#
proc ::math::statistics::logistic-model {xdata ydata} {
    variable xLogit
    variable yLogit

    set xLogit {}
    foreach coords $xdata {
        lappend xLogit [concat 1.0 $coords]
    }
    set yLogit $ydata

    #
    # Use a trivial starting point
    #
    set startx [lrepeat [llength [lindex $xLogit 0]] 0.0]

    set result [::math::optimize::nelderMead LogisticML_NM $startx]

    return [dict get $result x]
}


# LogisticML_NM --
#     Calculate the (log) maximum likelihood for the given logistic model
#     using Nelder-Mead
#
# Arguments:
#     args        Vector of the current regression coefficients
#
# Returns:
#     Log maximum likelihood
#
proc ::math::statistics::LogisticML_NM {args} {
    variable xLogit
    variable yLogit

    set loglike 0.0
    foreach coords $xLogit score $yLogit {
        set sum 0.0

        foreach c $coords v $args {
            set sum [expr {$sum + $v * $c}]
        }
        set exp [expr {exp(-$sum)}]

        if { $score == 1 } {
            set loglike [expr {$loglike - log(1.0 + $exp)}]
        } else {
            set loglike [expr {$loglike - $sum - log(1.0 + $exp)}]
        }
    }
    return [expr {-$loglike}]
}

# logistic-probability --
#     Calculate the probability of a positive score (1) given the model
#
# Arguments:
#     coeffs      Coefficients of the logistic model (for instance outcome of model fit)
#     values      Values of the independent variables
#
# Returns:
#     Probability
#
proc ::math::statistics::logistic-probability {coeffs values} {
    set sum 0.0

    foreach c $coeffs v [concat 1.0 $values] {
        set sum [expr {$sum + $c * $v}]
    }

    return [expr {1.0 / (1.0 + exp(-$sum))}]
}

# test case: from Wikipedia
if {0} {
set xdata {0.50 0.75 1.00 1.25 1.50 1.75 1.75 2.00 2.25 2.50 2.75 3.00 3.25 3.50 4.00 4.25 4.50 4.75 5.00 5.50}
set ydata {0    0    0    0    0    0    1    0    1    0    1    0    1    0    1    1    1    1    1    1   }

set coeffs [::math::statistics::logistic-model $xdata $ydata]

puts "Model fit: $coeffs"

puts "Probabilities:"

foreach x {1 2 3 4 5} {
    puts "$x - [::math::statistics::logistic-probability $coeffs $x]"
}
}

Added modules/math/stat_wasserstein.tcl.













































































































































































































































































































































































































































>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
# stat-wasserstein.tcl --
#     Determine the Wasserstein distance between two probability distributions
#
#     Note:
#     This is an implementation for one-dimensional distributions (or better:
#     non-negative patterns)
#
#     Note 2:
#     The lower bound of 1.0e-10 is probably not at all necessary
#

# LastNonZero --
#     Auxiliary procedure to find the last non-zero entry
#
# Arguments:
#     prob           Probability distribution
#
# Result:
#     Index in the list of the last non-zero entry
#
# Note:
#     To avoid numerical problems any value smaller than 1.0e-10 is considered to
#     be zero
#
proc ::math::statistics::LastNonZero {prob} {
    set maxidx [expr {[llength $prob] - 1}]

    for {set idx $maxidx} {$idx >= 0} {incr idx -1} {
        if { [lindex $prob $idx] > 1.0e-10 } {
            return $idx
        }
    }

    return -1 ;# No non-zero entry
}

# Normalise --
#     Auxiliary procedure to normalise the probability distribution
#
# Arguments:
#     prob           Probability distribution
#
# Result:
#     Normalised distribution (i.e. the entries sum to 1)
#
# Note:
#     To avoid numerical problems any value smaller than 1.0e-10 is set to zero
#
proc ::math::statistics::Normalise {prob} {

    set newprob {}
    set sum     0.0

    foreach p $prob {
        set sum [expr {$sum + $p}]
    }

    if { $sum == 0.0 } {
        return -code error "Probability distribution should not consist of only zeroes"
    }

    foreach p $prob {
        lappend newprob [expr {$p > 1.0e-10? ($p/$sum) : 0.0}]
    }

    return $newprob
}

# wasserstein-distance --
#     Determine the Wasserstein distance using a "greedy" algorithm.
#
# Arguments:
#     prob1          First probability distribution, interpreted as a histogram
#                    with uniform bin width
#     prob2          Second probability distribution
#
# Result:
#     Distance between the two distributions
#
proc ::math::statistics::wasserstein-distance {prob1 prob2} {
    #
    # First step: make sure the histograms have the same length and the
    # same cumulative weight.
    #
    if { [llength $prob1] != [llength $prob2] } {
        return -code error "Lengths of the probability histograms must be the same"
    }

    set prob1 [Normalise $prob1]
    set prob2 [Normalise $prob2]

    set distance 0.0

    #
    # Determine the last non-zero bin - this bin will be shifted to the second
    # distribution
    #
    while {1} {
        set idx1 [LastNonZero $prob1]
        set idx2 [LastNonZero $prob2]

        if { $idx1 < 0 } {
            break ;# We are done
        }

        set bin1 [lindex $prob1 $idx1]
        set bin2 [lindex $prob2 $idx2]

        if { $bin1 <= $bin2 } {
            lset prob1 $idx1 0.0
            lset prob2 $idx2 [expr {$bin2 - $bin1}]
            set distance [expr {$distance + abs($idx2-$idx1) * $bin1}]
        } else {
            lset prob1 $idx1 [expr {$bin1 - $bin2}]
            lset prob2 $idx2 0.0
            set distance [expr {$distance + abs($idx2-$idx1) * $bin2}]
        }
    }

    return $distance
}

# kl-divergence --
#     Calculate the Kullback-Leibler (KL) divergence for two discrete distributions
#
# Arguments:
#     prob1          First probability distribution - the divergence is calculated
#                    with this one as the basis
#     prob2          Second probability distribution - the divergence of this
#                    distribution wrt the first is calculated
#
# Notes:
#     - The KL divergence is an asymmetric measure
#     - It is actually only defined if prob2 is only zero when prob1 is too
#     - The number of elements in the two distributions must be the same and
#       bins must be the same
#
proc ::math::statistics::kl-divergence {prob1 prob2} {
    if { [llength $prob1] != [llength $prob2] } {
        return -code error "Lengths of the two probability histograms must be the same"
    }

    #
    # Normalise the probability histograms
    #
    set prob1 [Normalise $prob1]
    set prob2 [Normalise $prob2]

    #
    # Check for well-definedness while going along
    #
    set sum 0.0
    foreach p1 $prob1 p2 $prob2 {
        if { $p2 == 0.0 && $p1 != 0.0 } {
            return -code error "Second probability histogram contains unmatched zeroes"
        }

        if { $p1 != 0.0 } {
            set sum [expr {$sum - $p1 * log($p2/$p1)}]
        }
    }

    return $sum
}

if {0} {
# tests --
#

# Almost trivial
set prob1 {0.0 0.0 0.0 1.0}
set prob2 {0.0 0.0 1.0 0.0}

puts "Expected distance: 1"
puts "Calculated: [wasserstein-distance $prob1 $prob2]"
puts "Symmetric:  [wasserstein-distance $prob2 $prob1]"

# Less trivial
set prob1 {0.0 0.75 0.25 0.0}
set prob2 {0.0 0.0  1.0  0.0}

puts "Expected distance: 0.75"
puts "Calculated: [wasserstein-distance $prob1 $prob2]"
puts "Symmetric:  [wasserstein-distance $prob2 $prob1]"

# Shift trivial
set prob1 {0.0 0.1 0.2 0.4 0.2 0.1 0.0 0.0}
set prob2 {0.0 0.0 0.0 0.1 0.2 0.4 0.2 0.1}

puts "Expected distance: 2"
puts "Calculated: [wasserstein-distance $prob1 $prob2]"
puts "Symmetric:  [wasserstein-distance $prob2 $prob1]"


# KL-divergence
set prob1 {0.0 0.1 0.2 0.4 0.2 0.1 0.0 0.0}
set prob2 {0.0 0.1 0.2 0.4 0.2 0.1 0.0 0.0}

puts "KL-divergence for equal distributions: 0"
puts "KL-divergence: [kl-divergence $prob1 $prob2]"

set prob1 {0.1e-8 0.1    0.2 0.4 0.2 0.1 0.0 0.0    0.0}
set prob2 {0.1e-8 0.1e-8 0.1 0.2 0.4 0.2 0.1 0.1e-8 0.1e-8}

puts "KL-divergence for shifted distributions: ??"
puts "KL-divergence: [kl-divergence $prob1 $prob2]"

# Hm, the normalisation proc causes a slight problem with elements of 1.0e-10
set prob1 {0.1e-8 0.1  0.2  0.4 0.2  0.1  0.0     0.0}
set prob2 {0.1e-8 0.11 0.19 0.4 0.24 0.06 0.1e-8  0.1e-8}

puts "KL-divergence slightly dififerent distributions: ??"
puts "KL-divergence: [kl-divergence $prob1 $prob2]"
}

Changes to modules/math/statistics.man.

2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
...
614
615
616
617
618
619
620
















































621
622
623
624
625
626
627
[manpage_begin math::statistics n [vset VERSION]]
[keywords {data analysis}]
[keywords mathematics]
[keywords statistics]
[moddesc {Tcl Math Library}]
[titledesc {Basic statistical functions and procedures}]
[category  Mathematics]
[require Tcl 8.4]
[require math::statistics [vset VERSION]]
[description]
[para]

The [package math::statistics] package contains functions and procedures for
basic statistical data analysis, such as:

................................................................................
is returned (as a list of "sampleSize" values), otherwise a list of samples is returned.

[list_begin arguments]
[arg_def list data]           List of values to chose from
[arg_def int sampleSize]      Number of values per sample
[arg_def int numberSamples]   Number of samples (default: 1)
[list_end]

















































[list_end]

[section "MULTIVARIATE LINEAR REGRESSION"]

Besides the linear regression with a single independent variable, the
statistics package provides two procedures for doing ordinary







|







 







>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>







2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
...
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
[manpage_begin math::statistics n [vset VERSION]]
[keywords {data analysis}]
[keywords mathematics]
[keywords statistics]
[moddesc {Tcl Math Library}]
[titledesc {Basic statistical functions and procedures}]
[category  Mathematics]
[require Tcl 8.5]
[require math::statistics [vset VERSION]]
[description]
[para]

The [package math::statistics] package contains functions and procedures for
basic statistical data analysis, such as:

................................................................................
is returned (as a list of "sampleSize" values), otherwise a list of samples is returned.

[list_begin arguments]
[arg_def list data]           List of values to chose from
[arg_def int sampleSize]      Number of values per sample
[arg_def int numberSamples]   Number of samples (default: 1)
[list_end]

[call [cmd ::math::statistics::wasserstein-distance] [arg prob1] [arg prob2]]
Compute the Wasserstein distance or earth mover's distance for two equidstantly spaced histograms
or probability densities. The histograms need not to be normalised to sum to one,
but they must have the same number of entries.
[nl]
Note: the histograms are assumed to be based on the same equidistant intervals.
As the bounds are not passed, the value is expressed in the length of the intervals.

[list_begin arguments]
[arg_def list prob1]          List of values for the first histogram/probability density
[arg_def list prob2]          List of values for the second histogram/probability density
[list_end]

[call [cmd ::math::statistics::kl-divergence] [arg prob1] [arg prob2]]
Compute the Kullback-Leibler (KL) divergence for two equidstantly spaced histograms
or probability densities. The histograms need not to be normalised to sum to one,
but they must have the same number of entries.
[nl]
Note: the histograms are assumed to be based on the same equidistant intervals.
As the bounds are not passed, the value is expressed in the length of the intervals.
[nl]
Note also that the KL divergence is not symmetric and that the second histogram
should not contain zeroes in places where the first histogram has non-zero values.

[list_begin arguments]
[arg_def list prob1]          List of values for the first histogram/probability density
[arg_def list prob2]          List of values for the second histogram/probability density
[list_end]

[call [cmd ::math::statistics::logistic-model] [arg xdata] [arg ydata]]
Estimate the coefficients of the logistic model that fits the data best. The data consist
of independent x-values and the outcome 0 or 1 for each of the x-values. The result
can be used to estimate the probability that a certain x-value gives 1.

[list_begin arguments]
[arg_def list xdata]          List of values for which the success (1) or failure (0) is known
[arg_def list ydata]          List of successes or failures corresponding to each value in [term xdata].
[list_end]

[call [cmd ::math::statistics::logistic-probability] [arg coeffs] [arg x]]
Calculate the probability of success for the value [term x] given the coefficients of the
logistic model.

[list_begin arguments]
[arg_def list coeffs]         List of coefficients as determine by the [cmd logistic-model] command
[arg_def float x]             X-value for which the probability needs to be determined
[list_end]

[list_end]

[section "MULTIVARIATE LINEAR REGRESSION"]

Besides the linear regression with a single independent variable, the
statistics package provides two procedures for doing ordinary

Changes to modules/math/statistics.tcl.

17
18
19
20
21
22
23

24
25
26
27
28
29
30
31
32
33
....
1790
1791
1792
1793
1794
1795
1796


1797
1798
1799
1800
1801
1802
1803
# version 0.7:   added Kruskal-Wallis test (by Torsten Berg)
# version 0.8:   added Wilcoxon test and Spearman rank correlation
# version 0.9:   added kernel density estimation
# version 0.9.3: added histogram-alt, corrected test-normal
# version 1.0:   added test-anova-F
# version 1.0.1: correction in pdf-lognormal and cdf-lognormal
# version 1.1:   added test-Tukey-range and test-Dunnett


package require Tcl 8.5 ; # 8.5+ feature in test-anovo-F: **-operator
package provide math::statistics 1.2.0
package require math

if {![llength [info commands ::lrepeat]]} {
    # Forward portability, emulate lrepeat
    proc ::lrepeat {n args} {
	if {$n < 1} {
	    return -code error "must have a count of at least 1"
................................................................................
source [file join [file dirname [info script]] pdf_stat.tcl]
source [file join [file dirname [info script]] plotstat.tcl]
source [file join [file dirname [info script]] liststat.tcl]
source [file join [file dirname [info script]] mvlinreg.tcl]
source [file join [file dirname [info script]] kruskal.tcl]
source [file join [file dirname [info script]] wilcoxon.tcl]
source [file join [file dirname [info script]] stat_kernel.tcl]



#
# Define the tables
#
namespace eval ::math::statistics {
    variable tukey_table_05
    variable tukey_table_01







>


|







 







>
>







17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
....
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
# version 0.7:   added Kruskal-Wallis test (by Torsten Berg)
# version 0.8:   added Wilcoxon test and Spearman rank correlation
# version 0.9:   added kernel density estimation
# version 0.9.3: added histogram-alt, corrected test-normal
# version 1.0:   added test-anova-F
# version 1.0.1: correction in pdf-lognormal and cdf-lognormal
# version 1.1:   added test-Tukey-range and test-Dunnett
# version 1.3:   added wasserstein-distance, kl-divergence and logit regression

package require Tcl 8.5 ; # 8.5+ feature in test-anovo-F: **-operator
package provide math::statistics 1.3.0
package require math

if {![llength [info commands ::lrepeat]]} {
    # Forward portability, emulate lrepeat
    proc ::lrepeat {n args} {
	if {$n < 1} {
	    return -code error "must have a count of at least 1"
................................................................................
source [file join [file dirname [info script]] pdf_stat.tcl]
source [file join [file dirname [info script]] plotstat.tcl]
source [file join [file dirname [info script]] liststat.tcl]
source [file join [file dirname [info script]] mvlinreg.tcl]
source [file join [file dirname [info script]] kruskal.tcl]
source [file join [file dirname [info script]] wilcoxon.tcl]
source [file join [file dirname [info script]] stat_kernel.tcl]
source [file join [file dirname [info script]] stat_wasserstein.tcl]
source [file join [file dirname [info script]] stat_logit.tcl]

#
# Define the tables
#
namespace eval ::math::statistics {
    variable tukey_table_05
    variable tukey_table_01

Changes to modules/math/statistics.test.

14
15
16
17
18
19
20

21
22
23
24
25
26
27
....
1194
1195
1196
1197
1198
1199
1200
1201



















































































































































1202
1203

testsNeedTcl     8.5;# statistics,linalg!
testsNeedTcltest 2.1

support {
    useLocal math.tcl math
    useLocal linalg.tcl math::linearalgebra

}
testing {
    useLocal statistics.tcl math::statistics
}

# -------------------------------------------------------------------------

................................................................................
            set result 0
        }
    }

    set result
} -result 1





















































































































































# End of test cases
testsuiteCleanup







>







 








>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>


14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
....
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351

testsNeedTcl     8.5;# statistics,linalg!
testsNeedTcltest 2.1

support {
    useLocal math.tcl math
    useLocal linalg.tcl math::linearalgebra
    useLocal optimize.tcl math::optimize
}
testing {
    useLocal statistics.tcl math::statistics
}

# -------------------------------------------------------------------------

................................................................................
            set result 0
        }
    }

    set result
} -result 1


#
# Tests for Wasserstein distance and KL divergence
#
# Tests for error conditions
#
test "Wasserstein-1.1" "Error: lengths differ" -body {
    set distance [::math::statistics::wasserstein-distance {0 1} {1 0 0 0}]
} -returnCodes 1 -result {Lengths of the probability histograms must be the same}

#
# Check the symmetry for arbitrary histograms
#
test "Wasserstein-1.2" "Test symmetry" -body {
    set okay 1
    for {set i 0} {$i < 10} {incr i} {
        set histogram1 {}
        set histogram2 {}
        for {set j 0} {$j < 3*($i+1)} {incr j} {
            lappend histogram1 [expr {rand()}]
            lappend histogram2 [expr {rand()}]
        }

        set distance1 [::math::statistics::wasserstein-distance $histogram1 $histogram2]
        set distance2 [::math::statistics::wasserstein-distance $histogram2 $histogram1]

        if { abs($distance1-$distance2) > 1.0e-6 } {
            set okay 0
        }
    }

    return $okay

} -result 1

#
# Check the non-negativity for arbitrary histograms
#
test "Wasserstein-1.3" "Test non-negativity" -body {
    set okay 1
    for {set i 0} {$i < 10} {incr i} {
        set histogram1 {}
        set histogram2 {}
        for {set j 0} {$j < 3*($i+1)} {incr j} {
            lappend histogram1 [expr {rand()}]
            lappend histogram2 [expr {rand()}]
        }

        set distance1 [::math::statistics::wasserstein-distance $histogram1 $histogram2]

        if { $distance1 < 0.0 } {
            set okay 0
        }
    }

    return $okay

} -result 1

#
# Check the non-normalised histograms
#
test "Wasserstein-1.4" "Test non-normalised histograms" -match tolerant -body {
    return [list [::math::statistics::wasserstein-distance {2 0 0} {0 0 0.5}]  \
                 [::math::statistics::wasserstein-distance {1 0 0} {0 0 0.25}] ]
} -result {2.0 2.0}

#
# Check arbitrarily extended histograms
#
test "Wasserstein-1.5" "Test extended histograms" -match tolerant -body {
    return [list [::math::statistics::wasserstein-distance {1 0 0}     {0 0 1}]     \
                 [::math::statistics::wasserstein-distance {0 0 1 0 0} {0 0 0 0 1}] \
                 [::math::statistics::wasserstein-distance {1 0 0 0 0} {0 0 1 0 0}] ]
} -result {2.0 2.0 2.0}

#
# Check numerical results
#
test "Wasserstein-1.6" "Test numerical results" -match tolerant -body {
    return [list [::math::statistics::wasserstein-distance {1 0 0}     {0 0.5 0.5}]             \
                 [::math::statistics::wasserstein-distance {1 0 0 0 0} {0 0 0 0.5 0.5}]         \
                 [::math::statistics::wasserstein-distance {1 0 0 0 0} {0 0.25 0.25 0.25 0.25}] ]
} -result {1.5 3.5 2.5}

#
# Tests for error conditions
#
test "KL-divergence-1.1" "Error: lengths differ" -body {
    set distance [::math::statistics::kl-divergence {0 1} {1 0 0 0}]
} -returnCodes 1 -result {Lengths of the two probability histograms must be the same}

test "KL-divergence-1.2" "Error: unmatched zeroes" -body {
    set distance [::math::statistics::kl-divergence {0.3 0.3 0.4} {1 0 0}]
} -returnCodes 1 -result {Second probability histogram contains unmatched zeroes}


test "KL-divergence-1.3" "Matched zeroes should be accepted" -match tolerant -body {
    set distance [::math::statistics::kl-divergence {0.3 0.3 0.4 0} {0.7 0.2 0.1 0}]
} -returnCodes 0 -result 0.42196792

#
# Tests for equal histograms (not all normalised)
#
test "KL-divergence-1.4" "Equal histograms give zero divergence" -body {
    set distance [::math::statistics::kl-divergence {0.3 0.3 0.4} {0.3 0.3 0.4}]
} -result 0.0

test "KL-divergence-1.5" "Non-normalised but equal histograms give zero divergence" -body {
    set distance [::math::statistics::kl-divergence {0.3 0.3 0.4} {0.6 0.6 0.8}]
} -result 0.0

#
# Numerical tests - note: the expected values were taken from the implementation
#                         No independent source found
#
test "KL-divergence-1.6" "Shifted histograms" -match tolerant -body {
    set distance [::math::statistics::kl-divergence {1.0e-8 0.3 0.3 0.3 0.1} {0.3 0.3 0.3 0.1 1.0e-8}]
} -result 1.9413931

test "KL-divergence-1.7" "Arbitrary histograms" -match tolerant -body {
    set distance [::math::statistics::kl-divergence {0.4 0.2 0.3 0.1} {0.1 0.1 0.7 0.1}]
} -result 0.4389578


#
# Tests for logistic regression
#
set xdata {0.50 0.75 1.00 1.25 1.50 1.75 1.75 2.00 2.25 2.50 2.75 3.00 3.25 3.50 4.00 4.25 4.50 4.75 5.00 5.50}
set ydata {0    0    0    0    0    0    1    0    1    0    1    0    1    0    1    1    1    1    1    1   }

test "Logit-regression-1.0" "Logistic regression - coefficients" -match tolerant -body {
    set coeffs [::math::statistics::logistic-model $xdata $ydata]
} -result {-4.07679035286303 1.5045982920571572}

test "Logit-regression-1.1" "Logistic regression - probabilities" -match tolerant -body {
    set coeffs [::math::statistics::logistic-model $xdata $ydata]

    set probabilities {}
    foreach x {1 2 3 4 5} {
        lappend probabilities [::math::statistics::logistic-probability $coeffs $x]
    }

    return $probabilities

} -result {0.07094967663389527 0.2558609520815849 0.6075450376084848 0.8745281239093334 0.9691176473773739}


# End of test cases
testsuiteCleanup