Tcl Source Code

View Ticket
Login
Ticket UUID: 713653b951c625ef974f5f06b2e96634ca5fb1f9
Title: Floating point precision problems on x86 musl
Type: Bug Version: 8.6.11
Submitter: rubicon Created on: 2022-07-13 14:20:05
Subsystem: 48. Number Handling Assigned To: kbk
Priority: 5 Medium Severity: Minor
Status: Closed Last Modified: 2022-07-17 13:28:18
Resolution: Fixed Closed By: jan.nijtmans
    Closed on: 2022-07-17 13:28:18
Description:
Hello, sorry if this isn't the correct place to report this.

While running Tcllib tests with Tcl 8.6.12 on 32-bit x86 Alpine Linux, which uses musl libc, I ran into a problem with some math tests failing.

I proceeded to try running the same tests on previous versions of Tcl 8.6, and found that the tests pass on 8.6.10 but fail on 8.6.11.

I think the problem started with https://core.tcl-lang.org/tcl/info/5cb42f70ad0e65ba , and I managed to suppress the problem by disabling the special i386 TCL_IEEE_DOUBLE_ROUNDING and TCL_DEFAULT_DOUBLE_ROUNDING macros.

However, not being familiar with low-level programming, I'm not sure if this is the correct way of fixing the problem, and indeed it causes Tcl's expr tests to fail instead: https://core.tcl-lang.org/tcl/artifact/5ad80ae6f0ae48d7 .

I have attached the output from Tcllib's math tests, and Tcl's expr-old and format tests.

I would appreciate it if someone could look into this issue.

Thank you.
User Comments: jan.nijtmans added on 2022-07-17 13:28:18:

Fix merged now. Closing.

There is no Tcl release planned now, it could happen somewhere between September and December.


rubicon added on 2022-07-16 02:24:13:
Thanks for the fix, the affected tests are all passing now.

Just wondering, will a release of Tcl 8.6 containing this fix be made available soon? Otherwise, I'll be submitting this patch for tclStrToD.c to Alpine.

kbk added on 2022-07-15 22:08:43:

@jan.nijtmans - OK, just waiting for the test result, since @rubicon is responsive. Will merge forward once it tests clean on the Alpine machine with the problem.


jan.nijtmans added on 2022-07-15 20:58:38:

Thanks, @kbk, for the fix. Given rubicon's explanation, I think the proposed solution is the way to go. (I already revised a similar fix, but yours is more complete....)


kbk added on 2022-07-15 17:16:34:

Thanks for tracking this down! Escaping the code with the floating point control word not restored is indeed a bad problem.

Could you test the change on the bug-713653b951 branch before I merge forward?


rubicon added on 2022-07-15 11:54:27:

After a bit of trial and error, I zeroed in on what happens before a return from MakeLowPrecisionDouble, and found that the TCL_DEFAULT_DOUBLE_ROUNDING macro runs before return retval; but not before return copysign(0.0, -signum);.

By adding that to make the code that I was commenting out look like this (of course, no longer commented out now):

    if (significand == 0) {
	TCL_DEFAULT_DOUBLE_ROUNDING;
	return copysign(0.0, -signum);
    }

The problem with Tcl's expr-old and format tests is solved.

I will try it with Tcllib's math and units tests shortly, if I don't say anything else it means they pass too.

I guess this probably means glibc resets the rounding mode automatically, but musl doesn't.

I see something similar happens with MakeHighPrecisionDouble, but I'll leave it to those more knowledgeable than me to determine if the TCL_DEFAULT_DOUBLE_ROUNDING needs to be added there as well.


rubicon added on 2022-07-15 09:32:51:

I found a musl toolchain at https://musl.cc , and by using i686-linux-musl-native.tgz on Debian buster with the following configure options for Tcl (./configure CC=/tmp/i686-linux-musl-native/bin/gcc --host=i686-linux-musl --disable-shared), I managed to reproduce the failures in Tcl's expr-old and format tests.

After commenting out

    if (significand == 0) {
	return copysign(0.0, -signum);
    }
in MakeLowPrecisionDouble in tclStrToD.c, the failures disappear.


rubicon added on 2022-07-15 04:58:10:
Here is some more information.

I grepped the build log for 'tclStrToD' to see the gcc command used to build and link it (sorry if this is just noise and not helpful):

gcc -c -I"." -I/builds/rubicon/aports/main/tcl/src/tcl8.6.12/unix -I/builds/rubicon/aports/main/tcl/src/tcl8.6.12/generic -I/builds/rubicon/aports/main/tcl/src/tcl8.6.12/libtommath -O2 -Os -fomit-frame-pointer -pipe -Os -fomit-frame-pointer -Wall -Wpointer-arith -fPIC -DBUILD_tcl -DPACKAGE_NAME=\"tcl\" -DPACKAGE_TARNAME=\"tcl\" -DPACKAGE_VERSION=\"8.6\" -DPACKAGE_STRING=\"tcl\ 8.6\" -DPACKAGE_BUGREPORT=\"\" -DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 -DHAVE_STRINGS_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1 -DHAVE_SYS_PARAM_H=1 -DUSE_THREAD_ALLOC=1 -D_REENTRANT=1 -D_THREAD_SAFE=1 -DHAVE_PTHREAD_ATTR_SETSTACKSIZE=1 -DHAVE_PTHREAD_ATFORK=1 -DTCL_THREADS=1 -DTCL_CFGVAL_ENCODING=\"iso8859-1\" -DHAVE_ZLIB=1 -DMODULE_SCOPE=extern\ __attribute__\(\(__visibility__\(\"hidden\"\)\)\) -DHAVE_HIDDEN=1 -DHAVE_CAST_TO_UNION=1 -DTCL_SHLIB_EXT=\".so\" -DNDEBUG=1 -DTCL_CFG_OPTIMIZED=1 -DTCL_TOMMATH=1 -DMP_PREC=4 -D_LARGEFILE64_SOURCE=1 -DTCL_WIDE_INT_TYPE=long\ long -DHAVE_STRUCT_DIRENT64=1 -DHAVE_STRUCT_STAT64=1 -DHAVE_OPEN64=1 -DHAVE_LSEEK64=1 -DHAVE_TYPE_OFF64_T=1 -DHAVE_GETCWD=1 -DHAVE_MKSTEMP=1 -DHAVE_OPENDIR=1 -DHAVE_STRTOL=1 -DHAVE_WAITPID=1 -DNO_GETWD=1 -DHAVE_GETNAMEINFO=1 -DHAVE_GETADDRINFO=1 -DHAVE_FREEADDRINFO=1 -DHAVE_GAI_STRERROR=1 -DHAVE_STRUCT_ADDRINFO=1 -DHAVE_STRUCT_IN6_ADDR=1 -DHAVE_STRUCT_SOCKADDR_IN6=1 -DHAVE_STRUCT_SOCKADDR_STORAGE=1 -DHAVE_GETPWUID_R_5=1 -DHAVE_GETPWUID_R=1 -DHAVE_GETPWNAM_R_5=1 -DHAVE_GETPWNAM_R=1 -DHAVE_GETGRGID_R_5=1 -DHAVE_GETGRGID_R=1 -DHAVE_GETGRNAM_R_5=1 -DHAVE_GETGRNAM_R=1 -DHAVE_DECL_GETHOSTBYNAME_R=1 -DHAVE_GETHOSTBYNAME_R_6=1 -DHAVE_GETHOSTBYNAME_R=1 -DHAVE_DECL_GETHOSTBYADDR_R=1 -DHAVE_GETHOSTBYADDR_R_8=1 -DHAVE_GETHOSTBYADDR_R=1 -DHAVE_TERMIOS_H=1 -DHAVE_SYS_IOCTL_H=1 -DHAVE_SYS_TIME_H=1 -DTIME_WITH_SYS_TIME=1 -DHAVE_GMTIME_R=1 -DHAVE_LOCALTIME_R=1 -DHAVE_MKTIME=1 -DHAVE_TM_GMTOFF=1 -DHAVE_TIMEZONE_VAR=1 -DHAVE_STRUCT_STAT_ST_BLOCKS=1 -DHAVE_STRUCT_STAT_ST_BLKSIZE=1 -DHAVE_BLKCNT_T=1 -DHAVE_INTPTR_T=1 -DHAVE_UINTPTR_T=1 -DNO_UNION_WAIT=1 -DHAVE_SIGNED_CHAR=1 -DHAVE_LANGINFO=1 -DHAVE_MKSTEMPS=1 -DHAVE_SYS_IOCTL_H=1 -DTCL_UNLOAD_DLLS=1 -DHAVE_CPUID=1      -DMP_FIXED_CUTOFFS -DMP_NO_STDINT /builds/rubicon/aports/main/tcl/src/tcl8.6.12/generic/tclStrToD.c

gcc -O2 -Os -fomit-frame-pointer -pipe -Os -fomit-frame-pointer  -Wl,--as-needed,-O1,--sort-common -Wl,--export-dynamic  -shared -o libtcl8.6.so regcomp.o regexec.o regfree.o regerror.o tclAlloc.o tclAssembly.o tclAsync.o tclBasic.o tclBinary.o tclCkalloc.o tclClock.o tclCmdAH.o tclCmdIL.o tclCmdMZ.o tclCompCmds.o tclCompCmdsGR.o tclCompCmdsSZ.o tclCompExpr.o tclCompile.o tclConfig.o tclDate.o tclDictObj.o tclDisassemble.o tclEncoding.o tclEnsemble.o tclEnv.o tclEvent.o tclExecute.o tclFCmd.o tclFileName.o tclGet.o tclHash.o tclHistory.o tclIndexObj.o tclInterp.o tclIO.o tclIOCmd.o tclIORChan.o tclIORTrans.o tclIOGT.o tclIOSock.o tclIOUtil.o tclLink.o tclListObj.o tclLiteral.o tclLoad.o tclMain.o tclNamesp.o tclNotify.o tclObj.o tclOptimize.o tclPanic.o tclParse.o tclPathObj.o tclPipe.o tclPkg.o tclPkgConfig.o tclPosixStr.o tclPreserve.o tclProc.o tclRegexp.o tclResolve.o tclResult.o tclScan.o tclStringObj.o tclStrToD.o tclThread.o tclThreadAlloc.o tclThreadJoin.o tclThreadStorage.o tclStubInit.o tclTimer.o tclTrace.o tclUtf.o tclUtil.o tclVar.o tclZlib.o tclTomMathInterface.o tclUnixChan.o tclUnixEvent.o tclUnixFCmd.o tclUnixFile.o tclUnixPipe.o tclUnixSock.o tclUnixTime.o tclUnixInit.o tclUnixThrd.o tclUnixCompat.o tclUnixNotfy.o  tclOO.o tclOOBasic.o tclOOCall.o tclOODefineCmds.o tclOOInfo.o tclOOMethod.o tclOOStubInit.o tclLoadDl.o  bn_s_mp_reverse.o bn_s_mp_mul_digs_fast.o bn_s_mp_sqr_fast.o bn_mp_add.o bn_mp_and.o bn_mp_add_d.o bn_mp_clamp.o bn_mp_clear.o bn_mp_clear_multi.o bn_mp_cmp.o bn_mp_cmp_d.o bn_mp_cmp_mag.o bn_mp_cnt_lsb.o bn_mp_copy.o bn_mp_count_bits.o bn_mp_div.o bn_mp_div_d.o bn_mp_div_2.o bn_mp_div_2d.o bn_mp_div_3.o bn_mp_exch.o bn_mp_expt_u32.o bn_mp_grow.o bn_mp_init.o bn_mp_init_copy.o bn_mp_init_multi.o bn_mp_init_set.o bn_mp_init_size.o bn_s_mp_karatsuba_mul.o bn_s_mp_karatsuba_sqr.o bn_s_mp_balance_mul.o bn_mp_lshd.o bn_mp_mod.o bn_mp_mod_2d.o bn_mp_mul.o bn_mp_mul_2.o bn_mp_mul_2d.o bn_mp_mul_d.o bn_mp_neg.o bn_mp_or.o bn_mp_radix_size.o bn_mp_radix_smap.o bn_mp_read_radix.o bn_mp_rshd.o bn_mp_set.o bn_mp_shrink.o bn_mp_sqr.o bn_mp_sqrt.o bn_mp_sub.o bn_mp_sub_d.o bn_mp_signed_rsh.o bn_mp_to_ubin.o bn_s_mp_toom_mul.o bn_s_mp_toom_sqr.o bn_mp_to_radix.o bn_mp_ubin_size.o bn_mp_xor.o bn_mp_zero.o bn_s_mp_add.o bn_s_mp_mul_digs.o bn_s_mp_sqr.o bn_s_mp_sub.o    -Wl,--as-needed,-O1,--sort-common -Wl,--export-dynamic  -ldl -lz  -lpthread    "-Wl,-rpath,/usr/lib"

I worked on this earlier this month, and didn't use fossil bisect, but by replacing 8.6.10 with 8.6.11, file-by-file (sort of), I isolated the changes that cause the tests to fail to tclStrToD.c, and by patching the file a little bit at a time, I stumbled upon those lines in MakeLowPrecisionDouble that I mentioned in my previous remark.

Lastly, the math tests are only failing on x86, Alpine is built for a few other architectures (aarch64, armhf, armv7, ppc64le, s390x, and of course x86_64), and they don't fail on those.

rubicon added on 2022-07-15 04:22:53:
I'll try to answer the questions as best as I can, but as I am a little under the weather today, I may be a bit slow.

> what compiler is this?

GCC 11.2.1

> Are you changing optimization flags

Alpine Linux's build system sets the following environment variables ( from https://git.alpinelinux.org/abuild/tree/abuild.conf ):

export CFLAGS="-Os -fomit-frame-pointer"
export CXXFLAGS="$CFLAGS"
export CPPFLAGS="$CFLAGS"
export LDFLAGS="-Wl,--as-needed,-O1,--sort-common"

> You did say that everything in Tcl's own tests passed before you started adjusting the rounding modes, right?

Sorry for not clarifying this better.

The expr tests (that I linked to, but which is no longer attached to this bug report) fail when I adjust the rounding modes.

The expr-old and format tests which are still attached fail together with the Tcllib math tests.

> I've not found that to be necessary at the -O2 setting that Tcl uses by default

Alpine uses -Os, is this likely the source of the problem?

> the next thing that I'd suspect is that something is messed up in the C library

Through trial and error (as I am not a C programmer), after commenting the following lines in MakeLowPrecisionDouble:

    if (significand == 0) {
	return copysign(0.0, -signum);
    }

Both Tcl's expr-old and format tests and Tcllib's math and units tests all pass.

Alpine is now using musl v1.2.3, which has this implementation of copysign: https://git.musl-libc.org/cgit/musl/tree/src/math/copysign.c?h=v1.2.3

> Are we sure that 8.6.10 and 8.6.11 were built with the same compiler, include files, system libraries and optimization flags?

I tested 8.6.10 and 8.6.11 in a Gitlab CI within about an hour of each other, I don't think anything changed within that short period of time.

kbk added on 2022-07-14 18:49:04:

Unfortunately, strange platform issues like this can be hard to debug "at arm's length". I don't have an Alpine machine to test on (and I think that all the Alpine machines I've ever encountered have been ARM rather than ix86).

Disabling the rounding mode isn't the correct solution (obviously, since it starts breaking other things!) - so we need to get to the bottom of this.

It would appear that it's something gone amiss in the environment. I've tested 8.6.11 with the standard Tcl test suite on both gcc and clang on Xubuntu 20.04 LTS, and the tests are passing. Likewise, all the test in the math module of tcllib are passing as well, with both gcc and clang builds.

For starters, what compiler is this? (I don't know if Alpine uses gcc, clang, tcc or something all its own...) Are you changing optimization flags from the set that tcl/unix/configure provides?

The testing of floating point conversion is pretty comprehensive. I don't think anything that would introduce breakage of this magnitude would slip through Tcl's own test suite. (You did say that everything in Tcl's own tests passed before you started adjusting the rounding modes, right?)

The broken Tcllib math tests suggest that something very weird is going on. The 'Trig-2.3' test is probably the easiest to attack - it's not doing anything terribly out of the ordinary and its execution path is only a few lines of Tcl, but the outputs grow increasingly bizarre. (The nelderMead tests also indicate something very fundamental, because they just do basic arithmetic.)

One very odd possibility is that the floating point mode setting is being reordered by the optimizer in the compiler you're using. I've heard rumoured that in some compilers it's necessary to replace the line

double retval;		/* Value of the number. */
in two places (the MakeLowPrecisionDouble and MakeHighPrecisionDouble functions in tclStrToD.c) with
volatile double retval; /* Value of the number */
to prevent that from happening. I've not found that to be necessary at the -O2 setting that Tcl uses by default, whether using gcc, clang, or Visual C++.

Beyond that, the next thing that I'd suspect is that something is messed up in the C library or the compiler intrinsics. As you've probably noticed, tclStrToD.c uses some odd functions - things like ldexp(), frexp() and copysign().

Of course, the change to tclStrToD.c might be entirely a red herring. Are we sure that 8.6.10 and 8.6.11 were built with the same compiler, include files, system libraries and optimization flags? (Another reason that I have for suspecting that the issue might be somewhere else is that the format-* tests that are failing don't go through tclStrToD.c. They eventually land in the sprintf() function of the C library.)

If you can verify that 8.6.10 works and 8.6.11 fails when built in the same toolchain, the next step might be to make a shell script that exits with status 0 for success on a test case, and status 1 for failure, and feed that to fossil bisect. That's pretty heavyweight but might narrow it from the point release down to the individual commit.


jan.nijtmans added on 2022-07-14 16:07:36:

Kevin, something for you?


Attachments: