# DEMOFL A demonstration of fl-numbers: k-bit arithmetic

## Contents

- Initialization
- Comparison to IEEE754 single precision
- Rounding into k bits
- The relative rounding error unit
- Basic operations
- Output format
- Checking theorems in k-bit arithmetic I
- Checking theorems in k-bit arithmetic II
- Predecessor and successor
- Vector and matrix operations
- Sparse matrices
- Matrix routines
- Doubled precision accumulation of dot products
- Mixed precisions
- Operator precedence
- Interval operations
- Interval vectors and matrices
- Specification of intervals
- Directed rounding
- Mixing precision and double rounding
- Enjoy INTLAB

## Initialization

Recently we published some papers showing that in the famous Wilkinson estimates factors gamma_n can often be replaced by n*eps. For testing purposes, the fl-package simulates IEEE 754 k-bit arithmetic including over- and underflow, exceptional values, directed rounding and alike.

To work, for example, with k bits precision and exponent range -E+1 ... E use

k = 5; E = 20; flinit(k,E);

The precision and the exponent range are limited to

1 <= prec <= 26 and 1 <= expBias <= 484

in order to achieve high performance of all operations. To retrieve the current setting of precision and exponent range use

flinit

fl-format set to 5 mantissa bits incl. impl. 1 and exponent range -19 .. 20 for normalized fl-numbers 101

## Comparison to IEEE754 single precision

Using flinit(24,127), the set of fl-numbers is identical to IEEE754 single precision. However, library routines in Matlab may use a different order to compute, for example, dot products, so that results in single and fl-arithmetic may be different. Consider

flinit(24,127); % same as IEEE754 single precision x = single(randn(1000,1)); % random single precision vector all(x == fl(x)) % verifies that x is exactly representable in k bit s = x'*x % dot product in single precision t = fl(x)'*fl(x) % dot product in k-bit precision s-t % difference of results

ans = logical 1 s = single 891.1407 fl-type t = +1.10111101100100011111111 * 2^9 fl-type ans = +1.10000000000000000000000 * 2^-13

Obviously the results computed in single precision and the fl-package do not coincide. This is due to a Matlab library routine to compute the dot product x'*x. Computed in a loop, the result by single precision and the fl-package are identical. To verify see

s_loop = single(0); for i=1:1000 % dot product x'*x by conventional loop s_loop = s_loop + x(i)*x(i); end s_loop - s % difference to Matlab library routine s_loop - t % difference to fl-package

ans = single -1.8311e-04 fl-type ans = +0.00000000000000000000000

## Rounding into k bits

The function 'fl' rounds a given double precision number into the current format and/or into a specified precision; the function 'double' is a typecast into double precision. Since fl-numbers are always a subset of IEEE754 double precision numbers, a typecast is always error-free.

flinit(5,20); x = fl(32) y = fl(33) double([x y]) [z,exact] = fl([32 33])

fl-type x = +1.0000 * 2^5 fl-type y = +1.0000 * 2^5 ans = 32 32 fl-type z = +1.0000 * 2^5 +1.0000 * 2^5 exact = 1×2 logical array 1 0

Here x=32 is representable in a single bit, so no rounding error occurs, but 33 needs 6 bits to be stored. Rounding tie to even produces the result y=32 in 5 bits. The second parameter 'exact' is (componentwise) 1 if no rounding error occurred. It offers a simple possibility to check whether a double number fits into k bits.

## The relative rounding error unit

Mathematically, the relative rounding error unit for k-bit arithmetic is 2^(-k), that is the maximal relative error when rounding a real number into the fl-format. However, it is known that just above a power of 2 the relative error is larger, and just below it is smaller.

x = linspace(0,10,100000); X = fl(x); close plot(x,relerr(x,X))

## Basic operations

The four basic operations are executed as usual, for example

x = fl(11) y = fl(3) z = x*y t = (-sqrt(y) + 1/y)*(1:3)

fl-type x = +1.0110 * 2^3 fl-type y = +1.1000 * 2^1 fl-type z = +1.0000 * 2^5 fl-type t = -1.0111 * 2^0 -1.0111 * 2^1 -1.0001 * 2^2

## Output format

Often it is useful to see the bit pattern, the default for fl-numbers. To switch to ordinary display as a double precision number use

```
format fldouble
z
```

===> Display fl-variables as doubles fl-type z = 32

One may use flinit('DoubleDisplay) as well.

The bit representation of fl-numbers and double numbers can be displayed directly by

d = 11 x = fl(d) getbits(d) getbits(x)

d = 11 fl-type x = 11 ans = ' +1.0110000000000000000000000000000000000000000000000000 * 2^3' ans = ' +1.0110 * 2^3'

The display may be restricted to m bits as well:

m = 10 getbits(d,m)

m = 10 ans = ' +1.011000000 * 2^3'

## Checking theorems in k-bit arithmetic I

As a basis for this toolbox we showed in

(*) S.M. Rump: IEEE754 -bit arithmetic inherited by double precision, ACM TOMS, 43(3), 2017

that for fl-numbers A,B in k-bit precision and m>=2k there is no double rounding when first computing the quotient in m-bits and then round the result into k-bits. That means

fl_k(A/B) = fl_k(fl_m(A/B))

We test that by checking all pairs of fl-numbers in [1,2), see "help flsequence".

k = 12; flinit(k,100); m = 2*k; tic A = flsequence(1,pred(2)); % all fl-numbers in [1,2) for i=1:length(A) B = A(i); index = find( flround(A/B,k) ~= flround( flround(A/B,m) , k ) ); if any(index) Aindex = A(index) B end end t = toc

t = 1.2168

Note that this is not a rigorous proof because there may be a double rounding in the computation of flround(A/B,m) and flround(A/B,k) because A/B is computed in double precision. There is a chicken-egg problem: If the statement is true, then it is proved since 53>=m>=k. Note that not writing the above code in vectorized form increases the computing time to more than one hour.

## Checking theorems in k-bit arithmetic II

In p-precision floating-point arithmetic the true result of an operation op is rounded to the nearest p-precision number. Usually the relative error is bounded by u=2^(-p), the relative rounding error unit. It has been noted in that this can be improved into

|fl(a op b) - ( a op b)| <= u/(1+u)|a op b| .

It is easy to see that the improved bound is attained if and only if, up to scaling, the true result a op b is equal to 1+u. For addition this is easy to achieve, but what about multiplication?

As shown in

C.-P. Jeannerod and S.M. Rump: On relative errors of floating-point operations: Optimal bounds and applications. Math. Comp., 87:803-819, 2018.

that the estimate is sharp if and only if 2^p+1 is prime, i.e. it is a Fermat prime. For that it is necessary that p=2^k is a power of 2, and up to now only five Fermat primes are known for k in {0,1,2,3,4} corresponding to 2^p+1 in {3,5,17,257,65537}. That means the only exceptional precisions p we know of for which the estimate is not sharp are p in {1,2,4,8,16}.

We might want to check the theorem by the following code

flinit('DisplayBits') setround(0) tic for k=0:5 if k==5 p = 10; else p = 2^k; end flinit(p,100); x = flsequence(1,2)'; y = fl((1+2^(-p))./double(x)); index = find( double(x).*double(y) == 1+2^(-p) ); if any(index) disp(['precision ' int2str(p) ' bits: examples for sharp estimate:']) examples = fl([x(index) y(index)]) else disp(['precision ' int2str(p) ' bits: estimate is never sharp']) end end toc

===> Display fl-variables by bit representation precision 1 bits: estimate is never sharp precision 2 bits: estimate is never sharp precision 4 bits: estimate is never sharp precision 8 bits: estimate is never sharp precision 16 bits: estimate is never sharp precision 10 bits: examples for sharp estimate: fl-type examples = +1.010000000 * 2^0 +1.100110100 * 2^-1 +1.010010000 * 2^0 +1.100100000 * 2^-1 +1.100100000 * 2^0 +1.010010000 * 2^-1 +1.100110100 * 2^0 +1.010000000 * 2^-1 Elapsed time is 0.021602 seconds.

The last case p=10 is only for check.

## Predecessor and successor

A sequence of consecutive k-bit fl-numbers may computed by

```
flinit(12,100);
flinit('DisplayDouble');
x = fl(32)
[ pred(x) x succ(x) succ(x,2) ]
```

===> Display fl-variables as doubles fl-type x = 32 fl-type ans = 31.9922 32.0000 32.0156 32.0313

## Vector and matrix operations

Vector and matrix operations are supported as well, for example

A = fl(reshape(1:16,4,4)) A + 25

fl-type A = 1 5 9 13 2 6 10 14 3 7 11 15 4 8 12 16 fl-type ans = 26 30 34 38 27 31 35 39 28 32 36 40 29 33 37 41

## Sparse matrices

Sparse vectors and matrices of fl-numbers as well as fl-intervals are specified as usual.

n = 4; A = fl(band(circulant(1:n),1,2)) S = sparse(A) [I,J] = find(S); IJ = [ I' ; J' ] is_sp = issparse(S) nnzS = nnz(S) [p,q] = bandwidth(S)

fl-type A = 1 2 3 0 4 1 2 3 0 4 1 2 0 0 4 1 fl-type S = (1,1) 1 (2,1) 4 (1,2) 2 (2,2) 1 (3,2) 4 (1,3) 3 (2,3) 2 (3,3) 1 (4,3) 4 (2,4) 3 (3,4) 2 (4,4) 1 IJ = 1 2 1 2 3 1 2 3 4 2 3 4 1 1 2 2 2 3 3 3 3 4 4 4 is_sp = logical 1 nnzS = 12 p = 1 q = 2

## Matrix routines

Many of the usual operations for full and sparse matrices are available in k-bit arithmetic.

t = trace(S) D = diag(S,-1)

fl-type t = (1,1) 4 fl-type D = (1,1) 4 (2,1) 4 (3,1) 4

The display routines for double intervals apply to fl-interval quantities as well.

T = tril(intval(A)) intvalinit('DisplayInfSup') T intvalinit('DisplayMidRad') T

fl-type intval T = 1.0000 0.0000 0.0000 0.0000 4.0000 1.0000 0.0000 0.0000 0.0000 4.0000 1.0000 0.0000 0.0000 0.0000 4.0000 1.0000 ===> Default display of intervals by infimum/supremum (e.g. [ 3.14 , 3.15 ]) fl-type intval T = [ 1.0000, 1.0000] [ 0.0000, 0.0000] [ 0.0000, 0.0000] [ 0.0000, 0.0000] [ 4.0000, 4.0000] [ 1.0000, 1.0000] [ 0.0000, 0.0000] [ 0.0000, 0.0000] [ 0.0000, 0.0000] [ 4.0000, 4.0000] [ 1.0000, 1.0000] [ 0.0000, 0.0000] [ 0.0000, 0.0000] [ 0.0000, 0.0000] [ 4.0000, 4.0000] [ 1.0000, 1.0000] ===> Default display of intervals by midpoint/radius (e.g. < 3.14 , 0.01 >) fl-type intval T = < 1.0000, 0.0000> < 0.0000, 0.0000> < 0.0000, 0.0000> < 0.0000, 0.0000> < 4.0000, 0.0000> < 1.0000, 0.0000> < 0.0000, 0.0000> < 0.0000, 0.0000> < 0.0000, 0.0000> < 4.0000, 0.0000> < 1.0000, 0.0000> < 0.0000, 0.0000> < 0.0000, 0.0000> < 0.0000, 0.0000> < 4.0000, 0.0000> < 1.0000, 0.0000>

## Doubled precision accumulation of dot products

The difference between double precision computations and k-bit precision can be explored. The default for matrix products is accumulation of dot products in working precision to simulate k-bit precision for all operations. For summation and dot products, doubled precision accumulation is possible.

The following generates some Toeplitz matrix and computes a residual in double and in 24-bit precision.

```
flinit(12,100)
flinit('AccumulateDouble')
n = 4;
A = toeplitz(1:n)
Ainv = inv(A)
resdble = eye(n) - A*Ainv
resfl_24 = eye(n) - fl(A)*fl(Ainv)
```

ans = 'Initialization of fl-format to 12 mantissa bits incl. impl. 1 and exponent range -99 .. 100 for normalized fl-numbers' ===> fl-accumulation precision double the working precision A = 1 2 3 4 2 1 2 3 3 2 1 2 4 3 2 1 Ainv = -0.4000 0.5000 0.0000 0.1000 0.5000 -1.0000 0.5000 0.0000 0.0000 0.5000 -1.0000 0.5000 0.1000 0.0000 0.5000 -0.4000 resdble = 1.0e-15 * 0 -0.1480 -0.2220 -0.2220 -0.1665 0.2220 -0.2220 0 -0.0555 0.0370 0.1110 -0.1110 -0.1665 0.0740 -0.1665 0 fl-type resfl_24 = 1.0e-04 * 0 -0.0000 0 0.9155 0.3052 0 0 0.6104 0.6104 -0.0000 0 0.3052 0.9155 -0.0000 0 0

Note that using fl(eye(n)) in the last statement would not change the result because 1 and 0 is exactly representable in any k-bit precision.

The same computation with k-bit precision accumulation of dot products is as follows:

```
flinit('AccumulateSingle')
resfl_12 = eye(n) - fl(A)*fl(Ainv)
```

===> fl-accumulation in working precision fl-type resfl_12 = 1.0e-03 * 0 -0.0000 0 0 0 0 0 0 0.1831 -0.0000 0 0 0.0916 -0.0000 0 0

As may be expected, some components are better, some are worse.

## Mixed precisions

Mixed operations between fl-numbers and single or double numbers are first executed and then rounded. This is unambiguous if the double number is exactly representable in the specified k-bit precision. However, consider

flinit(5,100); x = fl(33)/fl(11) y = 33/fl(11)

fl-type x = 2.8750 fl-type y = 3

Here fl(11)=11, so y is exactly equal to 3=fl(3), but fl(33)=32 so that x is not equal to y. The same happens for comparison:

fl(32) < fl(33) fl(32) == fl(33) fl(32) == 33

ans = logical 0 ans = logical 1 ans = logical 0

For safety reasons, a warning is given when a newly specified format does not cover all quantities of the old fl-format. Special caution is necessary when mixing precisions.

## Operator precedence

As usual, Matlab follows strictly the operator concept, and in particular the precedence of operators. Therefore special care is necessary when using mixed precisions. Consider

x = 3 * 11 + fl(1) y = fl(3) * 11 + 1

fl-type x = 34 fl-type y = 32

Here first 3*11 is executed in double precision in the computation of x, whereas the computation of y consists only of fl-number operations.

## Interval operations

Intervals with fl-number as endpoints are supported as well. For example,

```
format infsup
x = infsup(fl(1),2)
3*x - 1
x/3
```

fl-type intval x = [ 1.0000, 2.0000] fl-type intval ans = [ 2.0000, 5.0000] fl-type intval ans = [ 0.3281, 0.6876]

The syntax and semantic is as for intervals with double precision endpoints, rounding is always outwards.

z = fl(infsup(32,33))

fl-type intval z = [ 32.0000, 34.0000]

## Interval vectors and matrices

Again interval vectors and matrices of fl-numbers as well as the operation between those are as usual.

```
n = 4;
intvalinit('Display_')
flinit(12,99)
A = midrad(fl(randn(n)),1e-3)
P = A*intval(A')
C = compmat(A)
```

===> Default display of intervals with uncertainty (e.g. 3.14_), changed to inf/sup or mid/rad if input too wide ans = 'Initialization of fl-format to 12 mantissa bits incl. impl. 1 and exponent range -98 .. 99 for normalized fl-numbers' fl-type intval A = 0.72__ 0.18__ -1.23__ -0.53__ 1.60__ 0.53__ -0.19__ -0.90__ -2.06__ -0.55__ -0.30__ -0.89__ -0.74__ 0.30__ 0.96__ 0.28__ fl-type intval P = 2.3___ 1.95__ -0.73__ -1.80__ 1.95__ 3.7___ -2.7___ -1.5___ -0.73__ -2.7___ 5.5___ 0.8___ -1.80__ -1.5___ 0.8___ 1.6___ C = 0.7148 -0.1772 -1.2280 -0.5347 -1.6001 0.5266 -0.1907 -0.9023 -2.0664 -0.5544 0.3007 -0.8938 -0.7449 -0.2994 -0.9583 0.2776

Note that by definition Ostrowski's comparison matrix is a point matrix.

## Specification of intervals

There are many ways to produce the narrowest interval with k-bit endpoints including 0.1:

flinit('DisplayBits') x1 = fl(intval(1)/10); x2 = fl(intval('0.1')); x3 = midrad(0.1,realmin); [ x1 x2 x3 ]'

===> Display fl-variables by bit representation fl-type intval ans = [ +1.10011001100 * 2^-4 , +1.10011001101 * 2^-4 ] [ +1.10011001100 * 2^-4 , +1.10011001101 * 2^-4 ] [ +1.10011001100 * 2^-4 , +1.10011001101 * 2^-4 ]

Note that x3 is correct because first the interval midrad(0.1,realmin) with double precision endpoints is computed, then it is rounded into the specified k-bit precision.

## Directed rounding

All operations and in particular a type cast from double into fl-numbers respect the current rounding mode. For example,

```
setround(1) % rounding upwards
32 + fl(1)
x = fl(33)
setround(0)
```

fl-type ans = +1.00001000000 * 2^5 fl-type x = +1.00001000000 * 2^5

## Mixing precision and double rounding

The function flround(d,k) rounds d, a double number or interval, into k-bit precision, see help flround.

Concerning double rounding of the square root, I proved in the mentioned paper (*) the following. Suppose the square root of a k-bit number X is first rounded into m bits with result g, and then g is rounded into k bits, then this is equal to rounding the square root of X directly into k bits provided m>=2k+2. Moreover, for m=2k+1 there is exactly one counterexample to that statement namely, up to scaling, the predecessor of 2. For k=12 this is checked as follows:

k = 12; flinit(k,100); tic X = flsequence(1,pred(4)); % All k-bit fl-numbers in [1,4) sizeX = size(X) for m=(2*k+1):(2*k+2) m sqrtX = sqrt(double(X)); g = flround(sqrtX,m); index = find( flround(sqrtX,k)~=flround(g,k) ); if any(index) getbits(X(index),k) % prints counterexample else disp('no counterexample') end end toc

sizeX = 1 4096 m = 25 ans = ' +1.11111111111 * 2^1' m = 26 no counterexample Elapsed time is 0.024712 seconds.

It may be a hint, but this is not quite a proof of correctness; even not for k=12 because in flround(sqrtX,k) or the computation of g a double rounding might have occurred.

## Enjoy INTLAB

INTLAB was designed and written by S.M. Rump, head of the Institute for Reliable Computing, Hamburg University of Technology. Suggestions are always welcome to rump (at) tuhh.de