Passing Glucas options to the compiler.
To make Glucas as fast and reliable as possible, we have to include
some flags and macro definitions at compile time. Some flags are specific for
the compiler and others for Glucas or the YEAFFT library.
We cannot help about the specific compiler flags. You should read the
compiler's documentation. Select those flags which make the binary as fast
as possible.
There are some macro definitions we can set using the macro definition
facility of most compilers. Then you have to include
-DYOUR_OPTION1[=value1] [-DYOUR_OPTION2[=value2]] ...
with the compiler flags when invoking the compiler.
For Metrowerks CodeWarrior a similar functionality can be achieved
by editing the parameters in the file macos-codewarrior-prefix.h
and including it in the Prefix File setting of the C/C++ Compiler Settings.
YEAFFT options
- Y_AVAL=value
- It gives the type and size of radices that will be used in FFTs.
(See Glucas internals.)
The default value is
3. We can define 4 or 5. It is
recommended to use the default and then you can see the performance with other
choices.
Y_AVAL=3 YEAFFT uses radices 4,5,6,7,8 and 9 in FFTs passes.
It is the default and the best option on most systems.
Y_AVAL=4 It uses radices 4,5,6,7,8 and 9 in first FFTs pass and
8,16 in the other passes.
Y_AVAL=5 It also can use radix32 reduction in middle passes.
There are few processors which we can gain some speed.
- Y_BLOCKSIZE=value
- This parameter sets the size of contiguous elements in the main array
data without padding (in size of doubles). This padding is necessary
to avoid cache thrashing. Without it, the performance drops drastically. It is
related with
Y_SHIFT and Y_PADDING_LEVEL.
- Y_SHIFT=value
- This parameter and
Y_BLOCKSIZE set the basic size of the array
pad. It is computed as
basic_pad_size = (Y_BLOCKSIZE) >> Y_SHIFT
Note, that the basic pad size must be greater than one to avoid
alignement problems.
- Y_PADDING_LEVEL=value
- This parameter sets the complexity in the padding. If
Y_PADDING_LEVEL=0 then actually there is no padding. If
Y_PADDING_LEVEL=n holes have sizes from 1 to n, depending on
the result of padding algorithm. If level is 1 then all holes have the
same size.
- Y_MANY_REGISTERS
- Defining it, in addition to
Y_AVAL > 3, you can use a first radix pass
reduction from 10 to 16 (See Glucas internals.) Due to the many local
variables these routines use, it is only an advantage when the processor has
a lot of registers (32 FPU registers or more). Indeed, we still have not
seen a machine which uses this feature with gain.
- Y_MEM_THRESHOLD=value
- To avoid when possible cache misses, the FFT passes are different depending on
the pad between data (See Glucas internals.) The threshold from pass 1 to 2
is defined by this parameter. The default is set to 2048. It is a good choice
for most systems, but others like Alpha ev67 will run faster by using 8192 instead.
The
value should be a power of two.
- Y_TARGET=value
- The YEAFFT library intensively uses preprocessor C macros. Most of the FFT
tasks are made by using bits of macros defined in the file ygeneric.h.
This file is written with a generic processor in mind. This generic
processor is the default and is set defining
Y_TARGET=0. Sure, there
are many things one could write better for a specific processor. If you are brave
enough, do it. You can even write a collection of assembler macros. This is
an advanced feature, we recommend do not touch it. You should change some
lines in mccomp.h file and write your own my_proc.h file.
Recently, from release v.2.8a. Prefetch hints has been introduced. It increases
the performance a lot in some cases. To use this feature, you have to define
other than generic Y_TARGET. Up to release 2.8b this is the list of
value for targets. Options 16,17,41 and 51 are not recommended,
they are still experimental.
0- Generic. No prefetch other than builtin GCC v3.1 used. All pure C code.
Generic C compiler.
1- Pentium, Pentium MMX, Pentium II. No prefetch. A lot of assembler
lines. GNU/gcc compiler or compatible
_asm_ extensions.
11- Pentium 3. Prefetch used. A lot of assembler code. GNU/gcc compiler
or compatible
_asm_ extensions.
12- AMD Athlon. Prefetch used. A lot of assembler code. GNU/gcc compiler
or compatible
_asm_ extensions.
16- Pentium 3. Prefetch used. Only two lines of assembler code. GNU/gcc
compiler or compatible
_asm_ extensions. Not recommended.
17- AMD Athlon. Prefetch used. Only two lines of assembler code. GNU/gcc
compiler or compatible
_asm_ extensions. Not recommended.
21- PowerPC 601. Prefetch used. Only two lines of assembler code.
GNU/gcc compiler or compatible
_asm_ extensions or Metrowerks
Codewarrior intrinsics.
23- PowerPC 604e, 7xx, 74xx. Prefetch used. Only two lines of assembler
code. GNU/gcc compiler or compatible
_asm_ extensions or Metrowerks
Codewarrior intrinsics.
31- Alpha ev56, ev6, ev67, ev68. Prefetch used. Only two lines of
assembler code. Compaq-C compiler with
asm calls.
32- Alpha ev56, ev6, ev67. Prefetch used. Only two lines of assembler
code. GNU/gcc compiler or compatible
_asm_ extensions.
41- Ultrasparc-II. No prefetch. Only two lines of assembler
code. GNU/gcc compiler or compatible
_asm_ extensions. Not recommended.
51- Intel Itanium IA-64. No prefetch. Only two lines of assembler
code. GNU/gcc compiler or compatible
_asm_ extensions. Not recommended,
use Y_ITANIUM option for a terrific performance.
- Y_PREFETCH_EXPENSIVE
- When prefetch is available using
Y_TARGET other than generic, some
routines can be unrolled to avoid unnecessary calls to prefetch hints. It
could be useful when prefetch hints are expensive in performance terms. At the
moment, it is still an experimental feature.
- Y_LONG_MACROS
- YEAFFT code is coded based mostly on small macros doing elemental
FFT work. Sometimes it is more convenient to use big macros to adjust and tune some
long latencies operations in a more convenient way.
- Y_VECTORIZE2
- Don't take this option as a multithreaded one. For radix-4 reduction it is
possible to unroll inner loops with a register pressure similar to radix-8.
It can help a bit for some processors.
- Y_ITANIUM
- This option activates special code for IA64 processors since 2.8c. It has no
effect in earlier releases. This code is plain C code, no assembler lines,
but gives a big penalty in performance for other than Intel IA64 processors.
It is strongly recommended to use this option for IA64 machines, you can
double the performance.
- Y_MINIMUM
- This option configures YEAFFT to use the minimum possible amount
of precomputed trigonometric factors. This increase the work at
FFT-time but reduces the memory traffic. On some systems it is
worthwhile. The default is to precompute only a part of trigonometric
factors and complete at FFT-time.
- Y_MAXIMUM
- This option is the opposite to Y_MINIMUM. Here, all the trigonometric
factors are precomputed, and so there are less work to do at
FFT-time. The negative part is that this increases the memory traffic
and can slowdown the speed.
- _PTHREADS=value
- This option enables the use of POSIX threads. This option is automatically
filled building the binary with configure script and using
--enable-pthread=n. If you want to use n threads you
should then add -D_PTHREADS=n to your command line compiler options.
When using configure the option --enable-pthread is
equivalent to --enable-pthread=2. This option is not
recommended for single processor machines.
- _OPENMP
- This option enables the use of OpenMP directives.
If you want to set the number of threads to use you have to include the
option Y_NUM_THREADS. Warning. If you want to use OpenMP
multiprocessing you have to enable it with the proper compiler flag (usually
-omp). Then, the compiler have already defined _OPENMP
macro so YOU DON'T HAVE TO DEFINE IT EXPLICITLY. This option is not
recommended for single processor machines.
- _SUNMP
- This option enables Sun MP C directives see docs about SunWSpro C compiler. Warning: You have to define PARALLEL=n
in your shell environment, and you also have to define Y_NUM_THREADS
and the specific compiler flag -xexplicitpar.
This option is not recommended for single processor machines.
- Y_NUM_THREADS=value
- This option is used to set EXPLICITLY how many threads are used in
OpenMP or SunMP multithreaded options. You have to assign a
value when using SunMP but not when using OpenMP (here the
best choice can be computed at runtime). Anyway, to avoid other than power of
two number of threads, it is better to use this option whenever you use
both _OPENMP or _SUNMP.
GLUCAS options
The following macros are defined to manage some aspects of Lucas Lehmer tests,
not the FFT routines.
- Y_SECURE
- When defined, Glucas makes a round off check every iteration
(See Glucas internals.) It costs about 5% of performance, but in some cases
it is convenient. We recommend to use it in systems with unreliable
hardware/software (low end PC's, overclocked systems, etc ...). In
2.9.2, when it is not defined the round off error is checked
in the first 131072 iterations after start. If any error of these
iterations is over 0.40 then Glucas restarts trying to adjust its
accuracy. Once the initial phase is passed, Glucas still
continues checking the roundoff error but now every 64 iterations and
with a higher threshold (0.45). A further error will reinit the test,
with accuracy increased, from the last save file.
You can also manage the roundoff check at run time by editing the ini file
(See Configure files.)
- Y_KILL_BRANCHES
- The carry and normalization phase of
Discrete Weighted Transform
(See Glucas internals.) has some unpredictable branches. This can slowdown
the performance in some processors. Glucas has an alternative code
to avoid all these branches but with some float and integer instructions as
cost. Some processors can gain with this option activated.
- Y_VECTORIZE
- If this option is used, it is possible to avoid some dependency stalls in carry
and normalization phase of
Discrete Weighted Transform.