Interpreters and caller-saved registers

Discussion:

(too old to reply)

a***@mips.complang.tuwien.ac.at

2023-10-13 07:44:46 UTC

In August we discussed in comp.arch that Intel's APX will increase the
number of caller-saved registers, but not the number of callee-saved
registers. And I pointed out that more caller-saved registers are
hardly useful for interpreters.

This discussion inspired me to reconsider how we might use
caller-saved registers for stack caching in Gforth. With stack-caching
as used in Gforth [ertl&gregg05], 0-n stack items reside in registers,
with n+1 different stack representations (one for each number of
registers on the stack).

E.g., on RISC-V the stack representations S0-S3 are:

stack
elements S0 S1 S2 S3
top 8(s8) s7 s0 s2
second 16(s8) 8(s8) s7 s0
third 24(s8) 16(s8) 8(s8) s7
fourth 32(s8) 24(s8) 16(s8) 8(s8)

With these stack representations, the definition

: cubed dup dup * * ;

is compiled into:

$3F9F301748 dup 1->2
3F9EFA4ACA: mv s0,s7
$3F9F301750 dup 2->3
3F9EFA4ACC: mv s2,s0
$3F9F301758 * 3->2
3F9EFA4ACE: mul s0,s0,s2
$3F9F301760 * 2->1
3F9EFA4AD2: mul s7,s7,s0
$3F9F301768 ;s 1->1
3F9EFA4AD6: ld a6,0(s9)
3F9EFA4ADA: addi s9,s9,8
3F9EFA4ADC: mv s10,a6
3F9EFA4ADE: ld a5,0(s10)
3F9EFA4AE2: jr a5

We use n=3 if enough registers are available, but until September 2023
we used n=1 on AMD64 because of a lack of registers. (n>3 results in
diminishing returns [ertl&gregg05].)

So on AMD64 we got:

$7F4534F89B00 dup 1->1
0x00007f4534c41030: mov %r8,0x0(%r13)
0x00007f4534c41034: sub $0x8,%r13
$7F4534F89B08 dup 1->1
0x00007f4534c41038: mov %r8,0x0(%r13)
0x00007f4534c4103c: sub $0x8,%r13
$7F4534F89B10 * 1->1
0x00007f4534c41040: imul 0x8(%r13),%r8
0x00007f4534c41045: add $0x8,%r13
$7F4534F89B18 * 1->1
0x00007f4534c41049: imul 0x8(%r13),%r8
0x00007f4534c4104e: add $0x8,%r13
$7F4534F89B20 ;s 1->1
0x00007f4534c41052: mov (%rbx),%r10
0x00007f4534c41055: add $0x8,%rbx
0x00007f4534c41059: mov %r10,%r15
0x00007f4534c4105c: mov (%r15),%rcx
0x00007f4534c4105f: jmp *%rcx

I.e., lots of memory accesses and lots of stack pointer updates.

As you can see in the RISC-V register names, gcc used callee-saved
registers (with names starting with "s") rather than caller-saved
registers (names starting with "t") for the stack cache registers.
That's because there are >100 VM instructions in Gforth that call
functions with code like:

/* start in S1 */
char *a_addr = (char *)s7;
free_l(a_addr);
long wior = 0;
s7 = wior;
/* end in S1 */
... invoke next VM instruction ...

(I used s7 here instead of the name of the C variable to make the
connection with the stuff above clearer).

Now the idea for making use of caller-saved registers is that, for a
VM instruction that ends in representation S1, (the variables in)
registers s0 and s2 are dead, because any transition to S2 or S3 will
write these registers. Unfortunately, gcc does not know that, and
assumes that the invocation of the next VM instruction could also
invoke something that expects the stack to be in S3, where s0 and s2
are alive. Therefore it thinks that s0 and s2 are alive at the call
to free() and allocates them in callee-saved registers. In Gforth,
the VM instructions containing calls all end in S1, so s0 and s2 are
(in reality, but unknown to gcc) dead after each call, and could be
allocated to caller-saved registers.

So the idea is to kill these two registers after the call to free_l()
(and after the other calls) in the eyes of gcc. More generally, my
idea was to kill s2 at the end of every VM instruction that ends in
S2, kill s2 and s0 at the end of every VM instruction that ends in S1,
and kill s2, s0, and s7 at the end of every VM instruction that ends
in S0. That should provide the maximum freedom to gcc, which is
generally believed to help the compiler produce good code.

How do we kill a variable/register: By writing to it, or making the
compiler believed that we write to it. One way would be:

s2 = 0;

However, this costs an instruction; a cheap one on sophisticated
microarchitectures, but still. So what we actually do is:

asm("":"=X"(s2))

This tells gcc that the asm statement writes to s2, and thus kills it,
but it actually does not generate any assembly language.

I implemented this and then added S2 and S3 to AMD64, and gcc now
indeed managed to keep s2 and s0 in caller-saved registers.

Unfortunately, gcc-11.4 also introduced two additional redundant move
instructions in every VM instruction, and Bernd Paysan reported that
gcc-12 and gcc-13 introduced even more superfluous code in every VM
instruction. This is similar to what we have seen from gcc-3.0 for
Gforth at that time, and what we have seen from clang last we tried
it.

I tried to work around this issue by having the kills only at the end
of VM instructions that perform a call, and indeed, that worked for
gcc-11.4. However, gcc-12 and gcc-13 still produced bad code.
Finally Bernd Paysan had the right idea and added -fno-tree-vectorize
to the list of options that we use to avoid gcc shenanigans, and now
we can also use this idea with gcc-12 and gcc-13.

The result is that the code for CUBED now looks as follows on AMD64:

$7F9D0D2531F0 dup 1->2
0x00007f9d0cefa8b2: mov %r8,%r15
$7F9D0D2531F8 dup 2->3
0x00007f9d0cefa8b5: mov %r15,%r9
$7F9D0D253200 * 3->2
0x00007f9d0cefa8b8: imul %r9,%r15
$7F9D0D253208 * 2->1
0x00007f9d0cefa8bc: imul %r15,%r8
$7F9D0D253210 ;s 1->1
0x00007f9d0cefa8c0: mov (%r14),%rbx
0x00007f9d0cefa8c3: add $0x8,%r14
0x00007f9d0cefa8c7: mov (%rbx),%rax
0x00007f9d0cefa8ca: jmp *%rax

In terms of performance, the difference is (on a 4GHz Core i5-6600K,
numbers in seconds):

sieve bubble matrix fib fft
0.070 0.096 0.029 0.058 0.022 with S0-S1
0.060 0.091 0.019 0.051 0.021 with S0-S3

Concerning the discussion of caller-saved vs. callee-saved registers,
if AMD64 had more callee-saved registers, it would have seen these
performance benefits many years ago, and we would not have had to
spend the work to introduce the kills, find out the problems, and work
around them.

This optimization is quite specific to the stack caching optimization.
A more general approach for VM interpreters is to surround every call
with code saving VM registers before and code restoring them
afterwards. This allows the C compiler to put the VM registers into
caller-saved registers. E.g.:

vmstate->ip = ip;
vmstate->sp = sp;
...
free_l(a_addr);
ip = vmstate->ip;
sp = vmstate->sp;
...

Of course, these saves and restores also cost, but at least they allow
to use caller-saved registers for the VM instructions that do not
perform calls. To reduce the overhead of these saves and restores,
one can keep m VM registers in C variables across calls instead of
saving them, where m is the architecture-specific number of
callee-saved registers. So architectures with more callee-saved
instructions have less of this saving and restoring overhead.

Followups set to comp.arch; change this to comp.compilers (moderated)
if your followup is more appripriate for that.

@InProceedings{ertl&gregg05,
author = {M. Anton Ertl and David Gregg},
title = {Stack Caching in {Forth}},
crossref = {euroforth05},
pages = {6--15},
url = {http://www.complang.tuwien.ac.at/papers/ertl%26gregg05.ps.gz},
pdfurl = {http://www.complang.tuwien.ac.at/anton/euroforth2005/papers/ertl%26gregg05.pdf},
OPTnote = {not refereed},
abstract = {Stack caching speeds Forth up by keeping stack items
in registers, reducing the number of memory accesses
for stack items. This paper describes our work on
extending Gforth's stack caching implementation to
support more than one register in the canonical
state, and presents timing results for the resulting
Forth system. For single-representation stack
caches, keeping just one stack item in registers is
usually best, and provides speedups up to a factor
of 2.84 over the straight-forward stack
representation. For stack caches with multiple stack
representations, using the one-register
representation as canonical representation is
usually optimal, resulting in an overall speedup of
up to a factor of 3.80 (and up to a factor of 1.53
over single-representation stack caching).}
}
@Proceedings{euroforth05,
title = {21st EuroForth Conference},
booktitle = {21st EuroForth Conference},
year = {2005},
key = {EuroForth'05},
editor = {M. Anton Ertl},
url = {http://www.complang.tuwien.ac.at/anton/euroforth2005/papers/proceedings.pdf}
}

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-***@googlegroups.com>

Thomas Koenig

2023-10-15 19:52:45 UTC

Permalink

[Replying to comp.compilers as this is more pertinent there]

Post by a***@mips.complang.tuwien.ac.at
asm("":"=X"(s2))
This tells gcc that the asm statement writes to s2, and thus kills it,
but it actually does not generate any assembly language.

[...]

Post by a***@mips.complang.tuwien.ac.at
Unfortunately, gcc-11.4 also introduced two additional redundant move
instructions in every VM instruction, and Bernd Paysan reported that
gcc-12 and gcc-13 introduced even more superfluous code in every VM
instruction.

It is well known that compilers in general and gcc specfically often
generate superflous register moves; there are quite some PRs in
gcc's bug database on this; I have submitted a few of them myself,
such as https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111373 which
includes compiler-generated code like

movq %rdx, %rsi
movq %rax, %rdx
movq %rcx, 8(%rdi)
movq %rsi, %rax
movq %rdx, 16(%rdi)
movq %rax, (%rdi)
ret

where it is obviously to anybody who can read assembly that the
register moves are unneeded (although they are likely to
be zero-cycle operations because of register renaming).

However, if this got worse between releases, this is a regression.
Those get higher priority for fixing. So, if it is reasonable
to generate a reduced test case (for which cvise, for example,
is an excellent tool) so filing a bug report would be a good thing.

Post by a***@mips.complang.tuwien.ac.at
This is similar to what we have seen from gcc-3.0 for
Gforth at that time, and what we have seen from clang last we tried
it.
I tried to work around this issue by having the kills only at the end
of VM instructions that perform a call, and indeed, that worked for
gcc-11.4. However, gcc-12 and gcc-13 still produced bad code.
Finally Bernd Paysan had the right idea and added -fno-tree-vectorize
to the list of options that we use to avoid gcc shenanigans, and now
we can also use this idea with gcc-12 and gcc-13.

That is strange, and would give valuable hints for investigating
this regression.

This sort of code is an example of the contradictions in today's
compiler technology. On the one hand, they do amazing optimizations
on large amounts of code which no programmer could hope to reach
while staying productive. On the other hand, it is very common
to see glaring inefficiencies when one looks at even small chunks
of code.

(A good assembler programmer can often beat compiler-generated
code by a factor of two or more, especially if SIMD is involved,
but SIMD is really hard to generate code for).

So far, nobody has found an algorithm for "just remove the
silliness" from compiled programs. Maybe it would be feasible to
run some peephole optimization as last passes which could improve
code like the one above, but that might also be difficult in the
more general case where registers are reused in other basic blocks
(which would mean just to redo the register allocation).

So, still work to do...

a***@mips.complang.tuwien.ac.at

2023-10-19 17:14:28 UTC

Permalink

Post by Thomas Koenig
It is well known that compilers in general and gcc specfically often
generate superflous register moves

...

Post by Thomas Koenig
However, if this got worse between releases, this is a regression.
Those get higher priority for fixing. So, if it is reasonable
to generate a reduced test case (for which cvise, for example,
is an excellent tool) so filing a bug report would be a good thing.

If you want to file such a bug report, I can give you the commit of
Gforth before we added all the workarounds, where all the problems
are visible without ado.

Post by Thomas Koenig
This sort of code is an example of the contradictions in today's
compiler technology. On the one hand, they do amazing optimizations
on large amounts of code which no programmer could hope to reach
while staying productive.

There are certainly cases where compilers loop-unroll (plus
ramp-up/down and compensation code) code into huge swaths of code that
makes it hard to see where the action is going on, but amazing?

Post by Thomas Koenig
So far, nobody has found an algorithm for "just remove the
silliness" from compiled programs. Maybe it would be feasible to
run some peephole optimization as last passes which could improve
code like the one above, but that might also be difficult in the
more general case where registers are reused in other basic blocks
(which would mean just to redo the register allocation).

Yes, I doubt that it can be solved with peephole optimization, or it
would have been done already.

- anton
--
M. Anton Ertl
***@mips.complang.tuwien.ac.at
http://www.complang.tuwien.ac.at/anton/

Thomas Koenig

2023-10-22 18:43:47 UTC

Permalink

Post by a***@mips.complang.tuwien.ac.at

Post by Thomas Koenig
It is well known that compilers in general and gcc specfically often
generate superflous register moves

...

If you want to file such a bug report, I can give you the commit of
Gforth before we added all the workarounds, where all the problems
are visible without ado.

This reply shows an interesting aspect of compiler development that is
often overlooked: The social aspect.

Compiler writers generally want to improve their product, but they
also generally feel that bug submitters (at least those who don't have
a support contract) should also invest a minimum of work if he wants
something fixed, and a general "look at large package xyz, it'll be
obvious" is below that threshold. (This is the reason why gcc, for
example, asks for a complete and self-contained test case in its bug
reports.)

People who complain about bugs, but are not willing to put in that
minimum amount of work, are often ignored.

a***@mips.complang.tuwien.ac.at

2023-10-24 17:15:04 UTC

Permalink

Post by Thomas Koenig

Post by a***@mips.complang.tuwien.ac.at
If you want to file such a bug report, I can give you the commit of
Gforth before we added all the workarounds, where all the problems
are visible without ado.

This reply shows an interesting aspect of compiler development that is
often overlooked: The social aspect.
Compiler writers generally want to improve their product, but they
also generally feel that bug submitters (at least those who don't have
a support contract) should also invest a minimum of work if he wants
something fixed, and a general "look at large package xyz, it'll be
obvious" is below that threshold. (This is the reason why gcc, for
example, asks for a complete and self-contained test case in its bug
reports.)
People who complain about bugs, but are not willing to put in that
minimum amount of work, are often ignored.

In my experience with gcc maintainers, when I put in that effort, it
is wasted, because

1) the resulting bug report is resolved as invalid (e.g., PR25285) in
less time than it took me to create it. Moreover gcc maintainers
who knew much less about the performance implications of the bug
than I did (I had researched the topic for several years at that
point), yet wrote patronizingly down to me; but that would have
been forgiven if they had kept their part of the social contract
and fixed the bug.

2) they confirm the bug and do nothing about it (PR93811).

3) they just do nothing about it (PR 93765).

In the present case, declaring the bug to be INVALID would be easy
given that the program uses features outside standard C, and I expect
that if you make such a bug report, it will be resolved as INVALID.
So why should anyone waste their time on it? If you think they are
going to fix it, why don't you invest your time in it?

- anton
--
M. Anton Ertl
***@mips.complang.tuwien.ac.at
http://www.complang.tuwien.ac.at/anton/

Kaz Kylheku

2023-10-25 17:19:17 UTC

Permalink

Post by a***@mips.complang.tuwien.ac.at
In my experience with gcc maintainers, when I put in that effort, it
is wasted, because

I sent a patch to the GCC patch mailing list in April 2022. A patch with
documentation updates and test cases. Pinged a number of times (most recently,
sometime this summer, I think). No reply, affirmative or negative.

Engaging GCC isn't the best way to spend one's time.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @***@mstdn.ca
NOTE: If you use Google Groups, I don't see you, unless you're whitelisted.