mjestecko/articles/firefox-wasm-tail-call-benchmark/page.mmd

Title:  Testing Firefox Wasm Tail Call
Brief:  Or why assumptions are not always correct.
Date:   1708076705
Tags:   Optimization, Wasm, Interpreters
CSS:    /style.css

### Lore ###

Interpreting comes at a cost, the more you nest - the more complex things become.
It's especially true on the web, where any user program already sits on layers and layers
of interfaces. It gets pretty funny, I can't even run ZX Spectrum emulator written in JavaScript with more than few frames a second.

A lot of software targeting the web has their own languages and interpreters (such as Godot and GDScript) and in realtime simulation intensive cases overheads do matter.

One of things that is often suggested for solving interpreter performance is `tail calling`.
And it works emperically on native platforms. [Check this post](https://mort.coffee/home/fast-interpreters/).

And so I wondered, could it work for Wasm platform? Firefox recently [pushed support](https://bugzilla.mozilla.org/show_bug.cgi?id=1846789) for [experimental spec](https://github.com/WebAssembly/tail-call/blob/main/proposals/tail-call/Overview.md) of it, after all.

### Results ###

I based the test interpreter on `fast-interpreters` post linked above.
Sources are available on [github](https://github.com/quantumedbox/wasm-tail-call-interpreter-benchmark). It does nothing, but increments until 100000000,
which is relevant case for nothing, but instruction decoding, which we are testing here.

First, native:
```
time ./jump-table

real    0m3,094s
user    0m3,082s
sys     0m0,012s

time ./tail-call

real    0m2,491s
user    0m2,485s
sys     0m0,005s

```

Run time decrease of `19.3%`! Formidable.

But with web it's more interesting:
```
tail-call.wasm (cold): 10874ms - timer ended

jump-table.wasm (cold): 6610ms - timer ended

```

Tail calls are actually slower in this case (by `39.2%`), which I'm not sure about why yet.
Intuition proven wrong, - but me testing it first proven useful :)

Note: I'm running it on amd64 cpu, stable Firefox 122.0, compiled with Zig's Clang version 16.

Seems like JIT complation on the web is the way to go, to fold everything to Wasm bytecode.
But overall with plain jump-table overheads are *mere* 113.6%, which I would say isn't critical for a lot of cases, especially if interpreter is intended mostly as an interface adapter, which is the case with GDScript.