Title: Testing Firefox Wasm Tail Call Brief: Or why assumptions are not always correct. Date: 1708076705 Tags: Wasm, Interpreters CSS: /style.css ### Lore ### Interpreting comes at a cost, the more you nest - the more complex things become. It's especially true on the web, where any user program already sits on layers and layers of interfaces. It gets pretty funny, I can't even run ZX Spectrum emulator written in JavaScript with more than few frames a second. A lot of software targeting the web has their own languages and interpreters (such as Godot and GDScript) and in realtime simulation intensive cases overheads do matter. One of things that is often suggested for solving interpreter performance is `tail calling`. And it works emperically on native platforms. [Check this post](https://mort.coffee/home/fast-interpreters/). And so I wondered, could it work for Wasm platform? Firefox recently [pushed support](https://bugzilla.mozilla.org/show_bug.cgi?id=1846789) for [experimental spec](https://github.com/WebAssembly/tail-call/blob/main/proposals/tail-call/Overview.md) of it, after all. ### Results ### I based the test interpreter on `fast-interpreters` post linked above. Sources are available on [github](https://github.com/quantumedbox/wasm-tail-call-interpreter-benchmark). It does nothing, but increments until 100000000, which is relevant case for nothing, but instruction decoding, which we are testing here. First, native: ``` time ./jump-table real 0m3,094s user 0m3,082s sys 0m0,012s time ./tail-call real 0m2,491s user 0m2,485s sys 0m0,005s ``` Run time decrease of `19.3%`! Formidable. But with web it's more interesting: ``` tail-call.wasm (cold): 10874ms - timer ended jump-table.wasm (cold): 6610ms - timer ended ``` Tail calls are actually slower in this case (by `39.2%`), which I'm not sure about why yet. Intuition proven wrong, - but me testing it first proven useful :) Note: I'm running it on amd64 cpu, stable Firefox 122.0, compiled with Zig's Clang version 16. Seems like JIT complation on the web is the way to go, to fold everything to Wasm bytecode. But overall with plain jump-table overheads are *mere* 113.6%, which I would say isn't critical for a lot of cases, especially if interpreter is intended mostly as an interface adapter, which is the case with GDScript.