Yeah, good to see someone who agrees. It feels like a lot of dynarec papers are recycling the exact same material in the previous one, then adding new things. Not that that's so awful, but a few bad points keep being reiterated.
It's like always, everyone is quoting everyone else. Remember when it was revealed that the amount of iron in spinach is much lower than originally thought? Someone made an error during the calculation and everyone else was quoting the result without rechecking it. People are just lazy, and that's also true for me. How often did I think that I should write a better introduction on dynamic recompilation, but I it was just too tempting to spend my spare time with less intellectual topics.
At least I try to keep an open mind. There is no pure right or wrong with this topic. That's why I tried to cover so many different options, when I initially started this thread. For example, I like dynamic register allocation, but if someone proves to me that I in his case static allocation is faster, then why should I argue against it? You make a dynarec because you want the speed, so speed almost always wins.
HotSpot probably is one of the few cases where a slight speed increase doesn't really justify to include an additional emulation engine.
BTW, I have to admit that I only browsed through Michael Steil's work, but wanted to include it because it at least looks interesting and it's good to provide different views.
I do know David and Victor, though. I've known David for several years before his project, and Neil Bradley started his dynarec mailing-list (not the Yahoo one) because he wanted to have a discussion with Victor and me at the same time. Victor would be the first to admit that his English isn't perfect, but I think his knowledge and ideas are so good to look past that.
On the other hand, if only very small regions are being modified (not entire blocks) you can black-list the opcodes and pass them through an interpreter, in which case it would make sense to have both active at the same time. Most games don't do things like this anymore, but I know I saw games on GBA that did.
Self-modifying code is mostly something from the "old ages". Today with split caches it's rather bad for the performance on most processors. A lot of 8-bit software seems to do it, but I don't really see the point of doing a dyanrec for an 8-bit system. Even ten years ago practically every half-decent computer was capable of running an 8-bit system full-speed in an interpreter. All those 8-bit dynarecs (I know of at least two or three) might be fun, but they aren't really worth the hassle. Not to mention that those systems often needed cycle-exact emulation, which is much easier to achieve with interpretation anyway.
I guess nowadays self-modifying code should be the biggest problem on embedded/handheld systems that hold the code on a media type where "execute in place" is possible (namely ROM or NOR Flash), but probably slower than RAM.
The GBA comes to mind, because it has 16-bit ROM access, but 32-bit work RAM. I've never programmed the GBA, but I would execute Thumb code from the ROM directly and copy the few ARM routines to the WRAM for speed. Otherwise you'd need two memory accesses for each ARM instruction, which practically nullifies the speed advantage.
I think I read a blog entry or version information about a GBA emulator, where self-modifying code was a problem in at least one case. Given your posting here, I wonder if that emulator could have been yours.
Of course, unless the system has lots of RAM and/or runs only one application, almost every modern system replaces code in RAM every now and then. But due to memory protection on those systems, code and data typically is in different memory pages. Thus it's quite easy to decide, when to remove some code from the cache: If a page that originally held code is written to, it's most likely that the code in that page was modified, so better recompile it.
Still, if you can manage to not include it in the release code it wouldn't hurt. Interpreters can be several thousand lines of code and I suppose several dozen KB.. which really isn't that much, but still. Probably nothing compared to any notable amount of recompiled code.
Most likely true. Very few software should have an initialization section of the complexity of an interpretive CPU emulator.
I'm not that familiar with what the various JVM JIT strategies are like, at least not in detail.
I have to admit that my information is pretty outdated by now, too. But I guess that due to the all-important compatibility those parts of the JVM most likely haven't really changed.
Then again, since you said beginnings of basic blocks aren't marked I'm going to guess that includes the beginnings of if statements and loops too. If so you're right, that is pretty silly.
I believe that's the case. But I have nothing against being proven wrong.
Being a stack machine makes it more compact, arguably.
Yes, the code density for stack architectures is better, simply because no operation needs to include arguments, they're simply on the stack.
But almost all newer VMs have switched to register architectures for speed, like the Lua VM or Parrot.
Rather absurdly, a lot of embedded platforms don't even use JVM JIT and are relying on interpreters because of memory issues. So just where it needs the speed the most it doesn't get any. I don't think I have to tell you how little Java chip technologies caught on, despite being available in a lot of ARMs, which embedded platforms usually have.
I do know of Jazelle, of course. But I'm not sure how fast it is compared to a JIT. From what I've read, some of the more complex opcodes have to be interpreted in software anyway.
Although I do actually work with embedded systems, I didn't have the need for Java yet. We still program all our controllers in C. Well, AVRs or PICs aren't really the best target for Java, and all the 32-bit systems ran with Linux.
But nevermind the lack of register allocation (to be fair, they'd have to redo that at runtime regardless because it's 100% dependent on the number of registers available).
I saw register allocation from a register VM to a real processor like page replacement. Allocate as many registers, as you can, by simply mapping the next virtual register to the next free hardware register. This will probably work for a high amount of the translation units. If you should run out of hardware registers, use a LRU or cheaper second-chance algorithm to replace one of the already mapped registers. What works for page replacement in virtual memory, could also work for dynamic register allocation.
I'm more disgusted with the outright lack of optimizations entirely. Javac is barely more than a tokenizer - okay, yeah, it does flatten down expressions into stack format but it really does surprisingly little. I hear that you can get back perfectly genuine Java code from bytecode most of the time, hence the availability of Java obfuscating programs.
Indeed, when you follow the stack progression, you can rebuild the original syntax tree, which is more or less the original code. All that is lacking, is names and comments.
This is a flaw of all stack-based machines, so Microsoft's .NET is no different here. While they made a few things different, they also stuck to the stack VM. Sun probably didn't know that they would be using a JIT compiler on the bytecode, but Microsoft knew from the start and should have learnt from Sun's errors, so shame on Microsoft.
So not only is Java forcing a lot of expensive optimizations on the JVM but it's giving it less information to work with with the bytecode format. The whole thing is a mess, and it's a good reason not to compare JVM JIT to recompilers - aside from being stack based like nothing is, like you mentioned, JIT really is more like a compiler than a recompiler. And it has given people the misconception that recompilers need a lot of time to do a good job.
The JVM is a bit of a mess anyway. From low-level operations to high-level object related stuff. You won't see this in a normal processor. On the other hand, a JVM won't have those timing issues or problems with trying to optimize condition code handling.
In a dynarec sometimes less is more. I'm a bit of a perfectionist (probably also one of the reasons why I don't release much stuff into the public), so I told David Sharp all sorts of optimizations he could use. I think in the end he used very few, because he found out that trying too hard to optimize everything took more time than the speed-up was worth it.
So what does that tell us? FX!32 may have been better off being fully dynamic. People argue a lot about static recompilers because they think that'll produce better code and provide advantages over dynarecs.
That's what I thought, too, until I read what the profiler actually does (ie. not much).
The other argument is, that they have most of the code pre-compiled in a database. But media access is much slower than RAM. So the question is, wouldn't it be faster to simply dynamically translate the code, than search a database for the right code segment and then load it? But that's more or less what you wrote further down as well.
I don't really see the point, also because you have to include the interpreter, because there are problems that cannot be solved during static recompilation.
I think there was a statically recompiling N64 emulator once (Corn?), but it just compiled all the code it found when the game was loaded. Yes, it was fast, but it never ran anything but Mario 64. And I bet there would be lots of code discovery problems with other ROMs.
Unaligned accesses can cause interrupts on most platforms where they're not supported outright, Alpha is no exception. Could have easily relied on this and patched the code at runtime as they occurred, or only after many of them occur from the same location.
But interrupts are expensive, and you have to attach your emulator quite deep in the system. Maybe they didn't want to do it.
One other thing - people often assume that dynarec means basic block at a time.
True, there is no theoretic limit how big your translation units are.
You could actually see the N64 emulator above as a dynarec instead, only that the translation unit was the whole code in the ROM.
I guess I'd normally prefer basic blocks, since you probably don't translate the same code twice, and you often return to the dispatcher loop, which makes timing easier. Of course the latter part makes the dynarec slower than it could be. There is always advantages and disadvantages.
Of course you could also resort to chaining basic blocks, like
Shade. Although in that case your register allocation advantage might not work.
The only problem is dealing with recursive links.
Yes, that could render an emulator rather unresponsive ;-)
For those who don't know the problem: To understand recursion, you have to understand recursion first...
Overall an interesting discussion, BTW.
I forgot to mention a book:
Virtual Machines by Jim Smith and Ravi Nair. While the bulk is probably about VMs like VMware, a whole chapter is on emulation techniques (interpretation, threaded code, binary translation), and other chapters also cover some problems related to dynamic binary translation.