Next Generation Emulation banner
41 - 60 of 86 Posts

· Registered
Joined
·
183 Posts
my projects are learning how to do it now .my xwindowsxp that will start in 1 year or so .maybe less or maybe not.i will be useing the the xwindows source code to start the xwindowsxp project.as i said right now i am learning how to make my own video plugin in glide.
 

· King of Pain
Joined
·
491 Posts
Discussion Starter · #43 ·
Originally posted by Nezzar
Do I see a new emulator coming? :p
Not from me in the near future, because I still have way more interest in different processor architectures than graphics and sound.

But I think of writing a real document about the basics of dynamic recompilation, and I also have some changes and additions for http://www.dynarec.com/ in mind.
The problem is that I also should rework my German BeOS FAQ, and probably even spend my precious free time with more entertaining activities... So much to do and so little time...
 

· King of Pain
Joined
·
491 Posts
Discussion Starter · #44 ·
Originally posted by n64warrior
my projects are learning how to do it now .my xwindowsxp that will start in 1 year or so .maybe less or maybe not.i will be useing the the xwindows source code to start the xwindowsxp project.as i said right now i am learning how to make my own video plugin in glide.
Good luck with that, but it still has nothing to do with dynamic recompilation at all.
 

· King of Pain
Joined
·
491 Posts
Discussion Starter · #52 ·
Since there still seems to be some interest in this topic and this forum thread is the third hit when you search for "dynamic recompilation" on google, I guess I should add some more recent information.

First of all, dynarec.com is long dead. Neil Bradley didn't renew the domain registration, so instead it is used by one of those typical useless sites. Unfortunately all the material on there is gone too. I think I should have a backup of the FAQ somewhere, but it's not that impressive anymore.
I probably should combine the old FAQ and what I wrote here, make the text better and add some diagrams, but time is always an issue.

Meanwhile there are some interesting PDFs you might want to check out instead:

Study of the Techniques for Emulation Programming; like the name says, this document by Victor Moya covers a lot of topics around emulation, including binary translation. You might want to visit his page about emulation as well.

David Sharp has written a report about Tarmac, a dynamically recompiling ARM emulator, which was his university project. Parts of it were later used in the Archimedes/RiscPC emulator Red Squirrel. It's a bit older, but it should be interesting, especially for beginners.

A newer university report is Michael Steil's paper on Dynamic Re-compilation of Binary RISC Code for CISC Architectures.

I hope these resources provide more in-depth view of this topic.
 

· Emu author
Joined
·
1,490 Posts
This new paper once again makes the same logical fallacy of applying the "90/10" rule to so-called "hotspot" recompilers. Even code that is executed 10% of the time will still quickly execute enough times to be recompiled. Comparing compile time to run time is really usually comparing an O(1) problem to an O(n) problem, where the latter never wins asymptotically. Translating code that only executes once or a few times doesn't hurt you very much for the precise reason that this code only executes once or a few times, and that these tend to be clustered inbetween real in-flight sequences where the program is transitioning anyway. This is especially true for games.

Hotspot recompilers are useful for decreasing memory used, although there are other techniques for doing this and on a modern OS/platform unused code blocks can eventually get swapped out anyway. They bring in the extra overhead of needing an interpreter in the first place, not to mention the storage for tracking block execution before the blocks are recompiled. I have yet to see solid empirical evidence that this approach gives an asymptotic performance advance, or even a better experience for the user... the only comparison I've seen is of Java's first JIT vs Hotspot, which is pretty flawed given the number of unrelated improvements that have been made then. It's also more of a big deal in JVM because Javac does almost no optimization and relies on the JVM to do a lot of expensive optimizations that you mostly wouldn't want to do in a normal recompiler targeting normal binaries that have been optimized by normal compilers.
 

· King of Pain
Joined
·
491 Posts
Discussion Starter · #54 ·
You're right, HotSpot techniques only save memory and maybe a little time, because they don't translate blocks that are only executed once. But since such blocks are most likely found in the initialization part of an application, it doesn't really matter that much, if the pure interpretation is faster or not.

As for the use of two emulation techniques in one emulator, most dynarec cores do have an interpretive brother, pretty much due to the fact that most programmers don't start out with the dynarec right away and like to test the result of the translated code against the interpretation. In some cases it makes sense to leave the interpreter in the release version as an option for software that doesn't run with the dynarec too well (eg. due to lots of self-modifying code), but I wouldn't use them at the same time either.

The JVM is a pet peeve of mine. Sun sell HotSpot as a great advantage, I see it as a correction of I big mistake. If they want to optimize the code produced by the JIT, they need to know the boundaries of the basic blocks. Unfortunately they don't mark the beginning of basic blocks in the byte code (despite the fact that the compiler is aware of them), so they have to search for those beginnings first (the end of a basic block is any jump or branch, of course).
The other problem with the JVM is that it's a stack machine. Virtually all CPUs nowadays are register machines. If they were using a register based VM, the JIT would just have to map the virtual registers to real ones, with occasional remapping, if there aren't enough registers. Instead they are using a VM technology from 1963 and either don't do any register allocation at all, or have to perform an expensive graph-coloring algorithm during runtime, which is normally done during the compiler run.

BTW, FX!32, a tool that statically recompiled x86 code on the Alpha version of Windows NT, first interprets the code, gathers profiling information, and launches the translator after the first run has been finished.
At first I thought that the profiler might search for a lot of information, but when I dug deeper to find more specific documentation, I found out that it's rather profane. It only records three types of data:
1. Jump target addresses, ie. basic blocks.
2. Unaligned memory accesses, because unlike on x86 they have to be handled specifically on Alpha.
3. Register indirect jumps, because they cannot be statically translated and you need to fallback to the interpreter for them.

So, in the end everything is about basic blocks, when you try to perform an optimized translation.
 

· Emu author
Joined
·
1,490 Posts
You're right, HotSpot techniques only save memory and maybe a little time, because they don't translate blocks that are only executed once. But since such blocks are most likely found in the initialization part of an application, it doesn't really matter that much, if the pure interpretation is faster or not.
Yeah, good to see someone who agrees. It feels like a lot of dynarec papers are recycling the exact same material in the previous one, then adding new things. Not that that's so awful, but a few bad points keep being reiterated.

As for the use of two emulation techniques in one emulator, most dynarec cores do have an interpretive brother, pretty much due to the fact that most programmers don't start out with the dynarec right away and like to test the result of the translated code against the interpretation. In some cases it makes sense to leave the interpreter in the release version as an option for software that doesn't run with the dynarec too well (eg. due to lots of self-modifying code), but I wouldn't use them at the same time either.
Yeah, starting with an interpreter is definitely the way to go and I have for at least one of the emulators I've written with a recompiler. Ideally the interpreter should never be superior in release code - even very routine self-modifying code can be dealt with better than interpretation straight out most of the time, if you have a suitable strategy for dealing with it. On the other hand, if only very small regions are being modified (not entire blocks) you can black-list the opcodes and pass them through an interpreter, in which case it would make sense to have both active at the same time. Most games don't do things like this anymore, but I know I saw games on GBA that did.

Still, if you can manage to not include it in the release code it wouldn't hurt. Interpreters can be several thousand lines of code and I suppose several dozen KB.. which really isn't that much, but still. Probably nothing compared to any notable amount of recompiled code.

The JVM is a pet peeve of mine. Sun sell HotSpot as a great advantage, I see it as a correction of I big mistake. If they want to optimize the code produced by the JIT, they need to know the boundaries of the basic blocks. Unfortunately they don't mark the beginning of basic blocks in the byte code (despite the fact that the compiler is aware of them), so they have to search for those beginnings first (the end of a basic block is any jump or branch, of course).
I'm not that familiar with what the various JVM JIT strategies are like, at least not in detail. But it seems to me that JVM bytecode is going to be so structured and well behaved that you should be able to do compile methods at a time. Nothing's going to jump into the middle of the method, if's are only going to branch forward, loops only going to branch backwards, their starts won't overlap, and breaks go to the the same point where loops exit. This is especially true if the JVM marks these control structures clearly. It shouldn't have that much of a problem finding the exit points of a method, I would expect, since there has to be an instruction for that.

Then again, since you said beginnings of basic blocks aren't marked I'm going to guess that includes the beginnings of if statements and loops too. If so you're right, that is pretty silly.

The other problem with the JVM is that it's a stack machine. Virtually all CPUs nowadays are register machines. If they were using a register based VM, the JIT would just have to map the virtual registers to real ones, with occasional remapping, if there aren't enough registers. Instead they are using a VM technology from 1963 and either don't do any register allocation at all, or have to perform an expensive graph-coloring algorithm during runtime, which is normally done during the compiler run.
Being a stack machine makes it more compact, arguably. Of course, having to recompile in the first place throws that right out the window. Rather absurdly, a lot of embedded platforms don't even use JVM JIT and are relying on interpreters because of memory issues. So just where it needs the speed the most it doesn't get any. I don't think I have to tell you how little Java chip technologies caught on, despite being available in a lot of ARMs, which embedded platforms usually have.

And you're right, this means that JVM JIT is even more expensive than it has to be. But nevermind the lack of register allocation (to be fair, they'd have to redo that at runtime regardless because it's 100% dependent on the number of registers available). I'm more disgusted with the outright lack of optimizations entirely. Javac is barely more than a tokenizer - okay, yeah, it does flatten down expressions into stack format but it really does surprisingly little. I hear that you can get back perfectly genuine Java code from bytecode most of the time, hence the availability of Java obfuscating programs. So not only is Java forcing a lot of expensive optimizations on the JVM but it's giving it less information to work with with the bytecode format. The whole thing is a mess, and it's a good reason not to compare JVM JIT to recompilers - aside from being stack based like nothing is, like you mentioned, JIT really is more like a compiler than a recompiler. And it has given people the misconception that recompilers need a lot of time to do a good job.

BTW, FX!32, a tool that statically recompiled x86 code on the Alpha version of Windows NT, first interprets the code, gathers profiling information, and launches the translator after the first run has been finished.
At first I thought that the profiler might search for a lot of information, but when I dug deeper to find more specific documentation, I found out that it's rather profane. It only records three types of data:
1. Jump target addresses, ie. basic blocks.
2. Unaligned memory accesses, because unlike on x86 they have to be handled specifically on Alpha.
3. Register indirect jumps, because they cannot be statically translated and you need to fallback to the interpreter for them.

So, in the end everything is about basic blocks, when you try to perform an optimized translation.
So what does that tell us? FX!32 may have been better off being fully dynamic. People argue a lot about static recompilers because they think that'll produce better code and provide advantages over dynarecs.

Unaligned accesses can cause interrupts on most platforms where they're not supported outright, Alpha is no exception. Could have easily relied on this and patched the code at runtime as they occurred, or only after many of them occur from the same location.

One argument I see a lot is that static is preferable because you just have to load recompiled code from disk. Actually, it's entirely possible that loading code from disk will be slower than recompiling it (if you're in a scenario where you have to load the original no matter what, which is usually the case).

One other thing - people often assume that dynarec means basic block at a time. In reality, a dynarec can go as recursively deep as you want (my GBA dynarec will follow all unresolved branch targets, and I've never seen this taking too much time). Once you do this you can do some pretty good global-ish optimizations without actually tracking a lot of additional data nor adding very much additional time. For instance, if you track the liveness state of registers/flags at the beginning of blocks then you:

- scan a block, tracking all dependencies and modifications and branch targets as you go
- resolve all branch targets and recompile those blocks if they don't exist
- merge in liveness information from freshly recompiled blocks and perform liveness analysis

This is especially useful for dynamic register allocation, where you won't have to write back registers if they're dead at the start of the linked in block. This will for instance occur when calling functions that are modifying caller-save registers. It's also useful for dead-flag elimination, for instance if you have the following sequence

reg = reg - 1; //sets flags
if(reg != 0) goto label;
...

If it happens that all other flags are dead after the fallthrough it means only the zero flag would be live here.

The only problem is dealing with recursive links. IE, if one of the blocks you're going to recompile links to the block you just recompiled. You have to avoid the infinite recursion, obviously, but if you haven't finished recompiling the dependent blocks you'll have no way of knowing where to link to. I think a simple solution is to just keep a list of these unresolved links and go in and patch them in a second pass once your recompilation has exited the top level of the recursion, so you kind of end up with a compilation and linking stage, only the compilation does as much linking as it can directly. Oh, and if the recursive case happens you have to use really naive/pessimistic liveness info, but that's not a big deal (I mean, just call all the registers/flags live, or maybe do an early guess to determine what definitely isn't)
 

· King of Pain
Joined
·
491 Posts
Discussion Starter · #56 ·
Yeah, good to see someone who agrees. It feels like a lot of dynarec papers are recycling the exact same material in the previous one, then adding new things. Not that that's so awful, but a few bad points keep being reiterated.
It's like always, everyone is quoting everyone else. Remember when it was revealed that the amount of iron in spinach is much lower than originally thought? Someone made an error during the calculation and everyone else was quoting the result without rechecking it. People are just lazy, and that's also true for me. How often did I think that I should write a better introduction on dynamic recompilation, but I it was just too tempting to spend my spare time with less intellectual topics.

At least I try to keep an open mind. There is no pure right or wrong with this topic. That's why I tried to cover so many different options, when I initially started this thread. For example, I like dynamic register allocation, but if someone proves to me that I in his case static allocation is faster, then why should I argue against it? You make a dynarec because you want the speed, so speed almost always wins.
HotSpot probably is one of the few cases where a slight speed increase doesn't really justify to include an additional emulation engine.

BTW, I have to admit that I only browsed through Michael Steil's work, but wanted to include it because it at least looks interesting and it's good to provide different views.
I do know David and Victor, though. I've known David for several years before his project, and Neil Bradley started his dynarec mailing-list (not the Yahoo one) because he wanted to have a discussion with Victor and me at the same time. Victor would be the first to admit that his English isn't perfect, but I think his knowledge and ideas are so good to look past that.

On the other hand, if only very small regions are being modified (not entire blocks) you can black-list the opcodes and pass them through an interpreter, in which case it would make sense to have both active at the same time. Most games don't do things like this anymore, but I know I saw games on GBA that did.
Self-modifying code is mostly something from the "old ages". Today with split caches it's rather bad for the performance on most processors. A lot of 8-bit software seems to do it, but I don't really see the point of doing a dyanrec for an 8-bit system. Even ten years ago practically every half-decent computer was capable of running an 8-bit system full-speed in an interpreter. All those 8-bit dynarecs (I know of at least two or three) might be fun, but they aren't really worth the hassle. Not to mention that those systems often needed cycle-exact emulation, which is much easier to achieve with interpretation anyway.
I guess nowadays self-modifying code should be the biggest problem on embedded/handheld systems that hold the code on a media type where "execute in place" is possible (namely ROM or NOR Flash), but probably slower than RAM.
The GBA comes to mind, because it has 16-bit ROM access, but 32-bit work RAM. I've never programmed the GBA, but I would execute Thumb code from the ROM directly and copy the few ARM routines to the WRAM for speed. Otherwise you'd need two memory accesses for each ARM instruction, which practically nullifies the speed advantage.
I think I read a blog entry or version information about a GBA emulator, where self-modifying code was a problem in at least one case. Given your posting here, I wonder if that emulator could have been yours.

Of course, unless the system has lots of RAM and/or runs only one application, almost every modern system replaces code in RAM every now and then. But due to memory protection on those systems, code and data typically is in different memory pages. Thus it's quite easy to decide, when to remove some code from the cache: If a page that originally held code is written to, it's most likely that the code in that page was modified, so better recompile it.

Still, if you can manage to not include it in the release code it wouldn't hurt. Interpreters can be several thousand lines of code and I suppose several dozen KB.. which really isn't that much, but still. Probably nothing compared to any notable amount of recompiled code.
Most likely true. Very few software should have an initialization section of the complexity of an interpretive CPU emulator.

I'm not that familiar with what the various JVM JIT strategies are like, at least not in detail.
I have to admit that my information is pretty outdated by now, too. But I guess that due to the all-important compatibility those parts of the JVM most likely haven't really changed.

Then again, since you said beginnings of basic blocks aren't marked I'm going to guess that includes the beginnings of if statements and loops too. If so you're right, that is pretty silly.
I believe that's the case. But I have nothing against being proven wrong.


Being a stack machine makes it more compact, arguably.
Yes, the code density for stack architectures is better, simply because no operation needs to include arguments, they're simply on the stack.
But almost all newer VMs have switched to register architectures for speed, like the Lua VM or Parrot.

Rather absurdly, a lot of embedded platforms don't even use JVM JIT and are relying on interpreters because of memory issues. So just where it needs the speed the most it doesn't get any. I don't think I have to tell you how little Java chip technologies caught on, despite being available in a lot of ARMs, which embedded platforms usually have.
I do know of Jazelle, of course. But I'm not sure how fast it is compared to a JIT. From what I've read, some of the more complex opcodes have to be interpreted in software anyway.
Although I do actually work with embedded systems, I didn't have the need for Java yet. We still program all our controllers in C. Well, AVRs or PICs aren't really the best target for Java, and all the 32-bit systems ran with Linux.

But nevermind the lack of register allocation (to be fair, they'd have to redo that at runtime regardless because it's 100% dependent on the number of registers available).
I saw register allocation from a register VM to a real processor like page replacement. Allocate as many registers, as you can, by simply mapping the next virtual register to the next free hardware register. This will probably work for a high amount of the translation units. If you should run out of hardware registers, use a LRU or cheaper second-chance algorithm to replace one of the already mapped registers. What works for page replacement in virtual memory, could also work for dynamic register allocation.

I'm more disgusted with the outright lack of optimizations entirely. Javac is barely more than a tokenizer - okay, yeah, it does flatten down expressions into stack format but it really does surprisingly little. I hear that you can get back perfectly genuine Java code from bytecode most of the time, hence the availability of Java obfuscating programs.
Indeed, when you follow the stack progression, you can rebuild the original syntax tree, which is more or less the original code. All that is lacking, is names and comments.
This is a flaw of all stack-based machines, so Microsoft's .NET is no different here. While they made a few things different, they also stuck to the stack VM. Sun probably didn't know that they would be using a JIT compiler on the bytecode, but Microsoft knew from the start and should have learnt from Sun's errors, so shame on Microsoft.

So not only is Java forcing a lot of expensive optimizations on the JVM but it's giving it less information to work with with the bytecode format. The whole thing is a mess, and it's a good reason not to compare JVM JIT to recompilers - aside from being stack based like nothing is, like you mentioned, JIT really is more like a compiler than a recompiler. And it has given people the misconception that recompilers need a lot of time to do a good job.
The JVM is a bit of a mess anyway. From low-level operations to high-level object related stuff. You won't see this in a normal processor. On the other hand, a JVM won't have those timing issues or problems with trying to optimize condition code handling.
In a dynarec sometimes less is more. I'm a bit of a perfectionist (probably also one of the reasons why I don't release much stuff into the public), so I told David Sharp all sorts of optimizations he could use. I think in the end he used very few, because he found out that trying too hard to optimize everything took more time than the speed-up was worth it.

So what does that tell us? FX!32 may have been better off being fully dynamic. People argue a lot about static recompilers because they think that'll produce better code and provide advantages over dynarecs.
That's what I thought, too, until I read what the profiler actually does (ie. not much).
The other argument is, that they have most of the code pre-compiled in a database. But media access is much slower than RAM. So the question is, wouldn't it be faster to simply dynamically translate the code, than search a database for the right code segment and then load it? But that's more or less what you wrote further down as well.
I don't really see the point, also because you have to include the interpreter, because there are problems that cannot be solved during static recompilation.
I think there was a statically recompiling N64 emulator once (Corn?), but it just compiled all the code it found when the game was loaded. Yes, it was fast, but it never ran anything but Mario 64. And I bet there would be lots of code discovery problems with other ROMs.

Unaligned accesses can cause interrupts on most platforms where they're not supported outright, Alpha is no exception. Could have easily relied on this and patched the code at runtime as they occurred, or only after many of them occur from the same location.
But interrupts are expensive, and you have to attach your emulator quite deep in the system. Maybe they didn't want to do it.

One other thing - people often assume that dynarec means basic block at a time.
True, there is no theoretic limit how big your translation units are.
You could actually see the N64 emulator above as a dynarec instead, only that the translation unit was the whole code in the ROM.
I guess I'd normally prefer basic blocks, since you probably don't translate the same code twice, and you often return to the dispatcher loop, which makes timing easier. Of course the latter part makes the dynarec slower than it could be. There is always advantages and disadvantages.
Of course you could also resort to chaining basic blocks, like Shade. Although in that case your register allocation advantage might not work.

The only problem is dealing with recursive links.
Yes, that could render an emulator rather unresponsive ;-)
For those who don't know the problem: To understand recursion, you have to understand recursion first...

Overall an interesting discussion, BTW.

I forgot to mention a book:
Virtual Machines by Jim Smith and Ravi Nair. While the bulk is probably about VMs like VMware, a whole chapter is on emulation techniques (interpretation, threaded code, binary translation), and other chapters also cover some problems related to dynamic binary translation.
 

· Emu author
Joined
·
1,490 Posts
Hey serge. I was helping zenogais with dynarecs back before he wrote the tutorial, and I thought that he used code at some point that was based on some I wrote for him. But it looks like he used stuff from GoldFinger so I'm not really sure what he did with what I gave him ;p

Anyway, what immediately strikes as strange is that "register" is used as an identifier. Is that really okay in C++? I know C would have a fit since that's a keyword.

What do you mean when you say that the ModRM doesn't match the written function? I think to best understand this you need a thorough understanding of how x86 encoding works, which this document doesn't explain for you. Basically, the mod/rm byte in an instruction describes the registers. This page will help you:

sandpile.org -- IA-32 architecture -- 32bit mod R/M byte

The byte contains three fields: mod, r/m, and reg. Mod is mode, and describes the address mode. r/m refers to either a register or a memory location, depending on mode - it's also possible that it can refer to an extended mode. What reg refers to depends on the instruction: the list at the top of the link gives the possibilities, which include 8 different sets of registers. Another thing it can be, probably not as obvious, is where it refers to a sub operation - this is the case with ALU operations against immediates. Here the immediate has to be encoded additionally, so those bits are open and are used to determine what kind of ALU operation it is.

There are two examples of these extensions: if mod is 0 then the encoding for EBP (5) means to use [displacement32] instead of [EBP]. For modes 0 through 3 an encoding for ESP (6) means to use [SIB], where what that means is determined by another byte called the Scaled Index Byte. This is what the enums near the top of zenogais's code are for, and explain what those values of rm mean.

I do think that, like you said, there are errors. It appears that the code is confusing the r/m and reg fields. However, this is really only a semantic error, since he's storing what he calls register in the r/m field (lower 3bits) and r/m in what he calls rm (middle 3bits). So the errors basically cancel out, leaving him with code that probably worked. I didn't see any other logical errors in what he's doing.

Could you please elaborate a bit more on what's giving you difficulty?
 

· Registered
Joined
·
111 Posts
C++ has a fit too, I had to change that.

the function call doesn't match the header.

in a few places it makes calls like this
ModRM(0, to, DISP32)

where DISP32=5 and to is an X86RegisterType

but the function is written
void ModRM(unsigned char mod, unsigned char rm, X86RegisterType reg);

so I'm a bit confused. Thought maybe it was just a typo, but it doesn't work even if I call ModRM(0, DISP32,to).

so I'm not really sure where the problem is.

Guess I start reading that page now.
 

· Emu author
Joined
·
1,490 Posts
When you say it doesn't work what do you mean exactly? That it doesn't produce the correct results or that it doesn't even compile?

X86RegisterType is an enum. In C you would be able to implicitly convert an integer to it, but in C++ you cannot. Maybe zenogais was adapting C code and tripped over it. He should have actually tested it, to say the least. As I explained to you, the code got the rm and reg fields swapped with each other, but it was consistent in how it output things. If you want to fix it you need to swap the names AND you need to swap how they're output by ModRM.

You could change X86RegisterType to unsigned char like the others. Casting from enums to integral types is no problem in C++.

But it sounds like you'd be better off just chucking this tutorial and figuring it out yourself. You're trying to fix something you don't really understand fundamentally and I don't see an awful lot of value in that.
 
41 - 60 of 86 Posts
This is an older thread, you may not receive a response, and could be reviving an old thread. Please consider creating a new thread.
Top