Next Generation Emulation banner

Dynamic Recompilation - An Introduction

34802 Views 85 Replies 21 Participants Last post by  Padremayi
In another thread sniff_381 asked me to explain what dynamic recompilation (dynarec for short) is.
Since most NG emus have a dynarec now, I thought it might be a good idea to cover the topic in a separate thread.

I have to admit that I haven't programmed a dynarec myself yet, but I have decent knowledge of the basics and some details. I'm not totally sure how much I should go into depth anyway. I guess I'll see if there is enough interest to talk about details.

First of all, the term "dynamic recompilation" is a bit odd, because "to recompile" often means to compile the source code of a program again, but the older term "binary translation" is more precise, since the binary code of a game or application is translated - not the source code - and "dynamic" only means that it is done during runtime and on demand.

So what's the difference to "traditional" or "interpretive" emulation?
An interpretive emulator always picks the instruction the program counter (PC) points to, decodes it, and executes it, just like a real processor would do it. So every time the emulator comes across it same instruction it has to do all the steps again.
In his article How To Write a Computer Emulator ( Marat Fayzullin uses this pseudo C-code sequence to describe the process:

for( ;; )

    case OpCode1:
    case OpCode2:

    /* Check for interrupts and do other */
    /* cyclic tasks here                 */
    if(ExitRequired) break;
Dynamic recompilation deviates from this procedure by working with whole blocks of code instead of single instructions, and that those blocks are translated into the machine language of the processor the emulator is running on. There wouldn't be a speed advantage if the translated blocks weren't cached and simply recalled as soon as the program counter enters that block again.
Here is some sample code from my (unfortunately still unfinished) DRFAQ (
/* the following line defines a 'function pointer',               */
/* which can be used to call the code generated by the translator */
/* CTX is the context of the processor, ie. the register values   */

int (*dyncode)(Context *CTX);

/* the following simplyfied loop is often called the "dispatcher" */

for( ;; ) {

  /* try to find the current address of the PC in the translation cache */
  address = block_translated(CTX->PC);

  /* nothing found, ie. first translate the code block starting at the PC address */
  if (address == NULL)
    /* do the translation and add it to the translation cache */                                      
    address = translate_block(CTX->PC);

  /* point the function pointer to the address of the translated code */
  dyncode = (int(*)(Context*)) address;

  /* call the translated code with the current context */
  status = (*dyncode)(CTX);

  /* handle interrupts and other events here */                                 
That's basically how a dynarec works, only that I still haven't explained how the translation cache and of course the translation are handled.

I spoke of code blocks several times, and it might be a good idea to define the term, since not all will be into compiler theory...
In compilers the smallest block of cohesive instructions is called a basic block. Such a block has a starting point and ends with the next conditional jump or branch, ie. as soon as there is a possibility that the program counter changes apart from pointing to the next instruction the block ends. It's also important that no other code block can jump into the middle of the basic block, only at the starting address, because only that way the compiler can see it as a separate collection of code that can be optimized in every possible way.
Most dynarecs probably work with basic blocks, but some use end the block with the next unconditional jump or branch, which leads to larger blocks, often called translation units. This leads to faster code, because all conditional branches can jump in the translated code without having to go through the dispatcher loop first, but it can be problematic to handle interrupts since you don't have a guarantee that the code returns to the dispatcher. Of course that could be handled within the generated code, but that makes things more complicated.

I think that's enough for an introduction. If there are any questions feel free to ask, and if there is interest in extending the parts I haven't covered yet, I could write something about the translation cache, some translation problems, register allocation, the difference to threaded interpretation, etc.
See less See more
21 - 40 of 86 Posts
You might laugh, but I already thought of writing something book-like about that topic, but it's always a time issue and I doubt that anybody would want to buy or even publish it.
If I should be insane enough to write a larger text about it, I'd probably distribute it as in PDF.
Translation Caching

Our next topic is the translation cache. But I won't go into detail how the generated code blocks are actually stored and freed, since I haven't given it too much thought yet. Although it is most likely that it's a good idea to allocate a large block of memory via the operating system and then make your own arena managament in that chunk for performance reasons.

The most interesting thing is how to remember which block is already translated and how to do that fast.
For 8-bit processors which normally have a 16-bit address range it would be easy to simply make an array of addresses and mark each single address when it is the start address of a recompiled block. But for those processors where dynamic recompilation is really interesting, this simple approach does not work, even if you take into account that most of the processors only permit aligned instructions. So other, less memory consuming methods have to be found.

When you come straight from computer science studies the most obvious solution would be a hash. That means a value is mapped through a special hash function (often containing a modulo operation to define the confines) onto a much smaller range array. The advantage is that in the best case you only perform the simply function and get the key to the memory location where you find the address of the recompiled code with one memory access (or a zero if it hasn't been recompiled yet). The problem is that the hash function is likely to generate the same key for different values, which happens more often when the hash array is too small and/or the hash function isn't that good. When this happens you get a so-called collision, which has to be solved somehow. The typical solution is to do a linked list of all values that collided, but that means that you need several lookups since you'll have to search the list in linear order until you find what you want.
So hashing is a possible solution, but not necessarily the best one.

Another problem that might have to be solved during the translation cache lookup is to detect self-modifying code. If the code behaves nicely you can skip that, but if self-modifying code occurs from time to time you'll have to detect it.
The typical computer science solution would likely be to run some kind of checksum (eg. CRC) over the original code, and regenerate the checksum every time the block is about to be executed. If the checksum has changed the block has to be translated again.
The problem here is that checksums can be fooled when several values (in the case instruction encodings) change but the change is not visable in the checksum. Also recalculating the checksum before every run of the code should hit on performance hard.

Since traditional solutions aren't perfect let's see what alternatives there are...

A drity trick that UltraHLE is said to be using works as follows:
Instead of using a data structure to memorize which blocks have been translated, the first instruction of the block is replaced by an illegal instruction that also contains the offset to the generated code - since MIPS has 32-bit instructions that's quite possible. So the emulator just takes a look at the start of the block to recognize if it has been translated already or if it has to be translated still. A side-effect is that code that modifies the first instruction of the block leads to the block being translated again automatically.
The disadvantage is that self-modifying code is only detected when the illegal instruction is replaced, and since you modify the original code you might run into problems when that block is actually just a sub-block of a larger block that might run into that illegal instruction. This could be handled of course, when the original instruction is stored somewhere, but it makes things a bit more complicated.

The elegant solution would be a paged translation map. When you don't know what to do in emulation, it often helps to take a look at how the hardware does things.
Most of the processors that are interesting candidates for dynamic recompilation organize their memory in pages, ie. the higher part of the address is the page number and the lower part is the page offset. The typical page size is 4K, which means that in a 32-bit address space you'd have 20-bit for the page number and 12-bit for the page offset. Even if you want to keep track of all possible pages (which isn't normally necessary) you'd need 1MB (= 20-bit address range) x 4 byte (size of an address on a 32-bit system) = 4MB, which might sound like much but actually keeps track of all pages in a 4GB address range. Now you still need 4KB x 4 byte = 16KB per page to have all the addresses, but you only need to keep track of pages that actually contain code, so that's far less than you might assume, and when you have a processor like MIPS where all instructions have to be aligned to 32-bit (ie. the start address of each instruction has the lower two bits cleared) it's only 4K.
When the emulation jumps to a certain address you first look at the page (by shifting the address right by 12 bit) to see if code in tat page has already been translated. If there is no translated code yet you allocate a new memory location that is large enough to hold all addresses for that page, enter a pointer to that area in the page number entry, and finally enter the address to the translated code in the location of the page offset at which the original code block starts.
The lookup needs just two memory accesses, one to find the find the location of the page directory via the page number and another to look up the address of the translated code via the page offset in the page directory.
I hope this doesn't sound too complicated, because it really isn't...

The paged transmap also allows for a solution to identify self-modifying code. Every time a write access is performed it is checked if the page number for that access indicates that code on that page has been translated, the cached code for that page is freed and the address in the page number entry cleared to force a recompilation of the code. This might sound crude, but in paged environments data and code are normally on different pages, so it is really likely that the code has been modified.

Since I was talking about Harvard architectures (ie. split integer and data caches) before, there is another way to detact self-modifying code in that case. Since those architectures have to flush their caches you can try to trace that (either some system call or the processor operation) and free the appropriate code that is no longer valid. According to an article ( the official 68K emulator for PowerMacs uses that solution.
See less See more
Re: Translation Caching

Originally posted by M.I.K.e7
I hope this doesn't sound too complicated, because it really isn't.
Don't get me wrong, but it's really a good read. :) Makes me sound a LOT smarter when I read it aloud. :D

But are you sure you aren't involved in emu coding?
Originally posted by Shiori
Don't get me wrong, but it's really a good read. :)
Well, cute little girls like the one in the picture shouldn't try to understand that anyway, maybe when she's a bit older ;-)

Makes me sound a LOT smarter when I read it aloud. :D
Haven't thought about that yet. Maybe I should try too? ;-)

But are you sure you aren't involved in emu coding?
So far I haven't worked on a single emulator, but I've done a lot of reading and had some discussions with Neil Bradley or Bart Trzynadlowski.
I started writing a dynarec some time ago, but stalled the project because I didn't like the way it turned out, and thought it might be a good idea to study some theories and rethink some of the design issues before I try it again.
So far I'm still in the theory phase, since I'm still not sure if I know enough about the topic, but sometimes I feel that I give it more thought than some programmers who actually wrote a complete dynarec.
Translation without Babelfish

Who actually waited for the next article yesterday? ;)
Sorry about the delay, but I had to make up my mind what to cover next.

I think it's time to tackle the most difficult part of a dynarec, the translation.
You should already know that a block of the original code is translated to a block of host machine code.
To join the generated code with the emulator you also need some glue code to set up registers and write back the values after the block has been executed. For static register allocation you have something you could call a master block, since it does all the setup and cleaning for all generated blocks. With dynamic register allocation on the other hand you need a prologue and epilogue for each generated block, as register allocation can be different from block to block.

In following postings I'll disuss different forms of translation methods...
See less See more
Direct Translation

Using direct translation the code block is processed linearly and each instruction is translated separately. Often the code that generates the translation is placed directly after the decoding of the instruction, ie. the instruction decode look very much like the one from an interpretive emulator, with the exception that machine code is generated instead of the instruction being simulated.

The advantage of that method is that you can simply transform an interpretive emulator into a dynamic recompilator.

But there are many disadavtages.
First of all, optimizing the code is very hard because you could only reasonable do a one or two instruction lookahead to test if the current instruction could be combined with the following ones for a better translation.
Retargetting the dynarec to a different host processor can be quite tedious, since you'll have to go through the whole decoder loop, which should actually be portable.

Unfortunately I see code like that much too often in real-world examples.
See less See more
The VIP Desaster

Once I past the direct translation stage - I was heading in that direction and I didn't like it, which was one of the reasons why I stalled my dynarec and started research again - I thought of making the dynamic recompiler more portable.

Wouldn't it be cool to have some virtual intermediate processor (VIP, ie. the original code is translated to VIP "instructions", which are then translated to target instructions) and enable dynamic recompilation between a whole lot processors without too many translators (just two per processor, one that translates to VIP while the other translates from VIP)?

I thought so, and since I didn't know of the failed UNCOL project (they tried to define some kind of universal machine language and it didn't work out, but I guess I wouldn't have cared anyway) I started analyzing maybed dozens of processor architectures, and after several months I knew much more about several architectures, but I gave up on the VIP idea.

It is true that basically all that processors do is calculate, but there are surprisingly many differences...
Just take the number of logical instructions, where some architectures have just the standard ones while others have a whole lot combinations. Or the fact that some architectures handle the carry flag differently during subtraction (ie. they borrow) while others don't have any flags at all.
The biggest difference so probably the division, where some architectures also calculate the remainder, others have a separate instruction to calculate the reminder, and some require you to calculate the reminder via multiplication. Also some architectures produce results only on special registers, do division steps (ie. only calculate a certain amount of bits of the result per instruction), or don't have a division instruction at all.
If you also add strange instuctions like the "add and branch" from PA-RISC you get a whole lot of different instructions.

With the VIP there would be two extremes, either you end up having all possible instructions from all architectures you know (and there will be stil many you don't know), or you make it very simple and compose translations for more complex instructions with a sequence of VIP instructions. Either way it's bad:
When you make the VIP too complex porting will be very tedious since you'd have to provide translations for hundreds and hundreds VIP instructions for each new host processor and no one will want to do that, which nullifies the whole idea behind using the VIP in the first place.
If you make it simple, then it will be a charm to port, but the quality of the target code will be very bad, as it is very hard to optimize code, when even very simple original instructions end up as being a long sequence in VIP, eg. since many processors have different condition flags and some don't have any those couldn't be handled in a simple VIP, and you'd have to translate these into separate VIP instructions, which could transform a very simple ADD instruction into a long VIP sequence.
Finding a compromise between these extremes will be very hard and most likely result in narrowing the number of supported architectures, which again ruins the idea of using a VIP.

So that isn't really a good idea either...
See less See more

Ok, since the first two approaches to translation aren't that recommended, what else could be done?
In my opinion, the translator should make it easy to optimize translations and should be not too closely linked to the instruction decode to make porting easier.

The best solution is probably to generate a block decode structure, ie. as the instructions of a block are decoded the decode information is added to a structure (you might even add some additional information about register and flag use), which is then handed to the translator that is a totally different module.

This method is relatively fast, since you don't do several translations as with the VIP and lookaheads are much easier in pre-decoded block structure than in direct translation, which also leads to peephole-optimization being much easier.

You still have to write a special translator for each host system, but since the translator is a separate module that communicates with the decoder via a data structure porting is a much cleaner process. Not to mention that you are able to do very specific optimization, which probably would not be possible if you were using a stronger abstraction.
See less See more
The ultimate tutorial for conceptual dynarec programming!!! Good job Mike! :)

YES, it will be great if you put these articles together in PDF and contribute it to emulation programming sites like "how to emulation" in!
I wouldn't call it the ultimate tutorial, because you'll find other sources with much more practical information, simply because of my lack of experience in that field.
But you'd probably have to search hard to find a single source that discusses more of the different possible techniques, since I had some time to play with these possibilities in my mind.
Some of those possibilities are easy to reject of course, but others might have their use for some special cases, even if they don't seem to be ideal from a general point of view. That's why I think it's important to know many of the techniques to have alternatives if the best general approach doesn't suit the problem well.
Code Generation

After some days without new information I think I should talk a little about code generation, since I don't want to dive into the platform specific translations yet.

One possiblity that is used by a few dynarecs are preassembled routines, ie. the covers (the code that represents a certain operation, in our case an instruction of the original code) are written in assembly (althought I know at least one example where they are hacked in hex code) and translated during the compilation of the emulator.
Those covers contain only placeholders where register references, addresses, or immediate values should be. After the whole cover has been copied to the target memory location the placeholders have to be patched with the appropriate values. Due to this fact the range of possible instruction formats is somewhat limited unless you want to make the whole process too complicated.
Obviously this method isn't that flexible and also won't lead to optimized code, as you have to translate each instruction all by itself instead of being able to combine two or three instructions to make a faster semantic translation of the code sequence.

Other methods typically use code emitter functions for covers. Those functions directly write exacutable instructions to a memory location. The difference here is how readable the code is. Often it's like this:
	emit_4byte(0xE2800000 | 1);  /* ADD R0, R0, #1 */
	emit_4byte(0xE1A0F00E);      /* MOV PC, LR     */
Some readers might recognize the ARM example code I was using for the demonstration of the dynamic function call. I only changed it to show how values like the immediate #1 in the ADD instruction tend to be added to such code.
Of course this is already more flexible than preassembled code, but it isn't very readable and even harder to edit. Mind you that you won't even find the comments that tell you which instruction is generated in some example dynarecs I found.

To improve the last method you should use either functions or macros to make the code emitters more readable and also easier to use, which might look like this in the end:
	emit_ADDI(REG0, REG0, 1);  /* immediate addition */
	emit_MOV(REG_PC, REG_LR);  /* return             */
With a few dozen code emitters like these you should be able to program and edit covers quite easily after some practice, and it is clear what they do even without the comments.

This is probably one of the last basics I can explain without using too much assembly. So are there any things you don't understand fully yet and that should be better explained before I continue?

What I have left now are topics about timing issues, specific translation problems, and code optimization, but those are very specific in most cases and maybe not that interesting for most. After that I'll probably have to pass or just answer questions, because I don't really have too much knowlegde of more in-depth stuff to continue. I didn't plan to fully explain a dynamic recompiler anyway ;)
See less See more
Compared to my Knowledge this is In-Depth already ;) . Very good read, this should definitly be a good Emu-Coder resource.

Not really a Question on the Dynarec Stuff, but do You know a table of the X86-instructions( & maybe SSE/ 3dnow/ MMX ) and their Hex-Codes?
ok.than how whould i simulate a chip like the RCP which is 2 or more cpu's or gpu's or a cpu and a gpu in 1 chip with fixed instructions and formats?? thats what i realy need to know..
Originally posted by n64warrior
ok.than how whould i simulate a chip like the RCP which is 2 or more cpu's or gpu's or a cpu and a gpu in 1 chip with fixed instructions and formats?? thats what i realy need to know..
You should have a Idea how to emulate a System or how it works or be able( or atleast TRY) to reverse it, if not JUST GIVE UP DAMMIT :rolleyes:
What do You want? someone who writes a Emulator for You if you ask the right Questions?

Hope this thread aint going down now :mad: ...

PS. Found the Hexcodes in a PDF Document at Intel`s
i am not asking anyone to write an emulator for me . i just want to learn as munch infor. about emulation, so i can get tooie working.and to make a some ideas on how to emulate it.but still my main goal is get real good at programing software for cross that lunix and apple software can run on windows on a Pentium2 at good speed.and i also want to to make a xwindowsxp that whould be an open source project.but thats all later.for now i just want get good at n64 emulation.and at least make one good video plugin.
Wow,so much projetcs :lol: .
Originally posted by N-Rage
Compared to my Knowledge this is In-Depth already ;) . Very good read, this should definitly be a good Emu-Coder resource.
Everything is realtive of course. I still feel that I'm in need of a lot of research.

Not really a Question on the Dynarec Stuff, but do You know a table of the X86-instructions( & maybe SSE/ 3dnow/ MMX ) and their Hex-Codes?
Since the encoding of the x86 instructions is rather complicated, I doubt that there is a simple table for these.
Just a small example: "MOV EAX, EBX" and "MOV AX, BX" have the same encoding, the correct instruction can only identified by the contrext of which mode (32-bit or 16-bit) is currently in. If you mean the other instruction you have to use the prefix byte hex 66 before the encoded instruction.

I would have directed you to the official Intel Instruction Set Reference (, but it seems that you found that already.

I guess AMD will have something similar to find out more about 3DNow!, but I'm not totally sure.
Originally posted by n64warrior
ok.than how whould i simulate a chip like the RCP which is 2 or more cpu's or gpu's or a cpu and a gpu in 1 chip with fixed instructions and formats?? thats what i realy need to know..
First of all, I meant this a general discussion about dynamic recompilation and I want to keep it clean from references to specific systems, which is why I won't turn this into a N64 emulation discusssion. If you want to know more about that you'd better open a new discussion thread and/or look for answers here:

Secondly, according to my information the RCP isn't a normal CPU but rather a video/audio chip combination, where dynamic recompilation doesn't make sense because there isn't any program code you could translate and cache.
You might try to cache display lists and there effect, but since display lists aren't necessarily linked to specific addresses unlike program code blocks so recognizing an already encountered display list will be slow, and not only due to this fact I assume that you won't get any speed increase with such a technique.

If you simply mean the synchronization of the CPU and the rest of the system, then I'll cover that topic in "Timing Issues" a bit later...

i am not asking anyone to write an emulator for me . i just want to learn as munch infor. about emulation, so i can get tooie working.and to make a some ideas on how to emulate it.
I guess you're just one of those guys who are pissed off that the only N64 emulator that is fast enough on your system to be playable is UltraHLE, but that uses lots and lots of dirty tricks to be able to run the games that fast at the expense of compatibility, which means that only few titles work.
Now you probably think that all the other N64 emulator authors just suck, and you'd be able to do a lightning fast emulator with top notch compatibility all by yourself, while you don't have enough knowledge to do that and the main problem simply is that your PC is too slow.

but still my main goal is get real good at programing software for cross that lunix and apple software can run on windows on a Pentium2 at good speed.and i also want to to make a xwindowsxp that whould be an open source project.but thats all later.
Good luck, but I guess you won't finish any of these projects in the next 5 or 10 years...
I had the idea of writing a VDM (Virtual DOS Machine) for BeOS to be able to run simple DOS tools under my favourite operating system. Apart from the use of the V86 mode of the processor this should be almost trivial compared to what you have on your mind, and I am still at the research stage after I thought of the VDM several months ago, and I'll probably drop the whole idea because of more important things I could do.

for now i just want get good at n64 emulation.and at least make one good video plugin.
It's certainly not the right thread to discuss that here...
See less See more
Timing Issues

When it comes to handling multiple chip functions in emulation one might think with all the multi-tasking operating systems the different functions could run in parallel as separate threads. But it will be very hard to synchronize the timing between the part, which will either lead to weird effects or the emulation won't work well at all. The only thing that can probably run in a separate thread is the user interface IMO, where the system can buffer user input for later use.

So how does synchronization work in an emulator?
A traditional emulator "executes" one instruction after another, then checks the clock cycles it would have taken on the real machine. This is done for two reasons: first of all, when you know how much time it takes to perform the instruction on the real processor you can adjust the timing of the emulator that it runs just as fast as the real system, and secondly, after a certain amount of clock cycles are processed it is time to refresh the display, output some sound, etc. The amount of clock cycles that have to pass before the emulator has to process the other tasks totally depends on the emulated system, so I won't go into detail here.

In some rare cases, when the emulation has to be very exact, a so-called single-cycle emulation is used. In that case not the whole instrution is emulated, but only a small part of it during each clock cycle, to get the timing of the emulator as close to the original system as possible. This is very slow and complicated of course, and the total opposite to dynamic recompilation, which is meant to be fast and not totally exact. Single-cycle emulation is used be some C64 emulators, eg.

How does synchronization work in a dynamic recompiler?
Basically it's the same way, only that you execute blocks instead of single instructions or even single cycles, but you still count the clock cycles the same block would have needed on the real machine to synchronize the timing of the CPU emulation with the remaining emulation.
Depending how exact the emulation has to be, some blocks might spend too many clock cycles. If that happens you'll need to find out how many clock cycles can pass without running into a problem, and basically split the block into smaller parts during translation to stay below that cycle limit. This means that you can perform less optimizations because the blocks are smaller, and when you work with dynamic register allocation you have to generate prologue/epilogue code for each smaller block.

The most extreme solution that Neil Bradley came up with, if you should need very exact timing when using a dynarec, is that you basically have one original instruction per block. To keep the overhead low you should only do it with static register allocation, or you'll end up with more prologue/epilogue code than actual translations.
But instead of returning to the dispatcher loop as would be normal with translated blocks, you decrease a cycle counter and only jump to the exit code when the counter becomes negative. It looks a bit like this in x86 code:
	; translated intruction here
	sub ecx, cycles
	js  exitcode_1234
	; next translated instruction
You need separate exit code for each instruction, but when you use static register allocation this won't be more than writing the current program counter address of the emulated system in an appropriate variable that the emulator knows where the dynarec left the block.
When you also use the paged translation map you can address each translated instrcution individually, so you don't have blocks anymore but linear code that can be left after any moment when it is time to do so.
One of the advantages is that you can actually make branches inside the generated code, because even when it would be an unlimited loop the execution thread would leave the block after a certain amount of clock cycles.
Since you'd most likely still translate block for block instead of the whole executable code (which is even impossible to find when you have indirect jumps), there are cases when you translate a block in the dispatcher that anther block just branched to. In that case you could patch the previous block not to leave to the dispatcher but jump to the newly translated block directly.
This whole idea - I hope I made it clear - should provide the best timing when using a dynamic recompiler, and it even has the advantage that it only returns to the dispatcher loop of the emulator when it has to due to clock cycles being "used up", but otherwise it runs totally on generated code.
No light without shadow, of course. The disadvantages are, you have to work with static recompilation (although some might not see that as a flaw) because the memory overhead for dynamic recompilation would be too big, and since you translate the instructions out of context you cannot use any optimization at all.
Use this technique if you really need instruction exact timing, otherwise it might be a bit extreme.
See less See more
21 - 40 of 86 Posts
This is an older thread, you may not receive a response, and could be reviving an old thread. Please consider creating a new thread.