Next Generation Emulation banner
1 - 20 of 86 Posts

· King of Pain
Joined
·
491 Posts
Discussion Starter · #1 ·
In another thread sniff_381 asked me to explain what dynamic recompilation (dynarec for short) is.
Since most NG emus have a dynarec now, I thought it might be a good idea to cover the topic in a separate thread.

I have to admit that I haven't programmed a dynarec myself yet, but I have decent knowledge of the basics and some details. I'm not totally sure how much I should go into depth anyway. I guess I'll see if there is enough interest to talk about details.

First of all, the term "dynamic recompilation" is a bit odd, because "to recompile" often means to compile the source code of a program again, but the older term "binary translation" is more precise, since the binary code of a game or application is translated - not the source code - and "dynamic" only means that it is done during runtime and on demand.

So what's the difference to "traditional" or "interpretive" emulation?
An interpretive emulator always picks the instruction the program counter (PC) points to, decodes it, and executes it, just like a real processor would do it. So every time the emulator comes across it same instruction it has to do all the steps again.
In his article How To Write a Computer Emulator (http://fms.komkon.org/EMUL8/HOWTO.html) Marat Fayzullin uses this pseudo C-code sequence to describe the process:
Code:
Counter=InterruptPeriod;
PC=InitialPC;

for( ;; )
{
  OpCode=Memory[PC++];
  Counter-=Cycles[OpCode];

  switch(OpCode)
  {
    case OpCode1:
    case OpCode2:
    ...
  }

  if(Counter<=0)
  {
    /* Check for interrupts and do other */
    /* cyclic tasks here                 */
    ...
    Counter+=InterruptPeriod;
    if(ExitRequired) break;
  }
}
Dynamic recompilation deviates from this procedure by working with whole blocks of code instead of single instructions, and that those blocks are translated into the machine language of the processor the emulator is running on. There wouldn't be a speed advantage if the translated blocks weren't cached and simply recalled as soon as the program counter enters that block again.
Here is some sample code from my (unfortunately still unfinished) DRFAQ (http://www.dynarec.com/~mike/drfaq.html):
Code:
/* the following line defines a 'function pointer',               */
/* which can be used to call the code generated by the translator */
/* CTX is the context of the processor, ie. the register values   */

int (*dyncode)(Context *CTX);


/* the following simplyfied loop is often called the "dispatcher" */

for( ;; ) {

  /* try to find the current address of the PC in the translation cache */
  address = block_translated(CTX->PC);

  /* nothing found, ie. first translate the code block starting at the PC address */
  if (address == NULL)
    /* do the translation and add it to the translation cache */                                      
    address = translate_block(CTX->PC);

  /* point the function pointer to the address of the translated code */
  dyncode = (int(*)(Context*)) address;

  /* call the translated code with the current context */
  status = (*dyncode)(CTX);

  /* handle interrupts and other events here */                                 
}
That's basically how a dynarec works, only that I still haven't explained how the translation cache and of course the translation are handled.

I spoke of code blocks several times, and it might be a good idea to define the term, since not all will be into compiler theory...
In compilers the smallest block of cohesive instructions is called a basic block. Such a block has a starting point and ends with the next conditional jump or branch, ie. as soon as there is a possibility that the program counter changes apart from pointing to the next instruction the block ends. It's also important that no other code block can jump into the middle of the basic block, only at the starting address, because only that way the compiler can see it as a separate collection of code that can be optimized in every possible way.
Most dynarecs probably work with basic blocks, but some use end the block with the next unconditional jump or branch, which leads to larger blocks, often called translation units. This leads to faster code, because all conditional branches can jump in the translated code without having to go through the dispatcher loop first, but it can be problematic to handle interrupts since you don't have a guarantee that the code returns to the dispatcher. Of course that could be handled within the generated code, but that makes things more complicated.

I think that's enough for an introduction. If there are any questions feel free to ask, and if there is interest in extending the parts I haven't covered yet, I could write something about the translation cache, some translation problems, register allocation, the difference to threaded interpretation, etc.
 

· King of Pain
Joined
·
491 Posts
Discussion Starter · #3 ·
I have to admit that I also had to look up fuction pointers in K&R because I simply didn't need them before I thought of how to call dynamically generated code.
But using function pointers is much cleaner than some assembly hacks I've seen in real dynarecs.

I forgot it today, but tomorrow I might post a nice little example that explains how you can call generated code with a function pointer.
 

· King of Pain
Joined
·
491 Posts
Discussion Starter · #5 ·
Calling dynamically generated code

As announced yesterday I'll provide a simple example now that shows how to call dynamically generated code. For some this might be even more intimidating, but those who looked up how function pointers work in C and have a slight understanding of x86 assembly it should make clear how that calling process works.

Code:
/* In the beginning we'll have to define the function pointer.  */
/* I called the function 'dyncode' and gave it an int argument  */
/* as well as an int return value just to show what's possible. */

int (*dyncode)(int);  /* prototype for call of dynamic code */

/* The following char array is initialized with some binary code */
/* which takes the first argument from the stack, increases it,  */
/* and returns to the caller.                                    */
/* Just very simple code for testing purposes...                 */

unsigned char code[] = {0x8B,0x44,0x24,0x04,  /* mov eax, [esp+4] */
                        0x40,                 /* inc eax          */
                        0xC3                  /* ret              */
                       };


/* Include the prototypes of the functions we are using... */

#include < stdio.h >


int main(void)
{
  /* To show you that the code can be dynamically generated    */
  /* although I defined static data above, the code is copied  */
  /* into an allocated memory area and the starting address is */
  /* assigned to the function pointer 'dyncode'.               */
  /* The strange stuff in front of the malloc is just to cast  */
  /* the address to the same format the function pointer is    */
  /* definded with, otherwise you'd get a compiler warning.    */

  dyncode = (int (*)(int)) malloc(sizeof(code));
  memcpy(dyncode, code, sizeof(code));

  /* To show that the code works it is called with the argument 41 */
  /* and the retval sould be 42, obviously.                        */

  printf("retval = %d\n", (*dyncode)(41) );  /* call the code and print the return value */  

  return 0;
}
This code has been written with GCC in mind, but it should work with any C compiler on any x86 operating system that passes function arguments on the stack.

I originally wrote this example with some ARM machine code instead of x86, and all that I had to change was the definition of the code[] array.
That's the nice thing about working with a function pointer to call dynamic code, apart from the generated code everything else is totally portable to any system with a C compiler.

A warning to those working with harvard architecture processors (ie. those with split instruction and data caches):
After copying the code and before calling it you'll have to flush the caches, otherwise the code will be in the data cache but not in the instruction cache and the processor will get into trouble.

While x86 processors nowadays have split L1 caches as well it's not a problem on these because they solve such issues in hardware due to transpartent compatibility with x86 processors that still had a unified cache.

Ok, so much for dynamically generated code...
Anyone still with me?
 

· playing FGO
Joined
·
10,942 Posts
Finally, a decent, intelligent thread.... the first for this day, i think. :)
 

· King of Pain
Joined
·
491 Posts
Discussion Starter · #8 ·
@Shiori:

Someone has to improve the niveau ;-)

@fivefeet8:

I'm not that much into C++ but I think it should be possible to do a dynarec in C++ as well.
IIRC David Sharp did his tARMac project (http://www.dynarec.com/~dave/tarmac/index.html) in C++.

But no matter what a C++ enthusiast will tell you, the language has a certain overhead compared to C, and when you need to use a dynarec to emulate a system at full speed you probably don't want that overhead as well. There is a reason why most emulators are written in plain ANSI C.
 

· King of Pain
Joined
·
491 Posts
Discussion Starter · #9 · (Edited)
When is using a dynarec reasonable?

Those who are still following this thread probably noticed that dynamic recompilation is far from being trivial, and I haven't even touched the more complex issues yet.

This leads to the ultimate question: When does it make sense to use dynamic recompilation anyway?

The advantages of dynamic recompilation are:
  • more speed
  • more speed (nothing else really)

The disadvantages of dynamic recompilation are:
  • quite complicated
  • hard to debug
  • not as exact as interpretive emulation
  • not portable to systems with other processors
  • problems with self-modifying code

This means, as long as you can pull the whole emulation off at a decent speed by using traditional emulation (perferable even a portable solution), just do it and don't give dynamic recompilation a second thought.
Although I've seen people toying with dynarecs for 6502 and similar 8-bit processors it's not worth the hassle, since a nice CPU core written in C would be portable to different systems and should run at full speed on any current and even most older computers.

Even most 16-bit processors should be tried in interpretive emulation before thinking of dynamic recompilation. One of the few reasonable 16-bit candidates would be the 68000, because it is widely used and quite complex, so a dynarec for it might speed up a lot of emulators if you stick to the same API.

Where dynamic recompilation really shines is 32-bit and 64-bit processors, because it makes sense to do operations on hardware registers when the original code does so. Especially the MIPS (used in PSX and N64, eg.) and SuperH (Saturn eg.) processors with their simple instruction set should be emulated via dynamic recompilation to get a decent speed.

One thing to keep in mind is that an emulator with a dynarec needs a lot of RAM, because it not only needs at least the same amount of memory as a traditional emulator but additionally also memory for the translation cache, ie. the code blocks that have already been translated.

Eventually, it's a good idea to start with interpretive emulation to see if that's fast enough and switch to dynamic recompilation when it isn't. During the switching process it's a good idea to keep both CPU emulations to test the dynarec against the interpreter which should make debugging a little easier.
 

· King of Pain
Joined
·
491 Posts
Discussion Starter · #10 ·
Threaded Interpreter - The compromise

Some emulators that claim to use dynamic recompilation actually utilize a technique called threaded interpretation, eg. Generator does that (http://www.squish.net/generator/docs.html).

How does threaded interpretation differ from dynamic recompilation and what do they have in common?

Both techniques work on code blocks and "translate" these into some other representation. This means that both share the disadvantage of needing more memory than a traditional emulator and having problems when the already translated code should be changed by the translation (keyword: self-modifying code).

But instead of translating to code threaded interpretation fills the translation cache with addresses to the instruction emulation routines instead, ie. each instruction found in the source binary will be translated to an address and parameters that point to a piece of code in the emulator that emulates this instruction.

The only thing that you spare compared to a traditional emulator so far is the repetive decoding of all instructions in a block. But threaded emulation can take one further step. Due to having to analyze a whole block of code you can find out which condition flags need to be calculated for an instruction, ie. if a certain flag is overwritten by the side-effect of a following instruction before it can be tested or taken as input by another it doesn't have to be calculated. Since calculation of condition flags often can need more than half the time to emulate an instruction this approach can lead to a noticable speed improvement.
The emulator Generator mentioned above has two different emulation functions for each single instruction, one that calculates all flags and another that doesn't calculate any flags. The address of one of these functions will be added to the translation cache as appropriate.

The advantage of threaded emulation is that it can be portable when it is programmed in a high-level lamguage (like C) and works with function pointers.
The disadvantage is that you cannot access hardware registers directly (unless the instruction routines are written in assembly and you are using static register allocation, but then it wouldn't be portable and you could also use dynamic recompilation), so you still need to access the register file in memory every time an instruction reads or alters a register.

I think that should be enough about threaded interpreting...
Maybe I should cover register allocation next, since I already mentioned it here.
 

· King of Pain
Joined
·
491 Posts
Discussion Starter · #14 ·
Intermediate Summary

Ok, maybe I should sum up some of the stuff here that no one can complain about too much information ;-)

Dynamic recompilation also known as dynamic binary translation is the process of translating binary code blocks during runtime into the binary code of the host machine.

The translated code is collected in a translation cache.

In a dispatcher loop the dynarec decides if certain code blocks still have to be translated and eventually calls the translated code.

The generated code is ideally called using a function pointer.
Here is an example how the function pointer code above has to be changed to get the same result on an ARM processor based machine:

Code:
unsigned long code[] = {0xE2800001,  /* ADD R0, R0, #1 */
                        0xE1A0F00E   /* MOV PC, LR     */
                       };
As you can see only the code[] array (that would be the generated code in a real dynarec) has to be changed. The switch from "unsigned char" to "unsigned long" has only been made due to the fact that ARM has 32-bit fixed lenght instructions, but since we cast it to the function pointer later there is no difference.

Of course the code generator has to be different on each processor, but it makes sense to make everything else portable, thus you only have to write a new code generator but not a totally new emulator.

Mind you that an ARM processor with separate instruction and data caches (eg. the StrongARM) needs its caches to be flushed before the code can be called, but that's operating system specific and I won't go into that detail here.
 

· King of Pain
Joined
·
491 Posts
Discussion Starter · #17 ·
Which emulators use dynarecs?

Every decent PSX and N64 emulator should be using a dynarec today, and ePSE surely does. I'm not totally sure about Bleem, but I guess it had one too.

There is a list of open source emulators with dynarecs on Dynarec.com (http://www.dynarec.com/dynarecs.html).
FPSE used to be on the list too, but then it went closed source and the old source was rather outdated, so I removed it. I guess I'll add GPSE to the list someday.

As I said in some previous posting those dynarecs for 6502 are merely toys and there isn't much need for them. The really interesting things are the dynarecs for processors like MIPS that would need a monster machine to run at full speed using interpretive emulation.
 

· King of Pain
Joined
·
491 Posts
Discussion Starter · #18 ·
UltraHLE was the first N64 emulator with a dynarec IIRC. And it's also said to use some dirty tricks. I might get to one of these tricks when I talk about translation caching.
 

· King of Pain
Joined
·
491 Posts
Discussion Starter · #19 · (Edited)
Register Allocation

One of the big advantages of dynamic recompilation is that you can actually use the registers of the host machine when an emulated instruction uses registers. This can not only reduce the amount of instructions needed to emulate another instruction but also minimize slow memory accesses for every referenced register.
The process of mapping the emulated registers to the registers of the host machine is called register allocation.

Unfortunately a lot of dynarecs in emulators still fetch all needed register values from the register file (just a memory structure where the contents of all registers of the emulated processor) in the beginning of each emulated instruction and write the result back to the register file afterwards. Only a little better are those that don't store the value of the result register if it is used as input in the following instruction.
This really spoils a large part of what dynamic recompilation is about, but still seen quite often in "real world" examples.

There are two different methods of allocating registers:

Static register allocation: This means that in every translated block the same emulated registers are always allocated to the same host registers. When the host machine has enough registers to hold all emulated registers this is the optimal solution, but there are also some advantages even if it has fewer registers (mainly related to timing and translation block handling; I'll cover that later). In the latter case this means that only some of the emulated registers are held in host registers (ideally the most often used ones) and for the remaining ones the register file is accessed.

Dynamic register allocation: In every translation block the registers are allocated differently. Ideally you load the values of the emulated registers into the host registers at the beginning of the block and store those that were modified back to the register file before the block returns control to the dispatcher loop. If there aren't enough registers to hold all registers used in the block you'll have store the value held in a host register to free it for the next one.

Implementation wise static register allocation is rather easy, because you always use the same host registers to hold specific emulated registers and access the memory locations in the register file for all the others. This also means that you always load the same register values and store them to the same memory locations afterwards no matter what translated code block you are executing. So you only need one setup/clean-up code for all blocks which is occasionally called glue code because it's the thing that connects the emulator and the generated code.

The implementation of dynamic register allocation is a bit more complicated though. First of all there is more bookkeeping to do, since the emulator has to remember these facts:
  • which host register holds which emulated register: you need to know if the register is already in use and where to store the value to if you need to free the register
  • which emulated register is held in which host register: this tells you if the register is already allocated, and if it is you know which register to use in the generated code
  • has the value held in the host register been modified? This isn't really necessary, but it's a good idea not to store a value to the register file that hasn't been changed since you spare another unnecessary memory access.

How does register replacement work when you run out of registers?
There are very many methods that could be applied, and probably only one would be ideal, which basically means that you replace that register which isn't needed in that block anymore. Since you do have the entire block you could actually go through it backwards to find out if there is a register the code doesn't use anymore, but that can be a bit tedious.
That ideal solution reminded me of the best but theoretical solution for a page replacement algorithm in operating systems, so I came up with the idea of using another page replacement algorithm called second chance, which is similar to LRU (least recently used) but simpler to implement.
For that algorithm you set a reference flag for the host register every time it is referenced during code generation. When you need a register you go through all host registers in a circle (using a modulo operation with the maximal number of host registers), pick the next register register where the reference flag isn't set, while unsetting the reference flag of all registers you have to skip. Probably sounds a bit weird, but it's easy to implement and should lead to good results.
Of course if you come up with an easy implementation of the optimal solution that would be perfect.

I guess that's the most important things about register allocation...
If you want to know how to handle 64-bit registers on the IA-32 then take a look at the 1964 documentation (http://e64.wwemu.com/emus/1964/1964_recompiler_doc_101.pdf).
 
1 - 20 of 86 Posts
This is an older thread, you may not receive a response, and could be reviving an old thread. Please consider creating a new thread.
Top