[Emulation] Any tips on writing a JIT?

cottonvibes · May 7, 2009

mudlord said:
Excuse me: there is NO right or wrong. Its the dev's choice. Yes, its insane using a dynamic recompiler on a C64 emulator (which I may add, HAS been done before, and done very well), but there should be no "shoulding" people about what can and can't be done.

alright boss.

Hatorijr said:
what i am having a little trouble with, what would the difference be between having the line in normal code format compared to asm? like with your example, would it be faster to store that value in the original variable (V01) or to do it the recompilation way? just seems to be faster the normal way.

well there's a bunch of different factors that make dynarecs faster (note: the factors listed below depend on the dynarec implementation):

1) a great speed benefit comes from not having the overhead of repetitive function calls for interpreted opcodes.
you basically recompile a bunch of the opcodes into blocks, and you're not doing any function calls while executing (or very little function calls).

2) great use of cpu cache, which is a big speedup.

3) you can generate optimized code from constant parameters.
for example, pretend you want to recompile something like:

Code:

// Assume the system you're emulating has this opcode:
addimm  reg1, reg2, 0; // reg1 = reg2 + 0

the interpreter function in an emulator for this opcode might look like this:

Code:

void addimm(int reg1, int reg2, u32 immediate) {
    cpu.regs[reg1] = cpu.regs[reg2] + immediate;
}

if recompiling the code, you can do optimizations like this:

Code:

// Note: the code below uses pcsx2's emitter syntax.
// different emu's use their own emitters to generate 
// x86 code more easily.

void rec_addimm(int reg1, int reg2, u32 immediate) {
    if (immediate == 0) { 
        MOV32MtoR(EAX, &cpu.regs[reg2]); // move memory to general purpose register (gpr)
        MOV32RtoM(&cpu.regs[reg1], EAX); // move gpr to memory
    }
    else {
        MOV32MtoR(EAX, &cpu.regs[reg2]); // move memory to gpr
        ADD32ItoR(EAX, immediate); // eax = eax + immediate
        MOV32RtoM(&cpu.regs[reg1], EAX); // move gpr to memory 
    }
}

notice that when immediate is equal to zero, you can just move reg2 to reg1, since reg2+0 == reg2.

since you're recompiling this code, you spend more cpu time doing this optimization once, but then you can execute the optimized code as much times as you want.

if you were to do the same thing in the interpreter function like this:

Code:

void addimm(int reg1, int reg2, u32 immediate) {
    if (immediate == 0) {
        cpu.regs[reg1] = cpu.regs[reg2];
    }
    else {
        cpu.regs[reg1] = cpu.regs[reg2] + immediate;
    }
}

the conditional if-statement will run every time you call that interpreter function, and therefor you'll actually loose speed compared to the original interpreter function w/o the conditional (because the compiler would change the latter code into compares + jumps, and it'll be faster to just always do the addition instead...)

4) theres something called reg-allocation/reg-cacheing that you can do with dynarecs, that basically has the idea to "keep the emulated-system's regs in your CPU's registers for as long as possible".

this way you can eliminate a lot of reading/writing from/to memory.
also, a lot of reg-to-reg x86 operations are shorter than reg-to-memory/memory-to-reg, and therefor have less cpu 'cache-clutter' and speed things up.

5) you can reorder opcodes and even perform stuff such as constant propagation and constant folding to optimize and speed things up.

the basic idea of dynarecs is "do expensive work once, then run the generated optimized code many times."

hope that helps.
and the best way to understand dynamic recompilation IMO is to play around with the source of some emu that does it.
i didn't understand the difference between interpreters and dynarecs before i started working on pcsx2.
and now a year later, i wrote my own dynarec

Squall-Leonhart · May 7, 2009

^_^ DynaRec's are not always the way to go.

Super nintendo emulation for example, will not benefit from it. Infact it is likely to negatively impact on it.

Hatorijr · May 7, 2009

btw, the way you explained the pcsx2 functions was a bit hard to follow at first, but i get what you are saying though and i sort of understand it now, but i doubt i will touch an emulator anytime soon where it would benefit the emulator to use a dynarec but it is nice to know how to do it though just in case.

thank you everyone for the info on dynarecs though as it has really helped a lot

blueshogun96 · May 7, 2009

TBH, I do encourage you to write a more complex emulator when you're ready like PSX or maybe even i286. I think you'll do a good job of it.

@mudlord, thanks for the link. I don't remember seeing that one!

Hatorijr · May 8, 2009

yea i just need to figure out whats causing me to not be able to concentrate at all, i got a fair bit of gameboy finished but until i can actually concentrate properly its on hold, no idea why i can't think straight while working on it lately :/.

mudlord · May 8, 2009

alright boss.

dont you ****ing get snarky with me >_>

Squall-Leonhart · May 8, 2009

mudlord... drop it. ^_^

blueshogun96 · May 8, 2009

Can't we all just... get along oO

Hatorijr said:
yea i just need to figure out whats causing me to not be able to concentrate at all, i got a fair bit of gameboy finished but until i can actually concentrate properly its on hold, no idea why i can't think straight while working on it lately :/.

I know that feeling. It's been bothering me for years. What I do is I do my coding elsewhere besides home. Good thing I have a notebook PC. I find it hard to concentrate at home. Distractions may include TV, family members, temptations to start playing your games, daydreaming, etc. That's why I like going to the library, restaraunts, and other public places that have a wifi nearby so I can concentrate on my projects. I get tempted to play Azurik and Halo (the first one, the rest can't live up to the original in any way) all the time, so I need to get away from it. Just a few suggestions. I hate it when I want to code, and I can't concentrate or even get comfortable.

cottonvibes · May 8, 2009

blueshogun96 said:
What I do is I do my coding elsewhere besides home. Good thing I have a notebook PC.

hmm i think i'd get more distracted on a notebook.

when i'm really serious about coding, i basically lock myself in my room and exit out everything on my pc except for visual studio and any documentation i need.

if i had a notebook i'd probably be moving around all over the place, and not get anything done. :emb:

blueshogun96 said:
I hate it when I want to code, and I can't concentrate or even get comfortable.

yeah i've had that happen before, it sucks.
the best feeling is when you have a really good coding day, and basically all your ideas magically flow together and everything works the first or second time! xD

for the most complex stuff i program, it usually takes me a few days to come up with a nice solution.
and i do alot of walking around and thinking, until i finally have a working idea in my head; then i quickly write the basic code down on paper, and later i code the real thing.

i don't like to code stuff unless i have a 'nice' solution.
when i have complex problems i always try to think of simple/nice solutions for coding them; it just takes a lot of thinking to find the best way to do it.
but the reward is better and more organized code in the end, so i think it pays off.

Hatorijr · May 8, 2009

my issue with concentration is not even limited to coding, and i have a feeling part of it may be whats causing these headaches, i would not doubt it for a second.

also i do tend to code without much around me but lately not even that has been helping me any, i mean i have the ideas in my head but when i go to try and code them in visual studio my mind immediately goes blank and if it does not go blank, it makes it very difficult to get anything down that i feel good about.

nice to know i am not the only one like that though

Exophase · May 8, 2009

cottonvibes said:
the solution would be to make blocks really small, and then exit out of the CPU recompilation/execution to update the other parts of the system. however this'll eliminate the speedgain from dynamic recompilation (because there's always an overhead in the recompilation and searching phases, and its only justified if your execution phase runs a lot of code; else, and interpreter would be faster)

I disagree that an interpreter would be faster. It depends on what you're emulating and what you're emulating it on. I've seen very weak recompilers or threaded interpreters that were still faster than their interpreter counterparts. Then again, it also depends on how efficient the interpreter is to begin with, since most people aren't willing to bust out assembler code for their interpreters these days. I would still say that a recompiler that has a cycle check plus allow exit after every opcode should still beat out an interpreter for most cases, except where it negatively impacts icache pressure too dramatically. But this kind of recompiler would be much weaker than what you could potentially do.

As for what constitutes a "lot of code", the overhead mitigates down very quickly. You're not liable to gain a lot of performance by checking every 1000 instructions instead of every 20. Usually checks are done before branches or after X possible cycles have accumulated to bound the amount it can go over.

There is an alternative to allowing cycle overshoot: use a hybrid coarse recompiler and either fine recompiler or interpreter. The coarse recompiler executes blocks of instructions, allowing for all of the otpimizations therein. Before executing a particular block, it checks the remaining cycles against the maximum number the block may consume, and if there are not enough then it bails out to the fine grain recompiler or interpreter. This execution path must run until it can get back to a coarse block, which is the more difficult part of the problem, but fairly manageable.

Even on platforms with tight cycle windows you can make things work a large amount of the time with a coarse recompiler - the rule of thumb is to allow more cycles than the emulated machine, not less. That means that if your block goes over by N cycles you generally just throw those cycles away rather than take them away from the next block to be executed. That way it's as if the later instructions got condensed in, rather than a latency skip. This prevents you from skipping over events in the overshoot, so it works unless the game wanted to skip those events (much less common, but does happen).

blueshogun96 said:
* Any CALL or JUMP instructions must be emulated in software. Never emulate a JUMP, CALL, RET or any instruction that modifies the EIP register directly in your code block. Emulate these in software Only use CALL when you need to call an emulator function that emulates the needed functionality in software (i.e. host memory access, privelaged instruction, emulated devices, etc.)

Actually, indirect jumps and calls can (and should) be translated to native jumps and calls if the destination block has been recompiled. If it hasn't then you can either recompile it on the spot or branch to code to do so at runtime and backpatch the branch to go to the new block.

By the way, for those people contemplating doing a dynarec for something simple like chip8 - do it. It doesn't matter if it "needs" it or not, it's good experience and a lot easier than doing it for something much more complex. I did my first recompiler for another imaginary platform called "lc2", and while the design was quite different than my later "real" one (and in some ways pretty bad) it was still a good educational experience.

cottonvibes · May 8, 2009

Exophase said:
Even on platforms with tight cycle windows you can make things work a large amount of the time with a coarse recompiler - the rule of thumb is to allow more cycles than the emulated machine, not less. That means that if your block goes over by N cycles you generally just throw those cycles away rather than take them away from the next block to be executed. That way it's as if the later instructions got condensed in, rather than a latency skip. This prevents you from skipping over events in the overshoot, so it works unless the game wanted to skip those events (much less common, but does happen).

well that doesn't sound very accurate.
on old systems, i like to be as cycle accurate as possible.
actually i like to be cycle accurate always unless its 100% guaranteed to not effect anything.

but yeah i'll agree there's ways to make a fast rec for older systems.
but it'll be harder to make it fast and accurate than doing it for a system that isn't so picky about cycles IMO.
and generally i think the best way to go is with a fast interpreter in that case.

Exophase · May 9, 2009

cottonvibes said:
well that doesn't sound very accurate.
on old systems, i like to be as cycle accurate as possible.
actually i like to be cycle accurate always unless its 100% guaranteed to not effect anything.

but yeah i'll agree there's ways to make a fast rec for older systems.
but it'll be harder to make it fast and accurate than doing it for a system that isn't so picky about cycles IMO.
and generally i think the best way to go is with a fast interpreter in that case.

Of course it'll be harder to make it fast and accurate. Faster is usually harder

But it's doable.

My later suggestion isn't "very accurate", but people assume a lot about demands for platforms without really having a large case study behind it. Yeah, depending on the platform, some varying percentage of games will break, sometimes in very minor ways and sometimes entirely. You can always fall back on something else. I don't really like the attitude some have that an emulator is worthless if it fails to emulate any particular game - if it's something that all games need but only 0.01% of the time then it's a different story, but I'd rather have an emulator that played 50% of games at real time speed than 100% of games at 50% speed. I think emulation just brings out perfectionist nature in people. I too have fallen under that before, it's hard to escape.

You like to be cycle accurate always unless you're sure it will break nothing, but are working on a PS2 emulator? With a library of games that size there is bound to be something that won't work with whatever not exact timing model you have, even if it wasn't a deliberate decision on the author's part.. will you be able to live with yourself then? ;p

cottonvibes · May 9, 2009

Exophase said:
You like to be cycle accurate always unless you're sure it will break nothing, but are working on a PS2 emulator? With a library of games that size there is bound to be something that won't work with whatever not exact timing model you have, even if it wasn't a deliberate decision on the author's part.. will you be able to live with yourself then? ;p

well when i'm finished with my VU recs it should be 100% cycle/pipeline accurate.
anything wrong timing-wise will be because of pcsx2's timing model, which i'm sure will never be perfect, but will get better in time...

speedwise my recs might be slower than the current VU recs, but i rather have near-perfect emulation, than something hacked together for speed (like the current VU recs).
if i find anything that has a big performance hit and is not needed for 90% of games, i'll make it into an optional speed hack. but the default options will try to be as hack-less and accurate as possible. (also prevents noobs from saying, why won't XXX work!?)

furthermore we plan to thread my recs, so even if they're a bit slower, since they'll be running in parallel with the other parts of the emu, it might be a faster overall net-gain.

and if my recs are 20% slower, but 99% accurate, i think its a good trade-off... hardware will get faster over the years, but the only way for the emulation to get more accurate is by code rewrites.
but don't get me wrong, speed is important to me too; but we already have a fast VU rec in pcsx2... now i think its time to have an accurate one.

Exophase · May 9, 2009

cottonvibes said:
well when i'm finished with my VU recs it should be 100% cycle/pipeline accurate.
anything wrong timing-wise will be because of pcsx2's timing model, which i'm sure will never be perfect, but will get better in time...

But will it check for events between every instruction? What about events that can happen in the middle of an instruction, such as affecting the results of a load? Because you said that a recompiler is useless like that and it can't be 100% accurate otherwise... I personally think attempting something like this on a PS2 emulator would be crazy.

cottonvibes said:
speedwise my recs might be slower than the current VU recs, but i rather have near-perfect emulation, than something hacked together for speed (like the current VU recs).
if i find anything that has a big performance hit and is not needed for 90% of games, i'll make it into an optional speed hack. but the default options will try to be as hack-less and accurate as possible. (also prevents noobs from saying, why won't XXX work!?)

But what about the limitations in floating point representation/results from using IEEE floats? Those are a big hack, right? But the emulator would be very slow w/o them.

cottonvibes said:
furthermore we plan to thread my recs, so even if they're a bit slower, since they'll be running in parallel with the other parts of the emu, it might be a faster overall net-gain.

Wow, are you sure they'll be 100% accurate? Splitting parts of an emulator off into threads is about the worst thing you can do for synchronization accuracy, unless you plan to never have them run at the same time (obviously not the case). Are we really on the same page when it comes to accuracy?

cottonvibes said:
and if my recs are 20% slower, but 99% accurate, i think its a good trade-off... hardware will get faster over the years, but the only way for the emulation to get more accurate is by code rewrites.
but don't get me wrong, speed is important to me too; but we already have a fast VU rec in pcsx2... now i think its time to have an accurate one.

Hm, 99% accurate, I see 1% has been notched off ;p

The way I see it, unless your timing is completely perfect then any tiny amount being wrong can screw you over. In reality, it can be much worse to be extremely close than far off but generously so. It's usually better to offer 50% too many cycles than 5% too few.

"Accuracy" really has only a weak correlation with compatibility. A less accurate emulator can be more compatible, even without per-game hacks (or hacks in general, really). Of course that depends on your metric of accuracy. The thing is, no one is going to care if your emulator is behaving more like a PS2 if it doesn't mean more games work, but they will care if it's slower.

cottonvibes · May 9, 2009

Exophase said:
But will it check for events between every instruction? What about events that can happen in the middle of an instruction, such as affecting the results of a load?

that's not applicable to the VU's.
they independently run their programs w/o interruption.
you basically send it a 'microProgram', and it runs it till completion w/o relying on any other parts of the ps2.

Exophase said:
Because you said that a recompiler is useless like that and it can't be 100% accurate otherwise... I personally think attempting something like this on a PS2 emulator would be crazy.

i never said it was 'useless', but i did say it wasn't accurate to do that on older systems.

i also said that with modern systems you can run your recs for 1000's of cycles w/o syncing, because they're not as cycle critical; which is the case with ps2 emulation.
and its one of those cases where it won't break games to do this, because its impossible for the game to know exactly what state the other parts of the system are in w/o polling for its status.

i think i should clarify that when i say 100% cycle and pipeline accuracy, i'm talking about calculating stalls and cycles correctly, and executing non-interlocked opcodes with the correct operand results according to the stage in the pipeline they're run.

Exophase said:
But what about the limitations in floating point representation/results from using IEEE floats? Those are a big hack, right? But the emulator would be very slow w/o them.

that's why my emu won't be 100% accurate overall.
but that has nothing to do with cycle/pipeline calculations, so there's no reason i can't do perfect cycle/pipelines in my recs.

Exophase said:
Wow, are you sure they'll be 100% accurate? Splitting parts of an emulator off into threads is about the worst thing you can do for synchronization accuracy, unless you plan to never have them run at the same time (obviously not the case). Are we really on the same page when it comes to accuracy?

there's nothing wrong with threading if you do it right ;p
and actually threading will make stuff more accurate in terms of syncing than we have now.
too lazy to explain the details.
but just think that on the real ps2 the stuff is run in parallel; so not doing this is technically more wrong.

Exophase said:
The thing is, no one is going to care if your emulator is behaving more like a PS2 if it doesn't mean more games work, but they will care if it's slower.

well generally more games should work the more accurate the emulation is; if not then you did something wrong, or its just a fluke that the old way seemed to work correctly.
and i don't code to make those people happy, i do it for myself.

p.s. if my recs are slower and less compatible, then we'll just keep the old ones and make my rec optional...

Exophase · May 9, 2009

cottonvibes said:
that's not applicable to the VU's.
they independently run their programs w/o interruption.
you basically send it a 'microProgram', and it runs it till completion w/o relying on any other parts of the ps2.

I find it hard to believe that that's completely literally true. If there's any means of communicating between the different components then they're not in isolation and can probably see the state of the others at any arbitrary moment. I'm just talking about things that apply to goals of "100% accuracy."

cottonvibes said:
i never said it was 'useless', but i did say it wasn't accurate to do that on older systems.

You're confusing things, I was referring to where you said that you'd may as well do an interpreter instead of cycle granular dynarec.

cottonvibes said:
i also said that with modern systems you can run your recs for 1000's of cycles w/o syncing, because they're not as cycle critical; which is the case with ps2 emulation.
and its one of those cases where it won't break games to do this, because its impossible for the game to know exactly what state the other parts of the system are in w/o polling for its status.

If there are any shared buses or interrupts then it isn't impossible. Maybe it's impractical for the games to deliberately detect the states of the other components but timing errors in emulators tend to arise due to states that were achieved accidentally and "just work."

cottonvibes said:
i think i should clarify that when i say 100% cycle and pipeline accuracy, i'm talking about calculating stalls and cycles correctly, and executing non-interlocked opcodes with the correct operand results according to the stage in the pipeline they're run.

That's fine. If we're talking about results due to multiplexing registers (term from C6x programming with non-interlocked instructions) then you definitely want to try to emulate the results correctly. This isn't so much a timing issue, which is more what I'm referring to.

On the other hand, how do you know that games are sensitive to stalls? Sure, I can see this making a huge impact on overall timing - a tight inner loop could perhaps take many times more cycles than you're giving it otherwise. But if a game is sensitive to this kind of timing accuracy it could be sensitive to pretty much anything, including the situations you said a game would never be sensitive to. Can you verify any cases where you know a game relies on the execution being slow enough?

By the way, I hope VU is largely statically scheduled and doesn't have a lot of runtime stalls. If it has cache then you're going to be way off no matter what, unless you want to emulate that and take a massive speed hit. These seem like the kind of units that would only have a small amount of dedicated RAM, but the stall timing on the CPU is probably never going to be anywhere close to correct.

cottonvibes said:
there's nothing wrong with threading if you do it right ;p
and actually threading will make stuff more accurate in terms of syncing than we have now.
too lazy to explain the details.
but just think that on the real ps2 the stuff is run in parallel; so not doing this is technically more wrong.

I don't buy this argument. You'd may as well say that making every cycle last whatever a cycle would last on PS2 is more accurate. Real world timing and emulation timing are very separate things. The parallelism you achieve here will be nothing like the parallelism the PS2 achieves, except perhaps in very broad strokes. I also think it'd be kind of bad if you actually got better accuracy from users of quad core, but with OS scheduling being such a big thing far removed from your emulator I doubt you will...

cottonvibes said:
well generally more games should work the more accurate the emulation is; if not then you did something wrong, or its just a fluke that the old way seemed to work correctly.
and i don't code to make those people happy, i do it for myself.

p.s. if my recs are slower and less compatible, then we'll just keep the old ones and make my rec optional...

When it comes to timing accuracy anything goes and I stand by what I said 100%. You have to clarify what it means to be "more accurate" with timing - generally people mean that a global cycle number in which events occur are within a lower percentage error of what they'd be on a real machine, which is what you're going for. But timing inaccuracy is timing inaccuracy and being closer can hurt more. I don't see that as being a "fluke."

For anything not involving timing then "more accuracy" helps.

I'm kinda getting tired of every emulator author only "coding for themselves", not that I doubt you ;P It'd be nice to see some that actually do care about what other people end up using :/

mudlord · May 11, 2009

Can't we all just... get along?

No. Polar opposites never get along.

Squall-Leonhart · May 11, 2009

they can be kept separate though.

70597 · May 13, 2009

Useful thread on Dynarec/JIT: http://forums.ngemu.com/web-development-programming/20491-dynamic-recompilation-introduction-2.html Although the site in the OP seems to be gone.. oh well.

[Emulation] Any tips on writing a JIT?

cottonvibes

Squall-Leonhart

Hatorijr

blueshogun96

Hatorijr

mudlord

Squall-Leonhart

blueshogun96

cottonvibes

Hatorijr

Exophase

cottonvibes

Exophase

cottonvibes

Exophase

cottonvibes

Exophase

mudlord

Squall-Leonhart

70597

Top Contributors this Month

Recommended Communities