Next Generation Emulation banner

1 - 20 of 35 Posts

·
Never give up your dreams
Joined
·
181 Posts
Discussion Starter #1
I have been reading some forums threads and most posts says that CUDA wouldn't help in the emulation scenary because the video card would be already stressed with the graphics emulation, so using CUDA, also, would not bring benefits.

Well... basically, my idea is to use 2 video cards.

One video card would emulate the CPU of the Xbox 360 with CUDA (or OpenCL), that would solve the hardware speed
limitation to emulate the Xbox 360 CPU, as the GPU is 14x faster than a CPU!
(Intel assumed that -> http://blogs.nvidia.com/2010/06/gpus-are-only-up-to-14-times-faster-than-cpus-says-intel/)
And the other video card would handle the graphics emulation.

So the concept of the idea, as you can see, it's simple (I'm not saying that the implementation of this idea is simple too): one video card to emulate the CPU of Xbox 360 and the other one to emulate the GPU of the Xbox 360.

I searched in some forums and found that we can choose the graphic card to use CUDA and other one to do something else (in my case, graphics emulation). It's just a configuration task.

I also have read that GPU couldn't emulate the CPU at all, but I think this method could increase some speed in the process, sharing some emulation with cuda and some with the user CPU.

I do want to build my own emulator, of course I won't start with Xbox 360, but my that's my main target. So... before I go forward, and invest money in this project... I want your opinion about this idea... specially from the experienced ones... Thank you in advance! :thumb:

PS: As far as I know, OpenCL is not as mature as CUDA, but comparing the code... they are really similar... so I can start developing the emulator using CUDA and then, when the OpenCL gets more mature, translate the code. It's more advantage to use OpenCL because it can be runned in both video cards: ATI and nVidia.
 

·
There is always hope, but you have to supply it.
Joined
·
6,886 Posts
I have been reading some forums threads and most posts says that CUDA wouldn't help in the emulation scenary because the video card would be already stressed with the graphics emulation, so using CUDA, also, would not bring benefits.

Well... basically, my idea is to use 2 video cards.

One video card would emulate the CPU of the Xbox 360 with CUDA (or OpenCL), that would solve the hardware speed
limitation to emulate the Xbox 360 CPU, as the GPU is 14x faster than a CPU!
(Intel assumed that -> http://blogs.nvidia.com/2010/06/gpus-are-only-up-to-14-times-faster-than-cpus-says-intel/)
And the other video card would handle the graphics emulation.

So the concept of the idea, as you can see, it's simple (I'm not saying that the implementation of this idea is simple too): one video card to emulate the CPU of Xbox 360 and the other one to emulate the GPU of the Xbox 360.

I searched in some forums and found that we can choose the graphic card to use CUDA and other one to do something else (in my case, graphics emulation). It's just a configuration task.

I also have read that GPU couldn't emulate the CPU at all, but I think this method could increase some speed in the process, sharing some emulation with cuda and some with the user CPU.

I do want to build my own emulator, of course I won't start with Xbox 360, but my that's my main target. So... before I go forward, and invest money in this project... I want your opinion about this idea... specially from the experienced ones... Thank you in advance! :thumb:

PS: As far as I know, OpenCL is not as mature as CUDA, but comparing the code... they are really similar... so I can start developing the emulator using CUDA and then, when the OpenCL gets more mature, translate the code. It's more advantage to use OpenCL because it can be runned in both video cards: ATI and nVidia.
Let's be super nice and pretend someone skilled enough wants to code an xbox 360 emulator in cuda, they wont be able to do it. There's a lack of the required documentation to emulate the 360 hardware. It's not possible at all without some decent information to work with.
 

·
Never give up your dreams
Joined
·
181 Posts
Discussion Starter #3
Let's be super nice and pretend someone skilled enough wants to code an xbox 360 emulator in cuda, they wont be able to do it. There's a lack of the required documentation to emulate the 360 hardware. It's not possible at all without some decent information to work with.
OK, thanks for the reply. I know that the lack of documentation/information is a big problem... but, let's not worry about this right now, I already know that the implementation won't be easy. And as I said, I won't start building an Xbox 360 emulator right now, so the documentation could appear in this meantime.

My real doubt is if this configuration would bring benefits in general, specially in the speed. Well... thank you anyway.
 

·
クロスエクス
Joined
·
4,662 Posts
In my very minor understanding of things, I think for GPGPU to be useful it needs to be used on extremely parallelizable (is that a word?) workloads. Like rendering stuff, which is what GPUs do so well.

I cannot imagine a point in an emulator (core?) that could benefit from that at all, specially with the latency of getting info back and forth from the GPU.
 

·
Never give up your dreams
Joined
·
181 Posts
Discussion Starter #5
In my very minor understanding of things, I think for GPGPU to be useful it needs to be used on extremely parallelizable (is that a word?) workloads. Like rendering stuff, which is what GPUs do so well.

I cannot imagine a point in an emulator (core?) that could benefit from that at all, specially with the latency of getting info back and forth from the GPU.
I think I understand your point of view... so, let's analyse, you're saying that GPU wouldn't be much useful because the CPU emulation is more serial than parallel?

Well, I don't know what are the techniques for emulating a CPU yet, that's why I came here, but... are you sure that in the CPU emulation there isn't any kind of stuff that could be done in parallel? Any kind of translation, calculation, or anything that need speed? My idea is not to emulate the CPU with ONLY CUDA, it's just to take the heavy part using CUDA. Because of that you pointed another possible problem: the latency. But can this latency between GPU and CPU really decrease the speed that we couldn't have much benefits by using this technology? Do you have any kind of real experience to share for proving me that it's not a good idea to use CUDA?

Thank you! :)
 

·
クロスエクス
Joined
·
4,662 Posts
I never coded an emulator, and I never coded GPGPU stuff so I don't have anything to prove anything. Take it as it is, a mere comment. You'll know your answer once you get your hands dirty with it. Code a Chip8 emu or something, code some GPGPU stuff, etc...
 

·
Never give up your dreams
Joined
·
181 Posts
Discussion Starter #7
hmm... for a moment I thought you're an emu author... anyway, thanks! :)

Code a Chip8 emu or something...
Yeah... I thought in beginning with something really small like this. Thank you for helping...

Anybody else have something to add? Any experience, statics, etc...?
 

·
Registered
Joined
·
376 Posts
Just an FYI, to the end-user, OpenCL is much more preferred due to its vender neutral-ness.
 

·
Never give up your dreams
Joined
·
181 Posts
Discussion Starter #9
Just an FYI, to the end-user, OpenCL is much more preferred due to its vender neutral-ness.
Yeah, thanks for giving me this information, it proves me that I thinking right.

JoaoHadouken said:
PS: As far as I know, OpenCL is not as mature as CUDA, but comparing the code... they are really similar... so I can start developing the emulator using CUDA and then, when the OpenCL gets more mature, translate the code. It's more advantage to use OpenCL because it can be runned in both video cards: ATI and nVidia.
I also want to make this emulator cross-platform. So even my beginning emulator (CHIP-8 I think) has to work on the main systems available: Windows, Linux (specially Ubuntu) and Mac OS X. After I get this working fine, I can go forward. I don't want to exclude anyone... :)
 

·
Registered
Joined
·
196 Posts
There are some major issues with your idea of using a GPU to emulate a CPU...

A GPU is good at dealing with parallel tasks, in a SIMD fashion (so doing the same things on each element of a vector), and generally speaking is mostly focused on math computations...

If you disassemble .xex files, you'll see that games don't launch a bunch of threads or do things in parallels; they're like PC games (I guess I could say they *are* PC games), they have a main thread that does a lot of things, and some smaller threads for limited - asynchronous - tasks (network, sound, kinect...).

Another problem is the kind of instructions a CPU does, as opposed to a GPU. If you take a simple instruction like RLWINM, which is basically a left rotation + applying some mask, this would be really painful to do with a GPU...

Now obviously there are some things that can be done in parallel. For example, something like static recompilation would heavily benefit from being done in parallel, once you've isolated code. Another place where it could be useful is when dealing with reconstructing the operations that have to be done by the GPU (Xbox 360 games use Direct3D 9, but as a static library; you actually have a single function that receives a buffer containing all the instructions generated by the D3D9 code... including shaders, vertex/texture objects, etc). One last one I can think of (after thinking about it for at least half a second) is decoding the XMA streams, since those are normally decoded on dedicated hardware.

Anyway at this point I guess it'd be more interesting for you to focus on getting something to work rather than thinking about optimizations this early in the process. If it works with a CPU, porting it to use a GPU won't be that big a deal, not to mention debugging your emulation is clearly easier with a CPU...
 

·
代言人
Joined
·
7,183 Posts
GPU emulator will not work simply because all algorithms are not written for massively multi-threaded architectures.
GPU path tracing looks a lot different from single threaded A* path tracing, though derived from A*. Main difference, GPU doesn't support recursion.
In other words, the entire algorithm has to be translated, not just instructions.
Try doing single threaded instruction on GPU, you will get like 1/100 the performance.
GPU is only fast when you write an algorithm that is inherently parallelizable and executable on GPU.
One example, you have to rewrite every single recursion into loops, if even possible.
You also need to have spinning locks or atomic operations implemented to incur memory consistency.
It's just gonna get ugly to force a single threaded algorithm into a massively-multithreded algorithm.
If you port a CPU code to GPU straight and enable multi-threaded kernels,
you are either gonna end up with memory hazard or 1/100 performance.

Bucket sort is a great example for performance.
If you just sort with each block and use atomic operation to collect, the performance will be terrible.
You will need to combine parallel reduction and bubble sort to obtain optimal result on a GPU.

An easier example, sum, is great for memory hazard.
If you let each block take on an address and add onto one destination address with zero as starting value.
The value in the address will be the value of the last accessed block.
Again, you need to use parallel reduction, which takes log base 2 iteration of the size of the array.

Not to mention the fact that it is hard to keep the entire grid saturated.
You have order of execution priority.
Even with asynchronous kernel calls, you get dependencies.
The longest path determines the total execution time.
Even if you can parallelize 90% of the instructions with only one series of instructions left.
If that 10% of instruction takes more time to execute on slower cores (GPU), CPU is gonna beat the crap out of GPU in this case.

Lastly, and most importantly, the most time consuming part of all GPU operation is memory transfer.
You cannot just let the GPU do certain operation and pop the result back to CPU.
CPU and GPU utilize different memories.
For most operations not rendering related, by the time you pass the info down to the GPU, the CPU can already finish the task on its own.
You either do everything on GPU or everything on CPU.

All, GPU is awesome at tasks designed specifically for GPU.
CPU should handle all the misc tasks, unless you have all the tasks well pipelined.
 

·
Never give up your dreams
Joined
·
181 Posts
Discussion Starter #12
Well... OK then... really frustrating, but... are the real facts and I have to deal with it...

-Ashe- and Fadingz... thank you for giving me those advice and information. Hope to find another way of developing an Xbox 360 emulator, I'm going to study, understand better emulation and how Xbox 360 works, etc... anyway I won't give up. :)
 

·
Registered
Joined
·
1,601 Posts
Well... OK then... really frustrating, but... are the real facts and I have to deal with it...

-Ashe- and Fadingz... thank you for giving me those advice and information. Hope to find another way of developing an Xbox 360 emulator, I'm going to study, understand better emulation and how Xbox 360 works, etc... anyway I won't give up. :)
you wont understand better until you code emulators yourself especially what is involved when you have lack of documentation... it has been proven time and time again hardwarewise it is not possible to run xbox360 easily or even fast including if you decide to offload the tasks to video cards that doesnt offer less speeds to an actual cpu...
 

·
代言人
Joined
·
7,183 Posts
xbox utilizes a tri-core architecture.
Here are my notes taken on xbox spec from my graduate class:
(taken from Dr. Milo Martin's Lecture Slide, also available in the public domain)



• ISA Extended with VMX-128 operations
• 128 registers, 128-bits each
• Packed “vector” operations
• Example: four 32-bit floating point numbers
• One instruction: VR1 * VR2  VR3
• Four single-precision operations
• Also supports conversion to Microsoft DirectX data formats
• Similar to Altivec (and Intel’s MMX, SSE, SSE2, etc.)
• Works great for 3D graphics kernels and compression

• Peak performance: ~75 gigaflops
• Gigaflop = 1 billion floating points operations per second

• Pipelined superscalar processor
• 3.2 Ghz operation
• Superscalar: two-way issue
• VMX-128 instructions (four single-precision operations at a time)
• Hardware multithreading: two threads per processor
• Three processor cores per chip

• Result:
• 3.2 * 2 * 4 * 3 = ~77 gigaflops

• ISA: 64-bit PowerPC chip
• RISC ISA
• Like MIPS, but with condition codes
• Fixed-length 32-bit instructions
• 32 64-bit general purpose registers (GPRs)


Each Xenon chip:
• 165 million transistors
• IBM’s 90nm process
• Three cores
• 3.2 Ghz
• Two-way superscalar
• Two-way multithreaded
• Shared 1MB cache


Pipeline:
• Four-instruction fetch
• Two-instruction “dispatch”
• Five functional units
• “VMX128” execution
“decoupled” from other units
• 14-cycle VMX dot-product
• Branch predictor:
• “4K” G-share predictor
• Unclear if 4KB or 4K 2-bit
counters
• Per thread


Memory:
• 128B cache blocks throughout
• 32KB 2-way set-associative instruction cache (per core)
• 32KB 4-way set-associative data cache (per core)
• Write-through, lots of store buffering
• Parity
• 1MB 8-way set-associative second-level cache (per chip)
• Special “skip L2” prefetch instruction
• MESI cache coherence
• Error Correcting Codes (ECC)
• 512MB GDDR3 DRAM, dual memory controllers
• Total of 22.4 GB/s of memory bandwidth
• Direct path to GPU


GPU parent die:
• 232 million transistors
• 500 Mhz
• 48 unified shader ALUs
• Mini-cores for graphics


GPU daughter die:
• 100 million
transistors
• 10MB eDRAM
• “Embedded”
• NEC Electronics
• Anti-aliasing
• Render at 4x
resolution,
then sample
• Z-buffering
• Track the
“depth” of
pixels
• 256GB/s internal
bandwidth
 

·
Super Moderator
Joined
·
13,069 Posts
Accelerating instruction execution is only viable when they already are properly emulated. CUDA could be useful for accelerating development processes and research, not execution.

System compatibility would also suffer. What about users of discrete GPUs, ATI? Performance aside, the output couldn't even be guaranteed across hardware.
 

·
Never give up your dreams
Joined
·
181 Posts
Discussion Starter #16
@Fadingz, again, thank you. I downloaded the PDF where did you get these information from. I will study it :)

Accelerating instruction execution is only viable when they already are properly emulated. CUDA could be useful for accelerating development processes and research, not execution.

System compatibility would also suffer. What about users of discrete GPUs, ATI? Performance aside, the output couldn't even be guaranteed across hardware.
hmm... I do understand... but do you think this method could solve the speed emulation problem? Of course, it doesn't need to be 100%, but 20% ~ 30% would be a nice increase of speed...
 

·
Premium Member
Joined
·
6,071 Posts
Every time I see a post about Xbox 360 emu theory, the most important perspectives are missed... \/_\/

You guys ONLY focus on hardware and think almost nothing of the software aspect. Without that, you have nothing to go by. You can come up with a million and one ideas to emulate the hardware, or even manage to emulate the hardware perfectly, but if you insist on being clueless with the software side (i.e. BIOS, HDD format, Dashboard, .xex files, etc.) then you're wasting your time altogether because these things don't magically come documented or available either. Has the .xex file format been properly documented yet? How does the BIOS boot sequence go/work? What dashboards have been dumped? These are some of the things you have to think of before you can even think of touching anything hardware related.

Another thing, while I did enjoy scanning Fadingz post on how the hardware interacts and what not (which is also important to know), what good is it if you don't have register level information on the chipsets? Where are the devices located in the 64GB address space (MMIO) or what ports are they mapped to? Do you have any information on the exclusive instructions the 360's CPU has? Have you thought about how the GPU can directly access video memory (seriously, that's [email protected]#%ing scary!!!) and how you'd possibly emulate that? Do you even know what you need to emulate the vector units or what instruction set it's based off of?? What else is there to emulating sound besides the Sis audio chip (what DSPs are we dealing with)? How does the gamepad interact on a hardware level? Those are just a few obstacles to think of in the beginning stages. You can know what the hardware specs are, and still get nowhere fast if you don't know how it works on a byte level; knowing the hardware specs is only 20% of the battle at most. Of course, it's impossible to lay out all of the requirements in the beginning, but eventually you need to ask yourself how are you going to [eventually] document all this? Some things you'll need to have right from the start, other things you can eventually reverse engineer.

As for finding register level documentation on these chipsets, some are easier to find than others. The GPU is based off of the ATI R600, and the specs are freely available as well as USB2.0 specs needed to emulate the gamepads and other input devices. There's even someone who's managed to write homebrew code to interface with the GPU on a low level, which can give you something to test on. The CPU is another story. It's common sense to find some doc on PPC64, but that's not enough to emulate it actually. Besides the fact that you have to emulate 3 of those suckers, what about the VUs? They use a special instruction set called AltiVec (IIRC). I searched the net for documentation on this, but haven't found [email protected]#%. I did get my hands on a datasheet when I was working for Microsoft 2 years ago, but anything else I did have access to (which wasn't very much) was closely guarded and a breach of such information would not only get me fired but even blacklisted. Finding good and accurate information on the CPU is hard enough to come by.

And I don't even want to think about threading... argh!
 

·
Never give up your dreams
Joined
·
181 Posts
Discussion Starter #20
hmmm... I see your point, BlueShogun. I thank you by the information and by the kind of questions that I have to ask myself to start working on a emu... I know these things are really important... but let's not put the cart before the horse... I don't intend to build an Xbox 360 emulator right now, it was just an idea I have had a year ago and wanted to share now... I don't know when I'm going to step into Xbox 360 emulation, perhaps I can help you in Xbox 1 emulation first... I don't know yet. Any way, your reply was really useful. Thanks! :)
 
1 - 20 of 35 Posts
Top