Hey ARM!
Quake 3 Arena code.
I pushed out the Quake 3 Arena code used for demoing limare and for benchmarking.
You can build it on your linux-sunxi with
make ARCH=arm
or, for the limare version (which will not build without the matching limare driver, which i haven't pushed out yet :))
make ARCH=arm USE_LIMARE=1
for the GLESv2 version (with the broken lamps due to missing alphatest):
make ARCH=arm USE_GLES2=1
Get a full Quake 3 Arena version first though, and stick all the paks in ~/ioquake3/baseq3. Add this to demofour.cfg in the same directory:
cg_drawfps 1
timedemo 1
set demodone "quit"
set demoloop1 "demo four; set nextdemo vstr demodone"
vstr demoloop1To run the timedemo then run the quake binary with +exec demofour.cfg
For your own reverse engineering purposes, to build the GLESv1 version with logging included, edit code/egl/egl_glimp.c, and remove the // before:
//#define QGL_LOG_GL_CALLS 1
But be aware, you are not free to spread that dumped data. That is ID Software data, albeit in a raw form.
I'd be much obliged if anyone hacks up input support, or re-adds sound. Or even adds the missing GLES2 shaders (color as a uniform for instance). That would make this code playable from the console, and should make it easier for me to provide playable limare code.
As you all can see, we have nothing to hide. The relevant differences between GLES1 and limare is in the GL implementation layer. I did shortcut vertex counting, this to ease logging, but this only has limited effect on CPU usage. The lesser CPU usage of limare is not significant or interesting as we do less checking than a full driver anyway. In simpler tests (rotating cubes), our scheduling results in a much higher CPU usage though (like 50% more, from 10% to 15% :)), even if we are not significantly faster. As said in my previous, i am not sure yet whether to keep this, or to find some improvements. Further tests, on much more powerful hardware, will tell.
Connors Compiler Work.
Connor had a massive motivation boost from FOSDEM (and did not suffer from the bug that was going round which so many of us in the last week). Earl from zaReason is sending him an A13 tablet, which should spur him on even further.
He has been coding like a madman, and he is now close to being able to compile the relatively simple shaders used in the Quake 3 Arena demo. He still has to manually convert the ESSL of the vertex shader to his own GP_IR first though, but that's already massive progress which gets us very close to our goals.
I am going to add MBS loading (Mali Binary Shader format) to limare to be able to forgo the binary compiler and load pre-compiled shaders into our programs. Since MBS is also spit out by the open-gpu-tools, we can then distribute our own compiled MBS files directly, and provide a fully open source Q3A implementation on mali.
How cool is that!
The very near future.
With my post purely about Q3A and its numbers, the reactions were rather strange. Seems like a lot of people were hung up exclusively on us being only 2% faster or because we were using this "ancient" game. The blog entry itself explained fully why this ancient game was actually a very good choice, yet only very few read it. Very few realized what a massive milestone an almost pixel-perfect Quake 3 Arena is for a reverse engineered driver.
As for performance... When i started investigating the mali, i had postulated that we would be happy to have only 75% of the performance of the binary driver. I assumed that, even with performance per watt being mighty important in the mobile space, 75% was the threshold at which the advantages of an open source driver would outweight the loss of performance. This would then lead to only ARMs big partners would end up shipping ARMs own binaries. And for projects like CyanogenMod and proper linux distributions there would be no question about what to ship.
With Q3A, and with the various rotating cubes, we now have proven that we can have a 100% match in performance. Sometimes we can even beat performance. All of this is general, no Q3A specific tricks here!
This is absolutely unique, and is beyond even the wildest dreams of a developer of any reverse engineered driver.
Absolutely nothing stops us now from delivering an open source driver that broadly matches the binary driver in performance! And this is exactly what we will be doing next!
Hey ARM!
We are not going away, we are here to stay. We cannot be silenced or stopped anymore, and we are becoming harder and harder to ignore.
It is only a matter of time before we produce an open source graphics driver stack which rivals your binary in performance. And that time is measured in weeks and months now. The requests from your own customers, for support for this open source stack, will only grow louder and louder.
So please, stop fighting us. Embrace us. Work with us. Your customers and shareholders will love you for it.
-- libv.
Chipmunk bindings for MonoTouch
![]() |
| (c) S. Delcroix 2013 |
At this point in time, the ~2000 lines of manually crafted lines of code can be found
What's missing ?
Mainly queries, plus a few function calls here and there.Integrate with cocos2d
Memory management
For more on this, or something totally different, contact me, I'm open for contracting.
Use Skype 4.10 with openSUSE 12.3 x86_64
I’m testing PulseAudio 3.0 on openSUSE 12.3 RC1 and it might be helpful to hang this information out here where Google can find it:
To use Skype with openSUSE 12.3, you need to download the Skype package for openSUSE. If you’re using a 64 bit machine/install, like most of us nowadays, you also need some 32 bit compatibility packages, included with openSUSE.
- Download Skype from http://www.skype.com/en/download-skype/skype-for-linux/
- Select openSUSE 12.1 32 bit (the most recent openSUSE version for which they offer Skype at time of writing) and save the package
- As root,
zypper in skype-4.1.0.20-suse.i586.rpm alsa-plugins-pulse-32bit
to install the package, its 32 bit requirements, and the 32 bit ALSA plugin for PulseAudio it also needs, but doesn’t/can’t specify automatically.
Happy calling.
Blog Moved
KDE Project:
I've moved my blog to http://llunak.blogspot.com. As I no longer focus on KDE (and even previously my blogs were often more about openSUSE than KDE), I've moved to a more generic blog site. KDE-related posts will still be aggregated on Planet KDE.
KDE SC 4.10 packages for openSUSE
As of today KDE SC 4.10 final packages are available for openSUSE 12.2 and Factory users. The new KR410 repo got built and you can replace your KR49 repos with it. KDE SC 4.10 will be part of openSUSE 12.3 and all minor updates for that KDE release will be shipped via the official update channel. Of course KDF does also contain KDE SC 4.10 final.
Currently there are no major bugs known. As already mentioned before, kio_sysinfo got replaced by kinfocenter because kio_sysinfo for openSUSE is unmaintained.
It is recommended to run nepomukcleaner to get rid of all legacy data. Beware though that it might take a long time. If you have nothing valuable in your nepomuk database, it is probably quicker to just remove its data from ~/.kde4/share/apps/nepomuk and start from scratch.
Another thing to note: if nepomuk crashes while indexing you might encounter an almost two years old Qt bug which according to the report, Qt devs are reluctant to fix. The problem of this bug is that after the crash virtuoso-t goes crazy. Thus you have to restart nepomuk via KDE’s systemsettings to make it behave again.
Please report packaging issues to the opensuse-kde mailinglist or #opensuse-kde on IRC. Bugs not specific to openSUSE should be reported upstream at KDE’s bug tracker.
KDE Platform, Workspaces and Applications 4.10 available for openSUSE
Hot on the heels of the announcement from KDE, the openSUSE KDE team is happy to announce the availability of packages for the latest stable release of the KDE Platform, Workspaces, and Applications.
Packages are available in the KDE:Distro:Factory repository (which is where the packages to land in 12.3 are tested) for openSUSE Factory (soon to be 12.3) and openSUSE 12.2 and soon (when the Open Build Systen finishes rebuilding a number of packages) in the KDE:Release:410 repository for openSUSE 12.2 users.
If you want to contribute and help KDE packaging in openSUSE, use the KDE:Distro:Factory version, otherwise stick to the KDE:Release:410 repository.
Enjoy!
Quake 3 Arena timedemo on top of the lima driver!
Let me get straight to the main point before delving into details: We now have a limare (our proto/research driver) port of Quake 3 Arena which is running the q3a timedemo 2% faster than the binary driver. With 3% less cpu overhead than the binary driver to boot!
Here is the timedemo video up on youtube. It is almost pixel-perfect, with just a few rounding errors introduced due to us being forced to use a slightly different vertex shader (ESSL, pulled through the binary compiler instead of a hand coded shader). We have the exact same tearing as the binary drivers, which are also not synced to display on the linux-sunxi kernel (but ever so slightly more tearing than the original ;)).
This Q3A port is not playable for a few reasons. One is, i threw out the touchscreen input support, but never hacked in the standard SDL based input, so we have no input today. It should be easy to add though. Secondly, i only include the shaders that are needed for running the timedemo. The full game (especially its cut scenes) requires a few more shaders, which are even simpler than the ones currently included. I also need to implement the equivalent of glTexSubImage2d, as that is used by the cut-scenes. So, yes, it is not playable today, but it should be easy to change that :)
We are also not fully open source yet, as we are still using the binary shader compiler. Even after begging extensively, Connor was not willing to "waste time" on hand coding the few shaders needed. He has the necessary knowledge to do so though. So who knows, maybe when i push the code out (the q3a tree is a breeze to clean, but the lima code is a mess, again), he might still give us the few shaders that we need, and we might even gain a few promille performance points still :)
I will first be pushing out the q3a code, so that others can use the dumping code from it for their own GPU reverse engineering projects. The limare code is another hackish mess again (but not as bad as last time round), so cleaning that up will take a bit longer than cleaning up q3a.
Why frag like it is 1999?
Until now, i was mostly grabbing, replaying, and then porting, EGL/GLES2 programs that were specifically written for reverse engineering the mali. These were written by an absolute openGL/openGLES newbie, someone called libv. These tests ended up targetting very specific but far too limited things, and had very little in common with real world usage of the GPU. As most of the basic things were known for mali, it was high time to step up things a level.
So what real world OpenGL(ES) application does one pick then?
Quake 3 Arena of course. The demo four timedemo was the perfect next step for reverse engineering our GPU.
This 1999 first person shooter was very kindly open sourced by ID Software in 2005. Oliver McFadden later provided an openGLES1 port of ioquake3 for the Nokia N900. With the Mali binary providing an OpenGLES1 emulation library, it was relatively easy to get a version going which runs on the Mali binary drivers. Thank you Oliver, you will be missed.
The Q3A engine was written for fixed 3D pipelines and this has some very profound consequences. First, it limits the dependency on the shader compiler and allowed me to focus almost purely on the command stream. This completely fits with the main strategy of our reverse engineering project, namely it being 2 almost completely separate projects in one (command stream versus shader compilers). Secondly, and this was a nice surprise when i started looking at captures, the mali OpenGLES1 implementation had some very hardware specific optimizations that one could never expose with OpenGLES2 directly. Q3A ended up being vastly more educational than I had expected it to be.
With Q3A we also have a good benchmark, allowing us to get a better insight into performance for the first time. And on top of all of that, we get a good visual experience and it is a dead-certain crowdpleaser (and it was, thanks for the cheers guys :))
The only downside is that the data needed to run demo four is not available with the q3a demo release and therefor not freely downloadable. Luckily you can still find Q3A CDs on ebay, and i have heard that steam users can easily download it from there.
The long story
After linuxtag, where i demoed the rotating companion cube, I assumed that my knowledge about the mali was advanced enough that bringing up Q3A would take only a given number of weeks. But as these things usually go, and with work an real life getting in the way, it never pans out like that. January 17th is when i had q3a first work correctly, time enough to worry about some optimization still before FOSDEM, but only just enough.
I started with an android device and the kwaak3 "app", which is just Olivers port with some androidiness added. I captured some frames to find out what i still missed with limare. When i finally had some time available, i first spent it cleaning up the linuxtag code, which i pushed out early december. I had already brought up Q3A on linux-sunxi with the mali binary drivers, which can be seen from the video i then published on youtube.
One thing about the youtube video though... Oliver had a tiny error in his code, one that possibly never did show up on the N900. In his version of the texture loading code, the lightmaps original format would end up being RGB whereas the destination format is RGBA. This difference in format, and in-driver conversion, is not supported by the openGLES standard. This made the mali driver refuse to load the texture, which later on had the driver use only the primary texture, even though a second set of texture coordinates were attached to the command stream. The vertex shader did not reflect this, and in my openGL newbieness i assumed that Ben and Connor had a bug in their vertex shader disassembler. You can clearly see the flat walls in the video i posted. Once i fixed the bug though, q3a suddenly looked a lot more appealing.
I then started with turning the openGLES1 support code in Quake's GLimp layer into a dumper of openGLES1 commands and data in a way that made it easy to replay individual frames. Then i chose some interesting frames, and replayed them, turned them into a GLES2 equivalent (which is not always fully possible, alphaFunc comes to mind), and then improved limare until it ran the given frames nicely through (the mali has hw alphaFunc, so limare is able to do this directly too). Rince and repeat, over several interesting frames.
By the evening of January the 16th, i felt that i knew enough to attempt to write a GLimp for limare. This is exactly when my father decided to give me a call. Some have met him at Le Paon last Friday, when he, to my surprise, joined us for a beer after work as his office is not far away. He remarked that i seemed "a bit on edge" when he called on the 16th. Yes, i indeed was, and how could i be anything else at a time like this :) I hacked all night, as at the time i was living purely at night anyway, and minutes before my girlfriend woke up i gave it my first shot. Crash, a stupid bug in my code. I told my girlfriend that i wouldn't join her for "breakfast" before i went to bed, as i was simply way too close. By the time she left for work, i was able to run until the first few in-game frames, when the rendering would hang, with the mali only coming back several seconds later. After a bit of trying around, i gave the GP (vertex shader) a bit more space for its tile heap. This time it ran for about 800 frames before the same thing happened. I doubled the tile-heap again, and it ran all the way through!
The evening before i had hoped that i would get about 20fps out of this hardware. This already was a pretty cocky and arrogant guess, as the binary driver ran this demo at about 47.3fps, but i felt confident that the hardware had little to hide. And then the demo ran through, and produced a number.
30.5fps
Way beyond my wildest dreams. Almost 65% of the performance of the binary driver. Un-be-liev-ab-le. And this was with plain sequential job handling. Start a GP job, wait for it to finish, then start the PP job, wait for it to finish, then flip. 30.5fps still! Madness!
I had two weeks left for FOSDEM, so i had a choice, either add input support and invite someone from the public to come and play before the audience, or, optimize until we beat the binary driver. The framerate of the first pass decided that, optimization it was. I had a good benchmark, and only a third of the performance needed to be found, and most of the corners for that extra performance were known.
My first optimization was to tackle the PP polygon list block access pattern. During my previous talk at FOSDEM, i explained that this was the only bit I found that might be IP encumbered. In the meantime, over the weekly beers with Michael Matz, the SuSE Labs toolchain lead, i had learned that there is thing called the "hilbert space filling curve". Thanks Matz, that was worth about ~2.2fps. I benchmarked another few patterns: two level hilbert (inside plb block, and out), and the non-rotated hilbert pattern that is used for the textures. None would give us the same performance as the hilbert curve.
Building with -O3 then gave us another 1.5fps. Passing vec2s between the shaders gave us 0.3fps. It was time to put in proper interleaved job handling. With the help of Marcus Meissner (the SuSE Security lead), an ioctl struct sizing issue was found for the job wait thread. This fixed the reliability issues with threading on the r3p0 kernel of linux-sunxi. (ARM! Stable kernel interfaces now!) But thanks Marcus, as proper threading and interleaved job handling put me at 40.7 fps!
And then i got stuck. I only had 40.7fps and knew nothing that could account for such a big gap in performance. I tried a few things left and right, but nothing... I then decided to port q3a to GLES2 (with the loss of alphafunc and buggered up lamps as a result) to see whether our issue was with the compiled versus hand-coded shader. But I quickly ran into an issue with multi-texture program state tracking, which was curious, as the lima code was logically the same. Once this was fixed the GLES2 port ran at about 47.6fps, slightly faster than GLES1, which i think might be because of the lack of alphafunc.
Immediately after that i ported the multi-texture state tracking fix to the limare GLimp, but i sadly got no change in framerate out of it. Strangely, it seemed like there was no multitexturing going as my debugging printfs were not being triggered. I then noticed the flag for telling Q3A that our GL implementation supports multitexturing. Bang. 46.7fps. I simply couldn't believe how stupid that was. If that had been correct on the first run, i would've hit above 75% of the framerate, how insane would that have been :)
For the final 1.5fps, which put us at 48.2fps, i added a third frame, this while only rendering out to two framebuffers. Job done!
Adding a fourth frame did not improve numbers, and i left some minute cpu usage and memory usage optimizations untouched. We are faster than the binary driver, while employing no tricks. We know what we need to know about this chip and there is nothing left to prove with Q3A performance.
The numbers.
The fact that we are slightly faster is actually normal. We do not have to adhere to the OpenGLES standard, we can do without a lot of the checking that a proper driver normally needs to do. This is why the goal was not to match the binary driver's performance, but to beat it, which is exactly what we achieved. From some less PP and CPU bound programs, like the spinning cubes, it does seem that we are more aggressive with scheduling though.
Now let's look at some numbers. Here is the end of the timedemo log for the binary driver, on an Allwinner A10 (single cortex a8, at 1GHz), with a Mali-400MP1 at 320MHz, rendering to a 1024x600 LCD, with framerate printing enabled:
THEINDIGO^7 hit the fraglimit.
marty^7 was melted by THEINDIGO^7's plasmagun
1260 frames 27.3 seconds 46.2 fps 10.0/21.6/50.0/5.6 ms
----- CL_Shutdown -----
RE_Shutdown( 1 )
-----------------------And here is the end of the timedemo log for the limare port:
THEINDIGO^7 hit the fraglimit.
marty^7 was melted by THEINDIGO^7's plasmagun
]64f in 1.313632s: 48.719887 fps (1280 at 39.473158 fps)
1260 frames 26.7 seconds 47.2 fps 9.0/21.2/74.0/5.6 ms
----- CL_Shutdown -----
RE_Shutdown( 1 )
]Max frame memory used: 2731/4096kB
Auxiliary memory used: 13846/16384kB
Total jobs time: 32.723190 seconds
GP job time: 2.075425 seconds
PP job time: 39.921429 seconds
-----------------------Looking at the numbers from the limare driver, my two render threads are seriously overcommitted on the fragment shader (PP). We really are fully fragment shader bound, which is not surprising, as we only have a single fragment shader. Our GP is sitting idle most of the time.
It does seem promising for a quad core mali though. I will now get myself a quad-core A9 SoC, and put that one through its paces. My feeling is that there we will either hit a wall with memory bandwidth or with the CPU, as q3a is single threaded. Since limare does not yet support multiple fragment shaders the last remaining big unknown will get solved too.
Another interesting number is the maximum frame time. 50.0ms for the binary driver, versus 74.0ms for limare. My theory there is that i am scheduling differently than the original driver and that we get hit by us overcommitting the fragment shader. Wait and see whether this difference in scheduling will improve or worsen the numbers on the potentially 4 times faster SoC. We will not be context switching anymore with our render threads, and we will no longer be limited by the fragment shader. This should then decide whether another scheme should be picked or not.
Once we fix up the Allwinner A10 display engine, and can reliably sync to refresh rate, this difference in job scheduling should become mostly irrelevant.
The star: the mali by falanx.
In the previous section i was mostly talking about the strategy of scheduling GP and PP jobs, of which one tends to have 1 of each per frame. Performance optimization is a very high level problem on the mali, which is a luxury. On mali we do not need to bother with highly specific command queue patterns which most optimally use the available resources, which then ends up being SoC and board specific. We are as fast as the original driver without any trickery, and this has absolutely nothing to do with my supposed ability as a hacker. The credit fully goes to the design of the mali. There is simply no random madness with the mali. This chip makes sense.
The mali is the correct mix of the sane and the insane. All the real optimization is baked into the hardware design. The vertex shader is that insane for a reason. There is none of that "We can fix it in software" bullshit going on. The mali just is this fast. And after 20 months of throwing things at the mali, i still have not succeeded in getting the mali to hard or soft lockup the machine. Absolutely amazing.
When i was pretty much the only open source graphics developer who was pushing display support and modesetting forwards, I often had to hear that modesetting was easy, and that 3d is insane. The mali proves this absolutely wrong. Modesetting is a very complex problem to solve correctly, with an almost endless set of combinations that requires very good insight and the ability to properly structure things. If you fail to structure correctly, you have absolutely no chance of satisfying 99.9% of your users, you'll be lucky if you satisfy 60%. Compared to modesetting, 3D is clearly delineated, and it is a vastly more overseeable and managable problem... Provided that your hardware is sane.
The end of the 90s was an absolute bloodbath for graphics hardware vendors with just a few, suddenly big, companies surviving. That's exactly when a few Norwegian demo-sceners, at the Trondheim University, decided that they would do 3D vastly better than those survivors and they formed a company to do so, called Falanx. It must've seemed like suicide, and I am very certain that pretty much everybody declared them absolutely insane (like engadget did). Now, 12 years later, seeing what came out of that, I must say that I have to agree. Falanx was indeed insane, but it was that special kind of insanity that we call pure genius.
You crazy demo-sceners. You rock, and let this Q3a port be my salute to you.
Our business app doesn't need your game development skills (using damped springs to place labels on a map)
The problem
Positioning labels on a map, a chart, a plan is a problem every UI developer face at one time or another if he has the chance to work on rich business app. This problem is NP-Hard for non-trivial cases. I faced that issue twice. First while building a silverlight charting library, and then very recently for positioning labels around a pin on a plan. This post isn't about how we solved those two cases. For the charting problem, we went for an algorithmic solution which was working but not optimal. For the second, I don't know how they solved it as I didn't took the job.Solution by Physics system
But the other night, as I contributed Chipmunk bindings to monotouch, one of the reply I got was roughly "I'm not into gaming stuffs, I don't really care. but it looks nice". That got me thinking. First, I'm not a game developer[1][2] neither, and second, you probably should care. One of the most elegant solution I was able to come with for solving a label-positioning problem is to use recipes normally used by game developers. It involves a physics engine, some bodies and some springs.
Let's say every label you want to position is a physics body with a mass (the same for all the labels), a shape (the rectangle bounding box is a good approximation). Let's say then than all those bodies are attached with a (sampling) spring to the pin. If we run a physics engine on this setup, let the springs pull, let the shapes collide, and it'll stabilise itself to a local optimum. With a some luck, or the right amount of initial entropy, you might even reach the global optimum, but that's not important. Multiples pins with labels close to each other? Check. A lot of labels ? Check. Restrict the labels to a zone or avoid some other? Check, just add constraints or shapes to help the collision algorithms.Implementation
But then, how hard is this to implement? If you have a physics engine you can use, this is probably shorter and cleaner to implement than any other method (except the obvious place-at-random-retry-if-it-overlaps). Here's my implementation. It uses monotouch and the chipmunk bindings (What next
Now you can play with the spring parameters and the number of iteration to get the results you expect. You can also use ellipses instead of rectangular bounding boxes. Expect smoother and more natural placement in this case. If you have a lot of labels, use different restLength for the labels you want close to the labels and the one you want a bit farther. A more complex system will take more time (more iterations) to stabilise, so adapt the number of steps, but in any case, keep the dt (time delta) parameter low, experience shows good results between 1/60 and 1/20.Want more than this from me ? Contact me, I'm open for contracting.
[1] I don't like categorisation. Nor being categorised. I'm not an UI dev, I'm not C# dev, I'm not a mobile dev. I'm not even a developer. I have a bagful of hats and wear the ones that gets the job done.
[2] But still, I wrote and tested most of the cocos2d and chipmunk bindings for mono touch, I ported a few games to those platform, and very recently wrote my first game with my 6 y.o. in a day (more on this later).
My openSUSE 12 Journal 11: VirtualBox
At the heart of these changes is the perceived lack of openness & transparency of commercial juggernauts shepherding open source projects. Another example is LibreOffice, being a fork of OpenOffice a few years back, and is now the default in openSUSE distributions.
VirtualBox was originally from Innotek GmbH and they were acquired by Sun Microsystems Inc. in February 2008 which in turn got acquired by Oracle Corporation in January 2010. Virtualbox is not shipped as a default on openSUSE but you can install it very easily because the binaries are available in the default online repositories.
Personally, I think VirtualBox is the 'BEST' virtualization software for the desktop. I would go with KVM or even Xen for enterprise server virtualization. However, for virtualizing Windows or Linux on a desktop for quick testing purposes, I'll pick Virtualbox anytime for its ease of use & free of cost attributes.
Easy Install Method
Read more »
Secure Boot on openSUSE talk at FOSDEM cancelled
for those of you who are attending FOSDEM this year and were planning to attend my talk about Secure Boot on openSUSE on Sunday, I'm sorry to announce I had to cancel my travel to Brussels (and my talk) for family reasons.
Since my slides were already written, I thought I could still share them with you Feel free to ask questions / comments on this blog post.


