[COMMIT] [AOSP] JustArchi's ArchiDroid Optimizations - Increases performance up to 6x
Hello dear developers.
I'd like to share with you effect of nearly 200 hours spent on trying to optimize Android and push it to the limits.
In general. You should be already experienced in setting up your buildbox, using git, building AOSP/CyanogenMod/OmniROM from source and cherry-picking things from review/gerrit. If you don't know how to build your own ROM from source, this is not a something you can apply to your ROM. Also, as you probably noticed, this is not a something you can apply to every ROM, as these optimizations are applied during compilation, so only AOSP roms, self-compiled from source may use this masterpiece.
So, what is it about? As we know, Android contains a bunch of low-level C/C++ code, which is compiled and acts as a backend for our java's frontend and android apps. Unfortunately, Google didn't put their best at focusing on optimization, so as a result we're using the same old flags set back in 2006 for Android Donut or anything which existed back then. As you guess, in 2006 we didn't have as powerful devices as now, we had to sacrifice performance for smaller code size, to fit to our little devices and run well on very low amount of memory. However, this is no longer a case, and by using newest compilers such as GCC 4.8 and properly setting flags, we can achieve something, which I call "Android in 2014".
You probably may heard of some developers claiming using of "O3 Flags" in their ROMs. Well, while this may be true, they've applied only to low-level ARM code, mostly used during kernel compilation. Additionally it overwrites O2 flag, which is already fast, so as you may guess, this is more likely a placebo effect and disappears right after you change the kernel. Take a look at the most cherry-picked "O3 Flags commit
". You see big "-Os" in "TARGET_thumb_CFLAGS"? This is what I'm talking about.
However, the commit I'm about to present you is not a placebo effect, as it applies flags to everything what is compiled
, and mostly important - target THUMB
, about 90% of an Android.
Now I'll tell you some facts. We have three
interesting optimization levels. Os, O2, O3. O2 enables all optimizations that do not involve a space-speed tradeoff. Os is similar to O2, but it disables all flags that increase code size. It also performs further optimizations to reduce code size to the minimum. O3 enables all O2 optimizations and some extra optimizations that like to increase code size and may, or may not, increase performance. If you want to ask if there's something more like O4, there is - Ofast, however it breaks IEEE standard and doesn't work with Android, as i.e. sqlite3 is not compatible with Ofast's -ffast-math flag. So no go for us.
Now here comes the fun part. Android by default is compiled with O2 flag for target ARM
(about 10% of Android, mostly kernel) and Os flag for target THUMB
(about 90% of Android, nearly everything apart from kernel). Some guys think that Os is better than O2 or O3 because smaller code size is faster due to better fitting in cpu cache. Unfortunately, I proven that it is a myth
. Os was good back in 2006, as only with this flag Google was able to compile Dalvik and it's virtual machine while keeping good amount of free memory and space on eMMC cards. As or now, we have plenty of space, plenty of ram, plenty of CPU power and still good old Os flag for 90% of Android.
Now you should ask - where is your proof?
, here I have it for you:
As you may noticed, I compiled whetstone.c benchmark using three different optimization flags - Os, O2 and O3. I repeated each test additional two times, just to make sure that Android doesn't lie to me. Source code of this test is available here
and you may download it, compile for our beloved Android and try yourself. As you can see O3 > O2 >> Os, Os performs about 2.5x times worse than O2, and about 3.0x times worse than O3.
But, of course. Android is not a freaking benchmark, it's operating system. We can't tell if things are getting better or worse according to a simple benchmark. I kept that in mind and provided community with JustArchi's Mysterious Builds
for test. I gave both mysterious builds and didn't tell them what is the mysterious change. Both builds have been compiled with the same toolchain, same version, same commits. The one and only mysterious change was the fact that every component compiled as target thumb (major portion of an android) has been optimized for speed (O3) in build #1 (experimental), and optimized for size (Os) in build #2 (normal behaviour). Check poll yourself, 9 votes on build 1 in terms of performance, and 1 vote on build 2. I decided that this and benchmark is enough to tell that O2/O3 for target thumb is something that we want.
Now the battle is, O2 or O3? This is tough choice, here are some facts:
1. Kernel compiled with O2 has 4902 KB, with O3 4944 KB, so O3 is 42 KB bigger.
2. ROM compiled with O3 is 3 MB larger than O2 after zip compression. Fast overview: 97 binaries in /system/bin and 2 binaries in /system/xbin + 283 libraries in /system/lib and other files, about 400 files in total. 3 MB / 400 = 7,5 KB per file size increase.
3. No issues
In general, I doubt that this extra chunk of code may cause any significant memory usage or slower performance. I suggest to use O3 if it doesn't cause any issues to you compared to O2, but older devices may use O2 purely for saving on code size, similar way Google did it back in 2006 using Os flag.
Now let's get down to bussiness
Here is a list of important improvements:
- Added missing Cortex-A9 CPU variant (-mcpu=cortex-a9)
- Disabled global workarounds for Cortex-A8, they're applied only when you're targetting Cortex-A8 CPU now (-Wl,--fix-cortex-a8)
- Bumped GCC version to 4.8 from default 4.7, as it performs much better than default 4.7 and gives excellent results
- Optimized for speed yet more all instructions - ARM and THUMB (-O3)
- Optimized for speed also parts which are compiled with Clang (-O3)
- Turned off all debugging code (-DNDEBUG)
- Performed loop invariant motion on trees. It also moved operands of conditions that are invariant out of the loop, so that we can use just trivial invariantness analysis in loop unswitching. The pass also includes store motion (-ftree-loop-im)
- Created a canonical counter for number of iterations in loops for which determining number of iterations requires complicated analysis. Later optimizations then may determine the number easily (-ftree-loop-ivcanon)
- Performed induction variable optimizations (strength reduction, induction variable merging and induction variable elimination) on trees (-fivopts)
- Tried to reduce the number of symbolic address calculations by using shared “anchor” symbols to address nearby objects. This transformation can help to reduce the number of GOT entries and GOT accesses on some targets (-fsection-anchors)
- Assumed that loop indices do not overflow, and that loops with nontrivial exit condition are not infinite. This enables a wider range of loop optimizations even if the loop optimizer itself cannot prove that these assumptions are valid (-funsafe-loop-optimizations)
- Allowed the compiler to assume the strictest aliasing rules applicable to the language being compiled. For C (and C++), this activates optimizations based on the type of expressions. This is only applied to target ARM, nothing has been changed in this matter apart from more precision in warnings (-fstrict-aliasing)
- Placed each function and data item into its own section, this is required for -Wl,--gc-sections (-ffunction-sections -fdata-sections)
- Moved branches with loop invariant conditions out of the loop (-funswitch-loops)
- Attempted to avoid false dependencies in scheduled code by making use of registers left over after register allocation. This optimization most benefits processors with lots of registers (-frename-registers)
- Re-ran common subexpression elimination after loop optimizations are performed (-frerun-cse-after-loop)
- Didn't keep the frame pointer in a register for functions that don't need one. This avoids the instructions to save, set up and restore frame pointers; it also makes an extra register available in many functions (-fomit-frame-pointer)
- Made a redundant load elimination pass performed after reload. The purpose of this pass is to clean up redundant spilling (-fgcse-after-reload)
- Ran a store motion pass after global common subexpression elimination. This pass attempts to move stores out of loops (-fgcse-sm)
- Eliminated redundant loads that come after stores to the same memory location, both partial and full redundancies (-fgcse-las)
- Constructed webs as commonly used for register allocation purposes and assigned each web individual pseudo register. This allows the register allocation pass to operate on pseudos directly, but also strengthens several other optimization passes, such as CSE, loop optimizer and trivial dead code remover (-fweb)
- Performed tail duplication to enlarge superblock size. This transformation simplifies the control flow of the function allowing other optimizations to do a better job (-ftracer)
- Optimized GNU linker, which significantly reduces launching time and memory usage. This is especially visible during booting process, which is a few seconds faster than usual (-Wl,-O1)
- Applied special --as-needed flag to GNU linker. The flag tells the linker to link in the produced binary only the libraries containing symbols actually used by the binary itself. This not only improves startup times (as the loader does not have to load all the libraries for every step) but might avoid the full initialization of things, which we're not even physically able to use (-Wl,--as-needed)
- Performed global optimizations that become possible when the linker resolves addressing in the program, such as relaxing address modes and synthesizing new instructions in the output object file (-Wl,--relax)
- Sorted the common symbols by alignment in descending order. This is to prevent gaps between symbols due to alignment constraints (-Wl,--sort-common)
- Enabled garbage collection of unused input sections, thanks to -ffunction-sections and -fdata-sections (-Wl,--gc-sections)
Looks badass? It is badass. Head over to my ArchiDroid 2.X project and see yourself how people react after switching to my ROM. Take a look at just one small example
, or another one
. No bullsh*t guys, this is future.
However, please read my commit carefully before you decide to cherry-pick it. You must understand that Google's flags weren't touched since 7 years
and nobody can assure you that they will work properly for your ROM and your device. You may experiment with them a bit to find out if they're not causing conflicts or other issues.
I can assure you that my OmniROM build compiles fine with some fixes mentioned in the commit itself. Just don't forget to clean ccache (rm -rf /home/youruser/.ccache or rm -rf /root/.ccache) and make clean/clobber.
You can use, modify and share my commit anyway you want, just please keep proper credits in changelogs and in the repo itself. If you feel generous, you may also buy me a coke for massive amount of hours put into those experiments.
Now go ahead and show your users how things should be done
JustArchi's ArchiDroid Optimizations V3
Older entries are provided for reference only. I suggest using only latest commit above.
JustArchi's ArchiDroid Optimizations V2
JustArchi's ArchiDroid Optimizations V1
AFTER applying above commit and AFTER EVERY CHANGE regarding flags, ALWAYS make clean/clobber AND empty ccache (rm -rf ~/.ccache)
How to properly add 4.8 toolchain to your local manifest?
Open from your source rootdir .repo/local_manifests/roomservice.xml
(or create one) and add:
For Linaro 4.8:
<!-- LINARO TOOLCHAIN -->
<project name="JustArchi/Linaro" path="prebuilts/gcc/linux-x86/arm/arm-eabi-4.8" remote="github" revision="4.8-eabi" />
<project name="JustArchi/Linaro" path="prebuilts/gcc/linux-x86/arm/arm-linux-androideabi-4.8" remote="github" revision="4.8-androideabi" />
For Google's GCC 4.8:
Example of ArchiDroid's roomservice.xml
<!-- GOOGLE GCC TOOLCHAIN -->
<project name="platform/prebuilts/gcc/linux-x86/arm/arm-eabi-4.8" path="prebuilts/gcc/linux-x86/arm/arm-eabi-4.8" remote="aosp" revision="master" />
<project name="platform/prebuilts/gcc/linux-x86/arm/arm-linux-androideabi-4.8" path="prebuilts/gcc/linux-x86/arm/arm-linux-androideabi-4.8" remote="aosp" revision="master" />
Your repos should pop up in specified "path" after next repo sync.
THUMB O2+ errors?
These are the most common issues.
* Change -O3 flag from TARGET_thumb_CFLAGS back to -Os, make clean/clobber, empty ccache and try again. This fixes most of the issues.
* RIL problems for for the Exynos 4210 family? Add -fno-tree-vectorize to TARGET_thumb_CFLAGS.
* Broken exFAT -> https://github.com/JustArchi/android...4bffccee650e0d
Errors caused by toolchain?
1. Try Google's GCC 4.8 if you used Linaro 4.8 or SaberMod 4.8
2. Fallback to Google's GCC 4.7 if above didn't help (change TARGET_GCC_VERSION back to 4.7)
Errors caused by GCC 4.8?
* ART Fix (bootloop) -> https://github.com/JustArchi/android...4443998d028407
* Not booting kernel -> https://github.com/JustArchi/android...e4bfb3cff64de9
Errors caused by Linaro?
* error: unknown CPU architecture -> https://github.com/JustArchi/android...85174baacecb03
(Keep in mind that this is a sample fix for smdk4412 kernel, you may need to use similar solution in your own case. Also, this error happens only with Linaro toolchain, doesn't happen with Google's GCC)
* ROMs built with Linaro 4.8 may cause some misc graphical glitches in latest Google Playstore (4.8+), other toolchains, including Google's GCC 4.8 are not affected.
* error: undefined reference to 'memmove' -> https://github.com/XperiaSTE/android...2d8219c1e6807a
- For inspiration and first steps
- For some nice commits