I'd like to share with you effect of nearly 300 hours spent on trying to optimize Android and push it to the limits.
In general. You should be already experienced in setting up your buildbox, using git, building AOSP/CyanogenMod/OmniROM from source and cherry-picking things from review/gerrit. Solving git conflicts would also be nice. If you don't know how to build your own ROM from source, this is not a something you can apply to your ROM. Also, as you probably noticed, this is not a something you can apply to already prebuilt ROM (stocks), as these optimizations are applied during compilation, so only AOSP roms, self-compiled from source may use this masterpiece.
So, what is it about? As we know, Android contains a bunch of low-level C/C++ code, which is compiled and acts as a backend for our java's frontend and android apps. Unfortunately, Google didn't put their best at focusing on optimization, so as a result we're using the same old flags set back in 2006 for Android Donut or anything which existed back then. As you guess, in 2006 we didn't have as powerful devices as now, we had to sacrifice performance for smaller code size, to fit to our little devices and run well on very low amount of memory. However, this is no longer a case, and by using newest compilers and properly setting flags, we can achieve something great.
You probably may heard of some developers claiming using of "O3 Flags" in their ROMs. Well, while this may be true, they've applied only to low-level ARM code, mostly used during kernel compilation. Additionally it overwrites O2 flag, which is already fast, so as you may guess, this is more likely a placebo effect and disappears right after you change the kernel. Take a look at the most cherry-picked "O3 Flags commit". You see big "-Os" in "TARGET_thumb_CFLAGS"? This is what I'm talking about.
However, the commit I'm about to present you is not a placebo effect, as it applies flags to everything what is compiled, and mostly important - target THUMB, about 90% of an Android.
Now I'll tell you some facts. We have three interesting optimization levels. Os, O2, O3. O2 enables all optimizations that do not involve a space-speed tradeoff. Os is similar to O2, but it disables all flags that increase code size. It also performs further optimizations to reduce code size to the minimum. O3 enables all O2 optimizations and some extra optimizations that like to increase code size and may, or may not, increase performance. If you want to ask if there's something more like O4, there is - Ofast, however it breaks IEEE standard and doesn't work with Android, as i.e. sqlite3 is not compatible with Ofast's -ffast-math flag. So no go for us.
Now here comes the fun part. Android by default is compiled with O2 flag for target ARM (about 10% of Android, mostly low-level parts) and Os flag for target THUMB (about 90% of Android, nearly everything apart from low-level parfts). Some guys think that Os is better than O2 or O3 because smaller code size is faster due to better fitting in cpu cache. Unfortunately, I proven that it is a myth. Os was good back in 2006, as only with this flag Google was able to compile Dalvik and it's virtual machine while keeping good amount of free memory and space on eMMC cards. As or now, we have plenty of space, plenty of ram, plenty of CPU power and still good old Os flag for 90% of Android.
I've made countless tests to find out what is the most efficient in terms of GCC optimization, two selected tests I am about to present you right now.
As you may noticed, I compiled whetstone.c benchmark using three different optimization flags - Os, O2 and O3. I set CPU to performance, maximum frequency, and I repeated each test additional two times, just to make sure that Android doesn't lie to me. Source code of this test is available here and you may download it, compile for our beloved Android and try yourself. As you can see O3 > O2 >> Os, Os performs about 2.5x times worse than O2, and about 3.0x times worse than O3.
But, of course. Android is not a freaking benchmark, it's operating system. We can't tell if things are getting better or worse according to a simple benchmark. I kept that in mind and provided community with JustArchi's Mysterious Builds for test. I gave both mysterious builds and didn't tell my users what is the mysterious change. Both builds have been compiled with the same toolchain, same version, same commits. The one and only mysterious change was the fact that every component compiled as target thumb (major portion of an android) has been optimized for speed (O3) in build #1 (experimental), and optimized for size (Os) in build #2 (normal behaviour). Check poll yourself, 9 votes on build 1 in terms of performance, and 1 vote on build 2. I decided that this and benchmark is enough to tell that O2/O3 for target thumb is something that we want.
Now it doesn't matter that match if you wish to use O2 or O3, but here is some comparison:
1. Kernel compiled with O2 has 4902 KB, with O3 4944 KB, so O3 is 42 KB bigger.
2. ROM compiled with O3 is 3 MB larger than O2 after zip compression. Fast overview: 97 binaries in /system/bin and 2 binaries in /system/xbin + 283 libraries in /system/lib and other files, about 400 files in total. 3 MB / 400 = 7,5 KB per file size increase.
3. It's unlikely that code working properly with O2 level might break on O3 level, most issues are on the Os <-> O2 part.
4. If it doesn't cause any issues, and speeds up a binary by a little bit, why not use it?
5. The only real reason to not use O3 is potential higher memory usage due to oversized binaries.
In general, I doubt that this extra chunk of code may cause any significant memory usage or slower performance. I suggest to use O3 if it doesn't cause any issues to you compared to O2, but older devices may use O2 purely for saving on code size, similar way Google did it back in 2006 using Os flag.
Now let's get down to business.
Here is a list of important improvements:
- Optimized for speed yet more all instructions - ARM and THUMB (-O3)
- Optimized for speed also parts which are compiled with Clang (-O3)
- Turned off all debugging code (lack of -g)
- Eliminated redundant loads that come after stores to the same memory location, both partial and full redundancies (-fgcse-las)
- Ran a store motion pass after global common subexpression elimination. This pass attempts to move stores out of loops (-fgcse-sm)
- Enabled the identity transformation for graphite. For every SCoP we generate the polyhedral representation and transform it back to gimple. We can then check the costs or benefits of the GIMPLE -> GRAPHITE -> GIMPLE transformation. Some minimal optimizations are also performed by the code generator ISL, like index splitting and dead code elimination in loops (-fgraphite -fgraphite-identity)
- Performed interprocedural pointer analysis and interprocedural modification and reference analysis (-fipa-pta)
- Performed induction variable optimizations (strength reduction, induction variable merging and induction variable elimination) on trees (-fivopts)
- Didn't keep the frame pointer in a register for functions that don't need one. This avoids the instructions to save, set up and restore frame pointers; it also makes an extra register available in many functions (-fomit-frame-pointer)
- Attempted to avoid false dependencies in scheduled code by making use of registers left over after register allocation. This optimization most benefits processors with lots of registers (-frename-registers)
- Tried to reduce the number of symbolic address calculations by using shared “anchor” symbols to address nearby objects. This transformation can help to reduce the number of GOT entries and GOT accesses on some targets (-fsection-anchors)
- Performed tail duplication to enlarge superblock size. This transformation simplifies the control flow of the function allowing other optimizations to do a better job (-ftracer)
- Performed loop invariant motion on trees. It also moved operands of conditions that are invariant out of the loop, so that we can use just trivial invariantness analysis in loop unswitching. The pass also includes store motion (-ftree-loop-im)
- Created a canonical counter for number of iterations in loops for which determining number of iterations requires complicated analysis. Later optimizations then may determine the number easily (-ftree-loop-ivcanon)
- Assumed that loop indices do not overflow, and that loops with nontrivial exit condition are not infinite. This enables a wider range of loop optimizations even if the loop optimizer itself cannot prove that these assumptions are valid (-funsafe-loop-optimizations)
- Moved branches with loop invariant conditions out of the loop (-funswitch-loops)
- Constructed webs as commonly used for register allocation purposes and assigned each web individual pseudo register. This allows the register allocation pass to operate on pseudos directly, but also strengthens several other optimization passes, such as CSE, loop optimizer and trivial dead code remover (-fweb)
- Sorted the common symbols by alignment in descending order. This is to prevent gaps between symbols due to alignment constraints (-Wl,--sort-common)
However, please read my commit carefully before you decide to cherry-pick it. You must understand that Google's flags weren't touched since 7 years and nobody can assure you that they will work properly for your ROM and your device. You may experiment with them a bit to find out if they're not causing conflicts or other issues.
I can assure you that my ArchiDroid based on CM compiles fine with suggested steps written in the commit itself. Just don't forget to clean ccache (rm -rf /home/youruser/.ccache or rm -rf /root/.ccache) and make clean/clobber.
You can use, modify and share my commit anyway you want, just please keep proper credits in changelogs and in the repo itself. If you feel generous, you may also buy me a coke for massive amount of hours put into those experiments.
Now go ahead and show your users how things should be done .
Android "Lollipop" (5.1.1 & 5.0.2 tested)
JustArchi's ArchiDroid Optimizations V4.1 for CyanogenMod (latest)
A set of commits you may want to pick to fix O3-related issues:
external_bluetooth_bluedroid | hardware_qcom_display | libcore | frameworks_av #1 | frameworks_av #2
Older entries are provided for reference only. I suggest using only latest commit above.
Android "Lollipop" (5.1.1 & 5.0.2 tested)
JustArchi's ArchiDroid Optimizations V4 for CyanogenMod
Android "Kitkat" 4.4.4:
JustArchi's ArchiDroid Optimizations V3 for CyanogenMod
JustArchi's ArchiDroid Optimizations V3 for OmniROM
JustArchi's ArchiDroid Optimizations V2
JustArchi's ArchiDroid Optimizations V1
AFTER applying above commit and AFTER EVERY CHANGE regarding flags, ALWAYS make clean/clobber AND empty ccache (rm -rf ~/.ccache)
Q: How to properly change toolchains used in local manifest?
Open from your source rootdir .repo/local_manifests/roomservice.xml (or create one). Here is a sample manifest that replaces default 4.8 toolchain (both eabi and androideabi) with 4.8 SaberMod and 4.9 ArchiToolchain:
<?xml version="1.0" encoding="UTF-8"?> <manifest> <remove-project name="platform/prebuilts/gcc/linux-x86/arm/arm-eabi-4.8" /> <project name="ArchiDroid/Toolchain" path="prebuilts/gcc/linux-x86/arm/arm-eabi-4.8" remote="github" revision="architoolchain-5.2-arm-linux-gnueabihf" /> <remove-project name="platform/prebuilts/gcc/linux-x86/arm/arm-linux-androideabi-4.8" /> <project name="ArchiDroid/Toolchain" path="prebuilts/gcc/linux-x86/arm/arm-linux-androideabi-4.8" remote="github" revision="uber-4.9-arm-linux-androideabi" /> </manifest>
Q: Compiler errror:
(...)/prebuilts/gcc/linux-x86/arm/arm-linux-androideabi-4.8/bin/../libexec/gcc/arm-linux-androideabi/4.8.x-sabermod/cc1: error while loading shared libraries: libcloog-isl.so.4: cannot open shared object file: No such file or directory
apt-get install libcloog-isl4
(...)/prebuilts/gcc/linux-x86/arm/arm-linux-androideabi-4.8/bin/../libexec/gcc/arm-linux-androideabi/4.8.x-sabermod/cc1: error while loading shared libraries: libisl.so.13: cannot open shared object file: No such file or directory
Add to your /etc/apt/sources.list following entries:
deb http://ftp.debian.org/debian testing main contrib non-free deb-src http://ftp.debian.org/debian testing main contrib non-free
Issues below are for older commits and should be used for reference only
Kitkat THUMB O2+ errors?
These are the most common issues.
* Change -O3 flag from TARGET_thumb_CFLAGS back to -Os, make clean/clobber, empty ccache and try again. This fixes most of the issues.
* RIL problems for for the Exynos 4210 family? Add -fno-tree-vectorize to TARGET_thumb_CFLAGS.
* Broken exFAT -> https://github.com/JustArchi/android...4bffccee650e0d
Errors caused by toolchain?
1. Try Google's GCC 4.8 if you used Linaro 4.8 or SaberMod 4.8
2. Fallback to Google's GCC 4.7 if above didn't help (change TARGET_GCC_VERSION back to 4.7)
Errors caused by GCC 4.8+?
* ART Fix (bootloop) -> https://github.com/JustArchi/android...4443998d028407
* Not booting kernel -> https://github.com/JustArchi/android...e4bfb3cff64de9 and https://github.com/JustArchi/android...90528feb0c9bdd
Errors caused by GCC 4.9+?
* Graphical glitches in PlayStore -> https://github.com/JustArchi/android...d57f3982191662
Errors caused by Linaro?
* error: unknown CPU architecture -> https://github.com/JustArchi/android...85174baacecb03 (Keep in mind that this is a sample fix for smdk4412 kernel, you may need to use similar solution in your own case. Also, this error happens only with Linaro toolchain, doesn't happen with Google's GCC)
* error: undefined reference to 'memmove' -> https://github.com/XperiaSTE/android...2d8219c1e6807a
@IAmTheOneTheyCallNeo - For inspiration and first steps
@metalspring - For some nice commits
@sparksco - For SaberMod, some nice commits and support for the optimization idea