Comparison of Dalvik and Java Bytecode

The following blog post describes the major similarities and differences of Dalvik and Java Bytecode. This is particularly important to understand how Dalvik and Java differ in order to be able to understand the characteristics of Android applications and malicious behavior.

Android applications are usually written in Java language and are executed in the Dalvik Virtual Machine (DVM), which is different from the classical Java Virtual Machine (JVM). The DVM is developed by Google and optimized for the characteristics of mobile operating systems (especially for the Android platform). The bytecode running in Dalvik is transferred from traditional JVM bytecode to the dex-format by translating Java .class files with the conversion tool dx. Contrary to the DVM, JVM is using pure Java class files. If you want to reverse an Android application, it is necessary to understand Dalvik bytecode format, as well as you need deep knowledge in static and dynamic state instrumentation. Wiliam Enck et. al summed up the differences between JVM and DVM bytecode in their paper “A Study of Android Application Security” as following:

1. Architecture of the Android Application

JVM bytecode is composed of one or more .class files (each of these contains one Java class). During run time, JVM will dynamically load the bytecode for each class from the corresponding .class file. While Dalvik bytecode is only composed of one .dex file, containing all the classes of the application. The figure below shows the genaration process of the .dex file. After the Java compiler has created JVM bytecode, the Dalvik dx compiler deletes all .class files and recompiles them to Dalvik bytecode. Afterwards dx merges them into one .dex file. This process includes translation, reconstruction and interpretation of the basic elements of the application (constant pool, class definitions and data segments). The constant pool describes all the contants, including references, method names, and numeric constants. Class definitions include access flags, class names, etc. The data segments include all the function code executed by the target VM as well as related information about classes and functions (e.g. number of registers used by DVM, local variables list, operand stack size) and instance variables.

2. Register Structure

DVM is based on registers, while JVM is based on the stack. In JVM bytecode, local variables will be listed in the local variables list, and then pushed onto the stack for manipulation by opcodes. In addition, JVM could also work directly on the stack without explicitly storing local variables into the variable list. In Dalvik bytecode, local variables would be assigned to any of the 216 available registers. Dalvik opcodes do not access the elements in the stack. Instead, they operate directly on the registers.

3. Instruction Set

Dalvik has 218 opcodes which are essentially different from the 200 opcodes in Java. For instance, a dozen of opcodes are used for transferring data between stack and local variables list, while totally absent in Dalvik. Instructions in Dalvik are longer than in Java, since most of them contain source and destination address of registers. For a comprehensive overview of Dalvik opcodes see the Blog post of Gabor Paller and Android Developers.

4. Constant Pool Structure

JVM bytecode has iterated constant pools among all of the .class files, e.g. referent function names. By providing a constant pool for the reference of all classes in Dalvik, dx compiler eliminates the iterations. Moreover, dx removes some constants by using inline technology. Therefore, integers, long integers, as well as single and double float constants disappear during the dx compilation.

5. Ambigious Primitive Types

In JVM, the opcodes for integers and single float constants are different, along with long integers and double float constants. While Dalvik implements the same opcode for both, integers and float constants.

6. Null References

Dalvik bytecode does not have a specific null type. Instead, Dalvik uses a 0 value constant. So, the ambiguous implication of constant 0 should be distinguished properly.

7. Object References

JVM bytecode uses different opcodes for object reference comparison and null type comparison, while Dalvik simplifies them into one opcode. Thus, the type info of the comparison object must be recovered during decompilation.

8. Storage of Primitive Type of Array

Dalvik uses indefinite opcodes to operate on an array, while JVM uses defined ones. The array type information must be recovered for correct translation.