The following blog post describes the main similarities and differences between Dalvik and Java bytecode. This is especially important to understand the difference between Dalvik and Java, so that you can understand the characteristics and malicious behavior of Android applications.
Android applications are usually written in the Java language and executed in the Dalvik Virtual Machine (DVM), which is different from the classic Java Virtual Machine (JVM). DVM is developed by Google and optimized for the characteristics of mobile operating systems (especially the Android platform). The bytecode running in Dalvik is converted
dx translate Java .class files. Unlike DVM, JVM uses pure Java class files. If you want to reverse engineer an Android application, you need to understand the Dalvik bytecode format, and you need in-depth knowledge of static and dynamic detection. Authors such as William Enke summarized the differences between JVM and DVM bytecode "Android Application Security Research"
- Android application architecture
The JVM bytecode consists of one or more .class files (each file contains a Java class). At runtime, the JVM will dynamically load the bytecode of each class from the corresponding .class file. The Dalvik bytecode consists of only one .dex file, which contains all classes of the application. The following figure shows the generation process of the .dex file. After the Java compiler creates the JVM bytecode, the Dalvik dx compiler deletes all .class files and recompiles them into Dalvik bytecode. Then dx merges them into one .dex file. This process includes the translation, reconstruction, and interpretation of the basic elements of the application (constant pool, class definition, and data segment). constant pool describes all constants, including references, method names, and numeric constants. class definition includes access flags, class names, and so on. data segment includes all function codes executed by the target VM, as well as related information about classes and functions (such as the number of registers used by the DVM, the list of local variables, and the size of the operand stack) and instance variables.
- Register structure
DVM is register-based, while JVM is stack-based . In JVM bytecode, local variables will be listed in the local variable list, and then pushed onto the stack for opcode operations. In addition, the JVM can also work directly on the stack without explicitly storing local variables in the variable list. In Dalvik bytecode, local variables will be assigned to any of the 16 available registers (original 2 16 registers, suspected to be wrong). Dalvik opcodes do not access elements in the stack. Instead, they operate directly on the registers.
- Instruction Set
Dalvik has 218 opcodes, which are fundamentally different from the 200 opcodes in Java. For example, there are more than a dozen opcodes used to transfer data between the stack and the list of local variables, but there are none in Dalvik. The instructions in Dalvik are longer than those in Java because most of them contain the source and destination addresses of the registers. For a comprehensive overview of Dalvik opcodes, see Gabor Paller and Android developers’ blog posts .
- Constant pool structure
The JVM bytecode needs to loop through the constant pool of all the constants from all the .class files, such as the name of the referenced function. By providing a constant pool for all class references in Dalvik, the dx compiler eliminates iteration. In addition, dx removes some constants by using inline technology. Therefore, during dx compilation, integers, long integers, and single and double floating point constants disappear.
- Ambiguous primitive type
In the JVM, the opcodes of integers and single floating-point constants are different, as are long integers and double floating-point constants. The corresponding Dalvik implements the same opcodes for integer and floating-point constants.
- Null reference
Dalvik bytecode has no specific Null type. In contrast, Dalvik uses 0 value constants. Therefore, the ambiguous meaning of the constant 0 should be correctly distinguished.
- Object reference
JVM bytecode uses different opcodes for object reference comparison and null type comparison, and Dalvik simplifies them into one opcode. Therefore, the type information of the comparison object must be restored during the decompilation process.
- Storage of primitive type arrays
Dalvik uses uncertain opcodes to operate on arrays, while JVM uses defined opcodes. The array type information must be restored to be able to convert correctly.