4
头图
Author of this article: Lazyyuuuuu

1. Background

App launch, as the first experience point for users to use the app, directly determines the user's first impression of the app. Cloud Music is an app with a history of nearly 10 years. With the continuous development of various businesses and the stacking of complex scenarios, different businesses and requirements keep adding codes to the startup link, which improves the startup performance of the app. brought great challenges. With the continuous expansion and in-depth use of cloud music user base, more and more users report that the startup speed is slow, and that the startup speed is too slow, which will even reduce the user's willingness to retain. Therefore, the Cloud Music iOS App urgently needs a special optimization for startup performance.

2. Analysis

2.1 Definition of startup

Everyone knows that after iOS13, Apple completely replaced dyld3 with the previous dyld2 1 , and added the concept of startup closure to dyld3, which is created when the app is downloaded/updated, the system is updated, or the app is started for the first time after restarting the phone. Therefore, the concept of cold start will be different before and after iOS13.

Before iOS13:
  • Cold start: Before the app is clicked to start, the process of the app does not exist in the system, the user clicks the app, and the system creates a process for the app to start;
  • Hot start: After the app is cold started, the user returns the app to the background, the app process is still in the system, and the user clicks the app to return to the app again;

    iOS13 and later:
  • Cold boot: After restarting the mobile phone system, there is no cache information of the App process in the system, the user clicks the App, and the system creates a process for the App to start;
  • Hot start: The user kills the App process, the cache information of the App process exists in the system, the user clicks the App, and the system creates a process for the App to start;
  • Back to the foreground: After the app is launched, the user returns the app to the background, the app process is still in the system, and the user clicks the app to return to the app again;

In the cloud music app startup management process, the cold startup after iOS13 is always used as the alignment standard. Whether the startup time measured from the user's perspective or the startup time measured by App Launch in Instrument is performed after the phone restarts.

2.2 Definition of Cold Start

Generally speaking, everyone defines the process of iOS cold boot as: from the user clicking the App icon to the completion of the first frame rendering after the startup image completely disappears. The whole process can be divided into two stages:

  • T1 stage: Before the main() function, including the system creating the App process, loading the MachO file into memory, creating the startup closure, and then going to dyld to process a series of loading, symbol binding, initialization, etc., and finally jumping to execute main( ) Before.
  • T2 stage: After jumping to the main() function, start to execute the creation of UI scenes in the App and the related life cycle methods of Delegate, until the first frame of the first screen rendering is completed.
    The overall process is shown in the following figure:

If this article involves time correlation, the iPhone 8 Plus with system 14.3 is used as the benchmark device, and it is in Debug mode.

2.3 Process of cold start

From the definition of cold start, we can divide the whole cold start process into two processes: T1 and T2. The iOS system will perform corresponding processing and code calls on different nodes in the two processes. Each process is optimized for governance.

The T1 stage startup process is shown in the following figure:

From the process shown in the figure above, we can see that in the T1 stage, the system is doing some initialization work for running the App, so what we can do is to minimize the impact on the system initialization work. From the perspective of the whole process, we can do some targeted management and optimization of the dynamic library loading, rebase&bind, Objc Init, +load, and static initializer after starting the closure.

The T2 stage startup process is shown in the following figure:

From the process shown in the figure above, we can see that in the T2 stage, it is basically the code belonging to the business side. In this stage, we often put crash related, APP configuration information, AB data, positioning, buried point, network Initialization, container preheating, and second- or third-party SDK initialization are all stuffed in it, and the ROI optimized for this stage is relatively high.

2.4 Status Quo of Cloud Music

As an app launched in 2013, Cloud Music has nearly 10 years of business development and code stacking. During this period, the attention and governance of startup performance was relatively limited. In addition to listening to songs, Cloud Music also has live broadcast, K song and other services are integrated, so in general, the code on the entire startup link is relatively complex. Even due to the particularity of cloud music's own screen-opening advertising business, after the author started to start the optimization project, I found that cloud music's start-up red screen consists of two parts: the general app's start-up screen page and the fake red screen. The entire start-up process is shown in the figure below. Show:

2.4.1 Analysis of various situations in T1 stage

dynamic library

From WWDC2022 2 , we also know that the number of dynamic libraries in an App will affect the time-consuming of the entire T1 stage, so we need to know the impact of the current dynamic library on the time-consuming of the entire T1 stage, and secondly, we need to know which dynamic libraries are available. has an impact and can be optimized. Through the environment variables provided by Xcode DYLD_PRINT_STATISTICS we can roughly know the time consumption of all dynamic libraries in the T1 stage, as shown in the following figure:

As can be seen from the results output by Xcode, the time-consuming of dynamic library loading accounts for a high proportion of the entire pre-main. At this time, I found that there are as many as 16 dynamic libraries in the Frameworks directory by decompressing the cloud music online IPA package.

+load method

iOS developers should already be familiar with the +load method, because the +load method provides an early opportunity for us to pre-execute some basic configuration code, registration class code, or method exchange code. It is also for this reason that in the continuous business iteration, we found that if you want to find an earlier time, you will think of using the +load method, which leads to too many +loads in the project, which seriously affects the startup performance. Cloud Music Project There is also such a problem. Let's take a look at the analysis of the usage of the +load method.

We know that classes and categories that implement the +load method will be written into MachO at compile time __DATA section __objc_nlclslist and __objc_nlcatlist in two sections . Therefore, we can use the getsectbynamefromheader method to fish out all the classes and categories that define +load, as shown in the following figure:

<img src="https://p6.music.126.net/obj/wonDlsKUwrLClGjCm8Kx/15973148132/0272/54f6/0d23/032c4b2c3a67bb09656c3c46d91dfa31.png">

Of course, when we know all the classes and categories that define +load, we want to know the time-consuming situation of these +loads, so that we can prioritize the part of +load methods with high time-consuming optimization. What we think of is the Hook +load method, and in order to be able to hook all the +load methods, we must go to the hook at the earliest time, so it is the best time to implement a dynamic library and go to the hook in the +load of the dynamic library. , and also ensure that this dynamic library is the first dynamic library loaded, as shown in the following figure:

Since the cloud music project has used Cocoapods to achieve componentization, it is only necessary to create a warehouse starting with the AAA name, such as AAAHookLoad, and introduce the corresponding warehouse in the Podfile, so that the dynamic library can be loaded first, here you can refer to Open source library A4LoadMeasure 3 . If it is still a single project, you can choose any name, just move the corresponding library to the first position in the project settings Build Phases=>Link Binary With Libraries , as shown in the following figure:

After the Hook +load method, we found that there are nearly 800 calls in the cloud music project, and the entire time-consuming has reached the level of 550ms+. It can be seen that the indiscriminate use of the +load method has a great impact on the overall startup performance.

static initializer

For the same binary file, after the +load method is executed, it will enter the static initializer stage. Generally speaking, an App with OC as the main development language will use relatively few static initializer codes, but it does not rule out some underlying layers. library will be used. The following types of code cause static initialization:

  • C/C++ constructor __attribute__((constructor)) , such as:

     __attribute__((constructor)) static void test() {
      NSLog(@"test");
    }
  • C++ static global variables of non-primitive types, such as:

     class Test1 { 
      static const std::string testStr1; 
    };
    const std::string testStr2 = "test"; 
    static Test1 test1;
  • Global variables that need to be initialized at runtime, such as:

     bool test2 () {
      NSLog(@"is a test func");          
      return false;
    }
    bool g_testFlag = test2();

    In fact, we can see that the initialization of global variables whose value cannot be determined during compilation can be considered to be performed at this stage.
    For the analysis of static initializer, MachO __DATA section __mod_init_func this section stores initialization related function addresses. Like +load, we only need to Hook the corresponding function pointer to get the time-consuming of the corresponding function. In the cloud music project, there are few static initializer related functions, and the time-consuming is not obvious, so there is no focus on this.

    Impact of Page In

    When the user clicks the App to start, the system will create a process and apply for a piece of virtual memory for the process. The virtual memory and physical memory need to be mapped. When a virtual memory page that the process needs to access has not yet mapped the corresponding physical memory page, a page fault interrupt Page In is triggered. During this process, I/O operations occur, reading data from disk into physical memory pages. If the page of the Text segment is read, it needs to be decrypted, and the system will also perform signature verification on the decrypted page. Therefore, if Page In occurs frequently during the startup process, the time-consuming of I/O operations and decryption and verification operations caused by Page In will also have a great impact. It should be noted that Apple has optimized this process in iOS13 and later, and decryption is no longer required when Page In.

We can analyze the specific situation of Page In through the System Trace tool in Instruments, find the Main Thread process, and then select the Summary:Virtual Memory option. The File Backed Page In seen below is the corresponding page fault interrupt data. From the data It can be seen from the above that the impact of Page In on cloud music is not a bottleneck, as shown in the following figure:

2.4.2 Analysis of the T2 stage

The T2 stage is mainly the method execution after Main. To analyze this stage, two tools can be used, one is to output the corresponding flame graph after the Hook objc_msgSend function, and the other is to use the App Launch tool in Instruments provided by Apple to analyze the entire startup process. . Through these two tools, we can find the points that need to be optimized from various details such as the timeline, method call stack, and execution status of different threads.

Flame Graph (Flame Graph) was invented by Linux performance optimization guru Brendan Gregg. Unlike all other profiling methods, Flame Graph looks at the time distribution from a global perspective. It lists all possible causes of The call stack for performance bottlenecks.

Hook objc_msgSend to generate flame graph

We know that OC is a dynamic language. All OC methods at runtime will be executed through objc_msgSend, and objc_msgSend will search for the corresponding function pointer according to the incoming object and the selector of the corresponding method. Therefore, we only need to drop objc_msgSend through Hook, and add the time-consuming statistical code before and after the original method, and then execute the original method to get the corresponding method name and time-consuming. Generally, if you want to hook objc_msgSend, you will think of fishhook. Since objc_msgSend is implemented in assembly, if you use fishhook to hook, you have to deal with the data scene of the register. In fact, through the HookZz 4 library, you can also hook objc_msgSend and it is more convenient than fishhook.

Here we use the open source library appletrace 5 to analyze the performance of the objc_msgSend method and generate the flame graph, as shown in the following figure:

Analysis of App Launch tools in Instruments

By analyzing the generated flame graph data and actual debugging, it is found that the time-consuming of the corresponding method on the flame graph is not particularly accurate, and there will be certain errors, but the relative proportion can still reflect the influence of the corresponding method in the entire T2 stage. At the same time, the flame graph can only see the timeline and method call stack of the entire startup link. The state between threads is still not intuitive enough, and it also lacks the detection of C/C++ related method performance, and the description of each specific stage in the flame graph is also insufficient. At this time, you need to use the App Launch tool of Instruments to analyze it again.

Xcode comes with a series of analysis tools in Instruments, and App Launch will display each stage of the entire launch chain in detail after the analysis. By dividing each stage interval, it is easy to find the performance bottleneck of the main thread of each stage and multi-threading status, as shown in the following figure:

2.4.3 Current Situation of Advertising Business

As mentioned above, there is a phenomenon of fake red screen in cloud music, and this fake red screen is generated by the advertising business. After consulting students related to the advertising business, I learned that the advertising business on the cloud music side is displayed in real time after real-time request, so the fake red screen page is displayed before the request, until the fake red screen disappears after waiting for the interface data to return, and the subsequent display Advertisement or go to the home page. After further understanding, we know that the real-time request is because the advertising business needs to go to the external advertising network to pull real-time advertisements, and then distribute the advertisements according to the business situation. Due to network fluctuations and the existence of response time, the impact of advertising services on startup performance is relatively large. The overall process is shown in the following figure:

3. Practice

3.1 T1 Stage Governance

3.1.1 Dynamic Library Governance

The increase in the number of dynamic libraries will not only affect the time for the system to create and start the closure, but also increase the time-consuming of the dynamic library loading phase. Apple's official recommendation for the number of dynamic libraries is to keep it within 6. Cloud Music currently has a total of 16 dynamic libraries, which shows the pressure. For the management of dynamic libraries, there are mainly the following ways:

  • Dynamic library to static library, it is recommended to manage in this way, and it can also optimize the package size;
  • When merging dynamic libraries, it is very difficult to let several parties handle the practical operation together because there are three or two providers of dynamic libraries;
  • Lazy loading of dynamic libraries, the benefits of this method are obvious, but each business party needs to transform and unify the entrance;

In the management of dynamic libraries, Cloud Music still advocates converting dynamic libraries into static libraries, which is more suitable for the long-term development of an application. In the process of converting dynamic library to static library, it is found that many dynamic libraries need to use OpenSSL, and there are already libraries in the project that use OpenSSL, which will lead to symbol conflicts, so we have to make dynamic libraries. For this situation, first of all It is to find the library of OpenSSL symbol conflict, followed by the unified OpenSSL version of the whole project.

Find the cause of OpenSSL symbol conflict

After integrating the OpenSSL static library and converting a dynamic library into a static library, it was found that some symbols were not correctly linked when linking, resulting in a runtime crash. The corresponding symbol is found to be _RC4_set_key, and it is found through LinkMap that _RC4_set_key is linked to the company's internal second-party SDK.

Open the LinkMap.txt file, first find the _RC4_set_key symbol, and then see that the serial number of the corresponding file in front is 2333, as shown below:

Then we can find the file with the corresponding serial number from the Object files block above the LinkMap, and find that it is the IM SDK of Yunxin, as shown in the following figure:

Since the cloud music project relies on 4 dynamic libraries of Yunxin, we checked the symbols of the 4 libraries and found that there are two libraries that depend on OpenSSL. What we need to do next is to make the OpenSSL symbols link correctly to Cloud Music's own OpenSSL library.

Troubleshooting OpenSSL symlink issues

By viewing the project configuration, it is found that the link order of OpenSSL symbols is related to the order in Other Linker Flags, and the order in Other Linker Flags is based on the order of OTHER_LDFLAGS in xcconfig of Pods in Cocoapods. After actually modifying the order of OTHER_LDFLAGS in xcconfig to verify the linking of OpenSSL symbols is resolved. According to this, there are two ways to solve the OpenSSL symbolic link problem:

  • By modifying the Podfile, the libraries in the whitelist are preferentially linked at the linking stage;
  • hide related OpenSSL symbols from dynamic libraries other than the OpenSSL library;
    Considering the long-term development in the future and avoiding hidden dangers in subsequent links, we chose the second method to hide the symbols of third-party libraries when Yunxin exports its own libraries.

After the unification of OpenSSL symbols, we converted the four related dynamic libraries into static libraries. At the same time, we removed a dynamic library that is no longer in use. There are 3 libraries that are optimized for long-term goals due to ffmpeg-related symbol conflicts and extensive coverage. Depends on a Thunder network library as the next optimization target. There are currently 5 optimizations in the dynamic library, and the profit is about 200ms.

3.1.2 +load method governance

In principle, we should not use +load in the development process, and many big manufacturers have disabled the +load method after establishing the specification. The effect of the +load method is as follows:

  • The running time of +load is very early, and the initialization of the SDK for application crash detection has not been completed. Once there is a problem with the code in +load, the SDK cannot capture the corresponding problem;
  • The calling sequence of +load is related to the link sequence of the corresponding file. If there are some registered services written in it, and when other +load related services are being acquired, the +load of the registered services may not be executed yet;
  • The code when executing +load is all run on the main thread, and the operation of all +load applications will increase the time-consuming of the entire startup, and +load can be added to the corresponding business class at will, and unintentional code additions in business development may be will cause a significant increase in time-consuming;
  • From the perspective of Page In, executing +load once not only needs to load the +load symbol, but also needs to load the symbols that need to be executed, which also increases unnecessary time-consuming;
    For the optimization of the +load method, the following solutions are mainly adopted:

    • remove unnecessary code;
    • The code in +load is delayed until after the main thread is processed or after the home page is displayed;
    • The proprietary initialization API of the underlying library design is unified to initialize;
    • Lazy loading of business code interface;
    • It is changed to execute in initialize. For the processing in initialize, it should be noted that the classification initialize will overwrite the main class initialize and the problem that initialize is executed multiple times after subclasses. It is necessary to use dispatch_once to ensure that the code is only executed once;

After a detailed analysis of the usefulness of some +load methods in cloud music, it is found that many underlying libraries in cloud music implement some registration behaviors in +load by using macro definitions, or only provide registration interfaces. Will choose to call the registration interface in +load. In response to this situation, we have optimized the registration methods of several libraries. Through decentralized registration and centralized unified initialization principle, not only can the registration timing be unified, but also business users can be better controlled, paving the way for future monitoring. The decentralized registry uses the attribute feature to write the corresponding structured data to the section specified in the DATA section during compilation:

 #define _MODULE_DATA_SECT(sectname) __attribute((used, section("__DATA," sectname) ))
#define _ModuleEntrySectionName   "_ModuleSection"
typedef struct {
    const char *className;
} _ModuleRegisterEntry;
#define __ModuleRegisterInternal(className) \
static _ModuleRegisterEntry _Module##className##Entry _MODULE_DATA_SECT(_ModuleEntrySectionName) = { \
    #className  \
};

At the same time, we provide a unified initialization interface. In the interface implementation, the corresponding section in the data is fished out and registered uniformly through the original interface:

 size_t dataLength = sizeof(_ModuleRegisterEntry);
        for (id headerItem in appImageHeaders) {
            const ne_mach_header *mach_header = (__bridge const ne_mach_header *)(headerItem);
            unsigned long size = 0;
            void *dataPtr = getsectiondata(mach_header, SEG_DATA, _ModuleEntrySectionName, &size);
            if (!dataPtr) {
                continue;
            }
            size_t count = size / dataLength;
            for (size_t i = 0; i < count; ++i) {
                void *data = &dataPtr[i * dataLength];
                if (!data) {
                    continue;
                }
                _ModuleRegisterEntry *entry = data;
                //调用原有注册接口
            }
        }

For the original way of using macro definitions to be registered in +load, we have added a method discarding annotation, so that business development students can perceive the change of use posture during use:

 static inline __attribute__((deprecated("NEModuleHubExport is deprecated, please use 'ModuleRegister'"))) void func_loadDeprecated (void) {}
#define NEModuleHubExport \
+(void)load { \
    // 调用原有注册接口\
    func_loadDeprecated();  \
}\

Due to the large number of +loads in stock, we only optimized the first 30 key +load methods that took more than 2ms in the first stage. And promote the business side to optimize governance.

3.1.3 Useless code cleanup

From the previous analysis chapters, we know that whether it is the rebase&bind or Objc Init stage, the amount of code in the class and classification in the project will affect the time-consuming of these stages, especially the continuous development of business in large apps leads to a huge amount of code, and Many services and codes are not used after going online, so cleaning up these useless codes can also reduce startup time. In addition, useless code cleanup has greater benefits for package size, and Cloud Music has cleaned up useless code in package size optimization6.

So, how can you find out which code is not being used? Generally, it can be divided into two methods: static code scanning and online big data statistics. Static code scanning still starts from MachO. In MachO, _objc_selrefs and _objc_classrefs store the referenced sel and class in the __objc_classlist section bf96f298 All sel and class are stored, and classes that are not used can be obtained by comparing the difference between the two data. And we know that OC is a dynamic language, so many classes are called at runtime, and we need to ensure that they are not really called before deleting a class. Online big data statistics are calculated by whether the corresponding mark in the class metadata is initialized. We know that in OC, each class has its own metadata. A marker bit in the metadata stores whether it is initialized or not. This marker bit is not affected by any factors. As long as it is initialized, it will be marked. The way to get the flag bit in the source code of OC is as follows:

 struct objc_class : objc_object {
    bool isInitialized() {
        return getMeta()->data()->flags & RW_INITIALIZED;
    }
}

But we cannot call this method directly, it is an OC method. However, we must know that the metadata structure of the class will not change, so we can obtain the RW_INITIALIZED flag bit data by simulating the metadata structure of the construction class to determine whether a class has been initialized. The code is as follows:

 #define FAST_DATA_MASK 0x00007ffffffffff8UL 
#define RW_INITIALIZED (1<<29)
 - (BOOL)isUsedClass:(NSString *)cls { 
     Class metaCls = objc_getMetaClass(cls.UTF8String); 
     if (metaCls) { 
         uint64_t *bits = (__bridge void *)metaCls + 32; 
         uint32_t *data = (uint32_t *)(*bits & FAST_DATA_MASK); 
         if ((*data & RW_INITIALIZED) > 0) { 
             return YES; 
         } 
     } 
     return NO; 
 }

Through the above code, you can get whether a class has been initialized, so as to count the usage of application classes, and further analyze which classes can be cleaned up through big data statistics. In this way, we counted thousands of unused classes. In the subsequent cleanup, by excluding business-side code such as AB testing and business pre-embedding, we cleaned up 300+ classes.

3.1.4 Binary rearrangement

From the previous analysis of Page In, it is known that too many Page Ins during the startup process will generate too many I/O operations and decryption and verification operations, and the time-consuming effects of these operations will also be relatively large. For the impact of Page In, we can reduce the time-consuming of this process through binary rearrangement. We know that a process accesses virtual memory in units of pages, and if the two methods in the startup process are in different pages, the system will perform two page faults to interrupt the Page In operation to load the two pages. If the methods on the startup link are scattered in different pages, the entire startup process will generate a lot of Page In operations. In order to reduce the Page In operation caused by the page fault interrupt of the system, all we need to do is to arrange all the methods used on the startup link on consecutive pages, so that the system can reduce the corresponding memory when loading symbols. The number of pages is accessed, thereby reducing the time-consuming of the entire startup process, as shown in the following figure:

To realize the rearrangement of symbols, firstly, we need to collect symbols such as methods and functions on the entire startup link, and secondly, we need to generate the corresponding order file to configure the Order File property in ld. When the project is compiled, Xcode will read this order file, and will generate the corresponding MachO according to the symbol order in this file during the linking process. Generally, there are two schemes for collecting symbols in the industry:

  • Hook objc_msgSend, can only get the symbols of OC and swift @objc dynamic;
  • Clang instrumentation, can perfectly get the symbols of OC, C/C++, Swift, Block;

Since the cloud music project has been componentized, and there are still some problems with the full source code compilation after binarization, in order to quickly verify the problem, we first chose to use Hook objc_msgSend to collect symbols. The method of Hook objc_msgSend can refer to the scheme when the flame graph is generated above. Collect more than 14,000 deduplicated symbols on the startup link through Hook objc_msgSend, and configure the Order File property of the main project, as shown in the following figure:

After the compilation is completed, verify whether the order of the symbols in the #Symbols: part of the LinkMap file is consistent with the order of the symbols in the order file to determine whether the configuration is successful, as shown in the following figure:

Finally, the effect of binary rearrangement is verified. From various articles on the Internet, we know that the System Trace in Instruments can see the corresponding effect. After restarting the phone, use System Trace to run the program until the home page appears and end the operation, find the main thread, and select Summary:Virtual Memory at the bottom left to see the corresponding File Backed Page In related data, as shown in the following figure:

After restarting the cold start test several times, we found that the data of File Backed Page In in System Trace is not stable, and the fluctuation range is relatively large. It is difficult to prove the optimization effect of the data before and after binary rearrangement optimization. We thought that APP Launch in Instruments may also have data related to Page In, so we also found Main Thread in App Launch and selected Summary:Virtual Memory, as shown in the following figure:

The difference is that from App Launch, we found that the data magnitude of File Backed Page In is much larger than that of System Trace, and it is relatively stable, and App Launch can select the corresponding App LifeCycle stage to view the corresponding data, so we can only look at the first The data before a frame is rendered. After our multiple tests and comparisons and taking the average, it is found that after optimization, it is only less than 50ms less than before optimization. At this point, we are very skeptical about the effect of binary rearrangement. After analyzing the test conditions, we found that we have two points for improvement. One is that Apple has optimized iOS13, so we prepared an iOS12 device for testing. The other is the problem that the Hook objc_msgSend symbol cannot be fully covered, so we spent It took a while to repair the full source code compilation of the project, and export the symbols on the startup link in the form of Clang instrumentation.
Clang instrumentation is mainly performed by using the SanitizerCoverage tool that comes with Xcode. SanitizerCoverage is a code coverage detection tool built into LLVM. Through configuration, it can insert __sanitizer_cov_trace_pc_guard callback function into each custom function according to the corresponding compilation configuration during compilation. By implementing this function, The original function address inserted into the function can be obtained at runtime, and the corresponding symbols can be parsed through the function address, so that the function symbols in the entire startup process can be collected. By configuring in Other C Flags -fsanitize-coverage=func, trace-pc-guard ; can collect symbols corresponding to C, C++, OC methods. And if there is Swift code in the project, it also needs to be configured in Other Swift Flags -sanitize-coverage=func; -sanitize=undefined ; so that the symbols of Swift methods can be collected. For projects that use Cocoapods to manage code, you can refer to the implementation of the open source project AppOrderFiles 7 . In addition, it should be noted that the implementation in AppOrderFiles is to first resolve the corresponding symbols through the function address and then deduplicate. For medium and large projects, the number of symbol calls during the startup process can reach several million levels, so this process Especially for a long time, it can be changed to perform deduplication first and then perform function address resolution symbols to save time. At the same time, since the cloud music project has enabled the generate_multiple_pod_projects feature in Cocoapods, the configuration in the corresponding Podfile also needs to be modified to the following code to effectively configure the Other C Flags/Other Swift Flags of all sub-projects, the code as follows:

 post_install do |installer|
  installer.pod_target_subprojects.flat_map { |project| project.targets }.each do |target|
    target.build_configurations.each do |config|
      config.build_settings['OTHER_CFLAGS'] = '-fsanitize-coverage=func,trace-pc-guard'
      config.build_settings['OTHER_SWIFT_FLAGS'] = '-sanitize-coverage=func -sanitize=undefined'
    end
  end
end

Through Clang instrumentation, we collected a total of about 20,000 deduplicated symbols on the startup link, and tested them on an iPhone 6 Plus device with iOS12.5.4. After taking the average of multiple tests, it is found that there is an optimization of about 180ms after binary rearrangement. It can be seen from the result data that the effect of binary rearrangement is a myth, and before iOS13, Apple's decryption and verification operation of the Page In process is the most time-consuming, and the rearrangement of symbols has less impact.

3.2 T2 Stage Governance

The governance of the T2 stage mainly starts from the configuration and initialization of each startup task, and the loading of the home page. The optimization space in this area is also the largest. As can be seen from the above, due to the particularity of the cloud music business, the influence of the advertising business accounts for a large proportion in the T2 stage, so we also managed the advertising business in the T2 stage. At present, the homepage of Cloud Music has been cached, and because of the existence of the advertising business, the homepage is not a bottleneck in the entire startup process. We focus on each startup task.

Except for some codes in AppDelegate initialization that Cloud Music has not managed, other startup tasks have been managed by a startup task management framework. Therefore, in the T2 stage, we mainly use the Hook objc_msgSend to generate the flame graph and the App Launch tool in Instruments combined with the startup task management framework to analyze the performance of the entire startup link. Through analysis and subsequent optimization, we summarize the following optimizations. direction:

3.2.1 Optimization of high frequency OC method

OC is a dynamic language, all runtime methods will be forwarded through objc_msgSend, so we implement a flame graph to analyze the performance of each method. Everyone knows that the advantage of dynamic languages is flexibility, but the accompanying performance is relatively poor, especially in the application of the underlying library, the impact and scope are more obvious.

NEHeimdall library optimization

From the analysis of the flame graph, we can see that the method of a low-level library is frequently called, and it takes a lot of time to summarize, as shown in the following figure:

From the enlarged image, we can see the frequently called method [[NEHeimdall]disableOptions] . NEHeimdall is our low-level library for crash protection at runtime. Hook includes container classes, NSString, UIVIew, NSObject and other classes, and makes a switch-on judgment in the method. And like the system bottom container class NSArray is widely used and called frequently, if it is called again in each objectAtIndex method [[NEHeimdall]disableOptions] method is indeed more time-consuming.

There are two main optimization ideas: one is to judge the switch state in the Hook stage to decide whether to enable the protection; the other is to change the original [[NEHeimdall]disableOptions] method to the C method, which can relatively improve the overall performance. Because the first method changes greatly and because the existence of AB cannot guarantee the real-time performance of the switch, we finally choose the second method.

JSON parsing optimization

ABTest is an indispensable component in conventional large-scale apps, and the acquisition of AB cache data must be in the early stage of starting the link. Due to the long history of the cloud music project, JSON data is currently in ABTest data serialization and deserialization. Parsing is still using SBJson's library, and SBJson will frequently call sub-methods, as shown in the following figure:

From the evaluation data 8 of netizens earlier in N, the performance of the SBJson library is relatively poor, as shown in the following figure:

As can be seen from the above figure, for the parsing of JSON data, the performance of the NSJSONSerialization library provided by the system is the best, so in the ABTest component, we mainly remove SBJson and use NSJSONSerialization to do JSON data. Parse. There are also non-start link components in the project that have dependencies on the SBJson library, and the further need to do is to remove the dependency on the SBJson library for the entire project.

3.2.2 runtime traversal optimization

The dynamism of OC gives developers a lot of scalability, so everyone will do some tricky operations in the usual development process, such as hooks and traversal symbols, etc., and these operations are very performance-intensive.

Hook optimization

There are many scenarios that require hooks in cloud music projects, whether it is through Method Swizzle or fishhook traversing the symbol table. When we analyzed the flame graph and Instrument, we found that both hook methods affect performance, as shown in the following figure:

There are two points in mind for the optimization of Hook. One is to find a replacement for the Hook library with good performance, but a new library will be introduced and there will be a certain transformation cost. The second is to asynchronously execute the original Hook code to the child thread, but the timing of the child thread will be uncertain. It is necessary to ensure that the Hook operation is completed before the corresponding class is applied. We made some attempts in the second method, but we did not go online in the end. We will manage Hooks in a unified manner in the future to reduce the time-consuming caused by repeated Hooks.

EXTConcreteProtocol optimization

We know that there is no default implementation of protocol in OC, but in many scenarios, it is very convenient if the protocol has a default implementation. The EXTConcreteProtocol in the libextobjc library can provide the ability to implement the protocol by default. Through Instrument, we found that the ext_loadConcreteProtocol method is particularly time-consuming, as shown in the following figure:

By looking at the source code, it is found that ext_loadConcreteProtocol also uses runtime traversal to achieve the ability of the protocol to have default implementation. Considering that only one place in the existing business uses EXTConcreteProtocol, but it has a particularly large impact on startup time, so the optimization of EXTConcreteProtocol is to remove dependencies , to transform the business code implementation, by adding a classification to NSObject and inheriting the protocol, the ability of the protocol to have a default implementation can also be achieved.

3.2.3 Network related optimization

In the cloud music project, there are two main points that affect the startup performance related to the network: the synchronization of cookies settings, and the generation and use of UserAgent.

Cookie settings synchronization optimization

For conventional apps, there will be three-party jumps to H5. In Cloud Music, a WKWebview object will be pre-generated on the startup link in order to synchronize Cookies, and the creation of WKWebview instances is very time-consuming. For this part, we mainly do lazy loading to optimize, put the creation of WKWebview object when there is actually an H5 page open, and synchronize Cookies when it is created.

UserAgent generates optimizations every time

UserAgent is an essential parameter for the request, and in cloud music, UserAgent is obtained by temporarily creating UIWebView object and executing navigator.userAgent, and it will be recreated and obtained again every time it is started, which consumes The time point is mainly in the creation of the UIWebView object. By viewing the specific content of UserAgent, it is found that except for the system version number and App version number, which will be updated with the upgrade, other content will not change. Therefore, we cache the use of UserAgent, and actively update the cache every time the system is updated or the App is updated to reduce the impact on startup performance, as shown in the following figure:

3.2.4 System Interface

In the process of analyzing the flame graph and Instrument data, we also found that the performance of some system interfaces has a great impact on the time-consuming of the entire startup link. At present, there are two main interfaces:

  • bundleWithIdentifier: interface in NSBundle;
  • The beginReceivingRemoteControlEvents interface in UIApplication;

When Cloud Music took the Bundle, it made a layer of encapsulation, and obtained the corresponding Bundle through podName. In the internal implementation, it is first searched through the system bundleWithIdentifier: interface, and if it is not found, it is searched through the mainBundle to find the URL. Through analysis, it is found that the system interface bundleWithIdentifier: has poor performance when it is called for the first time, while the performance of obtaining Bundle through mainBundle is very high. It has been verified that the mainBundle method can obtain the Bundle, so we switched it in order, and firstly searched for the Bundle through the mainBundle, as shown in the following figure:

The usage scenario of the beginReceivingRemoteControlEvents interface is mainly to display related information and buttons on the lock screen interface, and you must enable the Remote Control Event first. Cloud Music, as a music software, needs to display relevant information when playing music. The previous practice was that playback-related services would register the corresponding instance in the IOC at startup. To this end, we have modified the bottom layer of the IOC to support lazy loading of related instances, and initialize the instances when the related services are used, thus delaying the impact of the beginReceivingRemoteControlEvents interface on startup, as shown in the following figure:

3.2.5 Advertising business optimization

After in-depth analysis of the advertising business, we found that the current advertising objects of Cloud Music include members and non-member users. Member users place relatively few advertisements, which are generally internal operation activities, and internal operation activities do not need to go to the advertising alliance to pull data. And from the code level, the interface request timing of the advertising business will not be sent until the advertising business code is executed, and the timing is already too late. In response to these two situations, we have optimized the advertising business accordingly:

  • Member user advertising business interface request switch dynamic configuration;
  • The timing of the advertising business interface is pre-positioned;

Internal operation activities are generally operation configuration, and there will be options for delivery objects, so the ability to dynamically configure this switch is placed on the backend, and the corresponding switch will be turned on when the operation configuration activity delivery object needs to have members. The non-operational status switches are all off, and member users will not request the interface. At the same time, the impact of advertising services is unbearable for non-member users. On the basis of the current state, we prepend the request timing of the advertising service interface to the network library initialization and send it out, which can shorten the impact of the request duration on startup. In terms of grayscale data, it can be optimized about 300~400ms on average.

3.2.6 Optimization at other business levels

In addition, there are some points that affect the startup performance brought about by business expansion or new functions, such as the iPhone supports the reading of numbers after one-click login. After Cloud Music supports the one-key login requirement, it will read whether the operator supports one-key login and obtain the number through the SDK. In the previous design, regardless of whether the user is logged in or not, it will be judged and obtained. Obtaining from the SDK also takes a certain amount of time. , we changed it to fetch only if the user is not logged in.

There are also some non-common business code usage posture issues. We have also done a lot of optimization, so we will not list them one by one here.

4. Summary

After periodic special optimization of startup performance, the startup performance of Cloud Music App has been improved to a certain extent, and the performance has increased by 30%+ so far. However, for startup performance optimization, all optimization measures are only processed for the current situation encountered by the app. The business iterations of conventional large-scale apps are very frequent, and the business demand is also very large. How can we detect and intercept code that affects startup performance during the daily development stage, and how can the app quickly locate the new version after it goes live? Degraded attribution, and even how to perceive the somatosensory data of a single user on startup performance. This needs to be considered and practiced after a stage of startup governance. We are currently improving the anti-deterioration system for the entire startup performance, and will further share some anti-deterioration ideas after it goes online and runs stably.

We can also know from the above that the impact of advertising business on the overall startup performance of Cloud Music App is particularly large, especially the uncertainty of the interface response time, and advertising involves revenue, so it is difficult to make short-term changes in this area, although We have optimized this time for member users, and will further analyze the advertising business and make certain optimizations in the future. There are also some business-level optimizations, such as tabbar lazy loading, home page loading, and regular +load, which will be further managed.

PS: Attach a small summary table of cloud music optimization practice:

stage Optimization direction probable gain Analysis tools/methods
T1/pre-main Dynamic library to static library Average 20-30ms/library Unpack/Xcode Environment Variables
+load see specific business Hook load summary
Useless code cleanup see specific business Big data statistics usage
binary rearrangement 50-200ms Hook objc_msgSend/Clang instrumentation
T2/post-main High frequency OC method 200-300ms flame graph
runtime symbol traversal 300-500ms Flame Graph/Instrument
Network related 200-300ms Flame Graph/Instrument
system interface 100-200ms Flame Graph/Instrument
business impact 300-400ms Flame Graph/Instrument

5. References


云音乐技术团队
3.6k 声望3.5k 粉丝

网易云音乐技术团队