Murder case triggered by R&D efficiency measurement

Reviewer | Cai Fangfang
Author | Ru Bingsheng
Tencent T4 level technical expert, special researcher of Tencent Research Institute, well-known industry expert in the field of R&D efficiency and software quality. The most valuable expert of Tencent Cloud, Alibaba Cloud, and Huawei Cloud, Certified DevOps Enterprise Coach, the most influential author of IT books of the year, author of many technical bestsellers, author of Geek Time "52 Lectures on Software Testing", new book "Software R&D Effectiveness" The Beauty of Ascension will be published soon. He has served as the co-chairman, producer and speaker at the main venue of major domestic technology summits for many times.

Some time ago I wrote an article "How to Use R&D Efficiency to aroused a lot of discussion and attention among industry peers. Today I want to continue to talk about another topic in the process of R&D efficiency improvement: "measurement". The purpose of discussing measurement is not to argue about right or wrong, but to arouse everyone's in-depth thinking on this topic.

1 Cases of measurement failure

Let's first look at a case of "involution" and other bad behaviors caused by improper measurement system design.

For example, by using "clicks" to measure the results of self-media operations, it is possible that the number of clicks will increase significantly, but the number of followers of the official account will decrease. The reason is that "Title Party" and other methods are used to trick readers into opening the link, but the actual content is not true to the name, and readers will not continue to pay attention to the official account after a few times.

2 Times have changed, and the underlying logic of many things has changed

Why are today's measurements so easy to fail? As I mentioned in the previous article, facing changes, the most important thing is not the upgrading of methods and technologies, but the upgrading of thinking mode. We are in the midst of digital transformation, and we need to completely transform the thinking of scientific management in the age of industrialization into a new thinking in the era of byte economy.

For the measurement of software R&D effectiveness, most of the time we are still trying to improve the R&D model under the byte economy by using management concepts formed in the industrial age. But the times have changed, and the underlying logic of many things has changed. Are the scientific management concepts formed in the industrial age still applicable in the byte economy today? Worth pondering.

This article will stand from the perspective of software R&D efficiency to discuss several questions that must be answered in R&D efficiency measurement in the era of byte economy:

Should R&D effectiveness be measured?
Can R&D effectiveness be measured?
How to measure R&D effectiveness?
How to choose the metrics for R&D efficiency?

3 Should R&D effectiveness be measured?

to. The answer to this question cannot be questioned.

Peter Drucker, the father of modern management, said, "There is no improvement without measurement." This underlying logic has not changed from beginning to end. It is just that the measurement in the industrial era and the measurement in the byte economy era will have many different ideas and methods. place.

metric for R&D process improvement is very clear . In the industrialized era, the risks involved in the R&D and production of physical products are relatively obvious, and it is easier to find ways to prevent them, and to distinguish the related responsibilities. However, the development of software products in the era of byte economy (there is no production) is promoted through the digital collaboration of more and more engineers. The more people involved in research and development, the higher the cost of communication between people, and the greater the probability of random deviation. In addition, the visibility of the software development process itself is very low, and the visibility of risks is easy to be affected by each. The link is covered up, but it will eventually accumulate in an invisible place. If there is no proper measurement system to reveal these risks, the results can be imagined, let alone continuous improvement and governance.

Measurement is also necessary for people's fairness appeal. "Although I have no credit, I also have hard work." Most people may only pay attention to their own contributions, but do not care about the actual effects of their contributions. As a manager, you should applaud for "hard work and pay for credit." The manifestation of credit and hard work also needs to be reflected with the help of objective measurement data, otherwise the members of the team will gradually fall into the dilemma of inaction.

4 Can R&D effectiveness be measured?

After clarifying that R&D effectiveness must be measured, let’s take a look at a more practical question: Can R&D effectiveness be measured? "Should" and "can" be two levels of questions, "Yes" does not mean "can", just like "I want to make money" and "I can make money" are two completely different questions.

Regarding this issue, the industry has two completely different views. One is based on the theory of Peter Drucker, the father of modern management, and advocates that R&D efficiency can be measured; the other is represented by world-class software development master Martin Fowler, who advocates R&D effectiveness is immeasurable.

On this issue, my view is more moderate. I think it can be measured, but there is no perfect measurement. The reasons are as follows:

4.1 The one-sidedness of the measurement itself cannot be avoided

Real things are complex and multi-faceted, and measurement is an abstract and quantitative measure taken to describe and compare these concrete facts. In a sense, the result of measurement must be one-sided and can only reflect some facts.

Managers often disassemble the target into measurable indicators . However, goals and indicators are often not a simple global and local relationship. The dismantling process of the target seems to be very smooth, so geographically natural, but when the dismantled indicators are combined, the result is often dumbfounding.

There is a joke that says, "You ask artificial intelligence, I want to find a girlfriend, with big eyes like Zhao Wei, big mouth like Julia Roberts, loves sports, land sports, water sports. Artificial intelligence just Based on these indicators, the female frog’s answer is given." Therefore, the relationship between indicators and targets is often not a sufficient and necessary relationship.

4.2 The measurement process is easy to fall into local thinking

Indicators are for achieving goals, but in practice, indicators are often the enemy of goals.

Managers often disassemble goals into indicators. After a long time, he only knows the indicators and forgets the more important goals behind them. If the target is forest, then the indicator is wood. Over time, only trees will be seen, but forests will not be seen. Managers who forget what their goals are at this time will become very short-sighted. Those who don't understand data are bad, and the worst are those who only look at numbers.

There was a dark period in the history of Ford Motor Company. Those older generations of management who had rich practical experience but had never attended a business school were killed and replaced by data analysts with a management background in prestigious schools. The company tried to achieve business growth through refined digital management. Because these data analysts are not familiar with the business, they can only look at the measurement data. The more they don't understand the business, the more they rely on the measurement data to make decisions, and finally the entire company is in a quagmire.

Software development also has a similar embarrassment. In order to better code quality, strict code test coverage requirements have been formulated. Over time, everyone mechanically pursued this indicator, and forgot the original intention of setting up this indicator at that time, so there was such an embarrassing situation that there was no assertion in a large number of unit tests with high coverage.

4.3 The interpretation of metric data is very misleading

The measurement data itself is not deceiving, but there is a lot of room for data presentation and interpretation. In many cases, different interpretations of the same data will lead to completely different results, which can easily be used artificially to achieve their respective goals.

For example, a researcher asked the respondent a question: If you have a terminal illness, there is a new drug that can be cured, but there is a risk. 20% of the users may die because of it. Do you take it? Most people will choose not to eat. But if you ask the other way around: If you have a terminal illness, there is a new drug that can cure 80% of patients, but the others will die. Would you take it? The vast majority of people will choose to eat. In fact, the basic data of these two questions are the same, but the answers obtained are opposite. The reason is simple. In the previous question, the emphasis is on "lost", and in the latter question, the emphasis is on "gain". Human nature prefers to "gain" rather than "to lose."

There are many similar cases in the field of R&D efficiency. The same data can be interpreted in different people's mouths, and the decisions made will be different.

In summary, I think whether the research and development efficiency can be measured is based on the scenario, and it does not make much sense to talk about whether it can be measured without the . 1616abecdb2514. Just like nothing is dirty in nature, it is the dirty things that are placed in the wrong place. The food is clean in the bowl, and it is dirty when it is spilled on the clothes. The soil is clean in the garden, and it is dirty when it shakes off the bed.

5 How to measure R&D effectiveness?

So how to measure R&D efficiency, the following are some of my thoughts.

5.1 Listen to the manager’s measurement demands, but don’t follow

In the era when the automobile has not yet been invented, you ask the horse-drawn carriage user what kind of transportation do you want, and the answer is likely to be "faster horse-drawn carriage". If you follow this line of thinking, you will fall into If you study the misunderstandings of horseshoe design and horse feed optimization, there will be no automobile invention. Many times the needs that users tell you are just self-righteous "solutions."

When the manager tells you that I want these measurement data and I want those measurement data, you should not Go up and understand the real motivation behind the data that managers want to see, and what kind of problems the managers want to solve through these data . To understand the deep-seated needs of managers, this is the essence of the problem. Only in this way can it be possible to give a relatively perfect measurement scheme on this basis.

5.2 The metric should be a hierarchical

The measurement dimensions that senior managers, middle managers, and first-line engineers care about are definitely different. Don't try to provide a seemingly large and comprehensive measurement system. When your measurement system can serve everyone, it just means it can't do anything.

The ideal approach can refer to the practice of OKR. First, the senior management formulates the overall goal of the measurement system, then the middle management breaks it down into executable and quantifiable indicators, and finally breaks it down into engineering-dimension indicators by frontline engineers. Each level only cares about the indicators of the current level and the goals of the upper level, and there should not be too many high-level and detailed attention to the lower-level indicators.

5.3 The design goal of measurement is to be able to guide the correct behavior

Measurement is never an end, but a means to achieve an end. Metrics serve the purpose, so a good measurement design must have a positive traction on the purpose. If the negative traction of the metric to the goal is greater than the positive traction, such a measurement essentially fails.

For example, many software companies in China now use Sonar to control the static quality of their code. In order to promote the popularity of Sonar in the team, many companies use indicators such as "Sonar project access rate", that is, how many A percentage of projects have enabled Sonar in continuous integration CI to measure the popularity of static code inspection. This indicator seems to be pertinent, but in fact the traction for achieving the ultimate goal is relatively limited. The ultimate goal of using Sonar is to improve the quality of the code. Just accessing Sonar does not actually improve the quality of the code, and it is also easy to fall into a competition of indicators for access. After understanding this layer of logic, you will find that the use of "average repair time for Sonar serious problems" and "increasing trend of Sonar problems" is actually more practical and instructive.

Therefore, a good metric must serve to solve the essential problem and be able to guide correct behavior.

5.4 Remember not to use "star chasing" metrics based on "comparative thinking"

When we see a person achieve success, we immediately think that all his past actions are so reasonable. When we see a company's success, we will feel how effective the strategies and engineering practices they have adopted. This is the terrible aspect of "comparative thinking". In fact, no company succeeds by pegging its competitors.

The successful application of OKR in Google has made many companies rush to this practice, but how many companies have achieved success by using OKR? This kind of "star chasing" measurement can only make you fall into a deeper involution.

For the measurement system of R&D effectiveness, remember not to blindly copy the so-called best practices of the "big factory", and don't compare your measurement practices with those of big companies. Your context and organizational ecology are different. This medicine is for big companies. It can be cured, and it can be fatal to you.

5.5 The measurement should not be widely

Do not carry out large-scale measurement without any clear improvement goals, because measurement has a cost, and the cost is not low. Many large organizations tend to spend a lot of money to establish a research and development efficiency measurement data center, hoping to obtain improvement points through the analysis of research efficiency big data. Although this strategy of "casting the net widely" seems to be effective, it actually has little effect. Facts have proved that the construction cost of the station in the measurement data is often significantly higher than the actual effect.

The ideal approach is to find out points that need to be changed through in-depth insights into the R&D process, then find a set of metrics that can confirm your views and take corresponding measures, and finally use measurement data to verify the actual value of the measures. The strategy of "precision fishing" is often more practical.

6 How to select the metrics for R&D effectiveness?

This problem is too big to unfold. But here I want to explain the problem of index selection through two typical cases. In many cases, when we cannot explain what is right, we can try to see what is wrong through reverse thinking, so as to give us some inspiration.

6.1 Murder case caused by "thousand lines of code defect rate"

The defect rate per thousand lines of code is widely known and used by many companies as a code quality-related metric, but can this indicator really reflect the quality of the code objectively? Is this indicator really a qualified indicator? I will not directly give a conclusion here, but take everyone to analyze it and let you draw your own conclusions.

The above figure gives the definition of the defect rate of a thousand lines of code, that is, the number of defects per thousand lines of code. Assuming that under normal circumstances, the team's average defect rate per thousand lines of code is probably in the range of 5-10.

Now we have three engineers:

Engineer A’s technical ability is relatively poor. It takes 20,000 lines of code to achieve requirement X and introduces 158 defects at the same time. From this, it is calculated that engineer A’s thousand lines of code defect rate = 7.9, which is within the average range of 5-10. Therefore, from the point of view of the defect rate of thousands of lines of code, Engineer A is at a normal level and will not attract everyone's attention, and it is reactive and without fault.
Engineer B is a great technologist. It only uses 3000 lines of code to achieve the same requirement X, but also introduces 10 defects. From this, it is calculated that Engineer B’s 1000 lines of code defect rate = 3.3, which is significantly lower than 5- 10 With this average range, do you think Engineer B will be praised for it? A big mistake, engineer B is likely to be judged as insufficiently tested and ordered to strengthen the test.
Engineer C is a person who has technical pursuits and strives to become a technical expert. He used 4000 lines of code to achieve the same requirement. However, due to the limited technical capabilities at present, 58 defects were introduced, and he calculated Engineer C's code defect rate per thousand lines = 14.5, which is significantly higher than the average range of 5-10, so there is no doubt that Engineer C will inevitably be criticized and ordered to improve code quality.

It can be seen from this that the evaluation of these three engineers based on the defect rate of thousands of lines of code is obviously unfair.

To make matters worse, future requirements may change. At this time, the code of Engineer B and Engineer C has good maintainability and can be changed easily, so the change task can be completed quickly, and no new defects will be introduced. The code of Engineer A lacks the support of design patterns, a large amount of code needs to be rewritten, and many new defects are introduced at the same time. However, due to the large amount of code, the defect rate of engineer A's thousand lines of code is still within the average range. Therefore, under the existing measurement system, Engineer A still has nothing to do with nothing, while Engineer B and Engineer C continue to receive negative reviews because their work seems too simple and obviously the workload is "not full".

It can be seen that thousand lines of code defect rate is a failure, and the values it conveys are contrary to what we expected . From the experience of engineer B, we can see that “we don’t believe you can write high-quality code”, and from the experience of engineer C, we can see that “we do not encourage the pains of the technical improvement stage”, and from the side of engineer A, we can see What is "We welcome those mediocre programmers", these are directly contrary to our actual values.

The above analysis is carried out under the premise that there is no artificial "painting" of indicators. In actual work, engineers tend to dilute the code to reduce the defect rate of thousands of lines of code, instead of actually reducing the number of defects. Because compared with reducing the number of defects, directly diluting the code (for example: write one line into multiple lines, brackets must be wrapped, write more comments, add more blank lines, etc.) is less difficult and more controllable. So I always say, never underestimate the "creativeness" of engineers when facing metrics.

At this time, the designer of the measurement system may quickly realize the small methods of engineers, so a new "development equivalent defect rate" indicator came into being, this indicator uses the development equivalent to replace the number of thousands of lines of code. development equivalent is a reasonable estimate of the development workload, which can be understood as the complexity of the abstract syntax tree AST compiled into the source code . Compared with shallow statistics such as the number of lines of code, development equivalence is less susceptible to interference from programming habits or specific behaviors (such as line breaks, comments, etc.). In this way, it is impossible to reduce the code defect rate by diluting the code. .

At first glance, using development equivalent seems to solve the problem, but when you think deeply, you will find that using development equivalent may make the situation worse. Because engineers can still reduce the code defect rate by artificially increasing the development equivalent (such as reducing packaging, etc.), but it is more difficult to increase the development equivalent than directly diluting the code, which will directly cause the difficulty of "painting" the indicator to become more difficult, and then fall into The dilemma of "algorithm confrontation". The final result is that engineers spent more time and energy in the wrong place in order to reduce the code defect rate, but the final code quality still has no improvement.

So what is going wrong? You calm down and think carefully, is there any relationship between the number of lines of code and the quality of the code? If there is a relationship, is there a causal relationship or just a correlation between the two? At this time, you may suddenly realize that is only a correlation between the number of lines of code and code quality, not causality at all. The code quality will not deteriorate increase in the number of lines of code 1616abecdb297a, so I tried to use the thousand lines of code defect rate. The major premise for measuring code quality is simply not true, and it is wrong from the source. It's as if there is a correlation between forest fire rate and ice cream sales. This correlation is related to the heat, and there is no causality between the two. In other words, trying to reduce the rate of forest fires by reducing ice cream sales is totally unworkable. Superstitious belief in the defect rate of a thousand lines of code is like deceiving yourself like throwing a brick and looking at the direction of the wind.

So what is the correct approach? We know that as long as the defects can be repaired quickly, then the defects are not terrible, and the defects are not terrible. We are afraid that the difficulty of repairing each defect is very high. A defect can not be repaired for a few days, and we are even more afraid. Defect repair changes to the original code will "break the bones".

So we can using the average time to repair the defect (Mean Time To Repair) to measure code quality . The average defect repair time can better reflect the quality of the code itself and the technical maturity of the team. Often the code with a long average repair time is the code with high complexity and high coupling. The code with short average repair time is the code with relatively clear structure, standardized naming, easy to understand, extend and change. Compared with the defect rate of 1,000 lines of code, the average defect repair time has a stronger positive traction effect on code quality.

6.2 Estimation of workload in agile mode is right or wrong

In the agile model of workload measurement, should "story points" be used as the unit or should be "person days" as the unit? Many people may think it is okay, they think this mainly depends on the team's usage habits. In fact, this answer is completely wrong. The correct approach is to use "story points" instead of "human days".

It is not complicated to understand the reason, because workload is the concept of quantity, and man-day is the concept of time. To move a thousand bricks, this thousand bricks is the concept of workload.

Moving fast, it is a thousand bricks, moving slowly, it is still a thousand bricks. The amount of work itself has nothing to do with time.

The relationship between workload and time is through the concept of speed. Similarly, if you move a thousand bricks, you can move 10 bricks per minute, and you can move them in 100 minutes; if I can only move 5 bricks per minute, then you can move them in 200 minutes. Therefore, only when the rate is determined, can the workload be converted into time.

The problem is that when we are planning iterations, we have no way to clearly know the value of the rate. The rate will dynamically change with many factors and is not necessarily invariant. For example, the proficiency of engineers, whether they have dealt with the same type of problems before, how many meetings need to be attended, and various chores at home will have a direct impact on the speed. Therefore, we cannot equate the "story points" representing workload with the "human days" representing time.

Why do many teams still use time directly to estimate the workload, and think that there is no difference between "story points" and "person days"? Because in their concept, the rate is constant, so workload and time can be linearly converted, so this seemingly "reasonable" assumption is actually a huge logical error behind it.

7 Summary

This article systematically explores all aspects of R&D effectiveness measurement, focusing on the specific practice of R&D effectiveness measurement. At the same time, it discusses common misunderstandings in the selection of measurement indicators through specific cases such as the defect rate of thousands of lines of code and agile workload estimation.

In fact, you will find that the process of R&D efficiency measurement is often a process of abstracting and simplifying the entire software R&D process, but simplification will bring distortion and distortion, and those areas that are perceptual, meaningful, and need to be repeatedly considered disappear. There are only conclusive metrics. A little carelessness will give us the illusion of holding the truth in our hands.

So under the blessing of cloud native and the trend of digitalization, how should companies quickly integrate and promote the upgrade of R&D efficiency? Ru Bingsheng, the author of this article, invites readers to participate in the Tencent Cloud CIF Engineering Efficiency Summit-Exploration of Enterprise R&D Management in the Forum. Five industry experts from different companies will focus on the pain points of corporate R&D efficiency upgrades, combine theory and practice, based on corporate development, focus on industry co-construction, and help companies deliver more efficient, higher quality, more reliable, and sustainable delivery through real cases. Better business value.

1616abecdb2b38 October

CIF Online Summit registered

Scan the poster QR code

More exciting agendas are waiting for you!

Murder case triggered by R&D efficiency measurement

1 Cases of measurement failure

2 Times have changed, and the underlying logic of many things has changed

3 Should R&D effectiveness be measured?

4 Can R&D effectiveness be measured?

5 How to measure R&D effectiveness?

6 How to select the metrics for R&D effectiveness?

7 Summary

CODING

引用和评论

Mojo——会燃的 AI 编程语言

管理者如何控制人员需求

【技术直播预告】分布式存储落地实践

2025年适合小型团队的9款最佳项目管理软件

面向研发人群使用，数百种功能控件+大量实用模板

2025年7款项目管理工具推荐：平衡效率与本土化需求

打造高效开发团队：主流DevOps工具推荐