PingCAP CTO Huang Dongxu: Basic software "easy to use" guide-these two gaps must be crossed!

Recently, I was particularly impressed by one thing, as an introduction to everyone: when we were doing some extreme traffic regression simulation experiments internally, we observed abnormal CPU usage on TiKV (TiDB's distributed storage component). But no abnormalities were seen in our Grafana Metrics and log output, so I was confused for several days. Finally, an old driver blindly guessed and combined with profiling to find the real culprit. The real culprit appeared in no one thought of it. Place: Log module used by Debug (Clarification: this bug has been fixed at present, and the trigger of this bug will only appear under very extreme pressure scenarios + the log level is fully opened, please rest assured).
This article is not a bug analysis. What I think is more important is the tools we use in the process of finding the problem and the thinking process of the old driver. As an observer, I saw young colleagues watching the old drivers operating perf skillfully and switching among various tools and interfaces with admiring eyes. I vaguely felt that something was a little wrong: this meant that this craft could not be used. copy.
Afterwards, I did some research on the user experience of basic software, and found that the theories and materials in this field are indeed quite few (most of them are research on ToC products, and system software is probably only related to the UNIX philosophy school), and it lacks systematization and dependence. For the author's personal "taste", there are obviously good and bad software experiences. For example, when an experienced engineer sees a command line tool, he will know whether it is easy to use with a few taps, and whether it is a tool with "taste".
In many cases, "taste" is called "taste" because it is unclear. This is certainly a manifestation of the artistry of software development, but it also means that it cannot be copied and is not easy to learn. I don’t think this is good either. Today’s article and possibly the next few articles (although I don’t know what to write in the next few articles, but set up a Flag first) will try to summarize where the good basic software experience comes from. .
As the first article, this article will focus on the two more important topics of observability and interactivity. As for why we put these two points together, I will talk about it first, and finally talk about it.

Observability

What is observability? This can be seen in the article "Distributed System Observability in My Eyes" [1] published two years ago, and I will not repeat the same content here. With the in-depth practice of observability in TiDB, we have a deeper understanding of this topic. In order to better understand, we first clarify a question: when we are talking about observability, who is observing?

Who is observing?

Many friends may be taken aback for a moment and think: This is to say, it must be a person, not a machine. Yes, people are indeed observing, but such a simple truth is often ignored by software designers, so what is the difference between the two? Why is it important to emphasize the human subject?

To answer this question, we need to be clear about a reality: human short-term working memory is very limited. A large number of psychological studies have shown that the capacity of human working memory is roughly 4, that is, to focus on 4 items of information at the same time in a short period of time [2], no matter how much information is memorized in modules, such as the way we quickly remember phone numbers, Take 13800001111 as an example, we usually do not back numbers one by one, but form grouping like: 138-0000-1111.
After understanding some basic assumptions and bandwidth of the human mental model, I think that many system software developers will probably no longer show off: My software has more than 1,000 monitoring items! Not only is this not a good thing, but it makes more information destroy the formation of short-term memory, introduces more noise, and allows users to spend a lot of time in the ocean of information to find key information, and unconsciously classify (I believe the brain’s An unconscious background task is to index and classify information. Note that this also consumes bandwidth), so the first conclusion: It is best to have only 4 key information in the one-screen interface of the software application. So, the next question is: What is the key information? What is noise?

Distinguish key information from noise

There is no standard answer to this question. For system software, my experience is: follow the key resources. The software is actually very simple, the essence is the use and allocation of hardware resources, and the art of balance. The key hardware resources are nothing more than the following. For each of the key resources below in a certain sampling period (single point does not make much sense), you can ask some simple questions to get a rough picture of the system's operating status. :

CPU: Which threads are working? What are these threads doing? How much CPU Time does each of these threads consume?
Memory: What are currently stored in the memory? The hit rate of these things? (Usually we pay more attention to business caching)?
Network I/O: Is there any abnormality in QPS/TPS? What request initiates the current main network I/O? Is the bandwidth enough? Request delayed? Long link or short link (a measure of the cost of syscall)?
Disk I/O: Is the disk reading and writing files? Which files are read and written? What is the pattern of most reading and writing? How big is the throughput? What is the latency of an I/O?
Critical logs: Not all logs are useful. People will only care about logs that contain specific keywords. So, are there logs with specific keywords appearing?

Through the soul torture of the above standard questions, we must have a certain understanding of the operating state of the system.

The further key is that these system indicators must be linked to the business context to be useful. For example, for a database that supports transactions, suppose we see CPU threads and call stacks, and find that a lot of CPU time is spent For things like wait sleep idle, there is no other I/O resource bottleneck. At this time, if you only look at these numbers, you may be confused, but in combination with the conflict rate of the transaction, it may even be clear. It is more useful for observers to directly give the waiting time of these locks which transactions are spent on, and even which lines of conflict.
It does not mean that other information is useless, but that the value of a lot of information is posterior. For example, most of the debug logs, or those auxiliary information to verify the conjecture, are actually hardly available when solving unknown problems. Help, but also requires the observer to have a lot of background knowledge. The best way to present this kind of information is to fold it, and it is better to be out of sight.
If you open TiDB's internal Grafana, you will see a large number of such indicators, such as stall-conditions-changed-of-each-cf (although I know the meaning of this indicator, I guess 99% of TiDB users don’t know) , And from the name I saw the inner struggle of the engineer who wrote the name. He must really want other people (or himself) to understand what the name refers to, but unfortunately, at least I did not succeed.
What is the next step for observation? Take action. Think about it before taking action. What is the prerequisite for action? Our actions to deal with problems generally follow the following pattern (I have summarized it myself, but any book on cognitive psychology will have similar concepts): Observation—>Discover motivation—>Conjecture—>Verify the conjecture—>Form a plan— > Act, and then go back to observation, repeat the cycle.
The important part of the person inside (or the experience of the old driver) is in the link from observation to conjecture. As for the motivation of observation, there are nothing more than two:
Solve the immediate fault;

Avoid potential risks (avoid future failures).

Assuming there is no problem with the system, no changes are needed. I think these two steps are important because basically other links can be automated. Only these two steps are difficult because they need to use human knowledge/experience and intuition.
For a system with good observability, it is usually a master who can make good use of human intuition. Take a small example: when opening a system background interface, we try not to pay attention to specific text information, if the interface There are more red and yellow color blocks in the middle. Our intuition will tell ourselves that this system may be in an unhealthy state. Furthermore, if both red and yellow are roughly concentrated on a specific location on the screen, our attention will definitely be Focus on this position; if an interface is all green, it should be in a relatively healthy state.
How to maximize human intuition? Or where do you want to lead? I think the best point is: anticipation of risks.

Where is human intuition used? Prejudgment of risk

Some prior knowledge needs to be used here. Before talking about this topic, I would like to share a short story I heard before. Back then, a motor in the Ford factory broke down. Then I found a master. He listened to the sound, looked at the operation of the machine, and finally used it. Chalk drew a line on the motor and said how many more coils the coils in this place had been wound around. The workers who believed and doubted followed suit. As expected, the problem was solved. Then the master paid a maintenance fee of US$10,000 (at the time it was a sky-high price). , Ford’s boss asked him why he would charge so much money for drawing a line. The master made a bill: Draw a line for $1 and know where to draw the line for $9,999.

Let’s not talk about the true and false of the story for now. If it is true, we can see that intuition and experience can really produce a lot of value. My first reaction when I heard this story was that the master was sure to see this situation. There are too many (nonsense), and this problem must be a common problem.
In fact, the most difficult part of solving the problem is to eliminate most unreliable directions through observation (especially some characteristic points), and to believe that the causes of common failures will converge. At this time, the first step of a system with good observability is to guide the user’s intuition. This direction requires the knowledge of the predecessors to give the most likely failure points and related indicators (such as CPU usage, etc.) ); The second step is to show it through some psychological tricks.
The following is the proof with TopSQL, a small feature that will be introduced in TiDB. This function is also very simple to talk about. We found that many user failures are related to a small amount of SQL. This type of SQL is characterized by a CPU footprint that is significantly different from other SQL, but the footprint of each SQL looks quite independent. It's normal, so the function of TopSQL is to answer: How much CPU does it consume? On which SQL? I'm trying not to interpret the screenshot below, I guess smart you will know how to use it right away:

Your intuition will tell you that the proportion of dense green in the second half seems to be different from the others, pushing up the overall CPU usage, and it feels like something is wrong. That’s right, this is probably the right direction. , A good visualization can quickly locate the main contradiction using human intuition.

What is "an operation"? Identify the true life cycle of the operation

When I wrote the first point, I thought of another key resource that is often overlooked: time. I originally wanted to put the time in the key resources section, but after thinking about it, it might be more appropriate to put it here.

From a slightly metaphysical point of view, our current computers are all implementations of Turing machines. I knew the minimum set of functions of a Turing complete language in elementary school: reading/writing variables, branching, and looping. To put it in a literary way: the so-called program is countless reincarnations, the big reincarnation is nested with small reincarnations (loops), and in each reincarnation, choices (branches) are constantly made according to the status quo (variables).
When I say this, smart readers may guess what I want to say: if we talk about observability out of the cycle, it is meaningless. The definition of cycle is flexible. For people, the big cycle is obviously a lifetime, the small cycle can be a year and a day, and even the cycle can not use the time span as a unit, such as the cycle of a job...
What is a reasonable cycle for a database software? Is it the execution cycle of a SQL? Or is it a transaction from Begin to Commit? There is no standard answer here, but I personally suggest that the closer the cycle is to the end user's usage scenario, the more practical it is.
For example, in a database, choosing a single SQL execution as a cycle is not as good as choosing a transaction cycle, and the transaction cycle is not as good as an application requesting a full link cycle. In fact, TiDB has introduced OpenTracing very early to track which functions are called and how much time is spent in the execution cycle of a SQL, but it was only used in the SQL layer of TiDB at first (friends who are familiar with us should know our SQL and storage It is separated), it is not implemented in TiKV at the storage layer, so the execution process of an SQL statement will appear and it will be a broken end when it is tracked down to TiKV;
Later, we realized that the function of passing TraceID and SpanID to TiKV was initially available. At least one cycle of the picture became more complete. We originally planned to stop here, but then a small thing happened, one day one The customer said: Why is my application so slow to access TiDB? Then I took a look at the TiDB monitoring, no, the SQL to the database is basically returned in milliseconds, but the customer said: You see, I didn’t do anything else with this request. Why are the two sides incompatible? Later, after we added Tracer, we learned that there was something wrong with the customer's network.
This case reminded me that if we can achieve full link tracing, the full link here should be calculated from the business end request, and it makes sense to look at the life cycle. So after that, by expanding Session Variable in TiDB, we can support users to pass the Tracer information of the OpenTracing protocol to the TiDB system through Session Varible, open up the business layer and the database layer, and can truly realize a full life cycle tracking. , This feature will also meet with you in the very near future version.
Having said so much, to summarize a few points:

Time is also an important resource.
Whether you want to grab Sample or do Trace, choosing the right cycle is very important.
The closer the cycle is to the business cycle, the more useful it is.

The moment when observable performance saves life: post-mortem observation

I believe no one will be okay looking at the monitoring interface every day. In fact, think about it carefully. When we need observability, most of them have already experienced a perceivable failure or a clear risk. At this time, the system may have been "deadly ill", or when the eyebrows were burned, the cause was not known. The root cause or some unobvious abnormal changes at a certain time before. At this time, it was discovered that in addition to the normal Metrics. There is no more information. Of course, we will not open the CPU Profiler forever. Usually, the Profiler is triggered manually, but if the reason is to be recovered after the incident, there can be a record of the CPU Profile before the incident, for the solution and return of the problem. Because it will be of great help, a better solution is to automatically turn on the Profiler at a relatively short time interval (such as minutes), and automatically save the diagnosis results, just like taking a regular in-depth physical examination record. Old records are deleted regularly. In case of an accident, you can quickly go back and save lives more efficiently.

In addition, believe me, there is no obvious performance loss when doing Profile (not to mention intermittent). This function is called Continuous Profiling. This function is very practical and will meet with you soon.
According to our experience, combined with the above section, with a complete Tracing system, most of the debugging process can find the root cause of the problem in Tracing + Log.

The best observability is to be able to guide the user: "What should I do next?"

The action mentioned above, I found a particularly interesting phenomenon when observing the teacher's handling of the problem: an experienced developer can always quickly determine what he should do next through observation, without having to look up information or wait. Following the guidance of others, you are completely in a state of flow (for example, if you see uneven data distribution or hot spots in the cluster in TiDB, you know to modify the scheduling strategy or manually split region), but newcomers will always get stuck at this step So, either go to Google or go to the document, the inner OS: "I see the problem, then what should I do?" If this time, the system can give some indicators that should be observed next, or suggestions for actions, it will be more friendly. There are not many systems that can do this. If you can do this, I believe your system has done a great job in observability. Putting this point at the end of observability is actually trying to use this topic to lead to interactivity.

Interoperability

Before talking about the interoperability of basic software, I would like to review the history of computers with you. In my opinion, a profile of computer history is the evolutionary history of human-computer interaction: from the first picture, looking at a bunch of lines I don't know how to operate it. Up to now, I have never read the iPhone manual to be able to use it proficiently. Behind this is the progress of multiple disciplines (including but not limited to psychology, cognitive science, neuroscience, philosophy, and computer science).

Back to our field, the field of basic software is indeed a bit far away from the public. In the past, many designs were done by engineers. People like us generally lacked an understanding of human nature (no offense). A typical logic is: "I am a human being, so I understand people. My design can be understood by myself, because I am a human, so other people can also understand. If others don’t know how to use it, just look at the document. A disgusting face)".

When we review some failures, we often come to the conclusion that "user operation is improper", but is this really the root cause? I had experienced an accident in my previous company, and I was deeply impressed: At that time, there was a distributed file system made by myself. Like all file systems, it has a shell that can support some UNIX styles. Command operation.
Once, an engineer executed a line of command: rm -rf usr local/... (note the space after usr), and then the system obediently began to delete itself... In the end, the review of this matter did not blame this operation Instead, he punished the designer of this system (the company’s boss at the time) because it was a bad interaction design. Even if you confirm before deleting important folders or protect it through the permission system, this will not happen. The machine It is indeed working in accordance with the logic, there is no bug in this place (even this deletion is very efficient, after all, the distributed system LOL).
In the long years as an engineer, I gradually understood a truth: the best engineer can find a balance between logic and sensibility. Good design comes from the understanding of technology and psychology. After all, we are writing programs for people.
As users of software, we are not so much using it as we are "talking" to the software. Since it is a dialogue, it means that this is an interactive process. What is a good interactive experience? I try to summarize some principles written to software designers, try to do this for the first time, do not rule out that I will add them later.

No one reads the document: one command start and exploratory learning

Admit it, no one will read the instructions. When we get a new iPhone, the first reaction must be to turn it on (it’s amazing, we seem to subconsciously know where the power button is). It’s definitely not to look for the power button by looking at the manual. We start to explore the new world with our fingers when we turn it on. Very simple reason, why do you have to read the documents before you can work in the field of system software?

I often educate our young product managers: "Your users will stay on your GitHub homepage or the Quick Start section of the document for 10 seconds at best. They don’t even have the patience to finish reading this document. They subconsciously look for "dark colors." "Background word" (shell command), and then copy the contents to your terminal to see what will happen, and nothing else will be done. If the first command fails, there will be nothing more to follow So remember that you only have one chance."
A small example is when I was working on TiUP (TiDB's installation and deployment tool), I repeatedly warned the product manager of TiUP, don’t talk nonsense on the homepage, just a command, paste it in and you can use it:

Screenshot of TiUP's homepage (tiup.io)

In fact, this example can be extended a bit. I remember that one year before the epidemic, I participated in FOSDEM in Brussels. In the evening, I was chatting with a DevOps from the United Kingdom in a bar near the venue. Maybe I was too drunk. He said, “You can’t use an apt- The system software that is successfully installed on get install is not a good software."
Then you may want to ask, if there are indeed some information or concepts that need to be delivered to the user, if the concepts in cognitive psychology are used, which can be called a Mental Model (mental model), what is the best way? My own experience is: exploratory learning. Systems that support this cognitive construction model usually need Self-Explanatory capabilities, that is, telling the user that every step after the first step (such as turning on the iPhone) can use the output of the previous step to determine the completion of the next step. Learn.
For example: MySQL system tables must be familiar to MySQL users. You only need to use an interactive mysql-client to link to an instance, and you don’t have to wait for the system to tell you what is in INFORMATION_SCHEMA. You only need to SHOW TABLES. You know, and then use the SELECT * FROM statement to explore the contents of the specific table in INFORMATION_SCHEMA step by step. This is an excellent example of Self-Explanatory (there is a premise in this example that SQL is a unified interactive language).
Another particularly good example is Telegram's Botfather. I believe that friends who have written bots for Telegram will be impressed by the usefulness of Botfather. I will put a picture and you will understand:

The process of creating a chatbot with Telegram's botfather

Telegram is a chat software. Botfather cleverly uses the interactive mode of IM to apply it to a relatively boring bot development process, instead of coldly throwing the user a URL https://core.telegram.org/bots/api , Let users study it by themselves.
I want to give you the last sentence of this section. There is an inexhaustible urban legend that says: The memory time of fish is only 7 seconds, I want to say, the same is true for people. I wish you a "fish" that can use good software.

Help users think about one more step, tell users half a step, let users take half a step by themselves

I like to read science fiction novels, an ultimate philosophical topic explored by many science fiction novels: Do we really have self-awareness? Although we think we have, when the software outputs Unknown Error, you definitely want a voice to tell you what to do next, right? An excellent basic software, when outputting negative feedback, the best way is to suggest the developer what to do next. Let me give a very classic example. All Rust developers have had the days of being tuned by the compiler, but this process is not strictly painful. For example, see the screenshot below:

Plain Text
error[E0596]: cannot borrow immutable borrowed content `*some_string` as  mutable
 --> error.rs:8:5
  |
7 | fn change(some_string: &String) {
  |                        ------- use `&mut String` here to make  mutable
8 |      some_string.push_str(", world");
  |     ^^^^^^^^^^^ cannot borrow as mutable

The reason why it is not painful is because the compiler clearly tells you where there is the problem, the reason, and what should be done next. An ordinary compiler may print a “cannot borrow as mutable” and it’s all right, but a good experience compiler will help you more. Think one step.

Returning to the question of self-awareness, I have heard a story before: a test engineer walks into a bar and asks for a NaN glass of Null, a test engineer walks into a bar disguised as a boss, asks for 500 glasses of beer without paying, one Tens of thousands of test engineers whizzed past the door of the bar. A test engineer walked into a bar and asked for a beer'; DROP TABLE, and finally the test engineers left the bar with satisfaction, and then a customer ordered a fried rice, the bar Blew up LOL. This story tells us that, as a software designer, you can never exhaust the ideas of users. Instead of letting users free their imagination, it is better to design your own story line and let users follow your ideas step by step. But why stay a half step?
my answer:
"Participation" will bring happiness. People are sometimes contradictory. While hoping that the machine will do everything automatically, they also expect that they will have the initiative. Sometimes the software already knows that the next step must be to do something, but leaving the operator to complete it is equivalent to giving the operator a sense of accomplishment.
The right to choose is left to the operator, especially when faced with some one-way door decisions, go or no-go should still be left to people.
For this, I have a few tips:

For some operations that may trigger multiple continuous operations (such as terraform deployment scripts, or functions such as cluster changes), it is necessary to provide a Dry Run mode, which only outputs operations and does not perform operations.
For the above batch operation, design the save point as much as possible, and do not have to start again every time (similar to the resuming of a breakpoint), and the experience will be much better.
When encountering a real Unknown Error, it is necessary to output various contextual information to help Debug. Finally, in the error log, the user is prompted which link to mention Github Issue, and then it is best to help the user fill in the Issue Title in the URL Link (let the user decide for themselves Not to issue an Issue).
Unified language: controller and control object

I have interviewed many system engineers, and I have a question that must be asked: Which is the best (database) cli tool in your mind? The vast majority answer redis-cli almost subconsciously. In fact, I would give the same answer myself, and then I thought why?

"Controller"-"Controlled Object" is a very common mode in basic software, just like when we are operating a TV, most of the time is through the remote control, so it can be considered that the user is the first to the TV. One and most of the contacts are actually remote controls, so analogy to the basic software, the design of the controller is actually very important. To make a good controller, I think the key points are:

Build a unified interactive language
Self-consistent and concise conceptual model

I will use redis-cli as an example to interpret it. Friends who have used redis-cli know that all operations follow the pattern of [CMD] [ARG1] [ARG2] .... There is no exception in redis-cli, whether it is operating data or modifying configuration, everything is In a unified interactive language, and the language is clear at a glance, and there are some very natural conventions in this language, for example, commands (CMD) are always composed of several letters that do not contain symbols.

Bash
redis 127.0.0.1:6379> SET k v
OK
redis 127.0.0.1:6379> DEL k
(integer) 1
redis 127.0.0.1:6379> CONFIG SET loglevel "notice"
OK
redis 127.0.0.1:6379> CONFIG GET loglevel
1) "loglevel"
2) "notice"

Redis-cli interactive example

In fact, this point is the same in the MySQL example in the exploratory learning section just mentioned. SQL itself is a unified interactive language, but it is not as intuitive as Redis.
The second point is the conceptual model. The advantage of Redis is that it is a Key-Value database, so the concept is very simple: everything is Key-Value. Observe its cli tool. Functions and interactions are mapped to this Key-Value model. This is natural, because the reason why we use redis-cli, first of all, is that we accept the reality that Redis is a KV database, so we are using redis-cli At that time, the mental assumption is automatically established is the Key-Value mode, which makes all operations natural when using cli. This is used in many excellent database software, such as Oracle. In theory, you can rely on SQL to do all operations on the software itself, because users should know the relational model and SQL by default as long as they are using Oracle.
Speaking of positive examples, let’s talk about a counter-example: Everyone knows that the main TiDB project (excluding other tools, such as cdc, binlog) has at least three Controller tools: tidb-ctl tikv-ctl pd-ctl, although TiDB is indeed a source A distributed system composed of multiple components, but for users, most of the time the object used is TiDB as a whole (database software), but the use of several ctl is different, for example, pd-ctl is a Interactive controller, and the scope of influence is probably the function of pd itself and TiKV, tikv-ctl also has some intersections, but it is only used for a single TiKV instance, this is too puzzling, TiKV is obviously a distributed system , But tikv-ctl is a single point controller? So which ctl should be used to control TiKV? Answer: pd-ctl is used most of the time (is it surprised or unexpected?).
Just like you have a TV set, but you need to use three remote controls to control it, and the remote control that actually controls the TV is called a set-top box. This kind of problem is considered a legitimate design problem in daily life, but Why does everyone's tolerance in the field of basic software suddenly become higher?

No Surprise: I am not afraid of troubles, I am afraid of surprises (scare)

I don’t know if it’s a common phenomenon. When users of basic software face errors (especially caused by bad interactions), they usually blame themselves and feel guilty first, thinking that they are their own problems, and rarely attributable to the software. Especially when you can operate some complex and fragmented software more proficiently, many people will think that this is a kind of "skill", after all, no one wants others to watch their clumsy operations.

In fact, there are deep-seated reasons behind this (Hacker Culture has a tendency to advocate complexity), but I want to say: This is the software problem! Just like I never shy away from saying that I would not use gdb, not because my IQ is not good, but because this thing is so difficult to use.
But I have seen many people really use the command line gdb as their capital to show off. Going back to the counter-example mentioned above, I observed their operators doing daily operation and maintenance on the side of an in-depth user of TiDB. This The operator is very proficient in switching and operating between various CTLs. He doesn't think there is any problem, even feels a bit powerful. Later, I thought about it, people are still very adaptable, and the things that are really troublesome are actually not It is not troublesome, but when you make an operation on the system, you usually have a subconscious assumption. For example, when a function is called "xx switch", the user's expectation when turning on the switch should be A positive feedback, but if the result is not the case, the user will be very frustrated. Here is a real story. We introduced a new feature in TiDB 5.0 called MPP (Massively Parallel Processing), which is massively parallel processing. We have a switch configuration called tidb_allow_mpp

I don’t know if you have noticed the problem: As a switch-type configuration, when set to OFF, it is a 100% negative feedback. This is no problem, but when the problem is set to ON, whether this function is enabled will depend on whether the function is enabled. The optimizer's judgment, that is, there is a certain probability that the MPP function will not take effect. This is like a switch in a room that controls the light. When you turn it off, the light will definitely not turn on. When you turn on the switch, the light will not work. It must be bright (the light feels that the light in the room is enough, it is not necessary...), you will definitely not think this light is smart, you will definitely think that the light is broken. A better way to write the above configuration should be:

tidb_mpp_mode = ON | OFF | AUTO

I don’t need to explain how to write this, and you don’t need to look at the documentation. Do you know how to use it at a glance? A good configuration should be self-explanatory. Generally speaking, configuration items are the hardest hit area that destroys the user experience. Let's talk about it later when we talk about feedback.

There is a "quiet principle" in UNIX philosophy, which says that if the program has nothing special to express, it should be quiet. One specific manifestation is to encourage the command line program to execute successfully without outputting anything, just exit with 0 as the return code. In fact, I have reservations about this. If the user's behavior is in line with expectations As a result, a clear positive feedback should be used as a reward (for example, printing a Success is good), and don't forget the humanity master Paplov.

Feedback: Expose progress, don't expose internal details

I just mentioned feedback. I think it’s not an exaggeration to call feedback the most important part of a good experience. Friends who have studied cybernetics know that feedback is a very important concept. The Self-Explanatory mentioned earlier is a good experience because of the timeliness of feedback.

But I am surprised that many basic softwares are designed to be horribly bad in the interactive feedback part. To give an example that I am familiar with, when some database software receives a complex query, when you press Enter, it is usually Hang There, it may be true that the database program is working hard to retrieve and scan the data, and then return a result directly after a few minutes (or hang up). There is no feedback in the process of how much data has been scanned and how much data is expected to be scanned. In fact, this The experience is very poor, because this information is progress (ClickHouse does a good job at this point). Feedback needs to be carefully designed. Some of my experiences are:
The feedback must be instant, and it is best to have feedback within 200ms after hitting Enter (the physiological reaction time of a person, after this time, people will feel stuck), and the smooth feeling is created by feedback.
Feedback progress, do not feedback details, do not feedback details that require context to be understood (unless it is Debug Mode), here is a counter-example of our own ( https://asktug.com/t/topic/2017):

Bash
MySQL [test]> SELECT COUNT(1) AS count, SUM(account_balance) AS  amount, trade_desc AS type FROM b_test WHERE member_id = 「22792279001」 AND detail_create_date >= 「2019-11-19 17:00:00」 AND detail_create_date < 「2019-11-28 17:00:00」 group by trade_desc;
ERROR 9005 (HY000): Region is unavailable

What's wrong with this Case? Obviously, for users, Region is an internal concept of TiDB. A very natural question is: What is Region (I buried a foreshadowing in front, I wonder if you noticed it)? Why is Select data and Region related? Why is Region unavailable? How can I solve this problem? Exposing this information to the user is useless, but creates noise for the user. The reason for this case is that TiKV is too busy to return the required data. A better feedback should be: Which TiKV specifically is because which data (in a form that the user can understand, such as: which table, which row) cannot be read It came out because TiKV is too busy. It’s better to tell the user why it’s busy and how to solve it. At least I can’t solve it by posting a link to the FAQ (I’ve seen software that directly post the LOL of StackOverflow’s Search URL).

Set some milestones for positive feedback. For example, when a server program starts to provide services to the outside world normally, it prints an Ascii Art and uses some colored labels for different log levels. This is a clear signal to the user that redis-server is doing a good job. . It is usually easy to design feedback for interactive command-line programs. A very troublesome thing is that the basic software usually relies heavily on configuration files. The problem with configuration is that the feedback period from modifying the configuration to confirming its effectiveness is usually very long. A common scenario is : Modify the configuration-restart-observe the effect, and usually the configuration is stored in the configuration file, which also causes the feedback of the file modification operation to be extremely bad, because the user does not know whether this operation takes effect, especially for some configurations It is not too obvious to take effect. Some good practices are: when the program starts, print which configuration file it reads and what the content of this configuration file is; design a command line function similar to print-default-config, Export the template configuration directly, saving the user's own Google.

In addition, for distributed systems, the configuration problem is more complicated, because there is not the difference between the local configuration and the global configuration, as well as the problem of the updated configuration distribution, including the problem of rolling restart (restarting the process to make the configuration effective is not itself A good design). To be honest, I don't have a particularly good solution yet. The possible idea is to use a distributed global configuration center like etcd or (for the database) to achieve it through some global configuration tables. But the overall principle is: concentration is better than decentralization; immediate effect is better than restart; unified interaction (the way of modifying and reading configuration) is better than multiple ways of interaction.

Write at the end

Finally, I have almost written it, but I think this article is just an introduction. There must be a lot of good practices that have not been summarized. I also hope that friends who have ideas will come to me to discuss it. I will reveal the suspense left at the beginning of the article. In the first article, observability and interactivity are written together. In fact, this is a model of human action from classic cognitive psychology [3]:

When users use software, they need to face two gaps: one is the gap in execution, where the user has to figure out how to operate and "talk" to the software; the other is the gap in evaluation, where the user has to figure out the result of the operation . Our mission as designers is to help users eliminate these two gaps, which corresponds to the observability and interactivity in the article.

Designing software that is pleasant to use is an art, and it is not necessarily simpler than designing a sophisticated algorithm or robust program. In a sense, it is more difficult, because it requires the designer to really have the right person. Both have in-depth understanding and passion for both software and software. Finally, I will give you a message from Steve Jobs for mutual encouragement:
The design is not just what it looks like and feels like. The design is how it works.

refer to:
[1] Distributed system observability in my eyes, Huang Dongxu, 2020
[2] Overtaxed Working Memory Knocks the Brain Out of Sync | Quanta Magazine
[3] The Design of Everyday Things, Donald Norman, 1988

PingCAP CTO Huang Dongxu: Basic software "easy to use" guide-these two gaps must be crossed!

Observability

Who is observing?

Distinguish key information from noise

Where is human intuition used? Prejudgment of risk

What is "an operation"? Identify the true life cycle of the operation

The moment when observable performance saves life: post-mortem observation

The best observability is to be able to guide the user: "What should I do next?"

Interoperability

No one reads the document: one command start and exploratory learning

Help users think about one more step, tell users half a step, let users take half a step by themselves

Unified language: controller and control object

No Surprise: I am not afraid of troubles, I am afraid of surprises (scare)

Feedback: Expose progress, don't expose internal details

Write at the end

PingCAP

引用和评论

从企业数智化四阶段解读 TiDB 场景价值

分布式数据库解析

做到真正0丢失、0重复：Apache SeaTunnel 实现万亿级数据一致性全解密

在 Kubernetes 上用 KubeBlocks + Dify 快速构建生产级 AIGC 应用

数据库的下一场革命：S3 延迟已降至原先的 10%，云数据库架构该进化了

Ape-DTS：开源 DTS 工具，助力自建 MySQL、PostgreSQL 迁移上云

好用的开源埋点方案-ClkLog埋点用户分析系统