[Editor's note: This article mainly introduces what a Notebook is, the importance and advantages of Notebook's application in the field of data science, and the key factors that data scientists/algorithm teams need to consider when choosing a Notebook. At the same time, based on the dimensions of Notebook's screening considerations, a preliminary comparative analysis of common Notebooks is carried out to provide reference for data scientists and algorithm engineers. 】
Notebook is a web-based interactive computing method. Users can develop, document, run code, display results and share results in Notebook. Compared with the traditional non-interactive development environment, the biggest feature of Notebook is that it allows to execute scripts cell by cell. Notebooks are a vital tool in the field of data science, and data scientists use notebooks to conduct experiments and exploratory tasks. In recent years, with the development of big data, non-technical personnel such as business analysts have increasingly begun to use notebooks.
1. Core Advantages of Notebook
In a traditional non-interactive development environment, it is necessary to compile the program written by the developer into an executable file, and then run the executable file completely. If an error occurs, you need to return to the editor to type new code, and then re-run all the code.
In Notebook, developers can write and run programs cell by cell. When an error occurs, they only need to adjust and run the cell with the error. The cell that runs correctly has been saved in the memory, and there is no need to repeat the operation, which greatly improves development efficiency. Notebook is also loved by data scientists and algorithm engineers, and is widely used in the field of AI algorithm development and training. Taking the deep learning experiment as an example, the training of the model usually takes several hours to a dozen hours. Using the notebook to debug the model, after making minor changes, there is no need to retrain all the models, which can greatly save the time of data scientists and algorithm engineers. .
2. The basic structure of Notebook
The earliest Notebook was Mathematica launched in 1988. The early notebooks were mainly used in the academic field. As notebooks gradually entered the industrial field from academia in the past ten years, more and more notebooks have emerged in the market, such as open source Jupyter, Apache Zeppelin, commercially hosted Colab, JetBrains Datalore, IDP Studio**, etc., Polynote that supports mixed languages, etc.
Although there are many types of notebooks, their core components include two major components: one is the front-end client, which consists of an ordered list of input/output cells, where users can enter code, text, and so on. Another component is the backend kernel (Kernel), which can be configured locally or in the cloud. The code is passed to the Kernel from the front end, and the Kernel runs the code and returns the result to the user. The Kernel determines the computing performance of the Notebook. IDP Studio** uses the Rust language to rewrite the Kernel, which improves the notebook startup speed and resource configuration speed by an order of magnitude.
(** In this article, IDP Studio is only used to refer to the notebook interactive programming environment in IDP Studio, and other plug-in functions such as model management and model publishing in IDP Studio are beyond the scope of this article.)
3. How to choose a suitable Notebook
Different Notebooks have their own characteristics. Data scientists and algorithm engineers need to choose the most suitable Notebook tool according to their core demands. Based on interviews with a large number of data scientists, we summarize the four core issues that data scientists care about in the selection of notebooks, which can be used as a reference for the selection criteria of tools by the majority of algorithm developers and data mining personnel.
1) Completeness and ease of use of basic functions
- Installation and deployment : For novice data scientists, commercially hosted notebooks (such as IDP Studio, Colab, JetBrians Datalore) use the SaaS model, which is out of the box and easier to install and get started. Open source notebooks need to be installed by users themselves. Usually, local installation is relatively easy, but if it is installed and run on a remote server, it will be more challenging.
- Version management : No matter the algorithm model or the algorithm interface, it will need to be continuously updated and optimized. Version management is very important. The completeness and ease of use of different Notebook version management functions are different. For example, open source products such as Jupyter support Git for version management; while supporting Git, IDP Studio has a built-in version management function, which automatically saves historical versions, while Colab temporarily supports version management. Version management is not supported.
- Language support : Python, SQL, R, etc. are commonly used languages in the field of machine learning and data science. Python is far ahead. According to a survey of more than 25,000 data scientists by Kaggle in 2021, 84% use Python. At present, the common notebooks all have good support for Python, but the depth of support for the second and third most commonly used languages SQL and R is not the same. Therefore, when choosing tools, data scientists need to consider whether Notebooks natively support their commonly used languages. For example, Jupyter can better support Python, Julia and R languages, but when supporting SQL, you need to install plug-ins and configure it yourself; IDP Studio naturally supports Python and SQL deeply, but not yet supports other common languages; if you need to support Scala and other multi-languages are well supported, you can consider Polynote.
2) Efficiency improvement
On the basis of basic functions, data scientists pay attention to whether Notebook can help them reduce non-core work and improve development efficiency.
- Code assistance: Code assistance can greatly help developers save time and improve efficiency. The main code assistance includes code completion, error prompts, quick fixes, definition jumps, etc. Open source tools have a rich ecosystem, and generally need to rely on third-party plug-ins to achieve code assistance functions. Commercial hosting products have built-in code assist functions, but have functional focus and performance, among which code completion is a common function. IDP Studio is the most comprehensive in terms of code assistance functions, and has a relatively better experience in terms of speed and performance, but the completion of functions in some third-party libraries needs to be improved.
- Access to data sources: Data is the cornerstone of data scientists' daily work. Usually, data sources are scattered in various places, which brings great challenges to data access. Whether data can be easily accessed is very important. Data scientists need to choose appropriate notebooks according to the distribution of their own data sources. At present, the open source software of Jupyter and Zeppelin needs to be configured and accessed by data scientists; Colab only supports data access in Google Drive; IDP Studio has integrated and docked with mainstream data sources, and users can access data sources with one click.
- Environment management: First, mature data scientists and algorithm teams have more requirements for convenient environment setup and environment management. They want to be able to configure the environment quickly, and at the same time, they want to be able to build and manage a consensus that can be shared between individuals and teams. sexual environment. Different Notebooks have different support for environment configuration and reuse. On the whole, Datalore, which naturally supports team collaboration, is slightly easier to use in terms of environment management. Users can choose according to their own needs for environmental management.
3) Accelerate collaboration
- Collaborative analysis across teams: Algorithms and business analysis are increasingly coupled. Algorithm developers hope to share results with business personnel in the form of interactive visual reports to achieve efficient collaborative analysis between algorithm teams and business teams. For data scientists who have strong needs for cross-team collaboration, they can pay more attention to Notebooks such as Datalore and IDP Studio, which are newly launched this year and highlight team collaboration in functional positioning.
Collaborative programming: In addition to cross-team collaboration, notebook sharing, real-time collaborative editing, and commenting have become increasingly prominent requirements for data scientists. At present, it seems that overseas data scientists are more in demand for this feature. At present, common notebooks all support a certain degree of collaborative programming, but there are differences in real-time and ease of use.
4) Cost
Cost is usually an important factor affecting the choice of data scientists and algorithm engineers, but in the field of notebook selection, we believe that this factor is relatively less important than performance and ease of use, because instant commercial notebooks are often used by individual users. A free basic version is available.
We're excited to see Notebooks grow in popularity and become a bridge between algorithm teams and business teams. The application of Notebook in the industry has also further deepened, providing data scientists with good support for algorithm development, experimentation, and exploration.
For more technical content, please pay attention to the public account: Baihai IDP
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。