头图

For data science and AI researchers, replication of research results is critical. Reproducing results is not only a way to study algorithms, but also helps researchers find new ways of doing research.

IDP provides a self-developed notebook interactive programming environment, which is very suitable for data analysis and code display. The main functions include: intelligent development, self-adaptive environment, one-click operation, one-click connection to data sources, integration of various other platform tools, Visual workflow management and more.

This article only aims to show reproducible research results, to explain how to use IDP to write beautiful notebooks.

The development cycle for the reproduction of research results can usually be divided into three stages: organization and recording, code idea sorting and recording, and preparation for sharing. These three stages will be expanded sequentially below.

I. ORGANIZATION AND RECORDS

"Organization record", as the name implies, is to record the complete experimental process, and its experimental reliance should fully refer to the clear instructions of the paper content or the details of self-experimental design.

Broadly summarizing the experience of other research scholars, the notebook writing at this stage should follow the following points:

1. Storytelling of the content

IDP notebook consists of four parts: code, text, SQL, and data visualization. We need to make good use of the excellent expression of text and even multimedia in interpretive aspects to create a computational narrative that introduces topics, arranges steps, Interpreting the results to make the story more engaging is to have an estimate of who will read the story, is he a non-technical colleague in a lab, an analyst in another lab, a reader of a journal, or the general public? This determines what kind of story you want to tell and the granularity of language description.


Figure 1. Composition of IDP Notebook

2. Focus more on process than results

IDP notebooks are highly interactive, which is further enhanced by the powerful variable manager on the right column, which makes it quick and easy to try and compare different methods or parameters. However, the convenient interaction and parameter management functions sometimes make us forget to record the process when performing these interactive investigations (the root cause is that we are too lazy to create a new cell). We should make sure to document all explorations, even those that lead to dead ends! These will help you remember what you did and why. It is not recommended to wait until the analysis is complete and reliable results are obtained before adding such explanatory text. The right thing to do is not to wait, by then you may have forgotten why a particular parameter value was chosen, where a piece of code was copied from, or what was interesting about the intermediate results. If you don't have time to fully document what you're doing or thinking at the moment, leave short descriptive notes to remind yourself and add them when you can stop.


Figure 2. IDP Workspace Variable Manager

3. Partition the content

IDP's self-developed notebook is an interactive environment, so it is easy to write and run single-line cells. During the experiment, there may be multiple short code cells that are difficult for others to understand. At this time, we can try to add a meaningful analysis step to divide the unit block, and also link the outline function of the right column to achieve jumping at any time and accurate positioning.

To refine it, it is to modularize the code by cell and mark it with markdown above the cell. Think of each cell as a paragraph, holds a function, or accomplishes a task (for example, creating a plot). Avoid long cells (anything over 100 lines or a page is too long). Put low-level documentation in code comments. Section your notebooks with descriptive markdown headers, making it easy to navigate and add table of contents. Split a long notebook into a series of notebooks and keep a top-level index notebook with links to each notebook.


Figure 3. IDP workspace outline

Second, the idea of code sorting records

At this point, the experiment has been basically completed. To make the notebook we write more beautiful and reasonable, we need to organize the notes in the following aspects:

1. The importance of following the rules

Object-oriented ideas are used a lot in code engineering, and at this time, it is also necessary to simply follow the rules to avoid repetition. In notebooks, copy a cell, adjust a few lines, and paste the generated code into a new cell or other notebook. It is very easy to run it again, so it is inevitable to leave some repetitive code during the experiment. At this time, it should be packaged in the form of a function. If it is more complicated, you need to use a class, which depends on your own judgment.

It's important to note that if you're going to reuse code in other projects or notebooks, consider converting it into a module, package, or library, and follow good software development practices (such as unit testing).

2. Configuration of the experimental environment

Documenting dependencies is a very important issue, which determines whether you can regenerate analysis in the future, in the field of computer science experiments, it is recommended to use tools such as conda's environment.yml or pip's requirements.txt explicitly from the beginning. Manage dependencies to list all relevant dependencies (including their software versions). Always work in the environment created by these dependencies to ensure that undocumented dependencies are not added.

In notebook, you can use ! pip install XXX.txt installs dependencies with one click, switches environments with one click at the bottom of the workspace, etc. to complete the experimental configuration. If you are familiar with the Linux system, you can also use the terminal function on the left column to directly complete the configuration of the experimental environment.

3. The magic of version control

The historical version on the right side of the IDP workspace header enables version control because the interactive nature of the notebook makes it easy to accidentally change or delete important content. In addition, since notebooks contain code, which inevitably contains bugs, determining when bugs are introduced and fixed (and analysis of their likely impact) is a critical capability in scientific computing.


Figure 4. IDP version management

4. Use of Workflow

After the experimental research has stabilized, you can consider building a pipeline. Notebooks that record preliminary exploratory research are rarely widely promoted, but once a stable analysis method is identified, a well-designed notebook can be extended to other tasks through the pipeline, thereby The analysis is easily repeated with different input data and parameters. With this in mind, design your notebook from the start to allow for future repurposing. Put key variable declarations (especially those that change when a new analysis is performed) at the top of the notebook, rather than buried somewhere in the middle. Perform preparatory steps, such as data cleaning, directly in the notebook and avoid manual intervention as much as possible.

By constructing the pipeline, others are also prevented from affecting the reproduction results due to the difficulty of reading ReadMe.txt.


Figure 5. IDP Visual Workflow Management

3. Sharing

Sharing is a far-sighted topic, and it seems a bit "chicken" to consider this when writing a notebook. In fact, sharing and communication are the cornerstones of further experimental research. The following points need attention.

1. Readability

The shared content must be readable, especially for experimental data, which requires detailed and clear explanations. Ideally, you would share the entire dataset in the notebook. We recognize that many datasets are too large or sensitive to share in this way. In these cases, consider decomposing large and complex datasets into multiple layers, so that even if the original data is too large to be shared with the published notebook, or restricted by privacy or other access concerns, it will not affect Reproducibility.

2. Convenience

When we have completed all the above points, and our notebooks are very beautiful and usable, then we need to consider a question, how can others access, run and explore them? Team collaboration is a feature of IDP, which supports one-click sharing of notebooks. Currently, one-click sharing of data visualization cells has been realized.
We can also choose to store the notebook in a public code repository with a clear README file for sharing, such as the common GitHub.


Welcome to experience unlocking more features of IDP: Try IDP now for free


Baihai_IDP
147 声望447 粉丝

IDP是AI训推云平台,旨在为企业和机构提供算力资源、模型构建与模型应用于一体的平台解决方案,帮助企业高效快速构建专属AI及大模型。