# 如何优化基于Jupyter的分析/挖掘测试项目

## Python语法级别的优化

### 值得一试的命名方案

``[具象词](_[操作])(_[介词短语])_[数据结构]``

## Jupyter级别的优化

### 线性执行

``````<!-- 可接受的例子 -->

​```cell 1
import pandas as pd
​```

​```cell 2
​```

​```cell 3
def sum_ab(row):
return row['a'] + row['b']
​```

​```cell 4
df.apply(sum_ab, axis=1)
​`````````
``````<!-- 不可接受的例子，不能正常运行 -->

​```cell 1
import pandas as pd
​```

​```cell 2
​```

​```cell 3
df.apply(sum_ab, axis=1)
​```

​```cell 4
def sum_ab(row):
return row['a'] + row['b']
​`````````

### 载入模块和读入数据放在开头

``````​```cell 1
!pip install scikit-learn
​```

​```cell 2
import panda as pd
from sklearn import metrics
import sys

## 自编模块
sys.path.append('../')
from my_module import my_func
​```

​```cell 3
​`````````

### 一个Cell一个功能

``````​```cell 1
df_1 = df.sum(axis = 1)
​```

​```cell 2
df_2 = df_1.fill_na(0)
​```

​```cell 3
ggplot(df_2, aes(x = 'x', y = 'y')) + geom_point()
​`````````

``````​```cell 1
df_1 = df.sum(axis = 1)
df_2 = df_1.fill_na(0)

## 绘图
ggplot(df_2, aes(x = 'x', y = 'y')) + geom_point()
​`````````

### 数据（包括中间结果）与运算分离

``````​```cell 1
df_1 = df.sum(axis = 1)
​```

​```cell 2
df_2 = df_1.fill_na(0)
df_2.to_pickle('../temp/df_2.pickle')
​```

​```cell 3
ggplot(df_2, aes(x = 'x', y = 'y')) + geom_point()
​`````````

``````​```cell 1
##~~~~ 中间处理 ~~~~##
# df_1 = df.sum(axis = 1)
# df_2 = df_1.fill_na(0)
# df_2.to_pickle('../temp/df_2.pickle')
##~~~~ 中间处理 ~~~~##

## 绘图
ggplot(df_2, aes(x = 'x', y = 'y')) + geom_point()
​`````````

### 抽象以及可复用分离到Notebook外部

``````​```cell 1
def func1(x):
"""
return x + 1

def func2(x):
temp = list(map(func1, x))
temp.sorted()
return temp[0] + temp[-1]

df.a.apply(func2, axis)
​`````````

``````--- my_module
|__ __init__.py
|__ a.py
___ notebook
|__ test.ipynb``````

``````## import cell
import pandas as pd
import sys

sys.path.append('../')
from my_module import *``````

``````%load_ext autoreload

import pandas as pd
import sys

sys.path.append('../')
from my_module import *``````

## 项目级别的优化

### 一个notebook解决一个问题

``````- 0. introduction and contents.ipynb
- eda.1 EDA问题一.ipynb
- eda.2 EDA问题二.ipynb
- eda. ...
- 1.1 方案一+特征工程.ipynb
- 1.2 方案一训练和结果.ipynb
- 2.1 方案二+特征工程.ipynb
- 2.2 方案二训练和结果.ipynb
- 3.1 方案三+特征工程.ipynb
- 3.2 方案三训练和结果.ipynb
- ...
- final.1 结论.ipynb``````

### 对文件进行必要的整理

``````-- 项目根目录
|__ SQL：存储需要用的SQL
|__ notebook: 存放notebook的地方
|__ 0. introduction and contents.ipynb
|__ eda.1 EDA问题一.ipynb
|__ eda.2 EDA问题二.ipynb
|__ eda. ...
|__ 1.1 方案一+特征工程.ipynb
|__ 1.2 方案一训练和结果.ipynb
|__ 2.1 方案二+特征工程.ipynb
|__ 2.2 方案二训练和结果.ipynb
|__ 3.1 方案三+特征工程.ipynb
|__ 3.2 方案三训练和结果.ipynb
|__ ...
|__ final.1 结论.ipynb
|__ src: 撰写报告或者文档时需要引用的文件
|__ data: 存放原始数据
|__ csv: csv文件
|__ train.csv
|__ ...
|__ ...
|__ temp: 存放中间数据
|__ output: 最后报告需要的综合分析结果
|__ *.pptx
|__ *.pdf
|__ src
|__ example.png
|__ ...
|__ temp_module: 自己写的notebook需要引用的模块``````

0 条评论