Machine learning scratch

`numpy` 相关

numpy array shape

>>> np.array([1,2,3,4]) # Rank 1 array
array([1, 2, 3, 4])

>>> np.array([[1],[2],[3],[4]]) # Rank 2 array（矩阵Matrix）
array([[1],
       [2],
       [3],
       [4]])
       
>>> np.arange(16).reshape((2,2,4)) # Rank 3 array
array([[[ 0,  1,  2,  3],
        [ 4,  5,  6,  7]],
       [[ 8,  9, 10, 11],
        [12, 13, 14, 15]]])
>>> M = np.arange(16).reshape((2,2,4)) # 也可以先 flatten 再 reshape
>>> M.size # 元素数量
16
>>> M.shape # 维度
(2, 2, 4)
>>> np.array([[1],[2],[3],[4]]).shape # Rank 2 array shape
(4, 1)

>>> np.transpose(M) # Rank 3 array transpose
array([[[ 0,  8],
        [ 4, 12]],
       [[ 1,  9],
        [ 5, 13]],
       [[ 2, 10],
        [ 6, 14]],
       [[ 3, 11],
        [ 7, 15]]])

numpy basis

>>> import numpy as np
>>> l = np.array([2,3,4,232,9])
>>> np.mean(l), np.median(l), sum(l)
(50.0, 4.0, 250)

>>> np.logspace(np.log10(10), np.log10(100), 11)
array([ 10.        ,  12.58925412,  15.84893192,  19.95262315,
        25.11886432,  31.6227766 ,  39.81071706,  50.11872336,
        63.09573445,  79.43282347, 100.        ])
>>> np.linspace(10,100,11)
array([ 10.,  19.,  28.,  37.,  46.,  55.,  64.,  73.,  82.,  91., 100.])

numpy 读取csv (panda会更加方便)

data = np.genfromtxt(path_to_csv, dtype=float, delimiter=',', names=True)

np.where # 寻找索引
np.unique # 去重

libreoffice 转换csv

/Applications/LibreOffice.app/Contents/MacOS/soffice --headless --convert-to csv *.xlsx --outdir ./csv/

绘图

import pylab as pl
import numpy as np

data = np.random.normal(size=10000)
pl.hist(data, bins=np.logspace(np.log10(0.1),np.log10(1.0), 50))
pl.gca().set_xscale("log")
pl.show()

`pandas` 相关

pandas.concat # concat
df.loc[:, 'role_id'] # all index, with 'role_id' column
df.loc(N) # 选择 N index
df.iloc[:, 1:3] # all index, 1 to 3 column
df.iloc[:, 1:-1]

df['fields'].map(lambda x: len(x)).max()


# api—seqs每个元素(list)的奇数项的最大值。
px['api_seqs'].apply(lambda x: max(x[0::2])).max()

# 出现次数
px.loc[:,'role_id'].value_counts()

# 索引重设
px.reset_index()
px.loc[px.ratio>0.8, 'role_id'].value_counts().reset_index(name='cnt').rename(columns={'index':'role_id'})

# Multi conditional selector (actually as a Boolean selector as the row axis).
#  选择器使用各种 bool 类型的运算，包括 `& | ~`
px.loc[(px['cnn_banned']==1) & (px['crazy_click']==1)]

参考官方文档 pandas.DataFrame.loc

实际上，px.loc接收多种类型的参数，导致他的功能强大，理解稍微有点麻烦：

一个独立的label。
一组独立的labels。
slice切片数据（和python通常的slice并不一致，是一个包含开始和结束标记的闭区间）。
一个bool型的，和指定轴axis等长的数组。
一个函数对象，参数为 Series或DataFrame，返回上面的（4种）合法参数之一。

Indexing and selecting data是一个非常重要的概念，后面会详细解读一下。

Machine learning scratch

`numpy` 相关

`pandas` 相关

秦川

引用和评论

关于 Go arena 的讨论的学习

Anaconda安装教程以及Anaconda和pip配置国内镜像

科学计算编程涉及到的技术栈简介

使用 chardet 判断文件编码需要注意的坑——过大的文件会导致高耗时

Python3 格式化时间（qbit）

manus 的替代品有哪些？使用LLM大模型技术做手机/网页/浏览器自动化操作技术汇总

怎么判断自己下载的 trae 是国际版还是国内版？

Machine learning scratch

numpy 相关

pandas 相关

秦川

引用和评论

关于 Go arena 的讨论的学习

Anaconda安装教程以及Anaconda和pip配置国内镜像

科学计算编程涉及到的技术栈简介

使用 chardet 判断文件编码需要注意的坑——过大的文件会导致高耗时

Python3 格式化时间（qbit）

manus 的替代品有哪些？使用LLM大模型技术做手机/网页/浏览器自动化操作技术汇总

怎么判断自己下载的 trae 是国际版还是国内版？

`numpy` 相关

`pandas` 相关