TensorFlow学习笔记（11）：数据操作指南

引言

用TensorFlow做好一个机器学习项目，需要具备多种代码能力：

工程开发能力：怎么读取数据、怎么设计与运行Computation Graph、怎么保存与恢复变量、怎么保存统计结果、怎么共享变量、怎么分布式部署
数据操作能力：怎么将原始数据一步步转化为模型需要的数据，中间可能涉及到Tensor转换、字符串处理、JSON处理等
模型理论知识：线性回归，逻辑回归，softmax regression，支持向量机，决策树，随机森林，GBDT，CNN，RNN
数值计算理论知识：交叉熵数值计算的潜在问题（为什么要用tf.nn.softmax_cross_entropy_with_logits），梯度下降法，海森矩阵与特征向量，牛顿法，Adam梯度法。

本系列文章已对TensorFlow的工程开发和与模型理论知识的结合做了较多的总结。本文的目的是聚焦于数据操作能力，讲述TensorFlow中比较重要的一些API，帮助大家实现各自的业务逻辑。

Tensor Transformation

拼接

TensorFlow提供两种类型的拼接：

tf.concat(values, axis, name='concat')：按照指定的已经存在的轴进行拼接
tf.stack(values, axis=0, name='stack')：按照指定的新建的轴进行拼接

t1 = [[1, 2, 3], [4, 5, 6]]
t2 = [[7, 8, 9], [10, 11, 12]]
tf.concat([t1, t2], 0) ==> [[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]]
tf.concat([t1, t2], 1) ==> [[1, 2, 3, 7, 8, 9], [4, 5, 6, 10, 11, 12]]
tf.stack([t1, t2], 0)  ==> [[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]]
tf.stack([t1, t2], 1)  ==> [[[1, 2, 3], [7, 8, 9]], [[4, 5, 6], [10, 11, 12]]]
tf.stack([t1, t2], 2)  ==> [[[1, 7], [2, 8], [3, 9]], [[4, 10], [5, 11], [6, 12]]]

上面的结果读起来不太直观，我们从shape角度看一下就很容易明白了：

t1 = [[1, 2, 3], [4, 5, 6]]
t2 = [[7, 8, 9], [10, 11, 12]]
tf.concat([t1, t2], 0)  # [2,3] + [2,3] ==> [4, 3]
tf.concat([t1, t2], 1)  # [2,3] + [2,3] ==> [2, 6]
tf.stack([t1, t2], 0)   # [2,3] + [2,3] ==> [2*,2,3]
tf.stack([t1, t2], 1)   # [2,3] + [2,3] ==> [2,2*,3]
tf.stack([t1, t2], 2)   # [2,3] + [2,3] ==> [2,3,2*]

抽取

tf.slice(input_, begin, size, name=None)：按照指定的下标范围抽取连续区域的子集
tf.gather(params, indices, validate_indices=None, name=None)：按照指定的下标集合从axis=0中抽取子集，适合抽取不连续区域的子集

input = [[[1, 1, 1], [2, 2, 2]],
         [[3, 3, 3], [4, 4, 4]],
         [[5, 5, 5], [6, 6, 6]]]
tf.slice(input, [1, 0, 0], [1, 1, 3]) ==> [[[3, 3, 3]]]
tf.slice(input, [1, 0, 0], [1, 2, 3]) ==> [[[3, 3, 3],
                                            [4, 4, 4]]]
tf.slice(input, [1, 0, 0], [2, 1, 3]) ==> [[[3, 3, 3]],
                                           [[5, 5, 5]]]
                                           
tf.gather(input, [0, 2]) ==> [[[1, 1, 1], [2, 2, 2]],
                              [[5, 5, 5], [6, 6, 6]]]

假设我们要从input中抽取[[[3, 3, 3]]]，这个输出在inputaxis=0的下标是1，axis=1的下标是0，axis=2的下标是0-2，所以begin=[1,0,0]，size=[1,1,3]。

假设我们要从input中抽取[[[3, 3, 3], [4, 4, 4]]]，这个输出在inputaxis=0的下标是1，axis=1的下标是0-1，axis=2的下标是0-2，所以begin=[1,0,0]，size=[1,2,3]。

假设我们要从input中抽取[[[3, 3, 3], [5, 5, 5]]]，这个输出在inputaxis=0的下标是1-2，axis=1的下标是0，axis=2的下标是0-2，所以begin=[1,0,0]，size=[2,1,3]。

假设我们要从input中抽取[[[1, 1, 1], [2, 2, 2]],[[5, 5, 5], [6, 6, 6]]]，这个输出在input的axis=0的下标是[0, 2]，不连续，可以用tf.gather抽取。

类型转化

tf.string_to_number(string_tensor, out_type=None, name=None): 将字符串转化为tf.float32（默认）和tf.int32
tf.to_double(x, name='ToDouble')：转化为tf.float64
tf.to_float(x, name='ToFloat')：转化为tf.float32
tf.to_int32(x, name='ToInt32')：转化为tf.int32
tf.to_int64(x, name='ToInt64')：转化为tf.int64
tf.cast(x, dtype, name=None)：转化为dtype指定的类型

形状转化

tf.reshape(tensor, shape, name=None)：转化为新shape，若有一个维度设置为-1，会自动推导

SparseTensor

TensorFlow使用三个dense tensor来表达一个sparse tensor：indices、values、dense_shape。

假如我们有一个dense tensor：

[[1, 0, 0, 0]
 [0, 0, 2, 0]
 [0, 0, 0, 0]]

那么用SparseTensor表达这个数据对应的三个dense tensor如下：

indices：[[0, 0], [1, 2]]
values：[1, 2]
dense_shape：[3, 4]

可以通过以下两种方法，将sparse tensor转化为dense tensor：

tf.sparse_to_dense(sparse_indices, output_shape, sparse_values, default_value=0, validate_indices=True, name=None)
tf.sparse_tensor_to_dense(sp_input, default_value=0, validate_indices=True, name=None)

字符串操作

拆分

tf.string_split(source, delimiter=' ')

source是一维数组，用于将一组字符串按照delimiter拆分为多个元素，返回值为一个SparseTensor。

假如有两个字符串，source[0]是“hello world”，source[1]是“a b c”，那么输出结果如下：

st.indices： [0, 0; 0, 1; 1, 0; 1, 1; 1, 2]
st.values： ['hello', 'world', 'a', 'b', 'c']
st.dense_shape：[2, 3]

拼接

tf.string_join(inputs, separator=None, name=None)，用起来比较简单：

tf.string_join(["hello", "world"], separator=" ") ==> "hello world"

自定义op

通过tf.py_func(func, inp, Tout, stateful=True, name=None)可以将任意的python函数func转变为TensorFlow op。

func接收的输入必须是numpy array，可以接受多个输入参数；输出也是numpy array，也可以有多个输出。inp传入输入值，Tout指定输出的基本数据类型。

先看一个解析json的例子，输入是一个json array，输出是一个特征矩阵。

import tensorflow as tf
import numpy as np
import json

json_str_1 = '''
{"name": "shuiping.chen",
"score": 95,
"department": "industrial engineering",
"rank": 2
}
'''
json_str_2 = '''
{"name": "zhuibing.dan",
"score": 87,
"department": "production engineering",
"rank": 4
}
'''

input_array = np.array([json_str_1, json_str_2])

def parse_json(json_str_array):
    fea_dict_array = [ json.loads(item) for item in json_str_array ]
    ret_feature = []
    for fea_dict in fea_dict_array:
        feature = [fea_dict["score"], fea_dict["rank"]]
        ret_feature.append(feature)
    return np.array(ret_feature, dtype=np.float32)

parse_json_op = tf.py_func(parse_json, [input_array], tf.float32)
sess = tf.Session()
print sess.run(parse_json_op)

再看一个多输入多输出的例子，输入两个numpy array，输出三个array，分别是和、差、乘积。

array1 = np.array([[1, 2], [3, 4]], dtype=np.float32)
array2 = np.array([[5, 6], [7, 8]], dtype=np.float32)

def add_minus_dot(array1, array2):
    return array1 + array2, array1 - array2, np.dot(array1, array2)

add_minus_dot_op = tf.py_func(add_minus_dot, [array1, array2], [tf.float32, tf.float32, tf.float32])
print sess.run(add_minus_dot_op)

TensorFlow学习笔记（11）：数据操作指南

引言

Tensor Transformation

拼接

抽取

类型转化

形状转化

SparseTensor

字符串操作

拆分

拼接

自定义op

丹追兵

引用和评论

源码解读：CSSRNN

如何减少跨团队交付摩擦？——基于 DevOps 与敏捷的最佳实践

AI Agent爆火后，MCP协议为什么如此重要！

科学计算编程涉及到的技术栈简介

使用 chardet 判断文件编码需要注意的坑——过大的文件会导致高耗时

Python3 格式化时间（qbit）

本地使用PaddleOCR进行图片识别获得文字（返回JSON）