在sql下计算tfidf

吐槽

由于工作中要直接在sql中写tfidf，对tfidf又有了新的理解，以及又让我厌恶了sql。。。因为在R或者python中写起tfidf来还是很方便的，直接调个包，或者直接写都很快，但在sql中写起来有点慢，比较冗长，也有可能我写的不多。。下面分tfidf的

理论部分
sql操作

来介绍本文：

tfidf给我直观的感觉就是我们数据挖掘竞赛中说的规则。简单有效

tfidf理论

tfidf分两部分，分别为tf与idf

1.tf(词频)：

$$ TF(词频) = 某个词在一个文档中出现的次数 $$

或者进行标准化

$$ TF(词频) = \frac{某个词在一个文档中出现的次数}{文档的总词数} $$

或者另一个标准化公式

$$ TF(词频) = \frac{某个词在一个文档中出现的次数}{文档中出现次数最多的词的出现次数} $$

2.idf(逆文档频率)：

$$ IDF(逆文档频率) = log(\frac{总文档数}{(包含该词的文档数+1)} ) $$

3.tfidf:

$$ TDIDF = TF *IDF $$

实际含义：如果某个词比较少见，但是它在这篇文章中多次出现，那么它很可能就反映了这篇文章的特性，正是我们所需要的关键词。

实际操作

我们可以把一个用户的一些文本，或者说一些类别特征，进行分词(类别特征已经直接可以看做分好的词)

把一个用户的所有分词结果作为一个文档
总文档数则为所有用户的文档加起来
包含该词的文档则可以根据group by uid,tag之后数据，在对tag计数 (tag指分词后的词语，或者类别)

假设我们的数据是下面这样的：表名为table1，库名为dw

 uid     hotelid    tags
125554    428365    美食预订,返现,优惠券,礼,休闲度假,亲子酒店
125554    456909    休闲度假,商务出行,健身室,免费WiFi
125554    662503    优惠券,亲子酒店,闪住,空气清新房
125554    7904714    返现,,健身室,免费WiFi
125554    3893971    返现,优惠券,礼,商务出行,空气清新房
125554    5445544    优惠券,礼,商务出行
125554    6395558    美食预订,返现,优惠券,礼,亲子酒店
125554    662485    美食预订,返现,优惠券,礼,休闲度假,亲子酒店
125554    429724    美食预订,返现,优惠券,礼,休闲度假,亲子酒店
125554    428827    返现,优惠券,礼,休闲度假

在sql中用split来拆固定字符串，再用explode把它炸开

select uid,tag,total_doc
        ,count(*)over(partition by tag) as words_in_doc -- 包含该词的文档数
        ,n --词频
        ,tf --归一化词频
        ,log(total_doc / (count(*)over(partition by tag) +1)) idf --得到idf
        ,tf * log(total_doc / (count(*)over(partition by tag) +1)) as tfidf --得到tfidf
from(
    select a.uid,a.tag
        ,count(*) n --得到词频
        ,count(*)  /(max(count(*)) over(partition by uid)) as tf --得到归一化词频
        ,dense_rank() over (order by uid) + dense_rank() over (order by uid  desc) - 1 as total_doc 
    from
    (select uid,tag
    from (select uid,tag1
          from dw.table1 LATERAL VIEW explode(split(tags,',')) a as tag1) b
    LATERAL VIEW explode(split(tag1,'，')) a as tag     
    where tag not in ('') 
     )a
    group by uid,tag
    )a

得到数据结构：
uid            tag        total_doc    words_in_doc    n    tf    idf    tfidf
00399066    麻辣火锅    58    1    1    1.0    3.367295829986474    3.367295829986474
02932115    首住特惠    58    1    3    0.034482758620689655    3.367295829986474    0.11611364930987841
02769693    闪住    58    45    1    1.0    0.23180161405732438    0.23180161405732438
02689299    闪住    58    45    18    0.42857142857142855    0.23180161405732438    0.09934354888171044
02589732    闪住    58    45    2    0.2    0.23180161405732438    0.04636032281146488
00087790    闪住    58    45    3    0.6    0.23180161405732438    0.13908096843439463
03373990    闪住    58    45    14    0.4666666666666667    0.23180161405732438    0.10817408656008472

这边是因为为了偷懒，所以直接写在一个查询了，如果分多次子表，会跟容易理解.

下面为分多次查询：

--写着一大段就是为了得到uid数来作为总文档数
use dw;
set hive.mapred.mode=nonstrict;
drop table if exists dw.table2;
create table dw.table2 as 
select uid,tag,total_doc --为了得到uid数来作为文档数
from
    (
    select uid,tag
    from (select uid,tag1
          from dw.table1 LATERAL VIEW explode(split(tags,',')) a as tag1) b
    LATERAL VIEW explode(split(tag1,'，')) a as tag     
    where tag not in ('') 
    )a
left join (select count(distinct uid) as total_doc from dw.table1 )b
on 1=1;

得到数据结构：

uid            tag            total_doc
02932115    首住特惠      58
02932115    首住特惠      58
02932115    首住特惠      58
06275610    闪住            58
06100328    闪住            58
06100328    闪住            58

所以只要分词到上面这样的数据结构就能直接算tfidf了

再全部算出来：

select uid,tag,total_doc
        ,count(*)over(partition by tag) as words_in_doc -- 包含该词的文档数
        ,n -- 得到词频
        ,tf --得到归一化词频
        ,log(total_doc / (count(*)over(partition by tag) +1)) idf --得到idf(因为已经去重，每个uid对应唯一tag，故 by tag 就能计算文档在总文档中出现的次数)
        ,tf * log(total_doc / (count(*)over(partition by tag) +1)) as tfidf --得到tfidf
from (
    select uid,tag,total_doc
    ,count(*) n
    ,count(*)  /(max(count(*)) over(partition by uid)) as tf --得到归一化词频
    from dw.table2
    group by uid,tag,total_doc
    )a

得到数据结构：跟上面一模一样

uid    tag    total_doc    words_in_doc    n    tf    idf    tfidf
00399066    麻辣火锅    58    1    1    1.0    3.367295829986474    3.367295829986474
02932115    首住特惠    58    1    3    0.034482758620689655    3.367295829986474    0.11611364930987841
02769693    闪住    58    45    1    1.0    0.23180161405732438    0.23180161405732438
06100328    闪住    58    45    5    0.25    0.23180161405732438    0.057950403514331096
00258565    闪住    58    45    3    0.42857142857142855    0.23180161405732438    0.09934354888171044
01001760    闪住    58    45    4    0.4444444444444444    0.23180161405732438    0.10302293958103305
03018508    闪住    58    45    15    0.9375    0.23180161405732438    0.2173140131787416

*总结：tfidf是一个强有力的规则，不仅能对文本做，其实任何带有实际意义的分类数据都能使用tfidf，而不是只统计个分类数据的众数就没有然后了。。
关键函数：在sql中用split来拆固定字符串，再用explode把它炸开
*

本文理论部分参考:
阮一峰的网络日志

在sql下计算tfidf

吐槽

tfidf理论

1.tf(词频)：

2.idf(逆文档频率)：

3.tfidf:

实际操作

下面为分多次查询：

aloneme

引用和评论

conda 包管理工具及换源

Mybatis源码-加载映射文件与动态代理

Mybatis源码-配置加载

GitHub Copilot Fridays｜GitHub Copilot 全新课程上线，助力开发者解锁 AI 编程超能力

SQL 查询的执行顺序

peewee 怎么实现 count(*)

2024年12月国产数据库大事记-墨天轮