求教两组间的标签重合度计算，这个应该学习什么算法？

V2EX = way to explore

V2EX 是一个关于分享和探索的地方

现在注册

已注册用户请登录

推荐学习书目

› Learn Python the Hard Way

Python Sites

› PyPI - Python Package Index

› http://diveintopython.org/toc/index.html

› Pocoo

值得关注的项目

› PyPy

› Celery

› Jinja2

› Read the Docs

› gevent

› pyenv

› virtualenv

› Stackless Python

› Beautiful Soup

› 结巴中文分词

› Green Unicorn

› Sentry

› Shovel

› Pyflakes

› pytest

Python 编程

› pep8 Checker

Styles

› PEP 8

› Google Python Style Guide

› Code Style from The Hitchhiker's Guide

这是一个创建于 2149 天前的主题，其中的信息可能已经有所发展或是发生改变。

向大家请教一下，本人最近刚刚接触编程，学习的 Python，目前有一个想法想要学习：
就是现在有很多组各不相同的标签，然后想计算两组之间的相似程度，找到重合度最高的。这种算法要学习什么算法呢？有没有 Python 的解决方案？

Python

算法

重合度

学习

18 条回复

lithiumii

2019-12-03 00:34:20 +08:00

不懂算法，盲猜一个 pca （ Principal component analysis?

TaihongZhang

2019-12-03 00:40:04 +08:00

@lithiumii 好的我去看看

how2code

2019-12-03 00:45:41 +08:00

说 PCA 的拉出去 251...

最简单的应该是关键词 TF IDF + cosine similarity

ZRS

2019-12-03 00:46:39 +08:00

直接每个 label 单独一维算 cosine 相似度吧

how2code

2019-12-03 00:48:25 +08:00

@how2code Google 第一个 https://stackoverflow.com/questions/12118720/python-tf-idf-cosine-to-find-document-similarity

wangyzj

2019-12-03 01:02:27 +08:00

@lithiumii 哈哈，我也是这么想的

klesh

2019-12-03 01:03:30 +08:00

看看 Jaccard Similarity 或 Overlap Coefficient 够不够用？

a = {'foo', 'bar', 'hello', 'world'}
b = {'foo', 'bar', 'hello', 'world', 'test'}
c = a.intersection(b)
d = a.union(b)
print('js(a, b)=', float(len(c))/float(len(d)))
print('oc(a, b)=', float(len(c))/float(min(len(a), len(b))))