部分重复的url地址怎么通过正则识别并删除?

1,今日头条的url地址列表,头条有CDN,id和访问结果是一样,cdn地址不一样。这类特殊重复,怎么用正则识别并剔除重复保留其中一个?研究了很长时间没解决。

['http:\\/\\/p3.pstatp.com\\/origin\\/1b7b000317e8e6eae3e0', 'http:\\/\\/p3.pstatp.com\\/origin\\/1b7b000317e8e6eae3e0', 'http:\\/\\/pb9.pstatp.com\\/origin\\/1b7b000317e8e6eae3e0', 'http:\\/\\/pb1.pstatp.com\\/origin\\/1b7b000317e8e6eae3e0', 'http:\\/\\/p3.pstatp.com\\/origin\\/1b7800060aed2ccfa0cc","width":640,"url_list":[{"url":"http:\\/\\/p3.pstatp.com\\/origin\\/1b7800060aed2ccfa0cc"},{"url":"http:\\/\\/pb9.pstatp.com\\/origin\\/1b7800060aed2ccfa0cc"},{"url":"http:\\/\\/pb1.pstatp.com\\/origin\\/1b7800060aed2ccfa0cc"}],"uri":"origin\\/1b7800060aed2ccfa0cc","height":917},{"url":"http:\\/\\/p3.pstatp.com\\/origin\\/1b7d0003099985f45ee3', 'http:\\/\\/p3.pstatp.com\\/origin\\/1b7d0003099985f45ee3', 'http:\\/\\/pb9.pstatp.com\\/origin\\/1b7d0003099985f45ee3', 'http:\\/\\/pb1.pstatp.com\\/origin\\/1b7d0003099985f45ee3', 'http:\\/\\/p1.pstatp.com\\/origin\\/1b7c000309f203688954', 'http:\\/\\/p1.pstatp.com\\/origin\\/1b7c000309f203688954', 'http:\\/\\/pb3.pstatp.com\\/origin\\/1b7c000309f203688954', 'http:\\/\\/pb9.pstatp.com\\/origin\\/1b7c000309f203688954', 'http:\\/\\/p1.pstatp.com\\/origin\\/1b7800060af42554fb15', 'http:\\/\\/p1.pstatp.com\\/origin\\/1b7800060af42554fb15', 'http:\\/\\/pb3.pstatp.com\\/origin\\/1b7800060af42554fb15', 'http:\\/\\/pb9.pstatp.com\\/origin\\/1b7800060af42554fb15', 'http:\\/\\/p3.pstatp.com\\/origin\\/1b7c000309fad41441ae","width":640,"url_list":[{"url":"http:\\/\\/p3.pstatp.com\\/origin\\/1b7c000309fad41441ae"},{"url":"http:\\/\\/pb9.pstatp.com\\/origin\\/1b7c000309fad41441ae"},{"url":"http:\\/\\/pb1.pstatp.com\\/origin\\/1b7c000309fad41441ae"}],"uri":"origin\\/1b7c000309fad41441ae","height":917},{"url":"http:\\/\\/p1.pstatp.com\\/origin\\/1b7d000309a67b996cfd","width":640,"url_list":[{"url":"http:\\/\\/p1.pstatp.com\\/origin\\/1b7d000309a67b996cfd"},{"url":"http:\\/\\/pb3.pstatp.com\\/origin\\/1b7d000309a67b996cfd"},{"url":"http:\\/\\/pb9.pstatp.com\\/origin\\/1b7d000309a67b996cfd"}],"uri":"origin\\/1b7d000309a67b996cfd","height":917},{"url":"http:\\/\\/p3.pstatp.com\\/origin\\/1b7c00030a00854feda6', 'http:\\/\\/p3.pstatp.com\\/origin\\/1b7c00030a00854feda6', 'http:\\/\\/pb9.pstatp.com\\/origin\\/1b7c00030a00854feda6', 'http:\\/\\/pb1.pstatp.com\\/origin\\/1b7c00030a00854feda6', 'http:\\/\\/p3.pstatp.com\\/origin\\/1b7d000309aa72ca8132', 'http:\\/\\/p3.pstatp.com\\/origin\\/1b7d000309aa72ca8132', 'http:\\/\\/pb9.pstatp.com\\/origin\\/1b7d000309aa72ca8132', 'http:\\/\\/pb1.pstatp.com\\/origin\\/1b7d000309aa72ca8132']
阅读 4.1k
4 个回答
a = ['http:\/\/p3.pstatp.com\/origin\/1b7b000317e8e6eae3e0', 'http:\/\/p9.pstatp.com\/origin\/1b7b000317e8e6eae3e0']
a = set([i.replace("p3", "p9") for i in a])
print(a)

set(['http:\/\/p9.pstatp.com\/origin\/1b7b000317e8e6eae3e0']

可能时间复杂度有点高,不过可以优化。

用集合 set :

new_list = list(set(old_list))

python去重

set(list)
撰写回答
你尚未登录,登录后可以
  • 和开发者交流问题的细节
  • 关注并接收问题和回答的更新提醒
  • 参与内容的编辑和改进,让解决方法与时俱进
推荐问题