面向程序猿的数据科学与机器学习知识体系及资料合集

 阅读约 35 分钟

本文面向程序猿的数据科学与机器学习知识体系及资料合集从属于程序猿的数据科学与机器学习实战手册。本文很多内容来自于hitvoice@github的建议与收集,特此感谢。

DataScience & Machine Learning Reference

本文是笔者在学习DataScience过程中所有资源的汇总,本文着眼于各个领域的入门介绍以及综述性质资源的汇总,并不会过多的深挖前沿,若有兴趣了解更多,可以关注笔者的程序猿的数据科学与机器学习实战手册。本文主线从对数据科学与机器学习入门概览开始,继而提供一系列的资源、书籍与教程,然后介绍各个具体的领域内的参考文章,最后介绍一系列的实用工具。笔者的数据科学与机器学习世界观图解如下,其从属于笔者的编程世界观与方法论系列:

本文会随着笔者自身学习实践中格局与能力的提升而不断完善,笔者并非纯粹的机器学习与数据挖掘研究者,更多的是从工程的角度来寻找能够与工程相结合应用的方面。

Introduction & Overview:入门与概览

Introduction

Machine Learning

Deep Learning

Statistics

News:行业与新闻

Application:数据挖掘/机器学习/深度学习的实际应用案例

Resources:资源

Collections:资源汇总帖

Books:书籍

Video Courses:视频教程

Blogs & Forum:博客与论坛

Methodology:方法论

Data Process:数据处理

Machine Learning:机器学习

Nature Language Processing:自然语言处理

Deep Learning:深度学习

Application:应用

Recommend System:推荐系统

CrawlerSE:爬虫与搜索引擎

Crawler:爬虫

Search Engine:搜索引擎

Toolkits:工具

Language

Python

Java

Matlab

R

ClusterComputing

    • [Madout]()

    • [MLib]()

    DeepLearning:深度学习工具集

    Data Visual:数据可视化

    Books:书籍

    Video Courses:视频教程

    Toolkits:工具

    Data Sets

    Collections:资源汇总帖

    • awesome-public-datasets:An awesome list of high-quality open datasets in public domains (on-going).

    • Wikimedia Dumps:Wiki上的数据打包下载

    • Reddit Datasets:Reddit上关于数据集的讨论板块
      | Militarized Interstate Disputes | Nearly 200 years of international threats, conflicts, etc. for modelling or prediction. Includes action taken, level of hostility, fatalities, and outcomes. | Multiple datasets, e.g., 962KB, 179KB | http://www.correlatesofwar.or... |

    单一数据库

    跨学科数据库与搜索引擎

    Text:文本

    • 20 Newsgroups:The text from 20000 messages taken from 20 Usenet newsgroups for text analysis, classification, etc. 61.6MB

    • Amazon Reviews:Over 142 million product reviews for sentiment analysis, recommender systems, and more.20GB
      | SMS Spam Collection | A collection of 5,574 SMS (text) messages, some spam, some normal, for spam filtering. | 204KB | http://www.dt.fee.unicamp.br/... |

    Social Network:社交网络

    Media:影音图片

    • Labeled Faces in the Wild:13,000 named faces for facial recognition. Multiple training and test sets. 共173MB

    • Mushroom Identification:For hypothetically classifying mushrooms as edible or poisonous based on its characteristics.3 files, 480KB

    • NORB 3D Object Recognition:Binocular images of 50 toy figurines for 3D object recognition from image.Multiple files, over 5GB total

    • One Million Songs :Audio features and metadata for a subset (10,000) of the one million popular songs dataset for recognition/classification.1.8GB

    • Hate Speech Identification:A sampling of Twitter posts that have been judged based on whether they are offensive or contain hate speech, as a training set for text analysis.2.66MB

    • Hidden Beauty of Flickr Pictures:15,000 Flickr photo IDs that have received ratings based on aesthetics, for image analysis.138KB, use Flickr API to get images

    Recognition

    | Human Activity Recognition with Smartphones | Sensor data for recognizing the human activity - walking, sitting, etc. | 25MB | https://www.kaggle.com/uciml/... |

    Driving Data:驾驶数据

    Domain:领域数据

    Sports:体育

    • Football Strategy:Thousands of scenarios to make the best coaching decisions. 共876KB

    • Horses for Courses:Horse-racing data for predicting race results. 共 19MB

    • NBA & MLB Stats:Current and past season stats for teams and players for fantasy sports predictions.

    Medicines:医药

    Alien:外星人

    • UFO Reports:80,000 historic reports for classification or regression. This dataset has been standardized from the source data at nuforc.org 共14.6MB。

    Foods:饮食

    • Wine Quality:Chemical properties of red and white wines (separately) and quality, for classification. 3个文件,共343KB。

    Finance:金融

    Others:其他

    Competition:机器学习相关竞赛

    Career:职业

    阅读 7.4k更新于 2016-11-13
    推荐阅读
    某熊的全栈之路
    用户专栏

    知识,应该在它该在的地方。一个热爱代码,热爱新技术的程序熊。

    4949 人关注
    361 篇文章
    专栏主页
    目录