面向程序猿的数据科学与机器学习知识体系及资料合集

DataScience & Machine Learning Reference
Introduction & Overview:入门与概览
Resources:资源
Methodology:方法论
Application:应用
- Recommend System:推荐系统
CrawlerSE:爬虫与搜索引擎
- Crawler:爬虫
- Search Engine:搜索引擎
Toolkits:工具
- Language
  - Python
  - Java
  - Matlab
  - R
- ClusterComputing
Data Visual:数据可视化
Data Sets
Others:其他
- Competition:机器学习相关竞赛
- Career:职业

本文面向程序猿的数据科学与机器学习知识体系及资料合集从属于程序猿的数据科学与机器学习实战手册。本文很多内容来自于hitvoice@github的建议与收集，特此感谢。

DataScience & Machine Learning Reference

本文是笔者在学习DataScience过程中所有资源的汇总，本文着眼于各个领域的入门介绍以及综述性质资源的汇总，并不会过多的深挖前沿，若有兴趣了解更多，可以关注笔者的程序猿的数据科学与机器学习实战手册。本文主线从对数据科学与机器学习入门概览开始，继而提供一系列的资源、书籍与教程，然后介绍各个具体的领域内的参考文章，最后介绍一系列的实用工具。笔者的数据科学与机器学习世界观图解如下，其从属于笔者的编程世界观与方法论系列:

本文会随着笔者自身学习实践中格局与能力的提升而不断完善，笔者并非纯粹的机器学习与数据挖掘研究者，更多的是从工程的角度来寻找能够与工程相结合应用的方面。

Introduction & Overview:入门与概览

Introduction

数据科学与机器学习导论

[数据分析，数据挖掘，数据科学，机器学习与大数据之间的异同](

                             https://www.quora.com/What-is-the-difference-between-Data-Analytics-Data-Analysis-Data-Mining-Data-Science-Machine-Learning-and-Big-Data-1)

如何向非计算机科学与技术的人解释机器学习与数据挖掘

Machine Learning

Visual Intro To Machine Learning:图解如何基于决策树对于纽约与San Francisco的房产进行分类
A Gentle Guide to Machine Learning
Machine Learning basics for a newbie
What is machine learning, and how does it work?

Deep Learning

有趣的机器学习概念纵览：从多元拟合，神经网络到深度学习，给每个感兴趣的人
[[翻译] 神经网络的直观解释](http://www.hackcv.com/index.p...卷积神经网络的讲解非常通俗易懂。
Deep-Learning-Papers-Reading-Roadmap:为每个对深度学习感兴趣的朋友整理的论文阅读路线图
程序员的深度学习入门指南:来自费良宏在2016QCon全球软件开发大会（上海）上的演讲。

Statistics

知乎：「数据会说谎」的真实例子有哪些？

News:行业与新闻

深度学习框架大战正在进行，谁将夺取“深度学习工业标准”的荣耀？

Application:数据挖掘/机器学习/深度学习的实际应用案例

Resources:资源

Collections:资源汇总帖

机器学习入门资源不完全汇总:本文是机器学习日报的一个专题合集。
Top-down learning path: Machine Learning for Software Engineers:针对软件工程师的机器学习进阶之路

Books:书籍

Video Courses:视频教程

University of Illinois at Urbana-Champaign:Text Mining and Analytics
台大机器学习技法
斯坦福机器学习课程
CS224d: Deep Learning for Natural Language Processing
Unsupervised Feature Learning and Deep Learning:来自斯坦福的无监督特征学习与深度学习系列教程
小象机器学习视频教程
小象深度学习视频教程

Blogs & Forum:博客与论坛

Methodology:方法论

Data Process:数据处理

Machine Learning:机器学习

Nature Language Processing:自然语言处理

Deep Learning:深度学习

重磅论文：解析深度卷积神经网络的14种设计模式

Application:应用

Recommend System:推荐系统

CrawlerSE:爬虫与搜索引擎

Crawler:爬虫

Search Engine:搜索引擎

Toolkits:工具

Language

Python

Jupyter:交互式编程与数据展示
data-science-ipython-notebooks:一系列基于IPython的数据科学代码展示
The Open Source Data Science Masters

Java

Matlab

R

ClusterComputing

[Madout]()

[MLib]()

DeepLearning:深度学习工具集

Evaluation of Deep Learning Toolkits
代码解析深度学习系统编程模型：TensorFlow vs. CNTK
tensorflow-playground:Play with neural networks!
dl-docker:将常用的深度学习工具打包在了一个Docker镜像中
deep-learning-models:Keras code and weights files for popular deep learning models.
Top Deep Learning Projects-

Data Visual:数据可视化

Books:书籍

Video Courses:视频教程

John C. Hart Coursera

Toolkits:工具

Data Sets

Collections:资源汇总帖

awesome-public-datasets:An awesome list of high-quality open datasets in public domains (on-going).
Wikimedia Dumps:Wiki上的数据打包下载
Reddit Datasets:Reddit上关于数据集的讨论板块
| Militarized Interstate Disputes | Nearly 200 years of international threats, conflicts, etc. for modelling or prediction. Includes action taken, level of hostility, fatalities, and outcomes. | Multiple datasets, e.g., 962KB, 179KB | http://www.correlatesofwar.or... |

单一数据库

跨学科数据库与搜索引擎

Text:文本

20 Newsgroups:The text from 20000 messages taken from 20 Usenet newsgroups for text analysis, classification, etc. 61.6MB
Amazon Reviews:Over 142 million product reviews for sentiment analysis, recommender systems, and more.20GB
| SMS Spam Collection | A collection of 5,574 SMS (text) messages, some spam, some normal, for spam filtering. | 204KB | http://www.dt.fee.unicamp.br/... |

Social Network:社交网络

http://enigma.io
http://www.ufindthem.com/
http://NetworkRepository.com（有视觉互动分析的机器学习数据库）
http://MLvis.com
Yahoo Instant Messenger Friends Connectivity Graph:Connections between Yahoo users who communicate with each other using Yahoo messenger, can be used to identify key social contacts/influencers. Add dataset to cart to access. 共 28MB。

Media:影音图片

Labeled Faces in the Wild:13,000 named faces for facial recognition. Multiple training and test sets. 共173MB
Mushroom Identification:For hypothetically classifying mushrooms as edible or poisonous based on its characteristics.3 files, 480KB
NORB 3D Object Recognition:Binocular images of 50 toy figurines for 3D object recognition from image.Multiple files, over 5GB total
One Million Songs :Audio features and metadata for a subset (10,000) of the one million popular songs dataset for recognition/classification.1.8GB
Hate Speech Identification:A sampling of Twitter posts that have been judged based on whether they are offensive or contain hate speech, as a training set for text analysis.2.66MB
Hidden Beauty of Flickr Pictures:15,000 Flickr photo IDs that have received ratings based on aesthetics, for image analysis.138KB, use Flickr API to get images

Recognition

Driving Data:驾驶数据

UDA City 开源的223G的关于自动驾驶的历史数据

Domain:领域数据

Sports:体育

Football Strategy:Thousands of scenarios to make the best coaching decisions. 共876KB
Horses for Courses:Horse-racing data for predicting race results. 共 19MB
NBA & MLB Stats:Current and past season stats for teams and players for fantasy sports predictions.

Medicines:医药

National Survey on Drug Use and Health:Predict drug use based on health survey questions. 共2GB
Prostate Cancer:Tumor and nontumor samples, used to recognize prostate cancer. 共 4.8MB
Record of Heart Sound:Recordings of normal and abnormal heartbeats, used to recognize heart murmur, etc. 共47.7MB

Alien:外星人

UFO Reports:80,000 historic reports for classification or regression. This dataset has been standardized from the source data at nuforc.org 共14.6MB。

Foods:饮食

Wine Quality:Chemical properties of red and white wines (separately) and quality, for classification. 3个文件，共343KB。

Finance:金融

Others:其他

Competition:机器学习相关竞赛

阿里天池新人实战赛
Kaggle:官方新人赛，不错的入门学习
Kaggle Tutorial:基于旅馆推荐比赛实例的完整Tutorial
Driven Data
Innocentive
Crowdanalytix
Tunedit
DataFountain:DF,CCF指定中国专业的数据竞赛平台

面向程序猿的数据科学与机器学习知识体系及资料合集

DataScience & Machine Learning Reference

Introduction & Overview:入门与概览

Introduction

Machine Learning

Deep Learning

Statistics

News:行业与新闻

Application:数据挖掘/机器学习/深度学习的实际应用案例

Resources:资源

Collections:资源汇总帖

Books:书籍

Video Courses:视频教程

Blogs & Forum:博客与论坛

Methodology:方法论

Data Process:数据处理

Machine Learning:机器学习

Nature Language Processing:自然语言处理

Deep Learning:深度学习

Application:应用

Recommend System:推荐系统

CrawlerSE:爬虫与搜索引擎

Crawler:爬虫

Search Engine:搜索引擎

Toolkits:工具

Language

Python

Java

Matlab

R

ClusterComputing

DeepLearning:深度学习工具集

Data Visual:数据可视化

Books:书籍

Video Courses:视频教程

Toolkits:工具

Data Sets

Collections:资源汇总帖

单一数据库

跨学科数据库与搜索引擎

Text:文本

Social Network:社交网络

Media:影音图片

Recognition

Driving Data:驾驶数据

Domain:领域数据

Sports:体育

Medicines:医药

Alien:外星人

Foods:饮食

Finance:金融

Others:其他

Competition:机器学习相关竞赛

Career:职业

王下邀月熊_Chevalier

引用和评论

2023~某熊的成长之路：拥抱更大的世界