本文面向程序猿的数据科学与机器学习知识体系及资料合集从属于程序猿的数据科学与机器学习实战手册。本文很多内容来自于hitvoice@github的建议与收集,特此感谢。
DataScience & Machine Learning Reference
本文是笔者在学习DataScience过程中所有资源的汇总,本文着眼于各个领域的入门介绍以及综述性质资源的汇总,并不会过多的深挖前沿,若有兴趣了解更多,可以关注笔者的程序猿的数据科学与机器学习实战手册。本文主线从对数据科学与机器学习入门概览开始,继而提供一系列的资源、书籍与教程,然后介绍各个具体的领域内的参考文章,最后介绍一系列的实用工具。笔者的数据科学与机器学习世界观图解如下,其从属于笔者的编程世界观与方法论系列:
本文会随着笔者自身学习实践中格局与能力的提升而不断完善,笔者并非纯粹的机器学习与数据挖掘研究者,更多的是从工程的角度来寻找能够与工程相结合应用的方面。
Introduction & Overview:入门与概览
Introduction
-
[数据分析,数据挖掘,数据科学,机器学习与大数据之间的异同](
https://www.quora.com/What-is-the-difference-between-Data-Analytics-Data-Analysis-Data-Mining-Data-Science-Machine-Learning-and-Big-Data-1)
Machine Learning
Visual Intro To Machine Learning:图解如何基于决策树对于纽约与San Francisco的房产进行分类
Deep Learning
[[翻译] 神经网络的直观解释](http://www.hackcv.com/index.p...卷积神经网络的讲解非常通俗易懂。
Deep-Learning-Papers-Reading-Roadmap:为每个对深度学习感兴趣的朋友整理的论文阅读路线图
程序员的深度学习入门指南:来自费良宏在2016QCon全球软件开发大会(上海)上的演讲。
Statistics
News:行业与新闻
Application:数据挖掘/机器学习/深度学习的实际应用案例
Resources:资源
Collections:资源汇总帖
机器学习入门资源不完全汇总:本文是 机器学习日报的一个专题合集。
Top-down learning path: Machine Learning for Software Engineers:针对软件工程师的机器学习进阶之路
Books:书籍
Video Courses:视频教程
University of Illinois at Urbana-Champaign:Text Mining and Analytics
Unsupervised Feature Learning and Deep Learning:来自斯坦福的无监督特征学习与深度学习系列教程
Blogs & Forum:博客与论坛
Methodology:方法论
Data Process:数据处理
Machine Learning:机器学习
Nature Language Processing:自然语言处理
Deep Learning:深度学习
Application:应用
Recommend System:推荐系统
CrawlerSE:爬虫与搜索引擎
Crawler:爬虫
Search Engine:搜索引擎
Toolkits:工具
Language
Python
Jupyter:交互式编程与数据展示
data-science-ipython-notebooks:一系列基于IPython的数据科学代码展示
Java
Matlab
R
ClusterComputing
[Madout]()
[MLib]()
DeepLearning:深度学习工具集
tensorflow-playground:Play with neural networks!
dl-docker:将常用的深度学习工具打包在了一个Docker镜像中
deep-learning-models:Keras code and weights files for popular deep learning models.
Data Visual:数据可视化
Books:书籍
Video Courses:视频教程
Toolkits:工具
Data Sets
Collections:资源汇总帖
awesome-public-datasets:An awesome list of high-quality open datasets in public domains (on-going).
Wikimedia Dumps:Wiki上的数据打包下载
Reddit Datasets:Reddit上关于数据集的讨论板块
| Militarized Interstate Disputes | Nearly 200 years of international threats, conflicts, etc. for modelling or prediction. Includes action taken, level of hostility, fatalities, and outcomes. | Multiple datasets, e.g., 962KB, 179KB | http://www.correlatesofwar.or... |
单一数据库
跨学科数据库与搜索引擎
Open Data Inception(这里有 2500+ 开源接口)
Text:文本
20 Newsgroups:The text from 20000 messages taken from 20 Usenet newsgroups for text analysis, classification, etc. 61.6MB
Amazon Reviews:Over 142 million product reviews for sentiment analysis, recommender systems, and more.20GB
| SMS Spam Collection | A collection of 5,574 SMS (text) messages, some spam, some normal, for spam filtering. | 204KB | http://www.dt.fee.unicamp.br/... |
Social Network:社交网络
http://NetworkRepository.com(有视觉互动分析的机器学习数据库)
Yahoo Instant Messenger Friends Connectivity Graph:Connections between Yahoo users who communicate with each other using Yahoo messenger, can be used to identify key social contacts/influencers. Add dataset to cart to access. 共 28MB。
Media:影音图片
Labeled Faces in the Wild:13,000 named faces for facial recognition. Multiple training and test sets. 共173MB
Mushroom Identification:For hypothetically classifying mushrooms as edible or poisonous based on its characteristics.3 files, 480KB
NORB 3D Object Recognition:Binocular images of 50 toy figurines for 3D object recognition from image.Multiple files, over 5GB total
One Million Songs :Audio features and metadata for a subset (10,000) of the one million popular songs dataset for recognition/classification.1.8GB
Hate Speech Identification:A sampling of Twitter posts that have been judged based on whether they are offensive or contain hate speech, as a training set for text analysis.2.66MB
Hidden Beauty of Flickr Pictures:15,000 Flickr photo IDs that have received ratings based on aesthetics, for image analysis.138KB, use Flickr API to get images
Recognition
| Human Activity Recognition with Smartphones | Sensor data for recognizing the human activity - walking, sitting, etc. | 25MB | https://www.kaggle.com/uciml/... |
Driving Data:驾驶数据
Domain:领域数据
Sports:体育
Football Strategy:Thousands of scenarios to make the best coaching decisions. 共876KB
Horses for Courses:Horse-racing data for predicting race results. 共 19MB
NBA & MLB Stats:Current and past season stats for teams and players for fantasy sports predictions.
Medicines:医药
National Survey on Drug Use and Health:Predict drug use based on health survey questions. 共2GB
Prostate Cancer:Tumor and nontumor samples, used to recognize prostate cancer. 共 4.8MB
Record of Heart Sound:Recordings of normal and abnormal heartbeats, used to recognize heart murmur, etc. 共47.7MB
Alien:外星人
UFO Reports:80,000 historic reports for classification or regression. This dataset has been standardized from the source data at nuforc.org 共14.6MB。
Foods:饮食
Wine Quality:Chemical properties of red and white wines (separately) and quality, for classification. 3个文件,共343KB。
Finance:金融
Others:其他
Competition:机器学习相关竞赛
Kaggle:官方新人赛,不错的入门学习
Kaggle Tutorial:基于旅馆推荐比赛实例的完整Tutorial
DataFountain:DF,CCF指定中国专业的数据竞赛平台
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。