斯坦福21秋季：实用机器学习_python

1 数据获取

开源数据集
• MNIST: digits written by employees of the US Census Bureau
• ImageNet: millions of images from image search engines
• AudioSet: YouTube sound clips for sound classification
• LibriSpeech: 1000 hours of English speech from audiobook
• Kinetics: YouTube videos clips for human actions classification
• KITTI: traffic scenarios recorded by cameras and other sensors
• Amazon Review: customer reviews and from Amazon online shopping
• SQuAD: question-answer pairs derived from Wikipedia

数据集网站：
• Paperswithcodes Datasets: academic datasets with
leaderboard
• Kaggle Datasets: ML datasets uploaded by data
scientists
• Google Dataset search: search datasets in the Web
• Various toolkits datasets: tensorflow, huggingface
• Various conference/company ML competitions
• Open Data on AWS: 100+ large-scale raw data
• Data lakes in your own organization

1.1 网页数据抓取

Craw individual pages

Extract data

Crawl Images

2 数据标注

2.1 Semi-Supervised Learning (SSL)

半监督学习

半监督通常是要基于一些假设，然后进行建模的，那么半监督学习的效果好不好，就是假设的是否合理。

• Focus on the scenario where there is a small amount of labeled data, along with large amount of unlabeled data
• Make assumptions（假设） on data distribution（分布） to use unlabeled data

Continuity assumption（连续性假设）: examples with similar features are more likely to have the same label
Cluster assumption（聚类假设）: data have inherent（固有的） cluster（群聚;聚集） structure, examples in the same cluster tend to have the same label
Manifold assumption（流形假设）: data lie on a manifold of much lower dimension than the input space

Self-training

自训练
https://blog.csdn.net/tyh70537/article/details/80244490
https://www.sohu.com/a/454689217_500659

2.2 Active Learning

主动学习
https://blog.csdn.net/qq_15111861/article/details/85264109

• Focus on same scenario as SSL but with human in the loop

Self training: Model helps propagate labels to unlabeled data
Active learning: Model select the most “interesting” data for labelers

• Uncertainty sampling

Select examples whose predictions are most uncertain
The highest class prediction score is close to random (1/n)

• Query-by-committee

Trains multiple models and select samples that models disagree with

2.3 Active Learning + Self-training

3 Data Cleaning（数据清洗）

右图为箱型图，有关箱型图的学习可以参考这篇博客：箱型图

3.1 Rule-based Detection

Functional dependencies：函数依赖

即一个x总会对应一个唯一的y，比如说你的邮政编码能够一一对应你所处的地理位置。如果是一对多的情况，这时候就要检查数据是否存在依赖错误。如果不满足依赖条件，要么就把这条样本删掉，要么就手动fix。

Denial constraints：拒绝约束

即根据一个规则或者函数function来约束数据。比如，如果你有家庭地址，那么就一定会有你个人的邮政编码；比如你指定了用户的ID是唯一的，如果出现了重复ID的情况，就考虑去掉重复样本。

3.2 Pattern-based Detection

Syntactic patterns：语法模式

比如说某个特征的英文规定是大写，如果出现小写，就会有语义错误。再比如说规定整数型的数据，但是出现了float型，这时候就考虑拿掉或者手动fix。

Semantic patterns：语义模式

比如某列特征是首都，但是数据中存在某个乡镇的名称，这时候就会存在语义上的错误。

欢迎分享，转载请注明来源：内存溢出

原文地址:https://54852.com/langs/942658.html

斯坦福21秋季：实用机器学习

发表评论

评论列表（0条）