斯坦福21秋季:实用机器学习

斯坦福21秋季:实用机器学习,第1张

1 数据获取

开源数据集
• MNIST: digits written by employees of the US Census Bureau
• ImageNet: millions of images from image search engines
• AudioSet: YouTube sound clips for sound classification
• LibriSpeech: 1000 hours of English speech from audiobook
• Kinetics: YouTube videos clips for human actions classification
• KITTI: traffic scenarios recorded by cameras and other sensors
• Amazon Review: customer reviews and from Amazon online shopping
• SQuAD: question-answer pairs derived from Wikipedia

数据集网站:
• Paperswithcodes Datasets: academic datasets with
leaderboard
• Kaggle Datasets: ML datasets uploaded by data
scientists
• Google Dataset search: search datasets in the Web
• Various toolkits datasets: tensorflow, huggingface
• Various conference/company ML competitions
• Open Data on AWS: 100+ large-scale raw data
• Data lakes in your own organization

1.1 网页数据抓取


Craw individual pages

Extract data

Crawl Images

2 数据标注

2.1 Semi-Supervised Learning (SSL)


半监督学习

半监督通常是要基于一些假设,然后进行建模的,那么半监督学习的效果好不好,就是假设的是否合理。

• Focus on the scenario where there is a small amount of labeled data, along with large amount of unlabeled data
• Make assumptions(假设) on data distribution(分布) to use unlabeled data

  • Continuity assumption(连续性假设): examples with similar features are more likely to have the same label
  • Cluster assumption(聚类假设): data have inherent(固有的) cluster(群聚;聚集) structure, examples in the same cluster tend to have the same label
  • Manifold assumption(流形假设): data lie on a manifold of much lower dimension than the input space
Self-training

自训练
https://blog.csdn.net/tyh70537/article/details/80244490
https://www.sohu.com/a/454689217_500659

2.2 Active Learning

主动学习
https://blog.csdn.net/qq_15111861/article/details/85264109

• Focus on same scenario as SSL but with human in the loop

  • Self training: Model helps propagate labels to unlabeled data
  • Active learning: Model select the most “interesting” data for labelers

• Uncertainty sampling

  • Select examples whose predictions are most uncertain
  • The highest class prediction score is close to random (1/n)

• Query-by-committee

  • Trains multiple models and select samples that models disagree with
2.3 Active Learning + Self-training


3 Data Cleaning(数据清洗)



右图为箱型图,有关箱型图的学习可以参考这篇博客:箱型图

3.1 Rule-based Detection

Functional dependencies:函数依赖

  • 即一个x总会对应一个唯一的y,比如说你的邮政编码能够一一对应你所处的地理位置。如果是一对多的情况,这时候就要检查数据是否存在依赖错误。如果不满足依赖条件,要么就把这条样本删掉,要么就手动fix。

Denial constraints:拒绝约束

  • 即根据一个规则或者函数function来约束数据。比如,如果你有家庭地址,那么就一定会有你个人的邮政编码;比如你指定了用户的ID是唯一的,如果出现了重复ID的情况,就考虑去掉重复样本。
3.2 Pattern-based Detection


Syntactic patterns:语法模式

  • 比如说某个特征的英文规定是大写,如果出现小写,就会有语义错误。再比如说规定整数型的数据,但是出现了float型,这时候就考虑拿掉或者手动fix。

Semantic patterns:语义模式

  • 比如某列特征是首都,但是数据中存在某个乡镇的名称,这时候就会存在语义上的错误。

欢迎分享,转载请注明来源:内存溢出

原文地址:https://54852.com/langs/942658.html

(0)
打赏 微信扫一扫微信扫一扫 支付宝扫一扫支付宝扫一扫
上一篇 2022-05-18
下一篇2022-05-18

发表评论

登录后才能评论

评论列表(0条)

    保存