from http://hao.memect.com/?p=3294
在大数据分析时,一个困难是海量的数据本地存储困难,而且下载耗费的时间极长。例如1T数据,如果下载网速是3MBps(目前中国的平均宽带速度),那要4天才能下载完。有些数据集有几十T,那光下载就要几个月。
亚马逊的AWS云服务平台上为了解决这个困难提供了很多常用的大规模数据集 Public Data Sets https://aws.amazon.com/datasets ,无需下载即可在亚马逊AWS EC2上使用。
以Linux为例,方法是
- 建立一个EC2的实例
- 查看对应的数据集对应的EBS的代码,如PubChem是snap-e6df3c8f,建立一个EBS虚拟卷时选择从这个snapshot建立
- 把新建的这个EBS卷附加到EC2实例(在AWS 管理控制台上做)
- 在EC2实例上用lsblk查看是不是附加上了,例如如果有xvdf之类就是成功了
- 用mount把EBS卷挂载到一个路径,例如 sudo mount /dev/xvdf /opt/pubchem就把该卷挂载到路径/opt/pubchem
1-3都在AWS的图形界面上操作,非常直观。4-5就是两行命令,就可以立即开始使用上T的数据了——比如CommonCrawl有50T
需要注意的是价格,目前EBS的价格最便宜是1G一个月5美分,也就是说CommonCrawl的数据一个月要花2500美元,外加读写的费用
目前在线的五十多个数据集是:
The 1000 Genomes Project, initiated in 2008, is an international public-private consortium that aims to build the most detailed map of human genetic variation available.
AnthroKids – Anthropometric Data of Children 70年代的儿童人体测量数据
Anthropometric data on children from two studies in 1975 and 1977
Apache Software Foundation Public Mail Archives Apache基金会的到2011年为止的邮件列表
A collection of all publicly available Apache Software Foundation mail archives as of July 11, 2011
C57BL/6J by C3H/HeJ mouse cross from the Jake Lusis lab at UCLA
Common Crawl Corpus 50亿网页
A corpus of web crawl data composed of over 5 billion web pages. This data set is freely available on Amazon S3 and is released under the Common Crawl Terms of Use.
A collection of daily weather measurements (temperature, wind speed, humidity, pressure, &c.) from 9000+ weather stations around the world.
DBpedia 3.5.1 DBpedia结构化知识库
DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web
Denisova Genome 丹尼索瓦人基因组
The high-coverage genome sequence of a Denisovan individual sequenced to ~30x coverage on the Illumina platform. Together with their sister group the Neandertals, Denisovans are the most closely related extinct relatives of currently living humans.
Enron Email Data 安然电子邮件数据
Enron email data publicly released as part of FERC’s Western Energy Markets investigation converted to industry standard formats by EDRM. The data set consists of 1,227,255 emails with 493,384 attachments covering 151 custodians. The email is provided in Microsoft PST, IETF MIME, and EDRM XML formats.
Ensembl – FASTA Database Files Ensembl真核生物基因组转录与翻译模型
Ensembl sequence databases of transcript and translation models
Ensembl Annotated Human Genome Data (FASTA Release 73) 人类与其他50个物种的基因序列
The Ensembl project produces genome databases for human as well as over 50 other species, and makes this information freely available.
Ensembl Annotated Human Genome Data (MySQL Release 73) 人类与其他50个物种的基因序列,MySQL版
The Ensembl project produces genome databases for human as well as over 50 other species, and makes this information freely available.
A data dump of all federal contracts from the Federal Procurement Data Center found at USASpending.gov.
Freebase Data Dump Freebase知识图谱
Freebase is an open database of the world’s information, covering millions of topics in hundreds of categories
Freebase Quad Dump Freebase知识图谱四元组格式
A data dump of all the current facts and assertions in Freebase
Freebase Simple Topic Dump Freebase知识图谱简化的主题数据
A data dump of the basic identifying facts about every topic in Freebase
GenBank 基因银行序列数据库
An annotated collection of all publicly available DNA sequences including more than 85.7B bases and 82.8M sequence records.
Google Books Ngrams 谷歌图书的ngram语言模型
A data set containing Google Books n-gram corpuses. This data set is freely available on Amazon S3 in a Hadoop friendly file format and is licensed under a Creative Commons Attribution 3.0 Unported License. The original dataset is available from http://books.google.com/ngrams/.
Human Liver Cohort characterizing gene expression in liver samples
Jay Flatley (CEO of Illumina) human genome data set.
Japan Census Data 日本人口统计数据
Multiple data sets including: (1) Population Census of Japan (1995, 2000, 2005, 2010), (2) Establishment and Enterprise Census of Japan (1999, 2001, 2004, 2006), and (3) Economic Census of Japan (2009).
M-Lab dataset: Network Diagnostic Tool (NDT) 2009年互联网性能(如网速)诊断数据
NDT test results created through Measurement Lab (M-Lab) between February 2009 and September 2009
M-Lab dataset: Network Path and Application Diagnosis tool (NPAD) 2009年互联网路由,包头等测试数据
NPAD test results created through Measurement Lab (M-Lab) between February 2009 and September 2009
Marvel Universe Social Graph 一个虚拟的社交网络关系图
This dataset is an example of a social collaboration network based on the characters in The Marvel Universe, that is, the artificial world that takes place in the universe of the Marvel comic books.
Million Song Dataset 百万歌曲数据
The Million Songs Collection is a collection of 28 datasets containing audio features and metadata for a million contemporary popular music tracks.
Million Song Sample Dataset 百万歌曲数据库的1万子集
This is a 10,000 song subset of audio features and metadata from the Million Songs collection – a collection of 28 datasets containing audio features and metadata for a million contemporary popular music tracks.
A collection of data from the modENCODE project ( http://www.modencode.org )
NASA NEX NASA的地球卫星地图与气候变迁
Three NASA NEX datasets are now available, including climate projections and satellite images of Earth.
OpenStreetMap Rendering Database 开源的全球地图数据
A PostGIS 8.3 data cluster of all OpenStreetMap data for the planet.
Public-domain data for the oil & gas industry, assembled from the contributions of participating agencies in the United States, Canada and around the world. This data provides industry stakeholders with an opportunity to focus their efforts on the analysis and interpretation of this data without concern for the trivial and time-consuming tasks of locating, downloading, reformatting and integrating the data prior to value-added work being performed.
PubChem Library 有机小分子生物活性数据
A data set of information on the biological activities of small molecules.
The Sloan Digital Sky Survey is the most ambitious astronomical survey ever undertaken.
Whole Genome Shotgun Sequencing of the Cannabis Sativa Cultivar “Chemdawg”
The WestburyLab USENET corpus 4万多个USENET新闻组数据
The WestburyLab USENET corpus is an anonymized compilation of postings from 47,860 English-language newsgroups from 2005-2010.
Twilio/Wigle.net Street Vector Data Set 完整的美国街道名与地址
Twilio/Wigle.net database of mapped US street names and address ranges.
University of Florida Sparse Matrix Collection 佛罗里达大学的稀疏矩阵数据集
The University of Florida Sparse Matrix Collection is a large, widely available, and actively growing set of sparse matrices that arise in real applications.
Wikipedia Extraction (WEX) 维基百科用Freebase增强过的结构化数据
A processed dump of the English language Wikipedia
Wikipedia Page Traffic Statistic V3 维基百科2011年3个月的按小时访问量
This dataset contains a 150 GB sample of the data used to power trendingtopics.org. It includes a full 3 months of hourly page traffic statistics from Wikipedia (1/1/2011-3/31/2011).
Wikipedia Page Traffic Statistics 维基百科2009年7个月的按小时访问量
Contains 7 months of hourly pageview statistics for all articles in Wikipedia
Wikipedia Traffic Statistics V2 维基百科2009-2010年16个月按小时访问量
Contains 16 months of hourly pageview statistics for all articles in Wikipedia
Wikipedia XML Data 维基百科2009版,XML格式
A complete copy of all Wikimedia wikis, in the form of wikitext source and metadata embedded in XML.
YRI Trio Dataset 三个约鲁巴人的完整基因组
Complete genome sequence data for three Yoruba individuals from Ibadan, Nigeria