亚马逊提供的海量公共数据集

from http://hao.memect.com/?p=3294

在大数据分析时,一个困难是海量的数据本地存储困难,而且下载耗费的时间极长。例如1T数据,如果下载网速是3MBps(目前中国的平均宽带速度),那要4天才能下载完。有些数据集有几十T,那光下载就要几个月。

亚马逊的AWS云服务平台上为了解决这个困难提供了很多常用的大规模数据集 Public Data Sets https://aws.amazon.com/datasets ,无需下载即可在亚马逊AWS EC2上使用。

以Linux为例,方法是

  1. 建立一个EC2的实例
  2. 查看对应的数据集对应的EBS的代码,如PubChem是snap-e6df3c8f,建立一个EBS虚拟卷时选择从这个snapshot建立
  3. 把新建的这个EBS卷附加到EC2实例(在AWS 管理控制台上做)
  4. 在EC2实例上用lsblk查看是不是附加上了,例如如果有xvdf之类就是成功了
  5. 用mount把EBS卷挂载到一个路径,例如 sudo mount /dev/xvdf /opt/pubchem就把该卷挂载到路径/opt/pubchem

1-3都在AWS的图形界面上操作,非常直观。4-5就是两行命令,就可以立即开始使用上T的数据了——比如CommonCrawl有50T

需要注意的是价格,目前EBS的价格最便宜是1G一个月5美分,也就是说CommonCrawl的数据一个月要花2500美元,外加读写的费用

目前在线的五十多个数据集是:

1000 Genomes Project 千人基因组计划,详见http://en.wikipedia.org/wiki/1000_Genomes_Project 
The 1000 Genomes Project, initiated in 2008, is an international public-private consortium that aims to build the most detailed map of human genetic variation available.

 

1980 US Census 美国1980年人口普查数据
Data from the 1980 US Census

 

1990 US Census 美国1990年人口普查数据
Data from the 1990 US Census

 

2000 US Census 美国2000年人口普查数据
Data from the 2000 US Census

 

2003-2006 US Economic Data 美国2003-2006经济数据
US Economic Data for years 2003 to 2006

 

2008 TIGER/Line Shapefiles 美国2000年人口普查与详细的政区划分
Census 2000 and Current United States shapefiles

 

3D Version of the PubChem Library  PubSem有机小分子生物活性数据三维版
3D Version of the PubChem Library

 

AnthroKids – Anthropometric Data of Children 70年代的儿童人体测量数据
Anthropometric data on children from two studies in 1975 and 1977

 

Apache Software Foundation Public Mail Archives Apache基金会的到2011年为止的邮件列表
A collection of all publicly available Apache Software Foundation mail archives as of July 11, 2011

 

Business and Industry Summary Data 美国工商业数据
US Business and Industry Summary Data

 

C57BL/6J by C3H/HeJ mouse cross from the Jake Lusis lab at UCLA

 

Common Crawl Corpus 50亿网页
A corpus of web crawl data composed of over 5 billion web pages. This data set is freely available on Amazon S3 and is released under the Common Crawl Terms of Use.

 

A collection of daily weather measurements (temperature, wind speed, humidity, pressure, &c.) from 9000+ weather stations around the world.

 

DBpedia 3.5.1 DBpedia结构化知识库
DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web

 

Denisova Genome 丹尼索瓦人基因组
The high-coverage genome sequence of a Denisovan individual sequenced to ~30x coverage on the Illumina platform. Together with their sister group the Neandertals, Denisovans are the most closely related extinct relatives of currently living humans.

 

Enron Email Data 安然电子邮件数据
Enron email data publicly released as part of FERC’s Western Energy Markets investigation converted to industry standard formats by EDRM. The data set consists of 1,227,255 emails with 493,384 attachments covering 151 custodians. The email is provided in Microsoft PST, IETF MIME, and EDRM XML formats.

 

Ensembl – FASTA Database Files Ensembl真核生物基因组转录与翻译模型
Ensembl sequence databases of transcript and translation models

 

Ensembl Annotated Human Genome Data (FASTA Release 73) 人类与其他50个物种的基因序列
The Ensembl project produces genome databases for human as well as over 50 other species, and makes this information freely available.

 

Ensembl Annotated Human Genome Data (MySQL Release 73) 人类与其他50个物种的基因序列,MySQL版
The Ensembl project produces genome databases for human as well as over 50 other species, and makes this information freely available.

 

A data dump of all federal contracts from the Federal Procurement Data Center found at USASpending.gov.

 

Federal Reserve Economic Data – Fred 美联储经济数据时间序列
Database of 20,059 U.S. economic time series.

 

Freebase Data Dump Freebase知识图谱
Freebase is an open database of the world’s information, covering millions of topics in hundreds of categories

 

Freebase Quad Dump Freebase知识图谱四元组格式
A data dump of all the current facts and assertions in Freebase

 

Freebase Simple Topic Dump Freebase知识图谱简化的主题数据
A data dump of the basic identifying facts about every topic in Freebase

 

GenBank 基因银行序列数据库
An annotated collection of all publicly available DNA sequences including more than 85.7B bases and 82.8M sequence records.

 

Google Books Ngrams 谷歌图书的ngram语言模型
A data set containing Google Books n-gram corpuses. This data set is freely available on Amazon S3 in a Hadoop friendly file format and is licensed under a Creative Commons Attribution 3.0 Unported License. The original dataset is available from http://books.google.com/ngrams/.

 

Human Liver Cohort (Sage Bionetworks) 人类肝脏基因表达
Human Liver Cohort characterizing gene expression in liver samples

 

Human Microbiome Project 人体微生物群数据
Human Microbiome Project Data Set

 

Jay Flatley (CEO of Illumina) human genome data set.

 

NCBI Influenza Resource Center Data.

 

Japan Census Data 日本人口统计数据
Multiple data sets including: (1) Population Census of Japan (1995, 2000, 2005, 2010), (2) Establishment and Enterprise Census of Japan (1999, 2001, 2004, 2006), and (3) Economic Census of Japan (2009).

 

Labor Statistics Databases 美国劳工部的统计数据
Various Labor Statistics

 

M-Lab dataset: Network Diagnostic Tool (NDT) 2009年互联网性能(如网速)诊断数据
NDT test results created through Measurement Lab (M-Lab) between February 2009 and September 2009

 

M-Lab dataset: Network Path and Application Diagnosis tool (NPAD) 2009年互联网路由,包头等测试数据
NPAD test results created through Measurement Lab (M-Lab) between February 2009 and September 2009

 

Marvel Universe Social Graph 一个虚拟的社交网络关系图
This dataset is an example of a social collaboration network based on the characters in The Marvel Universe, that is, the artificial world that takes place in the universe of the Marvel comic books.

 

Material Safety Data Sheets 材料安全数据
230,000 Material Safety Data Sheets.

 

Million Song Dataset 百万歌曲数据
The Million Songs Collection is a collection of 28 datasets containing audio features and metadata for a million contemporary popular music tracks.

 

Million Song Sample Dataset 百万歌曲数据库的1万子集
This is a 10,000 song subset of audio features and metadata from the Million Songs collection – a collection of 28 datasets containing audio features and metadata for a million contemporary popular music tracks.

 

A collection of data from the modENCODE project ( http://www.modencode.org )

 

NASA NEX NASA的地球卫星地图与气候变迁
Three NASA NEX datasets are now available, including climate projections and satellite images of Earth.

 

OpenStreetMap Rendering Database 开源的全球地图数据
A PostGIS 8.3 data cluster of all OpenStreetMap data for the planet.

 

Public-domain data for the oil & gas industry, assembled from the contributions of participating agencies in the United States, Canada and around the world. This data provides industry stakeholders with an opportunity to focus their efforts on the analysis and interpretation of this data without concern for the trivial and time-consuming tasks of locating, downloading, reformatting and integrating the data prior to value-added work being performed.

 

PubChem Library 有机小分子生物活性数据
A data set of information on the biological activities of small molecules.

 

Sloan Digital Sky Survey DR6 Subset  斯隆数字化巡天
The Sloan Digital Sky Survey is the most ambitious astronomical survey ever undertaken.

 

Whole Genome Shotgun Sequencing of the Cannabis Sativa Cultivar “Chemdawg”

 

The WestburyLab USENET corpus 4万多个USENET新闻组数据
The WestburyLab USENET corpus is an anonymized compilation of postings from 47,860 English-language newsgroups from 2005-2010.

 

Transportation Databases 美国交通部的航空,航海,公路,铁路,管道,自行车等统计数据
Various transportation statistics

 

Twilio/Wigle.net Street Vector Data Set 完整的美国街道名与地址
Twilio/Wigle.net database of mapped US street names and address ranges.

 

Unigene NCBI的转录组数据库
UniGene: An Organized View of the Transcriptome.

 

University of Florida Sparse Matrix Collection 佛罗里达大学的稀疏矩阵数据集
The University of Florida Sparse Matrix Collection is a large, widely available, and actively growing set of sparse matrices that arise in real applications.

 

Wikipedia Extraction (WEX) 维基百科用Freebase增强过的结构化数据
A processed dump of the English language Wikipedia

 

Wikipedia Page Traffic Statistic V3 维基百科2011年3个月的按小时访问量
This dataset contains a 150 GB sample of the data used to power trendingtopics.org. It includes a full 3 months of hourly page traffic statistics from Wikipedia (1/1/2011-3/31/2011).

 

Wikipedia Page Traffic Statistics 维基百科2009年7个月的按小时访问量
Contains 7 months of hourly pageview statistics for all articles in Wikipedia

 

Wikipedia Traffic Statistics V2 维基百科2009-2010年16个月按小时访问量
Contains 16 months of hourly pageview statistics for all articles in Wikipedia

 

Wikipedia XML Data 维基百科2009版,XML格式
A complete copy of all Wikimedia wikis, in the form of wikitext source and metadata embedded in XML.

 

YRI Trio Dataset 三个约鲁巴人的完整基因组
Complete genome sequence data for three Yoruba individuals from Ibadan, Nigeria

内容多来自网络,如有侵权,请联系QQ:23683716,谢谢。:一起大数据 » 亚马逊提供的海量公共数据集

优秀人才不缺工作机会,只缺适合自己的好机会。但是他们往往没有精力从海量机会中找到最适合的那个。 100offer 会对平台上的人才和企业进行严格筛选,让「最好的人才」和「最好的公司」相遇。 注册 100offer,谈谈你对下一份工作的期待。一周内,收到 5-10 个满足你要求的好机会!
赞 (4)
分享到:更多 ()