亚马逊提供的海量公共数据集

from http://hao.memect.com/?p=3294

在大数据分析时，一个困难是海量的数据本地存储困难，而且下载耗费的时间极长。例如1T数据，如果下载网速是3MBps（目前中国的平均宽带速度），那要4天才能下载完。有些数据集有几十T，那光下载就要几个月。

亚马逊的AWS云服务平台上为了解决这个困难提供了很多常用的大规模数据集 Public Data Sets https://aws.amazon.com/datasets ，无需下载即可在亚马逊AWS EC2上使用。

以Linux为例，方法是

建立一个EC2的实例
查看对应的数据集对应的EBS的代码，如PubChem是snap-e6df3c8f，建立一个EBS虚拟卷时选择从这个snapshot建立
把新建的这个EBS卷附加到EC2实例（在AWS 管理控制台上做）
在EC2实例上用lsblk查看是不是附加上了，例如如果有xvdf之类就是成功了
用mount把EBS卷挂载到一个路径，例如 sudo mount /dev/xvdf /opt/pubchem就把该卷挂载到路径/opt/pubchem

1-3都在AWS的图形界面上操作，非常直观。4-5就是两行命令，就可以立即开始使用上T的数据了——比如CommonCrawl有50T

需要注意的是价格，目前EBS的价格最便宜是1G一个月5美分，也就是说CommonCrawl的数据一个月要花2500美元，外加读写的费用

目前在线的五十多个数据集是：

1000 Genomes Project 千人基因组计划，详见http://en.wikipedia.org/wiki/1000_Genomes_Project

The 1000 Genomes Project, initiated in 2008, is an international public-private consortium that aims to build the most detailed map of human genetic variation available.

1980 US Census 美国1980年人口普查数据

Data from the 1980 US Census

1990 US Census 美国1990年人口普查数据

Data from the 1990 US Census

2000 US Census 美国2000年人口普查数据

Data from the 2000 US Census

2003-2006 US Economic Data 美国2003-2006经济数据

US Economic Data for years 2003 to 2006

2008 TIGER/Line Shapefiles 美国2000年人口普查与详细的政区划分

Census 2000 and Current United States shapefiles

3D Version of the PubChem Library PubSem有机小分子生物活性数据三维版

3D Version of the PubChem Library

AnthroKids – Anthropometric Data of Children 70年代的儿童人体测量数据

Anthropometric data on children from two studies in 1975 and 1977

Apache Software Foundation Public Mail Archives Apache基金会的到2011年为止的邮件列表

A collection of all publicly available Apache Software Foundation mail archives as of July 11, 2011

Business and Industry Summary Data 美国工商业数据

US Business and Industry Summary Data

C57BL/6J by C3H/HeJ Mouse Cross (Sage Bionetworks) 老鼠杂交数据

C57BL/6J by C3H/HeJ mouse cross from the Jake Lusis lab at UCLA

Common Crawl Corpus 50亿网页

A corpus of web crawl data composed of over 5 billion web pages. This data set is freely available on Amazon S3 and is released under the Common Crawl Terms of Use.

Daily Global Weather Measurements, 1929-2009 (NCDC, GSOD) 80年的按日全球天气数据

A collection of daily weather measurements (temperature, wind speed, humidity, pressure, &c.) from 9000+ weather stations around the world.

DBpedia 3.5.1 DBpedia结构化知识库

DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web

Denisova Genome 丹尼索瓦人基因组

The high-coverage genome sequence of a Denisovan individual sequenced to ~30x coverage on the Illumina platform. Together with their sister group the Neandertals, Denisovans are the most closely related extinct relatives of currently living humans.

Enron Email Data 安然电子邮件数据

Enron email data publicly released as part of FERC’s Western Energy Markets investigation converted to industry standard formats by EDRM. The data set consists of 1,227,255 emails with 493,384 attachments covering 151 custodians. The email is provided in Microsoft PST, IETF MIME, and EDRM XML formats.

Ensembl – FASTA Database Files Ensembl真核生物基因组转录与翻译模型

Ensembl sequence databases of transcript and translation models

Ensembl Annotated Human Genome Data (FASTA Release 73) 人类与其他50个物种的基因序列

The Ensembl project produces genome databases for human as well as over 50 other species, and makes this information freely available.

Ensembl Annotated Human Genome Data (MySQL Release 73) 人类与其他50个物种的基因序列，MySQL版

The Ensembl project produces genome databases for human as well as over 50 other species, and makes this information freely available.

Federal Contracts from the Federal Procurement Data Center (USASpending.gov) 美国联邦政府的合同

A data dump of all federal contracts from the Federal Procurement Data Center found at USASpending.gov.

Federal Reserve Economic Data – Fred 美联储经济数据时间序列

Database of 20,059 U.S. economic time series.

Freebase Data Dump Freebase知识图谱

Freebase is an open database of the world’s information, covering millions of topics in hundreds of categories

Freebase Quad Dump Freebase知识图谱四元组格式

A data dump of all the current facts and assertions in Freebase

Freebase Simple Topic Dump Freebase知识图谱简化的主题数据

A data dump of the basic identifying facts about every topic in Freebase

GenBank 基因银行序列数据库

An annotated collection of all publicly available DNA sequences including more than 85.7B bases and 82.8M sequence records.

Google Books Ngrams 谷歌图书的ngram语言模型

A data set containing Google Books n-gram corpuses. This data set is freely available on Amazon S3 in a Hadoop friendly file format and is licensed under a Creative Commons Attribution 3.0 Unported License. The original dataset is available from http://books.google.com/ngrams/.

Human Liver Cohort (Sage Bionetworks) 人类肝脏基因表达

Human Liver Cohort characterizing gene expression in liver samples

Human Microbiome Project 人体微生物群数据

Human Microbiome Project Data Set

Illumina – Jay Flatley (CEO of Illumina) Human Genome Data Set 人体基因组数据

Jay Flatley (CEO of Illumina) human genome data set.

Influenza Virus (including updated Swine Flu sequences) 流感病毒数据

NCBI Influenza Resource Center Data.

Japan Census Data 日本人口统计数据

Multiple data sets including: (1) Population Census of Japan (1995, 2000, 2005, 2010), (2) Establishment and Enterprise Census of Japan (1999, 2001, 2004, 2006), and (3) Economic Census of Japan (2009).

Labor Statistics Databases 美国劳工部的统计数据

Various Labor Statistics

M-Lab dataset: Network Diagnostic Tool (NDT) 2009年互联网性能（如网速）诊断数据

NDT test results created through Measurement Lab (M-Lab) between February 2009 and September 2009

M-Lab dataset: Network Path and Application Diagnosis tool (NPAD) 2009年互联网路由，包头等测试数据

NPAD test results created through Measurement Lab (M-Lab) between February 2009 and September 2009

Marvel Universe Social Graph 一个虚拟的社交网络关系图

This dataset is an example of a social collaboration network based on the characters in The Marvel Universe, that is, the artificial world that takes place in the universe of the Marvel comic books.

Material Safety Data Sheets 材料安全数据

230,000 Material Safety Data Sheets.

Million Song Dataset 百万歌曲数据

The Million Songs Collection is a collection of 28 datasets containing audio features and metadata for a million contemporary popular music tracks.

Million Song Sample Dataset 百万歌曲数据库的1万子集

This is a 10,000 song subset of audio features and metadata from the Million Songs collection – a collection of 28 datasets containing audio features and metadata for a million contemporary popular music tracks.

Model Organism Encyclopedia of DNA Elements (modENCODE) 模式生物生命百科全书

A collection of data from the modENCODE project ( http://www.modencode.org )

NASA NEX NASA的地球卫星地图与气候变迁

Three NASA NEX datasets are now available, including climate projections and satellite images of Earth.

OpenStreetMap Rendering Database 开源的全球地图数据

A PostGIS 8.3 data cluster of all OpenStreetMap data for the planet.

Petroleum Public Data Set (working Title) 石油数据

Public-domain data for the oil & gas industry, assembled from the contributions of participating agencies in the United States, Canada and around the world. This data provides industry stakeholders with an opportunity to focus their efforts on the analysis and interpretation of this data without concern for the trivial and time-consuming tasks of locating, downloading, reformatting and integrating the data prior to value-added work being performed.

PubChem Library 有机小分子生物活性数据

A data set of information on the biological activities of small molecules.

Sloan Digital Sky Survey DR6 Subset 斯隆数字化巡天

The Sloan Digital Sky Survey is the most ambitious astronomical survey ever undertaken.

The Cannabis Sativa Genome 大麻基因

Whole Genome Shotgun Sequencing of the Cannabis Sativa Cultivar “Chemdawg”

The WestburyLab USENET corpus 4万多个USENET新闻组数据

The WestburyLab USENET corpus is an anonymized compilation of postings from 47,860 English-language newsgroups from 2005-2010.

Transportation Databases 美国交通部的航空，航海，公路，铁路，管道，自行车等统计数据

Various transportation statistics

Twilio/Wigle.net Street Vector Data Set 完整的美国街道名与地址

Twilio/Wigle.net database of mapped US street names and address ranges.

Unigene NCBI的转录组数据库

UniGene: An Organized View of the Transcriptome.

University of Florida Sparse Matrix Collection 佛罗里达大学的稀疏矩阵数据集

The University of Florida Sparse Matrix Collection is a large, widely available, and actively growing set of sparse matrices that arise in real applications.

Wikipedia Extraction (WEX) 维基百科用Freebase增强过的结构化数据

A processed dump of the English language Wikipedia

Wikipedia Page Traffic Statistic V3 维基百科2011年3个月的按小时访问量

This dataset contains a 150 GB sample of the data used to power trendingtopics.org. It includes a full 3 months of hourly page traffic statistics from Wikipedia (1/1/2011-3/31/2011).

Wikipedia Page Traffic Statistics 维基百科2009年7个月的按小时访问量

Contains 7 months of hourly pageview statistics for all articles in Wikipedia

Wikipedia Traffic Statistics V2 维基百科2009-2010年16个月按小时访问量

Contains 16 months of hourly pageview statistics for all articles in Wikipedia

Wikipedia XML Data 维基百科2009版，XML格式

A complete copy of all Wikimedia wikis, in the form of wikitext source and metadata embedded in XML.

YRI Trio Dataset 三个约鲁巴人的完整基因组

Complete genome sequence data for three Yoruba individuals from Ibadan, Nigeria

发表评论取消回复

推荐访问

发表评论 取消回复

发表评论取消回复