RealNest - Nested Data from Real-World Datasets
- 64 * 1024 row version: tables_65536.tar
- 10 * 64 * 1024 row version: tables_655360.tar
This readme is adapted from the RealNest repository. Visit the GitHub repository for the latest updates.
This website contains the static version of the RealNest dataset, a collection of nested data derived from real-world datasets. The dataset is designed to help computer science researchers benchmark and evaluate data systems and data formats supporting nested data types.
RealNest is provided as a script that downloads and generates the data, but for convenience and
to
facilitate
standardized comparisons, we host on this website two static datasets in sizes
of 64 * 1024 resp. 10 * 64 * 1024 rows. Each dataset is a .tar
file containing a folder for each
table. These sample
datasets were downloaded and generated by our script in mid-May 2024.
Because we provide the script that downloads the original datasets and processes them into a common format, one can create the dataset from newer versions of the underlying data and also enlarge them with respect to the static datasets, since even the larger of the two statically downloadable datasets contains only a small part of each of the original data sources. This script is available at the RealNest GitHub repository
Note that the static datasets hosted here remain under the same licenses and terms of use as the original datasets they are generated from. If you are the owner of an original dataset, and object to the inclusion of your data in the RealNest static datasets hosted here, please contact Peter Boncz (boncz@cwi.nl), and we will take action.
Please note that below we attempt to properly attribute the individual datasets as required by their various open-source licenses and terms of usage.
Dataset Structure
The dataset contains a directory for each table with the following files:
schema.json
: The schema of the table. The schema is a JSON object with a single key,columns
, containing a list of columns. Each column is a JSON object with 2 or 3 keys:name
- The name of the column as a string.type
- The type of the column as a string.children
- Optional, only exists for nested types (list
,struct
,map
). Describes the child types of the nested type as a list of column objects. Thelist
type always has a single child column with the namechild
. Themap
type always has two child columns with the nameskey
andvalue
.
data.jsonl.gz
: The data of the table in Gzip compressed JSON Lines format
The schema might contain a JSON
type, which may happen for empty JSON objects in the data
({}
) or when DuckDB's
schema inference detects incompatible types. The columns of this type can be ignored since they are not typical
for
structured data, or they can be handled as VARCHAR columns, where the value is the JSON string.
Attribution
The data has been downloaded from various public sources and converted to a common format. We note that the real-world datasets from which RealNest is derived are released under varying open-source licenses and terms of usage.
The sources of the original datasets are:
- Amazon Berkeley Objects
(LICENSE)
- J. Collins, S. Goel, K. Deng, A. Luthra, L. Xu, E. Gundogdu, X. Zhang, T. F. Yago Vicente, T. Dideriksen, H. Arora, M. Guillaumin, and J. Malik, "Abo: Dataset and benchmarks for real-world 3d object understanding," CVPR, 2022.
- AWS Public Blockchain Data (LICENSE)
- Data Lake as Code (ATTRIBUTIONS)
- CORD-19 (LICENSE)
- L. L. Wang, K. Lo, Y. Chandrasekhar, R. Reas, J. Yang, D. Eide, K. Funk, R. M. Kinney, Z. Liu, W. Merrill, P. Mooney, D. A. Murdick, D. Rishi, J. Sheehan, Z. Shen, B. Stilson, A. D. Wade, K. Wang, C. Wilhelm, B. Xie, D. A. Raymond, D. S. Weld, O. Etzioni, and S. Kohlmeier, "Cord-19: The covid-19 open research dataset," ArXiv, 2020.
- Daylight Map Distribution of OpenStreetMap (Open Database License (ODbL))
- GitHub Archive
- CERN Open Data
- CMS collaboration (2017). SingleMu primary dataset in AOD format from Run of 2012 ( /SingleMu/Run2012B-22Jan2013-v1/AOD). CERN Open Data Portal. DOI:10.7483/OPENDATA.CMS.IYVQ.1J0W
- Overture Maps Foundation Open Map Data
- Overture data is licensed under the Community Database License Agreement Permissive v2 (CDLA) unless derived from a source that requires publishing under a different license, such as data derived from OpenStreetMap, that constitutes a 'Derivative Database' (as defined under ODbL v1.0), which will be licensed under ODbL v1.0.
- Twitter Stream Archive
By using the data from this website, you agree to the terms of use of the original data sources. The data source for each table is given below (Data source numbers refer to the list above):
# | Table / Folder Name | Data Source |
---|---|---|
1 | amazon-berkeley-objects-listings | [1] Amazon Berkeley Objects |
2 | aws-public-blockchain-btc-transactions | [2] AWS Public Blockchain Data |
3 | aws-public-blockchain-eth-logs | [2] AWS Public Blockchain Data |
4 | aws-roda-hcls-datalake-clinvar_summary_variants-gene_specific_summary | [3] Data Lake as Code |
5 | aws-roda-hcls-datalake-clinvar_summary_variants-hgvs4variation | [3] Data Lake as Code |
6 | aws-roda-hcls-datalake-clinvar_summary_variants-submission_summary | [3] Data Lake as Code |
7 | aws-roda-hcls-datalake-gnomad-sites | [3] Data Lake as Code |
8 | aws-roda-hcls-datalake-gtex_8-rnaseqcv1_1_9_gene_tpm | [3] Data Lake as Code |
9 | aws-roda-hcls-datalake-gtex_8-rsemv1_3_0_transcript_expected_count | [3] Data Lake as Code |
10 | aws-roda-hcls-datalake-gtex_8-rsemv1_3_0_transcript_tpm | [3] Data Lake as Code |
11 | aws-roda-hcls-datalake-opentargets_latest-aotfelasticsearch | [3] Data Lake as Code |
12 | aws-roda-hcls-datalake-opentargets_latest-cooccurrences | [3] Data Lake as Code |
13 | aws-roda-hcls-datalake-opentargets_latest-diseasetophenotype | [3] Data Lake as Code |
14 | aws-roda-hcls-datalake-opentargets_latest-epmccooccurrences | [3] Data Lake as Code |
15 | aws-roda-hcls-datalake-opentargets_latest-failedcooccurrences | [3] Data Lake as Code |
16 | aws-roda-hcls-datalake-opentargets_latest-failedmatches | [3] Data Lake as Code |
17 | aws-roda-hcls-datalake-opentargets_latest-interaction | [3] Data Lake as Code |
18 | aws-roda-hcls-datalake-opentargets_latest-interactionevidence | [3] Data Lake as Code |
19 | aws-roda-hcls-datalake-opentargets_latest-knowndrugsaggregated | [3] Data Lake as Code |
20 | aws-roda-hcls-datalake-opentargets_latest-matches | [3] Data Lake as Code |
21 | aws-roda-hcls-datalake-thousandgenomes_dragen-var_partby_samples | [3] Data Lake as Code |
22 | cord-19-document_parses | [4] CORD-19 |
23 | daylight-openstreetmap-osm_elements | [5] Daylight Map Distribution of OpenStreetMap |
24 | daylight-openstreetmap-osm_features | [5] Daylight Map Distribution of OpenStreetMap |
25 | gharchive-CommitCommentEvent | [6] GitHub Archive |
26 | gharchive-ForkEvent | [6] GitHub Archive |
27 | gharchive-GollumEvent | [6] GitHub Archive |
28 | gharchive-IssueCommentEvent | [6] GitHub Archive |
29 | gharchive-IssuesEvent | [6] GitHub Archive |
30 | gharchive-MemberEvent | [6] GitHub Archive |
31 | gharchive-PullRequestEvent | [6] GitHub Archive |
32 | gharchive-PullRequestReviewCommentEvent | [6] GitHub Archive |
33 | gharchive-PullRequestReviewEvent | [6] GitHub Archive |
34 | gharchive-PushEvent | [6] GitHub Archive |
35 | gharchive-ReleaseEvent | [6] GitHub Archive |
36 | hep-adl-ethz-Run2012B_SingleMu | [7] CERN Open Data |
37 | overturemaps-us-west-2-admins | [8] Overture Maps Foundation Open Map Data |
38 | overturemaps-us-west-2-base | [8] Overture Maps Foundation Open Map Data |
39 | overturemaps-us-west-2-buildings | [8] Overture Maps Foundation Open Map Data |
40 | overturemaps-us-west-2-divisions | [8] Overture Maps Foundation Open Map Data |
41 | overturemaps-us-west-2-places | [8] Overture Maps Foundation Open Map Data |
42 | overturemaps-us-west-2-transportation | [8] Overture Maps Foundation Open Map Data |
43 | twitter-stream-2023-01 | [9] Twitter Stream Archive |