RealNest - Nested Data from Real-World Datasets

This readme is adapted from the RealNest repository. Visit the GitHub repository for the latest updates.

This website contains the static version of the RealNest dataset, a collection of nested data derived from real-world datasets. The dataset is designed to help computer science researchers benchmark and evaluate data systems and data formats supporting nested data types.

RealNest is provided as a script that downloads and generates the data, but for convenience and to facilitate standardized comparisons, we host on this website two static datasets in sizes of 64 * 1024 resp. 10 * 64 * 1024 rows. Each dataset is a .tar file containing a folder for each table. These sample datasets were downloaded and generated by our script in mid-May 2024.

Because we provide the script that downloads the original datasets and processes them into a common format, one can create the dataset from newer versions of the underlying data and also enlarge them with respect to the static datasets, since even the larger of the two statically downloadable datasets contains only a small part of each of the original data sources. This script is available at the RealNest GitHub repository

Note that the static datasets hosted here remain under the same licenses and terms of use as the original datasets they are generated from. If you are the owner of an original dataset, and object to the inclusion of your data in the RealNest static datasets hosted here, please contact Peter Boncz (boncz@cwi.nl), and we will take action.

Please note that below we attempt to properly attribute the individual datasets as required by their various open-source licenses and terms of usage.

Dataset Structure

The dataset contains a directory for each table with the following files:

The schema might contain a JSON type, which may happen for empty JSON objects in the data ({}) or when DuckDB's schema inference detects incompatible types. The columns of this type can be ignored since they are not typical for structured data, or they can be handled as VARCHAR columns, where the value is the JSON string.

Attribution

The data has been downloaded from various public sources and converted to a common format. We note that the real-world datasets from which RealNest is derived are released under varying open-source licenses and terms of usage.

The sources of the original datasets are:

  1. Amazon Berkeley Objects (LICENSE)
    • J. Collins, S. Goel, K. Deng, A. Luthra, L. Xu, E. Gundogdu, X. Zhang, T. F. Yago Vicente, T. Dideriksen, H. Arora, M. Guillaumin, and J. Malik, "Abo: Dataset and benchmarks for real-world 3d object understanding," CVPR, 2022.
  2. AWS Public Blockchain Data (LICENSE)
  3. Data Lake as Code (ATTRIBUTIONS)
  4. CORD-19 (LICENSE)
    • L. L. Wang, K. Lo, Y. Chandrasekhar, R. Reas, J. Yang, D. Eide, K. Funk, R. M. Kinney, Z. Liu, W. Merrill, P. Mooney, D. A. Murdick, D. Rishi, J. Sheehan, Z. Shen, B. Stilson, A. D. Wade, K. Wang, C. Wilhelm, B. Xie, D. A. Raymond, D. S. Weld, O. Etzioni, and S. Kohlmeier, "Cord-19: The covid-19 open research dataset," ArXiv, 2020.
  5. Daylight Map Distribution of OpenStreetMap (Open Database License (ODbL))
  6. GitHub Archive
  7. CERN Open Data
    • CMS collaboration (2017). SingleMu primary dataset in AOD format from Run of 2012 ( /SingleMu/Run2012B-22Jan2013-v1/AOD). CERN Open Data Portal. DOI:10.7483/OPENDATA.CMS.IYVQ.1J0W
  8. Overture Maps Foundation Open Map Data
    • Overture data is licensed under the Community Database License Agreement Permissive v2 (CDLA) unless derived from a source that requires publishing under a different license, such as data derived from OpenStreetMap, that constitutes a 'Derivative Database' (as defined under ODbL v1.0), which will be licensed under ODbL v1.0.
  9. Twitter Stream Archive

By using the data from this website, you agree to the terms of use of the original data sources. The data source for each table is given below (Data source numbers refer to the list above):

# Table / Folder Name Data Source
1 amazon-berkeley-objects-listings [1] Amazon Berkeley Objects
2 aws-public-blockchain-btc-transactions [2] AWS Public Blockchain Data
3 aws-public-blockchain-eth-logs [2] AWS Public Blockchain Data
4 aws-roda-hcls-datalake-clinvar_summary_variants-gene_specific_summary [3] Data Lake as Code
5 aws-roda-hcls-datalake-clinvar_summary_variants-hgvs4variation [3] Data Lake as Code
6 aws-roda-hcls-datalake-clinvar_summary_variants-submission_summary [3] Data Lake as Code
7 aws-roda-hcls-datalake-gnomad-sites [3] Data Lake as Code
8 aws-roda-hcls-datalake-gtex_8-rnaseqcv1_1_9_gene_tpm [3] Data Lake as Code
9 aws-roda-hcls-datalake-gtex_8-rsemv1_3_0_transcript_expected_count [3] Data Lake as Code
10 aws-roda-hcls-datalake-gtex_8-rsemv1_3_0_transcript_tpm [3] Data Lake as Code
11 aws-roda-hcls-datalake-opentargets_latest-aotfelasticsearch [3] Data Lake as Code
12 aws-roda-hcls-datalake-opentargets_latest-cooccurrences [3] Data Lake as Code
13 aws-roda-hcls-datalake-opentargets_latest-diseasetophenotype [3] Data Lake as Code
14 aws-roda-hcls-datalake-opentargets_latest-epmccooccurrences [3] Data Lake as Code
15 aws-roda-hcls-datalake-opentargets_latest-failedcooccurrences [3] Data Lake as Code
16 aws-roda-hcls-datalake-opentargets_latest-failedmatches [3] Data Lake as Code
17 aws-roda-hcls-datalake-opentargets_latest-interaction [3] Data Lake as Code
18 aws-roda-hcls-datalake-opentargets_latest-interactionevidence [3] Data Lake as Code
19 aws-roda-hcls-datalake-opentargets_latest-knowndrugsaggregated [3] Data Lake as Code
20 aws-roda-hcls-datalake-opentargets_latest-matches [3] Data Lake as Code
21 aws-roda-hcls-datalake-thousandgenomes_dragen-var_partby_samples [3] Data Lake as Code
22 cord-19-document_parses [4] CORD-19
23 daylight-openstreetmap-osm_elements [5] Daylight Map Distribution of OpenStreetMap
24 daylight-openstreetmap-osm_features [5] Daylight Map Distribution of OpenStreetMap
25 gharchive-CommitCommentEvent [6] GitHub Archive
26 gharchive-ForkEvent [6] GitHub Archive
27 gharchive-GollumEvent [6] GitHub Archive
28 gharchive-IssueCommentEvent [6] GitHub Archive
29 gharchive-IssuesEvent [6] GitHub Archive
30 gharchive-MemberEvent [6] GitHub Archive
31 gharchive-PullRequestEvent [6] GitHub Archive
32 gharchive-PullRequestReviewCommentEvent [6] GitHub Archive
33 gharchive-PullRequestReviewEvent [6] GitHub Archive
34 gharchive-PushEvent [6] GitHub Archive
35 gharchive-ReleaseEvent [6] GitHub Archive
36 hep-adl-ethz-Run2012B_SingleMu [7] CERN Open Data
37 overturemaps-us-west-2-admins [8] Overture Maps Foundation Open Map Data
38 overturemaps-us-west-2-base [8] Overture Maps Foundation Open Map Data
39 overturemaps-us-west-2-buildings [8] Overture Maps Foundation Open Map Data
40 overturemaps-us-west-2-divisions [8] Overture Maps Foundation Open Map Data
41 overturemaps-us-west-2-places [8] Overture Maps Foundation Open Map Data
42 overturemaps-us-west-2-transportation [8] Overture Maps Foundation Open Map Data
43 twitter-stream-2023-01 [9] Twitter Stream Archive