The processed versions of the datasets that we used for our experiments are provided with the repository in data/source_tables. To stress-test each method we use depleted versions of each table, which include only the minimum amount of attributes that can still be used to perfrom prediction.

We used the following sources for our dataset:

  • Company Employees source - CC0
  • Housing Prices source
  • Movie Ratings and Movie Revenue source - CC0
  • US Accidents source - CC BY-NC-SA 4.0
  • US Elections source - CC0

The Schools dataset is an internal dataset found in the Open Data US data lake. The US County Population dataset is an internal dataset found in YADL.

YADL is derived from YAGO3 source and shares its CC BY 4.0 license.

Datasets were pre-processed before they were used in our experiments. Pre-processing steps are reported in the preparation repository and the pipeline repository.