Retrieve, Merge, Predict

The processed versions of the datasets that we used for our experiments are provided with the repository in data/source_tables. To stress-test each method we use depleted versions of each table, which include only the minimum amount of attributes that can still be used to perfrom prediction.

We used the following sources for our dataset:

Company Employees source - CC0
Housing Prices source
Movie Ratings and Movie Revenue source - CC0
US Accidents source - CC BY-NC-SA 4.0
US Elections source - CC0

The Schools dataset is an internal dataset found in the Open Data US data lake. The US County Population dataset is an internal dataset found in YADL.

YADL is derived from YAGO3 source and shares its CC BY 4.0 license.

Datasets were pre-processed before they were used in our experiments. Pre-processing steps are reported in the preparation repository and the pipeline repository.