Data Ingestion
The current version of our API, allows for datasets to be uploaded into our platform. However, we used a different internal process for ingesting data from public sources, which allows us to perform data transformations, aggregation, and further manipulation. We are now isolating such processes allowing for data to be "Adapted" into the final Dataset. As we do that, we are also storing every raw dataset version, and keeping track of every difference.
This new data ingestion platform also allows working with data that has multiple dimensions. In the past, we would only accept Date, and a general Entity dimension, that were used across different Variables in the Dataset:
Date | Entity | Production | Imports | Exports |
---|---|---|---|---|
2020-01-01 | USA | 2123 | 23000 | 25000 |
2020-02-01 | Canada | 3223 | 24000 | 27000 |
2020-03-01 | Mexico | 5423 | 22000 | 21000 |
Our API now works with "Columns" rather than Variables, and each column can be specified as Entity, allowing for composed keys, or in other words multiple dimensions:
Date | Country | State | Production | Import | Export |
---|---|---|---|---|---|
2020-01-01 | USA | Florida | 2123 | 23000 | 25000 |
2020-02-01 | USA | California | 3223 | 24000 | 27000 |
2020-03-01 | USA | Texas | 5423 | 22000 | 21000 |
Data Pipelines
Having a better detail of the structure of each dataset, allows for richer data manipulation scenarios, which we will be enabling as Pipelines:
A Pipeline is triggered whenever a participating Dataset gets updated. Then, individual columns of each dataset are extracted as defined by the pipeline, data is transformed, aggregated or combined, and new Datasets can be created. This may sound like complex scenarios, but we are working to make it super simple to combine data from multiple datasets on our web portal.