Use ddf_utils for ETL tasks

Create DDF dataset from non-DDF data files

If you want to learn how to compose DDF datasets, read Recipe Cookbook (draft). If you are not familiar with DDF model, please refer to DDF data model document.

ddf_utils provides most of data classes and methods in terms of the DDF model: concept/entity/datapoint/synonmy (and more to come.) Together with the other utility functions, we hope to provide a tool box for users to easily create a DDF dataset. To see it in action, check this notebook for a demo.

In general, we are building scripts to transform data from one format to the other format, so guidelines for programming and data ETL applies here. You should care about the correctness of the scripts and be ware of bad data.

Create DDF dataset from CSV file

When you have clean CSV data file, you can use the ddf from_csv command to create DDF dataset. Currently only one format is supported: Primary Keys as well as all indicators should be in columns.

ddf from_csv -i input_file_or_path -o out_path

Where -i sets the input file or path and when it is a path all files in the path will be proceed; -o sets the path the generated DDF dataset will be put to. If -i is not set, it defaults to current path.

Compare 2 datasets

ddf diff command compares 2 datasets and return useful statistics for each indicator.

ddf diff -i indicator1 -i indicator2 dataset1 dataset2

For now this command supports following statistics:

  • rval: the standard correlation coefficient
  • avg_pct_chg: average percentage changes
  • max_pct_chg: the maximum of change in percentage
  • rmse: the root mean squared error
  • nrmse: equals rmse/(max - min) where max and min are calculated with data in dataset2
  • new_datapoints: datapoints in dataset1 but not dataset2
  • dropped_datapoints: datapoints in dataset2 but not dataset1

If no indicator specified in the command, rmse and nrmse will be calculated.

Note

Please note that rval and avg_pct_chg assumes there is a geo column in datapoints, which is not very useful for now. We will improve this later.

You can also compare 2 commits for a git folder too. In this case you should run

cd dataset_path

ddf diff --git -o path/to/export/to -i indicator head_ref base_ref

Because the script needs to export different commits for the git repo, you should provide the -o flag to set which path you’d like to put the exported datasets into.