Use ddf_utils for ETL tasks¶
Create DDF dataset from non-DDF data files¶
If you want to learn how to compose DDF datasets, read Recipe Cookbook (draft). If you are not familiar with DDF model, please refer to DDF data model document.
ddf_utils provides most of data classes and methods in terms of the DDF model: concept/entity/datapoint/synonmy (and more to come.) Together with the other utility functions, we hope to provide a tool box for users to easily create a DDF dataset. To see it in action, check this notebook for a demo.
In general, we are building scripts to transform data from one format to the other format, so guidelines for programming and data ETL applies here. You should care about the correctness of the scripts and be ware of bad data.
Create DDF dataset from CSV file¶
When you have clean CSV data file, you can use the ddf from_csv
command to create DDF dataset. Currently only one format is supported:
Primary Keys as well as all indicators should be in columns.
ddf from_csv -i input_file_or_path -o out_path
Where -i
sets the input file or path and when it is a path all
files in the path will be proceed; -o
sets the path the generated
DDF dataset will be put to. If -i
is not set, it defaults to
current path.
Compare 2 datasets¶
ddf diff
command compares 2 datasets and return useful statistics
for each indicator.
ddf diff -i indicator1 -i indicator2 dataset1 dataset2
For now this command supports following statistics:
rval
: the standard correlation coefficientavg_pct_chg
: average percentage changesmax_pct_chg
: the maximum of change in percentagermse
: the root mean squared errornrmse
: equalsrmse
/(max - min) where max and min are calculated with data in dataset2new_datapoints
: datapoints in dataset1 but not dataset2dropped_datapoints
: datapoints in dataset2 but not dataset1
If no indicator specified in the command, rmse
and nrmse
will
be calculated.
Note
Please note that rval and avg_pct_chg assumes there is a geo
column in datapoints, which is not very useful for now. We will
improve this later.
You can also compare 2 commits for a git folder too. In this case you should run
cd dataset_path
ddf diff --git -o path/to/export/to -i indicator head_ref base_ref
Because the script needs to export different commits for the git repo,
you should provide the -o
flag to set which path you’d like to put
the exported datasets into.