ddf_utils package

Submodules

ddf_utils.cli module

script for ddf dataset management tasks

ddf_utils.i18n module

i18n project management for Gapminder’s datasets.

The workflow is described in this google doc. The json part comes from discussion here

ddf_utils.i18n.merge_translations_csv(path, split_path='langsplit', lang_path='lang', overwrite=False)

merge all translated csv files and update datapackage.json

ddf_utils.i18n.merge_translations_json(path, split_path='langsplit', lang_path='lang', overwrite=False)

merge all translated json files and update datapackage.json.

ddf_utils.i18n.split_translations_csv(path, split_path='langsplit', exclude_concepts=None, overwrite=False)

split all string concepts and save them as csv files

ddf_utils.i18n.split_translations_json(path, split_path='langsplit', exclude_concepts=None, overwrite=False)

split all string concepts and save them as json files

Note

There is an issue with dataframe.to_json() method for multiIndex files (i.e. datapoints), which cause we can’t read back the split files and merge them. In this case merge_translations_json() will fail.

ddf_utils.package module

functions for handling DDF datapackage

ddf_utils.package.create_datapackage(path, gen_schema=True, progress_bar=False, **kwargs)

create datapackage.json base on the files in path.

If you want to set some attributes manually, you can pass them as keyword arguments to this function

Note

A DDFcsv datapackage MUST contain the fields name and resources.

if name is not provided, then the base name of path will be used.

Parameters:
  • path (str) – the dataset path to create datapackage.json
  • gen_schema (bool) – whether to create DDFSchema in datapackage.json. Default is True
  • progress_bar (bool) – whether progress bar should be shown when generating ddfSchema.
  • kwargs (dict) – metadata to write into datapackage.json. According to spec, title, description, author and license SHOULD be fields in datapackage.json.
ddf_utils.package.get_datapackage(path, use_existing=True, update=False, progress_bar=False)

get the datapackage.json from a dataset path, create one if it’s not exists

Parameters:

path (str) – the dataset path

Keyword Arguments:
 
  • use_existing (bool) – whether or not to use the existing datapackage
  • update (bool) – if update is true, will update the resources and schema in existing datapackage.json. else just return existing datapackage.json
  • progress_bar (bool) – whether progress bar should be shown when generating ddfSchema.
ddf_utils.package.get_ddf_files(path, root=None)

yield all csv files which are named following the DDF model standard.

Parameters:
  • path (str) – the path to check
  • root (str, optional) – if path is relative, append the root to all files.
ddf_utils.package.is_datapackage(path)

check if a directory is a dataset directory

This function checks if ddf–index.csv and datapackage.json exists to judge if the dir is a dataset.

ddf_utils.io module

io functions for ddf files

ddf_utils.io.cleanup(path, how='ddf', exclude=None, use_default_exclude=True)

remove all ddf files in the given path

ddf_utils.io.csvs_to_ddf(files, out_path)

convert raw files to ddfcsv

Parameters:
  • files (list) – a list of file paths to build ddf csv
  • out_path (str) – the directory to put the ddf dataset
ddf_utils.io.download_csv(urls, out_path)

download csv files

ddf_utils.io.dump_json(path, obj)

convenient function to dump a dictionary object to json

ddf_utils.io.open_google_spreadsheet(docid)

read google spreadsheet into excel io object

ddf_utils.io.serve_concept()
ddf_utils.io.serve_datapoint(df_: <Mock name='mock.DataFrame' id='139917546889360'>, out_dir, concept, copy=True, by: Iterable[T_co] = None, formatter: Callable = <function format_float_digits>, **kwargs)

save a pandas dataframe to datapoint file. the file path of csv will be out_dir/ddf–datapoints–$concept–$by.csv

addition keyword arguments can be passed to pd.DataFrame.to_csv() function.

ddf_utils.io.serve_entity()

ddf_utils.patch module

functions working with patches

ddf_utils.patch.apply_patch(base, patch)

apply patch created with daff. more on the diff format, see: http://specs.frictionlessdata.io/tabular-diff-format/

return: the updated DataFrame.

ddf_utils.qa module

QA functioins.

ddf_utils.qa.avg_pct_chg(comp_df, indicator, on='geo')

return average precentage changes between old and new data

ddf_utils.qa.compare_with_func(dataset1, dataset2, fns=None, indicators=None, key=None, **kwargs)

compare 2 datasets with functions

ddf_utils.qa.dropped_datapoints(comp_df, indicator, **kwargs)
ddf_utils.qa.max_change_index(comp_df, indicator, **kwargs)
ddf_utils.qa.max_pct_chg(comp_df, indicator, **kwargs)

return average precentage changes between old and new data

ddf_utils.qa.new_datapoints(comp_df, indicator, **kwargs)
ddf_utils.qa.nrmse(comp_df, indicator, **kwargs)
ddf_utils.qa.rmse(comp_df, indicator, **kwargs)
ddf_utils.qa.rval(comp_df, indicator, on='geo')

return r-value between old and new data

ddf_utils.str module

string functions for ddf files

ddf_utils.str.fix_time_range(s)

change a time range to the middle of year in the range. e.g. fix_time_range(‘1980-90’) = 1985

ddf_utils.str.format_float_digits(number, digits=5, threshold=None, keep_decimal=False)

format the number to string, limit the maximum amount of digits. Removing tailing zeros.

ddf_utils.str.format_float_sigfig(number, sigfig=5, threshold=None)

format the number to string, keeping some significant digits.

ddf_utils.str.parse_time_series(ser, engine='pandas')

try to parse date time from a Series of string

see document https://docs.google.com/document/d/1Cd2kEH5w3SRJYaDcu-M4dU5SY8No84T3g-QlNSW6pIE/edit#heading=h.oafc7aswaafy for more details of formats

ddf_utils.str.to_concept_id(s, sep='_')

convert a string to alphanumeric format.

ddf_utils.transformer module

functions for common tasks on ddf datasets

ddf_utils.transformer.extract_concepts(dfs, base=None, join='full_outer')

extract concepts from a list of dataframes.

Parameters:

dfs (list[DataFrame]) – a list of dataframes to be extracted

Keyword Arguments:
 
  • base (DataFrame) – the base concept table to join
  • join ({'full_outer', 'ingredients_outer'}) – how to join the base dataframe. full_outer means union of the base and extracted, ingredients_outer means only keep concepts in extracted
Returns:

the result concept table

Return type:

DataFrame

ddf_utils.transformer.merge_keys(df, dictionary, target_column, merged='drop', agg_method='sum')

merge keys

ddf_utils.transformer.split_keys(df, target_column, dictionary, splited='drop')

split entities

ddf_utils.transformer.translate_column(df, column, dictionary_type, dictionary, target_column=None, base_df=None, not_found='drop', ambiguity='prompt', ignore_case=False)

change values in a column base on a mapping dictionary.

The dictionary can be provided as a python dictionary, pandas dataframe or read from file.

Note

When translating with a base DataFrame, if ambiguity is found in the data, for example, a dataset with entity-id congo, to align to a dataset with cod ( Democratic Republic of the Congo ) and cog ( Republic of the Congo ), the function will ask for user input to choose which one or to skip it.

Parameters:
  • df (DataFrame) – The dataframe to be translated
  • column (str) – The column to be translated
  • dictionary_type (str) – The type of dictionary, choose from inline, file and dataframe
  • dictionary (str or dict) – The dictionary. Depanding on the dictionary_type, the value of this parameter should be: inline: dict file: the file path, str dataframe: dict, must have key and value keys. see examples in examples section.
  • target_column (str, optional) – The column to store translated resluts. If this is None, then the one set with column will be replaced.
  • base_df (DataFrame, optional) – When dictionary_type is dataframe, this option should be set
  • not_found (str) – What to do if key in the dictionary is not found in the dataframe to be translated. avaliable options are drop, error, include
  • ambiguity (str) – What to do when there is ambiguities in the dictionary. avaliable options are prompt, skip, error

Examples

>>> df = pd.DataFrame([['geo', 'Geographical places'], ['time', 'Year']], columns=['concept', 'name'])
>>> df
  concept                 name
0     geo  Geographical places
1    time                 Year
>>> translate_column(df, 'concept', 'inline', {'geo': 'country', 'time': 'year'})
   concept                 name
0  country  Geographical places
1     year                 Year
>>> base_df = pd.DataFrame([['geo', 'country'], ['time', 'year']], columns=['concept', 'alternative_name'])
>>> base_df
  concept alternative_name
0     geo          country
1    time             year
>>> translate_column(df, 'concept', 'dataframe',
...                     {'key': 'concept', 'value': 'alternative_name'},
...                     target_column='new_name', base_df=base_df)
  concept                 name new_name
0     geo  Geographical places  country
1    time                 Year     year
ddf_utils.transformer.translate_header(df, dictionary, dictionary_type='inline')

change the headers of a dataframe base on a mapping dictionary.

Parameters:
  • df (DataFrame) – The dataframe to be translated
  • dictionary_type (str, default to inline) – The type of dictionary, choose from inline or file
  • dictionary (dict or str) – The mapping dictionary or path of mapping file
ddf_utils.transformer.trend_bridge(old_ser: <Mock name='mock.Series' id='139917539292688'>, new_ser: <Mock name='mock.Series' id='139917539292688'>, bridge_length: int) → <Mock name='mock.Series' id='139917539292688'>

smoothing data between series.

To avoid getting artificial stairs in the data, we smooth between to series. Sometime one source is systematically higher than another source, and if we jump from one to another in a single year, this looks like an actual change in the data.

Parameters:
  • old_data (Series) –
  • new_data (Series) –
  • bridge_length (int) – the length of bridge
Returns:

bridge_data

Return type:

the bridged data