ddf_utils package¶
Subpackages¶
Submodules¶
ddf_utils.cli module¶
script for ddf dataset management tasks
ddf_utils.i18n module¶
i18n project management for Gapminder’s datasets.
The workflow is described in this google doc. The json part comes from discussion here
-
ddf_utils.i18n.
merge_translations_csv
(path, split_path='langsplit', lang_path='lang', overwrite=False)¶ merge all translated csv files and update datapackage.json
-
ddf_utils.i18n.
merge_translations_json
(path, split_path='langsplit', lang_path='lang', overwrite=False)¶ merge all translated json files and update datapackage.json.
-
ddf_utils.i18n.
split_translations_csv
(path, split_path='langsplit', exclude_concepts=None, overwrite=False)¶ split all string concepts and save them as csv files
-
ddf_utils.i18n.
split_translations_json
(path, split_path='langsplit', exclude_concepts=None, overwrite=False)¶ split all string concepts and save them as json files
Note
There is an issue with dataframe.to_json() method for multiIndex files (i.e. datapoints), which cause we can’t read back the split files and merge them. In this case
merge_translations_json()
will fail.
ddf_utils.package module¶
functions for handling DDF datapackage
-
ddf_utils.package.
create_datapackage
(path, gen_schema=True, progress_bar=False, **kwargs)¶ create datapackage.json base on the files in path.
If you want to set some attributes manually, you can pass them as keyword arguments to this function
Note
A DDFcsv datapackage MUST contain the fields name and resources.
if name is not provided, then the base name of path will be used.
Parameters: - path (str) – the dataset path to create datapackage.json
- gen_schema (bool) – whether to create DDFSchema in datapackage.json. Default is True
- progress_bar (bool) – whether progress bar should be shown when generating ddfSchema.
- kwargs (dict) – metadata to write into datapackage.json. According to spec, title, description, author and license SHOULD be fields in datapackage.json.
-
ddf_utils.package.
get_datapackage
(path, use_existing=True, update=False, progress_bar=False)¶ get the datapackage.json from a dataset path, create one if it’s not exists
Parameters: path (str) – the dataset path
Keyword Arguments: - use_existing (bool) – whether or not to use the existing datapackage
- update (bool) – if update is true, will update the resources and schema in existing datapackage.json. else just return existing datapackage.json
- progress_bar (bool) – whether progress bar should be shown when generating ddfSchema.
-
ddf_utils.package.
get_ddf_files
(path, root=None)¶ yield all csv files which are named following the DDF model standard.
Parameters: - path (str) – the path to check
- root (str, optional) – if path is relative, append the root to all files.
-
ddf_utils.package.
is_datapackage
(path)¶ check if a directory is a dataset directory
This function checks if ddf–index.csv and datapackage.json exists to judge if the dir is a dataset.
ddf_utils.io module¶
io functions for ddf files
-
ddf_utils.io.
cleanup
(path, how='ddf', exclude=None, use_default_exclude=True)¶ remove all ddf files in the given path
-
ddf_utils.io.
csvs_to_ddf
(files, out_path)¶ convert raw files to ddfcsv
Parameters: - files (list) – a list of file paths to build ddf csv
- out_path (str) – the directory to put the ddf dataset
-
ddf_utils.io.
download_csv
(urls, out_path)¶ download csv files
-
ddf_utils.io.
dump_json
(path, obj)¶ convenient function to dump a dictionary object to json
-
ddf_utils.io.
open_google_spreadsheet
(docid)¶ read google spreadsheet into excel io object
-
ddf_utils.io.
serve_concept
()¶
-
ddf_utils.io.
serve_datapoint
(df_: <Mock name='mock.DataFrame' id='139917546889360'>, out_dir, concept, copy=True, by: Iterable[T_co] = None, formatter: Callable = <function format_float_digits>, **kwargs)¶ save a pandas dataframe to datapoint file. the file path of csv will be out_dir/ddf–datapoints–$concept–$by.csv
addition keyword arguments can be passed to pd.DataFrame.to_csv() function.
-
ddf_utils.io.
serve_entity
()¶
ddf_utils.patch module¶
functions working with patches
-
ddf_utils.patch.
apply_patch
(base, patch)¶ apply patch created with daff. more on the diff format, see: http://specs.frictionlessdata.io/tabular-diff-format/
return: the updated DataFrame.
ddf_utils.qa module¶
QA functioins.
-
ddf_utils.qa.
avg_pct_chg
(comp_df, indicator, on='geo')¶ return average precentage changes between old and new data
-
ddf_utils.qa.
compare_with_func
(dataset1, dataset2, fns=None, indicators=None, key=None, **kwargs)¶ compare 2 datasets with functions
-
ddf_utils.qa.
dropped_datapoints
(comp_df, indicator, **kwargs)¶
-
ddf_utils.qa.
max_change_index
(comp_df, indicator, **kwargs)¶
-
ddf_utils.qa.
max_pct_chg
(comp_df, indicator, **kwargs)¶ return average precentage changes between old and new data
-
ddf_utils.qa.
new_datapoints
(comp_df, indicator, **kwargs)¶
-
ddf_utils.qa.
nrmse
(comp_df, indicator, **kwargs)¶
-
ddf_utils.qa.
rmse
(comp_df, indicator, **kwargs)¶
-
ddf_utils.qa.
rval
(comp_df, indicator, on='geo')¶ return r-value between old and new data
ddf_utils.str module¶
string functions for ddf files
-
ddf_utils.str.
fix_time_range
(s)¶ change a time range to the middle of year in the range. e.g. fix_time_range(‘1980-90’) = 1985
-
ddf_utils.str.
format_float_digits
(number, digits=5, threshold=None, keep_decimal=False)¶ format the number to string, limit the maximum amount of digits. Removing tailing zeros.
-
ddf_utils.str.
format_float_sigfig
(number, sigfig=5, threshold=None)¶ format the number to string, keeping some significant digits.
-
ddf_utils.str.
parse_time_series
(ser, engine='pandas')¶ try to parse date time from a Series of string
see document https://docs.google.com/document/d/1Cd2kEH5w3SRJYaDcu-M4dU5SY8No84T3g-QlNSW6pIE/edit#heading=h.oafc7aswaafy for more details of formats
-
ddf_utils.str.
to_concept_id
(s, sep='_')¶ convert a string to alphanumeric format.
ddf_utils.transformer module¶
functions for common tasks on ddf datasets
-
ddf_utils.transformer.
extract_concepts
(dfs, base=None, join='full_outer')¶ extract concepts from a list of dataframes.
Parameters: dfs (list[DataFrame]) – a list of dataframes to be extracted
Keyword Arguments: - base (DataFrame) – the base concept table to join
- join ({'full_outer', 'ingredients_outer'}) – how to join the base dataframe.
full_outer
means union of the base and extracted,ingredients_outer
means only keep concepts in extracted
Returns: the result concept table
Return type: DataFrame
-
ddf_utils.transformer.
merge_keys
(df, dictionary, target_column, merged='drop', agg_method='sum')¶ merge keys
-
ddf_utils.transformer.
split_keys
(df, target_column, dictionary, splited='drop')¶ split entities
-
ddf_utils.transformer.
translate_column
(df, column, dictionary_type, dictionary, target_column=None, base_df=None, not_found='drop', ambiguity='prompt', ignore_case=False)¶ change values in a column base on a mapping dictionary.
The dictionary can be provided as a python dictionary, pandas dataframe or read from file.
Note
When translating with a base DataFrame, if ambiguity is found in the data, for example, a dataset with entity-id congo, to align to a dataset with cod ( Democratic Republic of the Congo ) and cog ( Republic of the Congo ), the function will ask for user input to choose which one or to skip it.
Parameters: - df (DataFrame) – The dataframe to be translated
- column (str) – The column to be translated
- dictionary_type (str) – The type of dictionary, choose from inline, file and dataframe
- dictionary (str or dict) – The dictionary. Depanding on the dictionary_type, the value of this parameter should be: inline: dict file: the file path, str dataframe: dict, must have key and value keys. see examples in examples section.
- target_column (str, optional) – The column to store translated resluts. If this is None, then the one set with column will be replaced.
- base_df (DataFrame, optional) – When dictionary_type is dataframe, this option should be set
- not_found (str) – What to do if key in the dictionary is not found in the dataframe to be translated. avaliable options are drop, error, include
- ambiguity (str) – What to do when there is ambiguities in the dictionary. avaliable options are prompt, skip, error
Examples
>>> df = pd.DataFrame([['geo', 'Geographical places'], ['time', 'Year']], columns=['concept', 'name']) >>> df concept name 0 geo Geographical places 1 time Year >>> translate_column(df, 'concept', 'inline', {'geo': 'country', 'time': 'year'}) concept name 0 country Geographical places 1 year Year
>>> base_df = pd.DataFrame([['geo', 'country'], ['time', 'year']], columns=['concept', 'alternative_name']) >>> base_df concept alternative_name 0 geo country 1 time year >>> translate_column(df, 'concept', 'dataframe', ... {'key': 'concept', 'value': 'alternative_name'}, ... target_column='new_name', base_df=base_df) concept name new_name 0 geo Geographical places country 1 time Year year
-
ddf_utils.transformer.
translate_header
(df, dictionary, dictionary_type='inline')¶ change the headers of a dataframe base on a mapping dictionary.
Parameters: - df (DataFrame) – The dataframe to be translated
- dictionary_type (str, default to inline) – The type of dictionary, choose from inline or file
- dictionary (dict or str) – The mapping dictionary or path of mapping file
-
ddf_utils.transformer.
trend_bridge
(old_ser: <Mock name='mock.Series' id='139917539292688'>, new_ser: <Mock name='mock.Series' id='139917539292688'>, bridge_length: int) → <Mock name='mock.Series' id='139917539292688'>¶ smoothing data between series.
To avoid getting artificial stairs in the data, we smooth between to series. Sometime one source is systematically higher than another source, and if we jump from one to another in a single year, this looks like an actual change in the data.
Parameters: - old_data (Series) –
- new_data (Series) –
- bridge_length (int) – the length of bridge
Returns: bridge_data
Return type: the bridged data