Downloading Data from Source Providers

ddf_utils provides a few classes under ddf_utils.factory module to help downloading data from serval data providers. Currently we support downloading files from Clio-infra, IHME GBD, ILOStat, OECD and WorldBank. Whenever possible we will use bulk download method from the provider.

General Interface

We created a general class for these data loaders, which has a metadata property and load_metadata, has_newer_source and bulk_download methods.

metadata is for all kinds of metadata provided by data source, such as dimension list and available values in a dimension. load_metadata tries to load these metadata. has_newer_source tries to find out if there is newer version of source data available. And bulk_download download the requested data files.

IHME GBD Loader

Note

The API we use in the IHME loader was not documented anywhere in the IHME website. So there may be problems in the loader.

The IHME GBD loader works in the same way as the GBD Result Tool. Just like one would do a search in the result tool, we need to select context/country/age etc. In IHME loader, we should provide a dictionary of query parameters to the bulk_download method (See the docstring for bulk_download for the usage). And the values for them should be the numeric IDs from IHME. We can check these ID from metadata.

Example Usage:

In [1]: from ddf_utils.factory.ihme import IHMELoader

In [2]: l = IHMELoader()

In [3]: md = l.load_metadata()

In [4]: md.keys()
Out[4]: dict_keys(['age', 'cause', 'groups', 'location', 'measure', 'metric', 'rei', 'sequela', 'sex', 'year', 'year_range', 'version'])

In [5]: md['age'].head()
Out[5]:
id      name short_name  sort  plot       type
1    1   Under 5         <5    22     0  aggregate
10  10  25 to 29         25    10     1   specific
11  11  30 to 34         30    11     1   specific
12  12  35 to 39         35    12     1   specific
13  13  40 to 44         40    13     1   specific

In [6]: l.bulk_download('/tmp/', context='le', version=376, year=[2017], email='your-email@mailer.com')
working on https://s3.healthdata.org/gbd-api-2017-public/xxxx
check status as http://ghdx.healthdata.org/gbd-results-tool/result/xxxx
available downloads:
http://s3.healthdata.org/gbd-api-2017-public/xxxx_files/IHME-GBD_2017_DATA-03cf30ab-1.zip
downloading http://s3.healthdata.org/gbd-api-2017-public/xxxx_files/IHME-GBD_2017_DATA-03cf30ab-1.zip to /tmp/xxxx/IHME-GBD_2017_DATA-xxxx-1.zip
1.13MB [00:01, 582kB/s]
Out[6]: ['03cf30ab']

ILOStat Loader

The ILO data loader use the bulk download facility from ILO.

See the API doc for how to use this loader.

WorldBank Loader

The Worldbank loader can download all datasets listed in the data catalog in CSV(zip) format.

Example Usage:

In [1]: from ddf_utils.factory.worldbank import WorldBankLoader

In [2]: w = WorldBankLoader()

In [3]: md = w.load_metadata()

In [4]: md.head()
Out[4]:
                     accessoption acronym api                          apiaccessurl  ...
0  API, Bulk download, Query tool     WDI   1  http://data.worldbank.org/developers  ...
1  API, Bulk download, Query tool     ADI   1  http://data.worldbank.org/developers  ...
2  API, Bulk download, Query tool     GEM   1  http://data.worldbank.org/developers  ...
3                      Query tool     NaN   0                                   NaN  ...
4  API, Bulk download, Query tool    MDGs   1  http://data.worldbank.org/developers  ...
...

In [5]: w.bulk_download('MDGs', '/tmp/')
Out[5]: '/tmp/'

OECD Loader

The OECD loader can download all datasets in OECD stats. We use the SDMX-JSON api and the downloaded dataset will be in json file. Learn more about SDMX-JSON in the OECD api doc.

Example Usage:

In [1]: from ddf_utils.factory.oecd import OECDLoader

In [2]: o = OECDLoader()

In [3]: md = o.load_metadata()

In [4]: # metadata contains all available datasets.

In [5]: md.head()
Out[5]:
id                                               name
0          QNA                        Quarterly National Accounts
1      PAT_IND                                  Patent indicators
2  SNA_TABLE11     11. Government expenditure by function (COFOG)
3    EO78_MAIN  Economic Outlook No 78 - December 2005 - Annua...
4        ANHRS    Average annual hours actually worked per worker

In [6]: o.bulk_download('/tmp/', 'EO78_MAIN')

Clio-infra Loader

The Clio infra loader parse the home page for clio infra and do bulk download for all datasets or all country profiles.

Example Usage:

In [1]: from ddf_utils.factory.clio_infra import ClioInfraLoader

In [2]: c = ClioInfraLoader()

In [3]: md = c.load_metadata()

In [4]: md.head()
Out[4]:
                  name                                     url     type
0    Cattle per Capita    ../data/CattleperCapita_Compact.xlsx  dataset
1  Cropland per Capita  ../data/CroplandperCapita_Compact.xlsx  dataset
2     Goats per Capita     ../data/GoatsperCapita_Compact.xlsx  dataset
3   Pasture per Capita   ../data/PastureperCapita_Compact.xlsx  dataset
4      Pigs per Capita      ../data/PigsperCapita_Compact.xlsx  dataset

In [5]: md['type'].unique()
Out[5]: array(['dataset', 'country'], dtype=object)

In [6]: c.bulk_download('/tmp', data_type='dataset')
downloading https://clio-infra.eu/data/CattleperCapita_Compact.xlsx to /tmp/Cattle per Capita.xlsx
...