Database¶

Specification¶

dlstats stores information from various statistical providers. The main goal is to keep up-to-date time series that are useful to the economist as well as their historical revisions.

Structure¶

The database structure is described in bson[1]_.

Journal¶

On top of MongoDB internal journaling mechanics, we keep a reference of all operations impacting the database. The method field stores the name of the method from dlstats.

journal : {
              _id : MongoID,
              method : str,
              arguments : []
             }

Categories¶

Generic schema¶

Time series are organized in a tree of categories. Each node stores a reference to the node’s children. It provides a simple and efficient solution to tree storage[2]_.

categories : {
              _id : MongoID,
              _id_journal : MongoID,
              name : str,
              children_id : [MongoID],
              series_id : [MongoID]
             }

[1]	http://www.bsonspec.org

[2]	http://docs.mongodb.org/manual/tutorial/model-tree-structures/

Metadata¶

The metadata differs across statistical providers. We add the corresponding fields when needed.

Eurostat¶

For eurostat, we add a number of URLs for accessing the raw tsv, dft or sdmx files. Also, there is a field for the flowRef identifying the dataflow[3]_. We name codes the nomenclature of attributes that defines atomically the time series. Those codes are only provided for exploration of the database. In the program, a time series is of course identified by its unique id. A document from the codes collection contains all the series related to this code. Consequently, it is possible to query for time series using a set of constraint on codes; at the application level, the client would differentiate all the series_id sets to only get the relevant time series. We keep a pointer to the time series for better performances.

categories : {
              _id : MongoID,
              _id_journal : [MongoID],
              name : str,
              children_id : MongoID,
              url_tsv : str,
              url_dft : str,
              url_sdmx : str,
              flowRef : str,
              codes : {
                       _id_journal : MongoID,
                       name : str,
                       values : {
                                 key : str,
                                 description : str,
                                 series_id : [MongoID]
                                }
                      }
             }

[3]	http://epp.eurostat.ec.europa.eu/portal/page/portal/sdmx_web_services/getting_started/rest_sdmx_2.1

Time series¶

The values are in a list. The position field in the revisions subcollection relates to the index of that list.

series : {
          _id : MongoID,
          _id_journal : MongoID,
          name : str,
          start_date : timestamp,
          end_date : timestamp,
          release_dates : [timestamp],
          values : [float64],
          frequency : str,
          revisions : {
                       value : float64,
                       position : int,
                       release_date : timestamp
                      },
          codes : {
                   name : str,
                   value : str
                  },
          categories_id : MongoID
         }

Implementation¶

MongoDB¶

Pros¶

simple (from a developer perspective)
large number of drivers
no ORM headache
painless sharding
very large user base
decent documentation

Cons¶

immature (mongodb 1.x was scary, 2.x is stable)
complex configuration, lot of fine-tuning required
slow map/reduce

Impact on the structure¶

Growing documents impact performance and should be avoided. Preallocation can alleviate the issue. Alternatively, setting the padding to a higher value may help but comes with a memory cost.

Large number of keys are bad because MongoDB isn’t Python. Collections aren’t indexed with hash tables; if the collection has a large number of keys, mongoDB has to do a large number of comparisons to execute a query. In case of reading performance issues, normalization should improve the results.

HDF5¶

Better than all the other solutions as long as everything is loaded in RAM. Unfit for our job,

Cassandra¶

Pros¶

supported by the Apache Software Foundation
excellent write performances