

dlstats stores information from various statistical providers. The main goal is to keep up-to-date time series that are useful to the economist as well as their historical revisions.


The database structure is described in bson[1]_.


On top of MongoDB internal journaling mechanics, we keep a reference of all operations impacting the database. The method field stores the name of the method from dlstats.

journal : {
              _id : MongoID,
              method : str,
              arguments : []


Generic schema

Time series are organized in a tree of categories. Each node stores a reference to the node’s children. It provides a simple and efficient solution to tree storage[2]_.

categories : {
              _id : MongoID,
              _id_journal : MongoID,
              name : str,
              children_id : [MongoID],
              series_id : [MongoID]


The metadata differs across statistical providers. We add the corresponding fields when needed.


For eurostat, we add a number of URLs for accessing the raw tsv, dft or sdmx files. Also, there is a field for the flowRef identifying the dataflow[3]_. We name codes the nomenclature of attributes that defines atomically the time series. Those codes are only provided for exploration of the database. In the program, a time series is of course identified by its unique id. A document from the codes collection contains all the series related to this code. Consequently, it is possible to query for time series using a set of constraint on codes; at the application level, the client would differentiate all the series_id sets to only get the relevant time series. We keep a pointer to the time series for better performances.

categories : {
              _id : MongoID,
              _id_journal : [MongoID],
              name : str,
              children_id : MongoID,
              url_tsv : str,
              url_dft : str,
              url_sdmx : str,
              flowRef : str,
              codes : {
                       _id_journal : MongoID,
                       name : str,
                       values : {
                                 key : str,
                                 description : str,
                                 series_id : [MongoID]

Time series

The values are in a list. The position field in the revisions subcollection relates to the index of that list.

series : {
          _id : MongoID,
          _id_journal : MongoID,
          name : str,
          start_date : timestamp,
          end_date : timestamp,
          release_dates : [timestamp],
          values : [float64],
          frequency : str,
          revisions : {
                       value : float64,
                       position : int,
                       release_date : timestamp
          codes : {
                   name : str,
                   value : str
          categories_id : MongoID




  • simple (from a developer perspective)
  • large number of drivers
  • no ORM headache
  • painless sharding
  • very large user base
  • decent documentation


  • immature (mongodb 1.x was scary, 2.x is stable)
  • complex configuration, lot of fine-tuning required
  • slow map/reduce

Impact on the structure

Growing documents impact performance and should be avoided. Preallocation can alleviate the issue. Alternatively, setting the padding to a higher value may help but comes with a memory cost.

Large number of keys are bad because MongoDB isn’t Python. Collections aren’t indexed with hash tables; if the collection has a large number of keys, mongoDB has to do a large number of comparisons to execute a query. In case of reading performance issues, normalization should improve the results.


Better than all the other solutions as long as everything is loaded in RAM. Unfit for our job,



  • supported by the Apache Software Foundation
  • excellent write performances