Database¶
Specification¶
dlstats stores information from various statistical providers. The main goal is to keep up-to-date time series that are useful to the economist as well as their historical revisions.
Structure¶
The database structure is described in bson[1]_.
Journal¶
On top of MongoDB internal journaling mechanics, we keep a reference of all operations impacting the database. The method field stores the name of the method from dlstats.
journal : {
_id : MongoID,
method : str,
arguments : []
}
Categories¶
Generic schema¶
Time series are organized in a tree of categories. Each node stores a reference to the node’s children. It provides a simple and efficient solution to tree storage[2]_.
categories : {
_id : MongoID,
_id_journal : MongoID,
name : str,
children_id : [MongoID],
series_id : [MongoID]
}
[1] | http://www.bsonspec.org |
[2] | http://docs.mongodb.org/manual/tutorial/model-tree-structures/ |
Metadata¶
The metadata differs across statistical providers. We add the corresponding fields when needed.
Eurostat¶
For eurostat, we add a number of URLs for accessing the raw tsv, dft or sdmx files. Also, there is a field for the flowRef identifying the dataflow[3]_. We name codes the nomenclature of attributes that defines atomically the time series. Those codes are only provided for exploration of the database. In the program, a time series is of course identified by its unique id. A document from the codes collection contains all the series related to this code. Consequently, it is possible to query for time series using a set of constraint on codes; at the application level, the client would differentiate all the series_id sets to only get the relevant time series. We keep a pointer to the time series for better performances.
categories : {
_id : MongoID,
_id_journal : [MongoID],
name : str,
children_id : MongoID,
url_tsv : str,
url_dft : str,
url_sdmx : str,
flowRef : str,
codes : {
_id_journal : MongoID,
name : str,
values : {
key : str,
description : str,
series_id : [MongoID]
}
}
}
[3] | http://epp.eurostat.ec.europa.eu/portal/page/portal/sdmx_web_services/getting_started/rest_sdmx_2.1 |
Time series¶
The values are in a list. The position field in the revisions subcollection relates to the index of that list.
series : {
_id : MongoID,
_id_journal : MongoID,
name : str,
start_date : timestamp,
end_date : timestamp,
release_dates : [timestamp],
values : [float64],
frequency : str,
revisions : {
value : float64,
position : int,
release_date : timestamp
},
codes : {
name : str,
value : str
},
categories_id : MongoID
}
Implementation¶
MongoDB¶
Pros¶
- simple (from a developer perspective)
- large number of drivers
- no ORM headache
- painless sharding
- very large user base
- decent documentation
Cons¶
- immature (mongodb 1.x was scary, 2.x is stable)
- complex configuration, lot of fine-tuning required
- slow map/reduce
Impact on the structure¶
Growing documents impact performance and should be avoided. Preallocation can alleviate the issue. Alternatively, setting the padding to a higher value may help but comes with a memory cost.
Large number of keys are bad because MongoDB isn’t Python. Collections aren’t indexed with hash tables; if the collection has a large number of keys, mongoDB has to do a large number of comparisons to execute a query. In case of reading performance issues, normalization should improve the results.
HDF5¶
Better than all the other solutions as long as everything is loaded in RAM. Unfit for our job,