October 25, 2012
Version 0.5 released
New features
- added a way to import several files for the same entity. A few comments are
in order:
- Each file can have different data points. eg if you have historical data
for some fields data going back to 1950 for some individuals, and other
fields going back to only 2000, the import mechanism will merge those data
sets.
- It can also optionally fill missing data points. Currently it only
supports filling with the “previous value” (the value the individual had
(if any) for that field in a previous period). In the future, we will add
more ways to fill those by interpolating existing data. Note that
currently only data points which are entirely missing are filled, not
those which are set to the special value corresponding to “missing” for the
field type (i.e. False for booleans, -1 for integers and “nan” for floats).
This will probably change in the future.
- As a consequence of this new feature, it is now possible to import liam1
files using the “normal” import file syntax.
- added an optional “default_entity” key to the “simulation” block of
simulation files, so that the interactive console starts directly in that
entity.
- added function to compute the Nth percentile: grppercentile(expr, percent[,
filter]).
- implemented an optional filter argument for many functions. The behaviour is
different depending on the kind of function:
- for functions that change an existing variable (clip() and round()), the
value for filtered individuals is not modified.
- for functions which create a new variable (uniform(), normal() and
randint()), the value for filtered individuals is the missing value
corresponding with the type of the column (-1 for randint(), nan for
uniform() and normal()).
- for aggregate functions (grpmin(), grpmax(), grpstd(), grpmedian() and
grppercentile()), the aggregate is computed over the individuals who
satisfy the filter.
- added new functions for testing: assertTrue and assertEqual:
- assertTrue(expr) evaluates its expression argument and check that it is
True.
- assertEqual(expr1, expr2) evaluates its two expressions and check that
they are equal.
- The behaviour when an assertion fails is configurable through the
“assertions” option in the “simulation” block. This option can take three
values:
- “raise”: interrupt the simulation (this is the default).
- “warn”: display a warning message.
- “skip”: do not run the assertion at all.
- added commands to the console:
- entities: prints the list of available entities.
- periods: prints the list of available periods for the current entity.
- added new command line arguments to override paths specified in the
simulation file:
- –input-path: override the input path
- –input-file: override the input file
- –output-path: override the output path
- –output-file: override the output file
- added –versions command line argument to display versions of all the
libraries used.
Miscellaneous improvements
- performance optimisations:
- fields which are used in lag expressions are cached (stored in memory) to
avoid fetching them from disk. This considerably speeds up lag expressions
at the expense of a bit more memory used.
- implemented a few internal functions in Cython to get C-level performance.
This considerably speeds up alignment and groupby expressions, especially
when the number of “alignment categories” (the number of possible
combinations of values for the variables used to partition) is high.
The down side is that if someone wants to recreate liam2 binaries from the
source code and benefit from this optimisation (there is a pure-python
fallback), he needs to have cython and a C compiler installed.
- other minor optimisations to groupby and alignments with take or leave
filters.
- slightly sped up initial data loading for very large datasets with a lot of
historical data.
- choices() arguments (options and probabilities) now accept expressions
(ie. they can be computed at run time).
- improved the interactive console:
- made the interactive console start in the last simulated period by default.
- changed the behaviour of the “entity” command without argument to print the
current entity.
- the “period” command can now be called without argument to print the
current period.
- added more explicit checks for bad input:
- check for duplicate headers in alignment files.
- check all arguments to groupby() are valid instead of only the first one.
- check for invalid keyword arguments to dump().
- check for invalid keyword arguments to csv().
- check the type of arguments to choice().
- validate globals at load time to make sure the declared globals are
actually present in the dataset.
- disallow strings for the score expression in the matching() function.
- improved the test coverage: There is still a long way for full test coverage,
but the changes in this version is already a first step in the right
direction:
- automated many tests by using the new assertions functions.
- added more tests.
- only copy declared globals to the output file, and do not create a “globals”
node at all if there is no declared global.
- manually close input and output files when an error happens during
initialisation, so that the user only sees the real error message.
- globals can be entirely missing from the input file if they are not used in
the simulation file.
- made the usual code clean-ups.
Fixes
- fixed typo in the code outputting durations (“hourss” instead of “hours”).
- fixed a bug which prevented to define constants without quoting them in some
cases.
- fixed a crash when all groups were empty in a groupby(xxx, expr=grpcount(),
percent=True).
- fixed aggregate functions (grpmin, grpmax, grpstd, grpmedian and
grppercentile) to accept a scalar as argument (even though it is not very
useful to do that).
- fixed a bug which prevented to use a simulation output file as input in some
cases.
Posted by Gaƫtan de Menten