.. highlight:: yaml .. index:: processes .. _processes_label: Processes ######### The processes are the core of a model. LIAM2 supports two kinds of processes: *assignments*, which change the value of a variable (predictor) using an expression, and *actions* which don't (but have other effects). For each entity (for example, "household" and "person"), the block of processes starts with the header "processes:". Each process then starts at a new line with an indentation of four spaces. Assignments =========== Assignments have the following general format: :: processes: variable1_name: expression1 variable2_name: expression2 ... The variable_name will usually be one of the variables defined in the **fields** block of the entity but, as we will see later, it is not always necessary. In this case, the name of the process equals the name of the *endogenous variable*. *Process names* have to be **unique** for each entity. See the section about procedures if you need to have several processes which modify the same variable. To run the processes, they have to be specified in the "processes" section of the simulation block of the file. This explains why the *process names* have to be unique for each entity. *example* :: entities: person: fields: - age: int processes: age: age + 1 simulation: processes: - person: [age] ... Temporary variables =================== All fields declared in the "fields" section of the entity are stored in the output file. Often you need a variable only to store an intermediate result during the computation of another variable. In LIAM2, you can create a temporary variable at any point in the simulation by simply having an assignment to an undeclared variable. Their value will be discarded at the end of the period. *example* :: person: fields: # period and id are implicit - age: int - agegroup: int processes: age: age + 1 agediv10: trunc(age / 10) agegroup: agediv10 * 10 agegroup2: agediv10 * 5 In this example, *agediv10* and *agegroup2* are temporary variables. In this particular case, we could have bypassed the temporary variable, but when a long expression occurs several times, it is often cleaner and more efficient to express it (and compute it) only once by using a temporary variable. Actions ======= Since actions don't return any value, they do not need a variable to store that result, and they only ever need the condensed form: :: processes: process_name: action_expression ... *example* :: processes: remove_deads: remove(dead) Procedures ========== A process can consist of sub-processes, in that case we call it a *procedure*. Processes within a procedure are executed in the order they are declared. Sub-processes each start on a new line, again with an indentation of four spaces and a -. So the general setup is: :: processes: variable_name: expression process_name2: action_expression process_name3: - subprocess_31: expression - subprocess_32: expression In this example, there are three processes, of which the first two do not have sub-processes. The third process is a procedure which consists of two sub-processes. If it is executed, subprocess_31 will be executed and then subprocess_32. Contrary to normal processes, sub-processes (processes inside procedures) names do not need to be unique. In the above example, it is possible for subprocess_31 and subprocess_32 to have the same name, and hence simulate the same variable. Procedure names (process_name3) does not directly refer to a specific endogenous variable. *example* :: processes: ageing: - age: age * 2 # in our world, people age strangely - age: age + 1 - agegroup: trunc(age / 10) * 10 The processes on *age* and *agegroup* are grouped in *ageing*. In the simulation block you specify the *ageing*-process if you want to update *age* and *agegroup*. By using procedures, you can actually make *building blocks* or modules in the model. Local (temporary) variables --------------------------- Temporary variables defined/computed within a procedure are local to that procedure: they are only valid within that procedure. If you want to pass variables between procedures you have to make them global by defining them in the **fields** section. *(bad) example* :: person: fields: - age: int processes: ageing: - age: age + 1 - isold: age >= 150 # isold is a local variable rejuvenation: - age: age – 1 - backfromoldage: isold and age < 150 # WRONG ! In this example, *isold* and *backfromoldage* are local variables. They can only be used in the procedure where they are defined. Because we are trying to use the local variable *isold* in another procedure in this example, LIAM 2 will refuse to run, complaining that *isold* is not defined. Actions ------- Actions inside procedures don't even need a process name. *example* :: processes: death_procedure: - dead: age > 150 - remove(dead) .. index:: expressions Expressions =========== Expressions can either compute new values for existing individuals, or change the number of individuals by using the so-called life-cycle functions. .. index:: simple expressions simple expressions ------------------ Let us start with a simple increment; the following process increases the value of a variable by one each simulation period. age: age + 1 The name of the process is *age* and what it does is increasing the variable *age* of each individual by one, each period. - Arithmetic operators: +, -, \*, /, \** (exponent), % (modulo) Note that an integer divided by an integer returns a float. For example "1 / 2" will evaluate to 0.5 instead of 0 as in many programming languages. If you are only interested in the integer part of that result (for example, if you know the result has no decimal part), you can use the *trunc* function: :: agegroup5: 5 * trunc(age / 5) - Comparison operators: <, <=, ==, !=, >=, > - Boolean operators: and, or, not Note that starting with version 0.6, you do not need to use parentheses when you mix *boolean operators* with other operators. :: inwork: workstate > 0 and workstate < 5 to_give_birth: not gender and age >= 15 and age <= 50 is now equivalent to: inwork: (workstate > 0) and (workstate < 5) to_give_birth: not gender and (age >= 15) and (age <= 50) - Conditional expressions: if(condition, expression_if_true, expression_if_false) *example* :: agegroup_civilstate: if(age < 50, 5 * trunc(age / 5), 10 * trunc(age / 10)) Note that the *if* functions always requires three arguments. If you want to leave a variable unchanged if a condition is not met, use the variable in the *expression_if_false* :: # retire people (set workstate = 9) when aged 65 or more workstate: if(age >= 65, 9, workstate) You can nest if-statements. The example below retires men (gender = True) over 64 and women over 61. :: workstate: if(gender, if(age >= 65, 9, workstate), if(age >= 62, 9, workstate)) # could also be written like this: workstate: if(age >= if(gender, 65, 62), 9, workstate) .. index:: globals usage .. _globals_usage: globals ------- Globals can be used in expressions in any entity. LIAM2 currently supports two kinds of globals: tables and multi-dimensional arrays. They both need to be imported (see the :ref:`import_data` section) and declared (see the :ref:`globals_declaration` section) before they can be used. Globals tables come in two variety: those with a PERIOD column and those without. The fields in a globals **table with a PERIOD column** can be used like normal (entity) fields except they need to be prefixed by the name of their table: :: myvariable: mytable.MYINTFIELD * 10 the value for INTFIELD is in fact the value INTFIELD has for the period currently being evaluated. There is a special case for the **periodic** table: its fields do not need to be prefixed by "periodic." (but they can be, if desired). :: - retirement_age: if(gender, 65, WEMRA) - workstate: if(age >= retirement_age, 9, workstate) This changes the workstate of the individual to retired (9) if the age is higher than the required retirement age in that year. Another way to use globals from a table with a PERIOD column is to specify explicitly for which period you want them to be evaluated. This is done by using tablename.FIELDNAME[period_expr], where period_expr can be any expression yielding a valid period value. Here are a few artificial examples: :: workstate: if(age >= WEMRA[2010], 9, workstate) workstate: if(age >= WEMRA[period - 1], 9, workstate) workstate: if(age >= WEMRA[year_of_birth + 60], 9, workstate) Globals **tables without a PERIOD column** can only be used with the second syntax, and in that case LIAM2 will not automatically subtract the "base period" from the index, which means that to access a particular row, you have to use its row index (0 based). Globals **arrays** can simply be used like a normal field: myvariable: MYARRAY * 2 .. index:: mathematical functions mathematical functions ---------------------- - log(expr): natural logarithm (ln) - exp(expr): exponential - abs(expr): absolute value - round(expr[, n]): returns the rounded value of expr to specified n (number of digits after the decimal point). If n is not specified, 0 is used. - trunc(expr): returns the truncated value (by dropping the decimal part) of expr as an integer. - clip(x, a, b): returns a if x < a, x if a < x < b, b if x > b. - min(x, a), max(x, a): the minimum or maximum of x and a. .. index:: count, sum, avg, std, min, max, median, percentile, gini, aggregate functions aggregate functions ------------------- - count([condition]): count the objects in the entity. If filter is given, only count the ones satisfying the filter. - sum(expr[, filter=condition][, skip_na=True]): sum the expression - avg(expr[, filter=condition][, skip_na=True]): average - std(expr[, filter=condition][, skip_na=True]): standard deviation - min(expr[, filter=condition][, skip_na=True]): min - max(expr[, filter=condition][, skip_na=True]): max - median(expr[, filter=condition][, skip_na=True]): median - percentile(expr, percent[, filter=condition][, skip_na=True]): percentile - gini(expr[, filter=condition][, skip_na=True]): gini **sum** sums any expression over all the individuals of the current entity. For example *sum(earnings)* will produce the sum of the earnings of all persons in the sample. **count** counts the number of individuals in the current entity, optionally satisfying a (boolean) condition. For example, *count(gender)* will produce the total number of men in the sample. Contrary to **sum**, the count does not require an argument: *count()* will return the total number of individuals in the sample. Note that, sum and count are exactly equivalent if their only argument is a boolean variable (eg. count(ISWIDOW) == sum(ISWIDOW)). *example* :: macros: WIDOW: civilstate == 5 processes: cnt_widows: show(count(WIDOW)) .. index:: link methods, count, sum, avg, min, max .. _link_methods: link methods ------------ (one2many links) - link.count([filter]) - counts the number of related individuals - link.sum(expr[, filter]) - compute the sum of an expression over the related individuals - link.avg(expr[, filter]) - compute the average of an expression over the related individuals - link.min(expr[, filter]) - compute the minimum of an expression over the related individuals - link.max(expr[, filter]) - compute the maximum of an expression over the related individuals *example* :: entities: household: fields: # period and id are implicit - nb_persons: {type: int, initialdata: false} links: persons: {type: one2many, target: person, field: household_id} processes: household_composition: - nb_persons: persons.count() - nb_students: persons.count(workstate == 1) - nb_children: persons.count(age <= 17) - avg_age: persons.avg(age) .. index:: temporal functions, lag, value_for_period, duration, tavg, tsum temporal functions ------------------ - lag(expr[, num_periods][, missing=value): value at previous period. **expr**: any expression. **num_periods**: optional argument specifying the number of periods to go back to. This can be either a constant or a scalar expression. Defaults to 1. **missing**: the value to return for individuals which were not present in the past period. By default, it returns the missing value corresponding to the type of the expression: -1 for an integer expression, nan for a float or False for a boolean. *example* :: lag(age) # the age each person had last year, -1 if newborn lag(age, missing=0) # the age each person had last year, 0 if newborn avg(lag(age)) # average age that the current population had last year lag(avg(age)) # average age of the population of last year lag(age, 2) # the age each person had two years ago (-1 for # newborns) lag(lag(age)) # this is equivalent (but slightly less efficient) - value_for_period(expr, period[, missing=value]): value at a specific period **expr**: any expression. **period**: the period used to evaluate the expression. This can be either a constant or a scalar expression. **missing**: the value to return for individuals which were not present in the period. By default, it returns the missing value corresponding to the type of the expression: -1 for an integer expression, nan for a float or False for a boolean. *example* :: value_for_period(inwork and not male, 2002) - duration(expr): number of consecutive period the expression was True *examples* :: duration(inwork and (earnings > 2000)) duration(educationlevel == 4) - tavg(expr): average of an expression since the individual was created *example* :: tavg(income) - tsum(expr): sum of an expression since the individual was created .. index:: random, uniform, normal, randint random functions ---------------- - uniform: random numbers with a uniform distribution [0,1) - normal: random numbers with a normal distribution - randint: random integers between bounds *example* :: # a random variable with the stdev derived from errsal normal(loc=0.0, scale=std(errsal)) randint(0, 10) .. index:: choice choice ~~~~~~ Monte Carlo or probabilistic simulation is a method for iteratively evaluating a deterministic model using sets of random numbers as inputs. In microsimulation, the technique is used to simulate changes of state dependent variables. Take the simplest example: suppose that we have an exogenous probability of an event happening, P(x=1), or not P(x=0). Then draw a random number u from an uniform [0,1) distribution. If, for individual i, ui < P(x=1), then xi=1. If not, then xi=0. The expected occurrences of x after, say, 100 runs is then P(x=1) * 100 and the expected value is 1xP(1)+0xP(0)=P(1). This type of simulation hinges on the confrontation between a random variable and an exogenous probability. In LIAM 2, such a probabilistic simulation is called a **choice** process. Suppose i=1..n choice options, each with a probability prob_option_i. A choice expression then has the following form: :: choice([option_1, option_2, ..., option_n], [prob_option_1, prob_option_2, ..., prob_option_n]) Note that both the list of options and their probabilities are between []’s. The options can be of any numeric type. A simple example of a process using a choice expression is the simulation of the gender of newborns (51% males and 49% females), as such: :: gender: choice([True, False], [0.51, 0.49]) In the current version of LIAM 2, it is not possible to combine a choice with alignment. Here is a more complex example of a process using choice. Suppose we want to simulate the work status (blue collar worker or white collar worker) for all working individuals. We want to assign 1 or 2 to their collar variable based on their sex and level of education (education_level=2, 3, 4). We could write our process as follow: :: collar_process: - no_collar: WORKING and collar == -1 - collar: if(no_collar and (education_level == 2), if(gender, choice([1, 2], [0.836, 0.164]), choice([1, 2], [0.687, 0.313]) ), collar) - collar: if(no_collar and (education_level == 3), if(gender, choice([1, 2], [0.643, 1 - 0.643]), choice([1, 2], [0.313, 1 - 0.313]) ), collar) - collar: if(no_collar and (education_level == 4), if(gender, choice([1, 2], [0.082, 1 - 0.082]), choice([1, 2], [0.039, 1 - 0.039]) ), collar) The procedure *collar_process* has collar as the key endogenous variable and has four sub-processes. The first sub-process defines a local variable no_collar, which will be used to select those that the procedure should apply to. These are all the workers that do not have a value for collar. The next three sub-processes simulate the actual collar variable. If one meets the above *no_collar* filter and has the lowest level of education (2), then one has a probability of about 83.6% (men) and 68.7% (women) of being a blue collar worker. If one has "education_level" equal to 3, the probability of being a blue collar worker is lower (64.3% for men and 31.3% for women), etc. .. index:: logit, alignment Regressions ----------- .. index:: logit_score logit_score ~~~~~~~~~~~ The logit of a number p between 0 and 1 is given by the formula: :: logit(p) = log(p / (1 - p)) Its inverse, the logistic function has the interesting property that it can convert any real number into a probability. :: logistic(a) = 1 / (1 + exp(-a)) The logit_score function is a logistic with a random part: :: logit_score(a) = logistic(a - logit(u)) where *u* is a random number from an uniform distribution [0, 1). *logit_score* is very useful in behavioural equations. A behavioural equation starts by creating a score that reflects the risk p*i of an event occuring. A typical usage is as follow: :: - score_variable: if(condition_1, logit_score(logit_expr_1), if(condition_2, logit_score(logit_expr_2), -1)) However, the nested structure can make things less readable if you have many different conditions. In that case, one would prefer the following longer form: :: process_name: # initialise the score to -1 - score_variable: -1 # first condition - score_variable: if(condition_1, logit_score(logit_expr_1), score_variable) # second condition - score_variable: if(condition_2, logit_score(logit_expr_2), score_variable) # ... other conditions ... In a first sub-process, a variable *score_variable* is set equal to -1, because this makes it highly unlikely that the event will happen to those not included in the conditions for which the logit is applied. Next, subject to conditions *condition_1* and *condition_2*, this score (risk) is simulated on the basis of estimated logits. Note that by specifying the endogenous variable *score_variable* without any transformation in the "else" conditions of the if functions makes sure that the score variable is not manipulated by a sub-process it does not pertain to. When the score is known, it can be either used as-is: :: - event_happened: uniform() < score_variable or in combination with an alignment (see below). .. index:: align, take, leave align ~~~~~ Now that we have computed a score (risk) for an event happening, we might want to use alignment so the number of events occuring per category matches a proportion defined externaly. There are different ways to choose which individuals are taken. The methodology used for now by LIAM 2 is called "alignment by sorting", that is, for each category, the N individuals with the highest scores are selected. The score computation is not done internally by the align() function, but is rather computed by an expression given by the modeller. One will usually use logit_score() to compute it, but it can be computed in any other way a modeller choose. To know more about the alignment process reading "Evaluating Alignment Methods in Dynamic Microsimulation Models", by Li and O'Donoghue is advised. An alignment expression takes the following general form: :: align(score, proportions [, filter=conditions] [, take=conditions] [, leave=conditions] [, expressions=expressions] [, possible_values=pvalues] [, frac_need="uniform"|"round"|"cutoff"]) For example, it could look like: :: - unemp: align(unemp_score, 'al_p_unemployed_m.csv', filter=not ISINWORK and (age > 15) and (age < 65), take=ISUNEMPLOYED, leave=ISSTUDENT or ISRETIRED) Now let us examine each argument in turn: * **score**: it must be an expression (or a simple variable) returning a numerical value. It will be used to rank individuals. One will usually use logit_score() to compute the score, but it can be computed in any other way a modeller choose. Note that the score is not modified in any way within the align() function, so if one wants a random factor, it should be added manually (or through the use of a function like logit_score which includes one). * **proportions**: the target proportions for each category. This argument can take many forms. The most common one will probably be a string holding the name of a file containing the alignment data (like in the example above) but it can be any of the following: + a single scalar, for aligning with a constant proportion. + a list of scalars, for aligning with constant proportions per category. + an expression returning a single scalar. + an expression returning an n-dimensional array. expressions and possible values will be retrieved from that array, so you can simply use: :: align(score_expr, array_expr) + a list of expressions returning scalars [expr1, expr2]. + a string treated as a filename. That file should be in the "array" format described in the :ref:`import_data` section. In that case, the proportions, expressions (column names) and possible values are read from that file. The "fname" argument which used to be the way to define this is still supported for backward compatibility. There is no technical restriction on names for files containing alignment data but, by convention, they usually use the following pattern: start with the prefix *al_* followed by the name of the endogenous variable and a suffix *_m* or *_f*, depending on gender. * **filter**: an expression specifying which individuals to taken into account for the alignment. Note that if the align() function is used inside an *if()* expression, its filter is adjusted accordingly ("anded" with the filter of the if() expression). For example: :: - aligned: if(condition, align(score_expr1, 'filename1.csv'), align(score_expr2, 'filename2.csv')) is equivalent to: :: - aligned1: align(score_expr1, 'filename1.csv', filter=condition) - aligned2: align(score_expr2, 'filename2.csv', filter=not condition) - aligned: if(condition, aligned1, aligned2) * **take**: an expression specifying individuals which should always be selected, regardless of their score. This argument should be used with care as those individuals will be selected unconditionally, even if that means overflowing the number of individuals desired to satisfy the alignment. Suppose that the alignment specifies that 10 individuals should experience a certain event, and that there are 3 individuals who meet the conditions specified in the *take*. Then these 3 individuals will be selected a priori (irrespective of their score) and the alignment process will select the remaining 7 candidates from the rest of the sample. A "softer" alternative can be easily achieved by setting a very high score for individuals to be taken first. * **leave**: an expression specifying individuals which should never be selected, regardless of their score. This argument should be used with care as those individuals will *never* be selected, even if that cause the target number of individuals for some categories to not be reached. A "softer" alternative can be easily achieved by setting a very low score for individuals to be taken last. Note that even if the score for an individual is -1 (or any other negative number), it *can* still be selected by the alignment expression. This happens when there are not enough candidates (selected by the filter) to meet the alignment needs. * **expressions**: specify the expressions used to partition the individuals into the different alignment categories. If proportions is a file name, the column names declared in the file are used by default, but they can be overridden using this argument. For example: :: align(0.0, 'al_p_dead.csv', expressions=[gender, age + 1]) * **possible_values**: specify the different values for each of the expressions in the expressions argument that should be evaluated. The combination of the different lists of possible values will form all the alignment categories. For example: :: align(0.0, proportions=[0.1, 0.2, 0.3, 0.4], expressions=[gender, age < 50], possible_values=[[False, True], [False, True]]) * **frac_need**: control how "fractional needs" are handled. This argument can take any of three values: "uniform" (default), "cutoff" or "round". + "uniform" draws a random number (u) from an uniform distribution and adds one individual if u < fractional_need. "uniform" is the default behavior. + "round" simply rounds needs to the nearest integer. In other words, one individual is added for a category if the fractional need for that category is >= 0.5. + "cutoff" tries to match the total need as closely as possible (at the expense of a slight loss of precision for individual categories) by searching for the "cutoff point" that yields: :: count(frac_need >= cutoff) == sum(frac_need) In practice alignment data is often separate for men and women. In that case, one will usually use the following form: :: - variable: if(condition, if(gender, align(score_expr, 'filename_m.csv'), align(score_expr, 'filename_f.csv')), False) Since LIAM2 supports alignment with any number of dimensions, one could also merge both data files in a single file with one more dimension and use a single align() expression: :: - variable: if(condition, align(score_expr, 'filename_m_and_f.csv'), False) # or even - variable: align(score_expr, 'filename_m_and_f.csv', filter=condition) In the example below describes the process of getting (or keeping) a job: :: inwork: - work_score: -1 # men - work_score: if(ISMALE and ACTIVEAGE and ISINWORK, logit_score(-0.196599 * age + 0.0086552 * age **2 - 0.000988 * age **3 + 0.1892796 * ISMARRIED + 3.554612), work_score) - work_score: if(ISMALE and ACTIVEAGE and (ISUNEMPLOYED or ISOTHERINACTIVE), logit_score(0.9780908 * age - 0.0261765 * age **2 + 0.000199 * age **3 - 12.39108), work_score) # women - work_score: if(ISFEMALE and ACTIVEAGE and ISINWORK, logit_score(-0.2740483 * age + 0.0109883 * age **2 - 0.0001159 * age **3 - 0.0906834 * ISMARRIED + 3.648706), work_score) - work_score: if(ISFEMALE and ACTIVEAGE and (ISUNEMPLOYED or ISOTHERINACTIVE), logit_score(0.8217638 * age - 0.0219761 * age **2 + 0.000166 * age **3 - 0.5590975 * ISMARRIED - 10.48043), work_score) - work: if(ACTIVEAGE, if(ISMALE, align(work_score, 'al_p_inwork_m.csv', leave=ISSTUDENT or ISRETIRED), align(work_score, 'al_p_inwork_f.csv', leave=ISSTUDENT or ISRETIRED)), False) The first sub process illustrates a "*soft leave*" by setting the score variable *work_score* to -1. This makes sure that the a priori selection probability is very low (but not zero, as in the case of *leave* conditions) for those who satisfy the filter of the alignment but for which a score is not explicitly specified the subsequent processes. Next come four *if* conditions, separating the various behavioural equations to the sub-sample they pertain to. The first two conditions pertain to men and respectively describe the probability of keeping a job and getting a job. The next two conditions describe the same transitions but for women. The last sub-process describes the alignment process itself. Alignment is applied to individuals between the age of 15 and 65. The input-files of the alignment process are 'al_p_inwork_m.csv' and 'al_p_inwork_f.csv'. The alignment process uses as input the scores simulated previously, and the information in the alignment files and sets the boolean variable *work*. No "take" or "leave" conditions are used in this case. .. index:: align_abs align_abs ~~~~~~~~~ align_abs is equivalent to align(), except that it aligns to absolute numbers instead of proportions. It also supports a few additional arguments to work on a **linked entity**. The general form of align_abs is : :: align_abs(score, need, [, filter=conditions] [, take=conditions] [, leave=conditions] [, expressions=expressions] [, possible_values=pvalues] [, frac_need="uniform"|"round"|"cutoff"] [, link=link_name] [, secondary_axis=column_name] [, errors="default"|"carry"]) In addition to all the arguments supported by *align()*, *align_abs()* also supports an optional "link" argument, which makes it work on a linked entity. Here is a description of the arguments specific to align_abs: * **link**: must be the name of a one2many link. When the link argument is used, the groups (given by the alignment file or in the *expressions* argument) are evaluated on the linked entity and the needs are expressed in terms of that linked entity. When the link argument is in effect, align_abs uses the "Chenard" algorithm. This can be used, for example, to take as many *households* as necessary trying to get as close as possible to a particular distribution of *persons*. * **secondary_axis**: name of an axis which will influence rel_need when the subtotal for that axis is exceeded. See total_by_sex in Chenard. *secondary_axis* can only be used in combination with the link argument and it *must* be one of the alignment columns. * **errors**: if set to 'carry', the error for a period (difference between the number of individuals aligned and the target for each category) is stored and added to the target for the next period. In the current version of LIAM2, *errors* can only be used in combination with the *link* argument. *example* :: test_align_link: # this is a procedure defined at the level of households - num_persons: persons.count() - total_population: sum(num_persons) # MIG_PERCENT is a simple float periodic global - num_migrants: total_population * MIG_PERCENT # MIG is a 3d array: age - gender - period but we want only the # 2d array for this period. # currently, we need to manually compute the index (0-based) # for the current period in the array. We know the first # period in our array is 2000, so the index for the current # period is: "period - 2000" # period is the last dimension of the array and we do not # want to modify other dimensions, so we use ":" for those # dimensions. - mig_period: MIG[:,:,period - 2000] # Distribute total desired migrants, by age and gender - need: num_migrants * mig_period # households have a 50% chance to be candidate for immigration - is_candidate: uniform() < 0.5 # apply alignment, using the number of persons in each household # as a score, so that households with more persons are tried first # as this gives better results. - aligned: align_abs(num_persons, need, filter=is_candidate, link=persons, secondary_axis=gender, errors='carry') .. index:: logit_regr logit_regr ~~~~~~~~~~ logit_regr is a shortcut form to call logit_score and "evaluate whether the event happened" in a single function. Thus, the function returns a boolean: True for individuals which are selected, False for all others. Its general form is: :: - aligned: logit_regr(expression, [, filter=conditions] [, align=proportions]) The *align* argument supports all the same formats than the *proportions* argument of align(): filename, percentage, list of values, ... Evaluation whether the event happens is done differently whether the align argument is used or not. If alignment is used, logit_regr is equivalent to: :: - aligned: align(logit_score(expression), proportions, filter=conditions) Without align argument, the condition for the event occurring is p*i > 0.5, which means that in this form, logit_regr is equivalent to: :: - aligned: if(conditions, logit_score(expression) > 0.5, False) *example* :: - to_give_birth: logit_regr(0.0, filter=FEMALE and (age >= 15) and (age <= 50), align='al_p_birth.csv') other regressions ~~~~~~~~~~~~~~~~~ - Continuous (expr + normal(0, 1) * mult + error_var): cont_regr(expr[, filter=None, mult=0.0, error_var=None]) - Clipped continuous (always positive): clip_regr(expr[, filter=None, mult=0.0, error_var=None]) - Log continuous (exponential of continuous): log_regr(expr[, filter=None, mult=0.0, error_var=None]) .. index:: matching Matching function ----------------- **matching**: (aka Marriage market) matches individuals from set 1 with individuals from set 2. For each individual in set 1 following a particular order (given by the expression in the *orderby* argument), the function computes the score of all (unmatched) individuals in set 2 and take the best scoring one. One has to specify the boolean filters which provide the two sets to match (set1filter and set2filter), the criterion to decide in which order the individuals of the first set are matched and the expression that will be used to assign a score to each individual of the second set (given a particular individual in set 1). In the score expression the fields of the set 1 individual can be used normally and the fields of its possible partners can be used by prefixing them by "**other.**". The matching function returns the identification number of the matched individual for individuals which were matched, -1 for others. If the two sets are of different sizes, the excedent of the largest set is simply ignored. *generic setup* :: matching(set1filter=boolean_expr, set2filter=boolean_expr, orderby=difficult_match, score=coef1 * field1 + coef2 * other.field2 + ...) *example* :: marriage: - to_couple: not in_couple and age >= 18 and age <= 90 - avg_age_males_to_couple: avg(age, filter=to_couple and MALE) - difficult_match: if(to_couple and FEMALE, abs(age - avg_age_males_to_couple), nan) - partner_id: if(to_couple, matching(set1filter=FEMALE, set2filter=MALE, orderby=difficult_match, score=- 0.4893 * other.age + 0.0131 * other.age ** 2 - 0.0001 * other.age ** 3 + 0.0467 * (other.age - age) - 0.0189 * (other.age - age) ** 2 + 0.0003 * (other.age - age) ** 3 - 0.9087 * (other.work and not work) - 1.3286 * (not other.work and work) - 0.6549 * (other.work and work)), partner_id) The code above shows an application. First, we decided that all persons between 18 and 90 years old who are not part of a couple are candidate for marriage. Next, for each candidate women, the variable *difficult_match* is the difference between her age and the average age of candidate men. In a third step, for each candidate woman in turn (following the order set by *difficult_match*), all candidate men which are still available are assigned a score and the man with the highest score is matched with that woman. This score depends on his age, his difference in age with the woman and the work status of the potential partners. .. index:: lifecycle functions Lifecycle functions ------------------- .. index:: new new ~~~ **new** creates new individuals. It can create individuals of the same entity (eg. a women gives birth) or another entity (eg. a *person*'s marriage creates a new *houshold*). The function returns the id of the newly created individuals. *generic format* :: new('entity_name'[, filter=expr][, number=value] *set initial values of a selection of variables*) The first argument specifies the entity in which the individuals will be created (eg person, household, ...). Then, one should use one of either the *filter* or the *number* argument. * **filter** specifies which individuals of the current entity will serve as the origin for the new individuals (for persons, that would translate to who is giving birth, but the function can of course be used for any kind of entity). * **number** specifies how many individuals need to be created. In this version, those new individuals do not have an "origin", so they can copy value from it. Any subsequent argument specifies values for fields of the new individuals. Any field which is not specified there will receive the missing value corresponding to the type of the field ('nan' for floats, -1 for integers and False for booleans). Those extra arguments can be given constants, but also any expression (possibly using links, random functions, ...). Those expressions are evaluated in the context of the origin individuals. For example, you could write "mother_age = age", which would set the field "mother_age" on the new children to the age of their mother. *example 1* :: birth: - to_give_birth: logit_regr(0.0, filter=not gender and (age >= 15) and (age <= 50), align='al_p_birth.csv') - new('person', filter=to_give_birth, mother_id = id, father_id = partner.id, household_id = household_id, partner_id = -1, age = 0, civilstate = SINGLE, gender=choice([True, False], [0.51, 0.49]) ) The first sub-process (*to_give_birth*) is a logit regression over women (not gender) between 15 and 50 which returns a boolean value whether that person should give birth or not. The logit itself does not have a deterministic part (0.0), which means that all women that meet the above condition are equally likely to give birth (they are selected randomly). This process is also aligned on the data in 'al_p_birth.csv'. In the above case, a new person is created for each time a woman is selected to give birth. Secondly, a number of links are established: the value for the *mother_id* field of the child is set to the id-number of his/her mother, the child's father is set to the partner of the mother, the child receives the household number of his/her mother, ... Finally some variables of the child are set to specific initial values: the most important of these is its gender, which is the result of a simple choice process. **new** can create individuals of different entities; the below procedure *get_a_life* makes sure that all those who are single when they are 24 year old, leave their parents’ household for their own household. The region of this new household is created randomly through a choice-process. *example 2* :: get_a_life: - household_id: if(ISSINGLE and age == 24, new('household', region_id=choice([0, 1, 2, 3], [0.1, 0.2, 0.3, 0.4])), household_id) .. index:: clone clone ~~~~~ **clone** is very similar to **new** but is intended for cases where most or all variables describing the new individual should be copied from his/its parent/origin instead of being set to "missing". With clone, you cannot specify what kind of entity you want to create, as it is always the same as the origin item. However, similarly to **new**, **clone** also allows fields to be specified manually by any expression evaluated on the parent/origin. Put differently, a **new** with no fields mentioned will result in a new item of which the initial values of the fields are all set to missing and have to be filled through simulation; on the contrary, a **clone** with no fields mentioned will result in a new item that is an exact copy of the origin except for its id number which is always set automatically. *example* :: make_twins: - clone(filter=new_born and is_twin, gender=choice([True, False], [0.51, 0.49])) .. index:: remove remove ~~~~~~ **remove** removes items from an entity dataset. With this command you can remove obsolete items (eg. dead persons, empty households) thereby ensuring they are not simulated anymore. This will also save some memory and, in some cases, improve simulation speed. The procedure below simulates whether an individual survives or not, and what happens in the latter case. :: dead_procedure: # decide who dies - dead: if(gender, logit_regr(0.0, align='al_p_dead_m.csv'), logit_regr(0.0, align='al_p_dead_f.csv')) # change the civilstate of the suriving partner - civilstate: if(partner.dead, 5, civilstate) # break the link to the dead partner - partner_id: if(partner.dead, -1, partner_id) # remove the dead - remove(dead) The first sub-procedure *dead* simulates whether an individual is ‘scheduled for death’, using again only a logistic stochastic variable and the age-gender-specific alignment process. Next some links are updated for the surviving partner. The sub-procedure *civilstate* puts the variable of that name equal to 5 (which means that one is a widow(er) for those individuals whose partner has been scheduled for death. Also, in that case, the partner identification code is erased. All other procedures describing the heritage process should be included here. Finally, the *remove* command is called to removes the *dead* from the simulation dataset. Output ====== LIAM 2 produces simulation output in three ways. First of all, by default, the simulated datasets are stored in hdf5 format. These can be accessed at the end of the run. You can use several tools to inspect the data. You can display information during the simulation (in the console log) using the *show* function. You can write that same information to csv files using the *csv* function. You can produce tabular data by using the *dump* or *groupby* functions. In the interactive console, you can use any of those output functions to inspect the data interactively. .. index:: show show ---- *show* evaluates expressions and prints the result to the console log. Note that, in the *interactive console*, show is implicit on all commands, so you do not need to use it. *show* has the following signature: :: show(expr1[, expr2, expr3, ...]) *example 1* :: show(count(age >= 18)) show(count(not dead), avg(age, filter=not dead)) The first process will print out the number of persons of age 18 and older in the dataset. The second one displays the number of living people and their average age. *example 2* :: show("Count:", count(), "Average age:", avg(age), "Age std dev:", std(age)) gives :: Count: 19944 Average age: 42.7496991576 Age std dev: 21.9815913417 Note that you can use the special character "\\n" to display the rest of the result on the next line. *example 3* :: show("Count:", count(), "\nAverage age:", avg(age), "\nAge std dev:", std(age)) gives :: Count: 19944 Average age: 42.7496991576 Age std dev: 21.9815913417 .. index:: qshow qshow ----- *qshow* evaluates expressions and prints their results to the console log alongside the "textual form" of the expressions. If several expressions are given, they are each printed on a separate line. *qshow* usage is exactly the same than *show*. *example* :: qshow(count(), avg(age), std(age)) will give: :: count(): 19944 avg(age): 42.7496991576 std(a=age): 21.9815913417 .. index:: csv csv --- The **csv** function writes values to csv files. :: csv(expr1[, expr2, expr3, ..., [suffix='file_suffix'][, fname='filename'][, mode='w']) 'suffix', 'fname' and 'mode' are optional arguments. By default (if neither 'fname' nor 'suffix' is used), the name of the csv file is generated using the following pattern: "{entity}_{period}.csv". *example* :: csv(avg(income)) will create one file for each simulated period. Assuming, start_period is 2002 and periods is 2, it will create two files: "person_2002.csv" and "person_2003.csv" with the average income of the population for period 2002 and 2003 respectively. Arguments: - 'suffix' allows to customize the name of the files easily. When it is used, the files are named using the following pattern: "{entity}_{period}_{suffix}.csv". *example* :: csv(avg(income), suffix='income') would create "person_2002_income.csv" and "person_2003_income.csv". - 'fname' allows defining the exact file name or pattern to use. You can optionally use the '{entity}' and '{period}' key words to customize the name. *example* :: csv(avg(income), fname='income{period}.csv') would create "income2002.csv" and "income2003.csv". - 'mode' allows appending (mode='a') to a csv file instead of overwriting it (mode='w' by default). This allows you, for example, to store the value of some expression for all periods in the same file (instead of one file per period by default). *example* :: csv(period, avg(income), fname='avg_income.csv', mode='a') Note that unless you erase/overwrite the file one way or another between two runs of a simulation, you will append the data of the current simulation to that of the previous one. One way to do overwrite the file automatically at the start of a simulation is to have a procedure in the init section without mode='a'. If you want that file to start empty, you can do so this way: :: csv(fname='avg_income.csv') If you want some headers in your file, you could write them at that point: :: csv('period', 'average income', fname='avg_income.csv') When you use the csv() function in combination with (at least one) table expressions (see dump and groupby functions below), the results are appended below each other. :: csv(table_expr1, 'and here goes another table', table_expr2, fname='tables.csv') Will produce a file with a layout like this: :: | table 1 value at row 1, col 1 | col 2 | ... | col N | | ... | ... | ... | ... | | row N, col 1 | col 2 | ... | col N | | and here goes another table | | | | | table 2 value at row 1, col 1 | ... | col N | | | ... | ... | ... | | | row N, col 1 | ... | col N | | You can also output several rows with a single command by enclosing values between brackets: :: csv([row1value1, ..., row1valueN], ... [rowNvalue1, ..., rowNvalueN], fname='several_rows.csv') *example* :: csv(['this is', 'a header'], ['with', 'several lines'], fname='person_age_aggregates.csv') Will produce a file with a layout like this: :: | this is | a header | | with | several lines | .. index:: dump dump ---- **dump** produces a table with the expressions given as argument evaluated over many (possibly all) individuals of the dataset. *general format* :: dump([expr1, expr2, ..., filter=filterexpression, missing=value, header=True]) If no expression is given, *all* fields of the current entity will be dumped (including temporary variables available at that point), otherwise, each expression will be evaluated on the objects which satisfy the filter and produce a table. The 'filter' argument allows to evaluate the expressions only on the individuals which satisfy the filter. Defaults to None (evaluate on all individuals). The 'missing' argument can be used to transform 'nan' values to another value. Defaults to None (no transformation). The 'header' argument determine whether column names should be in the dump or not. Defaults to True. *example* :: show(dump(age, partner.age, gender, filter=id < 10)) gives :: id | age | partner.age | gender 0 | 27 | -1 | False 1 | 86 | 71 | False 2 | 16 | -1 | True 3 | 19 | -1 | False 4 | 27 | 21 | False 5 | 89 | 92 | True 6 | 59 | 61 | True 7 | 65 | 29 | False 8 | 38 | 35 | True 9 | 48 | 52 | True .. index:: groupby groupby ------- **groupby** (aka *pivot table*): group all individuals by their value for the given expressions, and optionally compute an expression for each group (using the *expr* argument). If no expression is given, it will compute the number of individuals in that group. A *filter* can be specified to limit the individuals taken into account. *general format* :: groupby(expr1[, expr2, expr3, ...] [, expr=expression] [, filter=filterexpression] [, percent=True], [, pvalues=possible_values]) *example* :: show(groupby(trunc(age / 10), gender)) gives :: trunc((age / 10)) | gender | | | False | True | total 0 | 818 | 803 | 1621 1 | 800 | 800 | 1600 2 | 1199 | 1197 | 2396 3 | 1598 | 1598 | 3196 4 | 1697 | 1696 | 3393 5 | 1496 | 1491 | 2987 6 | 1191 | 1182 | 2373 7 | 684 | 671 | 1355 8 | 369 | 357 | 726 9 | 150 | 147 | 297 total | 10002 | 9942 | 19944 *example* :: show(groupby(inwork, gender)) gives :: inwork | gender | | | False | True | total False | 6170 | 5587 | 11757 True | 3832 | 4355 | 8187 total | 10002 | 9942 | 19944 *example* :: show(groupby(inwork, gender, percent=True)) gives :: inwork | gender | | | False | True | total False | 30.94 | 28.01 | 58.95 True | 19.21 | 21.84 | 41.05 total | 50.15 | 49.85 | 100.00 *example* :: groupby(workstate, gender, expr=avg(age)) gives the average age by workstate and gender :: workstate | gender | | | False | True | total 1 | 41.29 | 40.53 | 40.88 2 | 40.28 | 44.51 | 41.88 3 | 8.32 | 7.70 | 8.02 4 | 72.48 | 72.27 | 72.38 5 | 42.35 | 46.56 | 43.48 total | 42.67 | 42.38 | 42.53 As of version 0.6, groupby can also be used in larger expressions. This can be used for example to compute alignment targets on the fly: :: # see note below about expr=count(condition) vs filter=condition - men_by_age: groupby(age, expr=count(gender)) - men_prop_by_age: men_by_age / groupby(age) - aligned: align(proportions=men_prop_by_age) Note that there is a subtle difference between using "filter=condition" and "expr=count(condition))". The former will not take the filtered individuals into account at all, while the later will take them into account but not count them. This can make a difference on the output if there are some empty categories, and this can be important when using the result of a groupby inside a larger expression (as above) because it can only work with arrays of the same size. Compare : :: groupby(civilstate, filter=age > 80) civilstate | | | 1 | 3 | 4 | total 542 | 150 | 85 | 777 with :: groupby(civilstate, expr=count(age > 80)) civilstate | | | | 1 | 2 | 3 | 4 | total 542 | 0 | 150 | 85 | 777 The *expr* argument will usually be used with an aggregate function, but it also supports normal expressions, in which case the values for each individual will be displayed in a list. This feature should only be used with care and usually in combination with a strong *filter* to avoid producing extremely large tables which would take forever to display. :: groupby(agegroup_civilstate, gender, expr=id, filter=id < 20) agegroup_civilstate | gender | | | False | True | total 0 | [0 1 4 6] | [2 3 5 7] | [0 1 4 6 2 3 5 7] 5 | [8 10 12] | [9 11 13] | [8 10 12 9 11 13] 10 | [14 16 18] | [15 17 19] | [14 16 18 15 17 19] total | [0 1 4 6 8 10 | [2 3 5 7 9 | [0 1 4 6 8 10 12 14 | 12 14 16 18] | 11 13 15 17 | 16 18 2 3 5 7 9 11 | | 19] | 13 15 17 19] or :: groupby(civilstate, gender, expr=age, filter=id > 100 and id < 110) civilstate | gender | | | False | True | total 1 | [46 47] | [46 47 47] | [46 47 46 47 47] 2 | [47] | [45 46] | [47 45 46] 4 | [46] | [] | [46] total | [46 47 47 46] | [46 47 47 45 46] | [46 47 47 46 46 47 47 45 46] .. index:: interactive console, debugging Debugging and the interactive console ===================================== LIAM 2 features an interactive console which allows you to interactively explore the state of the memory either during or after a simulation completed. You can reach it in two ways. You can either pass "-i" as the last argument when running the executable, in which case the interactive console will launch after the whole simulation is over. The alternative is to use breakpoints in your simulation to interrupt the simulation at a specific point (see below). Type "help" in the console for the list of available commands. In addition to those commands, you can type any expression that is allowed in the simulation file and have the result directly. Show is implicit for all operations. *examples* :: >>> avg(age) 53.7131819615 >>> groupby(trunc(age / 20), gender, expr=count(inwork)) trunc(age / 20) | gender | | | False | True | total 0 | 14 | 18 | 32 1 | 317 | 496 | 813 2 | 318 | 258 | 576 3 | 40 | 102 | 142 4 | 0 | 0 | 0 5 | 0 | 0 | 0 total | 689 | 874 | 1563 .. index:: breakpoint breakpoint ---------- **breakpoint**: temporarily stops execution of the simulation and launch the interactive console. There are two additional commands available in the interactive console when you reach it through a breakpoint: "step" to execute (only) the next process and "resume" to resume normal execution. *general format* breakpoint([period]) the "period" argument is optional and if given, will make the breakpoint interrupt the simulation only for that period. *example* :: marriage: - in_couple: MARRIED or COHAB - breakpoint(2002) - ... .. index:: assertions, assertTrue, assertEqual assertions ---------- Assertions can be used to check that your model really produce the results it should produce. The behavior when an assertion fails is determined by the :ref:`assertions-label` simulation option. - assertTrue(expr): evaluates the expression and check its result is True. - assertEqual(expr1, expr2): evaluates both expressions and check their results are equal. - assertEquiv(expr1, expr2): evaluates both expressions and check their results are equal tolerating a difference in shape (though they must be compatible). - assertIsClose(expr1, expr2): evaluates both expressions and check their results are almost equal.