Validation and Transformation Language VTL

VTL is a new language intended to be directly usable by statisticians to specify, and thus document, validation rules and logical transformations on data. This is the case, for example, when administrative data tables require transformations to become statistical sources (checks, conceptualisation, consistency, etc.).

VTL is based on a logical formalisation of the validation and transformation rules which has the advantage of being able to be converted automatically into specific programming languages for execution (Python, R, Java, SQL, etc.)

This logical formalisation provides a “neutral” expression for which various implementations can be implemented.

TREVAS, an open-source project being developed by INSEE, is an example of a VTL implementation engine written in Java which relies in particular on spark for execution.

VTL is being developed by members of the SDMX, DDI and GSIM communities and is beginning to be used more widely by the international statistical and research community.

Below is an example of a VTL instruction set for the table transformation:

  • The dummy table ds_source contains an employee identifier, the company name, the gross salary, the amount of charges
  • The first statement will create a new table ds_net by calculating the net salary amount: [calc NET_WAGE := GROSS_WAGE – CHARGES]
  • The second statement creates a new table ds_gros_salaires from ds_net by filtering on salaries above 3000
  • The third statement creates a ds_final table by deleting the variables SALAIRE_BRUT et CHARGES

 

Note that it is possible to obtain the df_final table in a single command:

df_final:=ds_source [calc salaire_net:=salaire_brut–charges] [filter salaire_net < 3000] [keep id, salaire_net] ;

IIt is also possible to combine tables by join. In the example below, the individus table is matched by the SIRET to the entreprises table, a filter is then applied.

individus_clean := individus [rename siret to SIRET] [calc identifier SIRET := SIRET, measure NIR :=NIR, measure sexe :=substr(NIR, 0,1)];
jointure := left_join(individus_clean, entreprises);
femmes := jointure [rename NOM to NOM_ENTREPRISE, EFFECTIF to EFFECTIF_ENTREPRISE][filter sexe = “2”][keep NIR, NOM_ENTREPRISE, EFFECTIF_ENTREPRISE];

A final example inspired by the GIT VTL-Lab scripts

/* Validation of indicator variables */

/* We create a variable V_INDICATRICE_TOUTES where we sum the indicators
* Observations where none or more than one indicator is 1 are deleted */

Table-indicatrices3 := Table-indicatrices2  [calc V_INDICATRICE_TOUTES :=V_INDICATRICE_a + V_INDICATRICE_b + V_INDICATRICE_c + V_INDICATRICE_d + V_INDICATRICE_e] [filter V_INDICATRICE _TOUTES = 1];