Ideas for developing a quality framework for job vacancy statistics English (en) français (fr)

From ESSnet Big Data
Revision as of 13:54, 25 January 2017 by Nswierni (Talk | contribs)

Jump to: navigation, search
Error Type Relevance to using on-line job advertisements for statistical purposes

Phase 1 errors

Phase 1 applies to a single dataset in isolation. For a complex statistical output from many different datasets, carry out the phase 1 evaluation separately for each source dataset. The framework can be used for both administrative and survey data.


Measurement (variables) terms

The measurement side describes the path from an abstract target concept to a final edited value for a concretely defined variable.


Validity error

Measurement begins with the target concept, or ‘the ideal information that is sought about an object’. To obtain this information, we must define a variable or measure that can be observed in practice. Validity error indicates misalignment between the ideal target information and the operational target measure used to collect it.


Measurement error

Once the target measure is defined, we collect actual data values. The values for specific units are the obtained measures.


Processing error

The edited measure is the final value recorded in the administrative or survey dataset, after any processing, validation, or other checks. These checks might correct errors in the values originally obtained, but can introduce additional errors.


Representation (objects)

This deals with defining and creating ‘objects’ – the basic elements of the population being measured.


Frame error

The target set is similar to the target concept – it is the set of all objects the data producer would ideally have data on. An important distinction between the usual statistical concept of ‘units’ and ‘objects’ in this context is that in some administrative datasets the base units could be records of individual events (eg transactions with customers).


Selection error

Many collections have objects in the accessible set that don’t end up in the data. For instance, our accessible set could be all people eligible to vote, but the accessed set, the set we actually obtain information about, includes only people who actually registered on the electoral roll. The missing, unregistered people are a source of selection error.


Missing/redundancy error

The observed set comprises objects in the final, verified dataset. Most checks on administrative data are likely to remove objects that shouldn’t have been in the selected set to begin with (eg someone trying to enrol to vote who is under 18); these types of errors are selection errors. The incidence of errors where the agency mistakenly rejects or duplicates objects due to their own processing is fairly rare, but this category of error exists so we keep such errors distinct from reporting-type errors.


Phase 2 errors

Phase two of the error framework covers errors arising when existing data is used to produce an output that meets a certain statistical purpose. Often this involves combining different datasets for different parts of the population, or integrating several datasets together.


Measurement (variables)


Relevance error

The target concept in phase 2 is similar to that in phase 1 (the ideal information sought about the statistical units). The harmonised measures are the practical measures decided on in designing the statistical output, such as a survey question aligned with a standard classification. In some cases they could be the same measures as in one of the datasets to be combined, but the harmonised measures may also be a standardised statistical measure that does not align perfectly with variables in the original datasets.


Mapping error

We transform measures in the source datasets into harmonised variable values. The values we assign in this process are called re-classified measures. Practical difficulties encountered in this stage lead to mapping errors.


Comparability error

Regardless of how the reclassified measures are derived, we may need editing and imputation to obtain consistent outputs. The final values after these processes are our adjusted measures. In addition to the usual imputation, for units with missing variables in the source datasets, we may need extra checks to reconcile values that are correct for each individual dataset but disagree with each other for the output measure.


Representation

Representation in phase 2 deals with creating a list of statistical units to include in the output data, based on the source data’s objects. Here is where the object/unit distinction is most important – the individual datasets may be based on transactions or events we need to connect then place into newly created statistical units that relate to customers, stores, or other entities of interest in the statistical target population


Coverage error

The target population is fairly familiar from survey statistics – it is the ‘set of statistical units that the statistics should cover’. The linked sets are the units that are connected across the relevant datasets. Note that these units will not necessarily be the final statistical units of the output. 


Identification error

Depending on the type of units in a linked data set, we may want to create ‘composite units’, which are made up of one or more ‘base units’. We can consider the aligned sets as a table that records the composite and base units. A failure to correctly align these units would be treated as identification error.


Unit error

The final statistical units in the output dataset could be created from scratch, without a direct correspondence to any of the units in the source datasets.


Understanding errors arising from modelling

To help understand modelling errors, we consider two types of error that can arise when we use a statistical model to estimate a target variable:

  • Model structure error – the model specification chosen may not capture the real relationship between the variables. For example, we might use a simple linear model to predict one variable, using another, but in reality the relationship between these variables is non-linear. Common techniques for assessing this type of error include goodness-of-fit tests and residual plots.
  • Parameter uncertainty – when we estimate the values of the parameters in a model, there is always some uncertainty. We need to measure parameter uncertainty and propagate it through to the final results that rely on the model. Techniques such as bootstrapping or Bayesian estimation are often used.

We also consider whether an overall model uncertainty can be determined. If we have more than one possible model, we might combine the results of the different models to provide an overall measure of uncertainty. Bayesian model averaging is one way of doing this.