WPI Overview English (en) français (fr)

From ESSnet Big Data
Jump to: navigation, search

This webpage describes in some detail the objectives, tasks, deliverables and timing of workpackage I (WPI) on mobile networks data of ESSnet Big Data.

Objectives

The main objective of WPI Mobile networks data is to carry out the construction of a production framework already initiated in the previous ESSnet. A modular instead of a linear approach will be followed both to avoid potential blockings in the research due to the obstacles in accessing real data and to optimally develop the different elements in the production framework.

The modular approach will enable the project to concentrate on many different aspects of the framework such as the data access itself and the relationship with Mobile Network Operators. Other important aspects are the methodological and IT elements (thus also enabling the identification of necessary skills and capabilities), quality issues as well as proposals for standards and related metadata and dissemination aspects of statistical products (mainly visualization thus orienting the outputs towards stakeholders). As a novelty, the use of instrumental (semi-simulated) data closely resembling real data (when these are not accessible) will be used in this line of work.

All in all, the ultimate goal is to push forward an ESS production framework with mobile phone data aiming at a standardised statistical production process.

Description of work

Before describing the planned work it is important to explicitly formulate the difference between those data sets the WP aims to process: microdata and aggregate data. The former are data at the mobile device level, either coming from Call Detail Records (CDRs) or from signalling data. The latter are aggregation of microdata (e.g. number of detected mobile devices) in each territorial cell and in a give time period. These data sets, according to our experience, will have to be complemented with technical information about the network (antennae position and others).

The work plan for this WP builds on the results already achieved in WP5 Mobile phone data of the former ESSnet on Big Data. As stated above, a modular approach will be followed. It is proposed to develop 8 tracks. They are not completely independent, but play a clear distinctive role in the overall strategy. They are:

  1. Access to data and NSI-MNO partnership models;
  2. Synthetic data generation;
  3. Methodology;
  4. Information technologies (IT);
  5. Standards and metadata;
  6. Quality;
  7. Application on real data;
  8. Visualisation.

These tracks have been structured in activities. Complementarily, since 7+2 countries will be partners in this WP, various roles have been profiled to make the project manageable: two participating roles and five different research roles/profiles in these tasks. In this way each partner can make their optimal contribution according to their staff and to their professional/academic background.

Tasks

The foreseen contents of each track are provided in detail below.

Track 1 - Access to data and NSI-MNO partnership models

Access to data is still one of the major goals for this sort of data. Negotiations and agreements with MNOs must still be pursued and the conditions under which access will be granted are a clear objective for the project and the whole ESS. As analysed in the former ESSnet, there is a number of issues to be overcome (legal issues, extraction and preprocessing costs, risk perception…). One of the main conclusions of the former experience in the ESS is that MNOs will ineludibly be part of the statistical production process in the initial stages of data collection and data preprocessing. This is clearly in line with the Reference Methodological Framework for Mobile Network Data proposed and promoted from Eurostat.

Two general activities are planned in this track:

  • Scenarios: a compilation of the different potential scenarios (third-party, in-house,…) is to be offered to provide the ESS with an overview of the different possibilities. This compilation should be based on the experience gained during the contact with MNOs and a list of a priori pros and cons must be constructed.
  • Framework: this activity revolves around the development of the access framework to be pursued according to the Reference Methodological Framework by Eurostat. This entails the involvement of MNOs in the initial data collection and data preprocessing stages of the statistical production process. A special emphasis on the economics of this collaboration between NSIs and MNOs will be better understood as agreements are reached.

After the experience gained in the first ESSnet on Big Data and some already ongoing contacts with MNOs by some of the partners, making the conditions for access explicit and shareable to all ESS partners is key, especially underlining the potential benefits for MNOs (statistical knowledge, higher-quality statistical products, social corporative responsibility,…). University experts will be contacted to produce an updated report on all these aspects with special emphasis on the inclusion of MNOs as part of the official statistical process. This report will be complemented with the empirical experience gained in the contact with MNOs.

All in all, all this information will offer a view on the different partnership models between NSIs and MNOs. By partnership model we shall understand a collaborative association between NSIs and MNOs for the production of official statistics in such a way that many aspects are simultaneously agreed:

  • Sharing of knowledge, both technological and statistical, in order to achieve high-quality standards in the statistical products. This sharing should respect intellectual property rights and industrial secrecy conditions so as not to impinge negatively on the competitivity of the telecommunication market (undue transfer of know-how among MNOs and loss of competitive position).
  • Sharing of production activities, so that delimited roles in the production process are commonly agreed by NSIs and MNOs. This may entail an analysis and an agreement on the associated production costs (costs should be attached to production activities and never to data themselves).
  • An agreement on the scopes of statistical products produced from mobile network data. This aims at delimiting private and public scopes in order to avoid conflicts of interest in the production and dissemination of mobile phone data-based statistical products.

Track 2 - Synthetic data generation

This is an intermediate instrumental task gathering necessary steps at other parts of the research. Firstlly, this synthetic data with similar structure and format to real data (taking advantage of our knowledge on them) will play the role of real data while access to these has not been reached. This will avoid blocking the research because of the lack of data. Secondly, these data will assist in the assessment of the economics regarding data access and its many aspects such as the analysis of costs/benefits for MNOs in granting access to different types of data or the technological issues regarding the preprocessing of these different types of data. Thirdly, they are necessary to assess statistical models in the inference stage or, in the case of microdata simulation, to assess geolocation routines or spacetime interpolation methods (see below). In particular, quality assessment (in terms of bias, variance, and model goodness of fit) will clearly benefit from these synthetic data. Finally, they will also provide an extraordinary framework for reproducibility of the results.

These synthetic (semi-simulated) data are not the target per se of the research. They will play the role of real data whenever real data are not available, apart from the quality assessment and reproducibility framework (where they are needed). Simulating whole population values will be a central part of this task; otherwise the instrumental value will be very low. The need for both methodology and IT tools is thus clear. We plan three activities in this track:

  • Methodology: a methodological framework to produce synthetic mobile network operator data is needed. The aim is not to produce single datasets but to construct a tool generating as many datasets as needed according to different scenarios and assumptions (with/without antenna location, CDRs/signalling data, with variables such as signal strength/Time Advance, multiple SIMs for the same subscribers\dots). We aim at simulating both aggregate and micro data, i.e. at the level of different territorial cells and at the level of individual mobile devices. The latter is indeed much more involved since many factors enter into play (simulating trajectories and their detection in a map, etc.). We shall begin with a simplified scenario enabling us to produce synthetic aggregate data just by a direct aggregation procedure and then we shall introduce more complex elements to make the simulation more realistic.
  • IT: the former methodology must be implemented in concrete software tools to generate the datasets. The infrastructure needed for this exercise together with software libraries and tools must be investigated and constructed.
  • Generation: finally, having statistical methodology and IT tools, datasets must be generated accordingly ready to be used in other activities of the project. They must be free and openly available to the whole ESS. For this task resources in the form of a distributed computing platform will be needed. Eurostat is working in providing facilities in this direction which will be used in the project.

Track 3 - Methodology

As a result of the previous ESSnet, several methodological key points were identified needing further (indeed constant) research. In the line of the Reference Methodological Framework cited above, the two-phase life-cycle model used in administrative data was proposed in the deliverables of the former ESSnet as a tool to understand the generation of these data as well as the identification of error sources. More recent proposals in the realm of admin data are also suitable for this framework under this view. This must provide an overall understanding of the whole production process. Additionally, the geolocation of events (thus the spatial attributes of the statistical units) and the inferential stage (going from data to the target population) are two key steps in this process. Thus, the WP plans three activities:

  • Framework: as described, by using the two-phase life-cycle model all potential error sources in the process must be identified. This will be essential not only for the methodological proposals regarding the inference stage, but to assess the quality of the final product (inasmuch as non-sampling errors play a fundamental role in survey data quality).
  • Geolocation of events: the assignment of spatial attributes to events, hence to statistical units, is essential in the utility and quality of these data. In this activity the key idea is the overlapping character of the geographical cells covering the territory. Already exploited techniques such as BSA in deliverables of the former ESSnet should be further explored allowing for the overlapping character of the cells (Best N-Service Area). When needed, instrumental (semi-simulated) data will be used in this task prioritizing the use of real data.
  • Inference: once a set of microdata is available, the inferential step connecting these to the target population must be undertaken. Firstly, the target population in the microdata set (general population, inbound tourists, outbound tourists…) must be identified. Once the target population is identified, the estimates must be constructed together with their accuracy assessment (coefficients of variations, confidence intervals…). In this accuracy assessment, the instrumental (semi-simulated) data will be used so that the application on real data (see below) can be assessed along similar lines. This task will serve as an input for WPK on methodology and quality, since it potentially offers the possibility to add a layer of abstraction so that the connection between Big Data sets and target populations can be inferred in a rigorous way.

Track 4 - Information technologies

The production of software solutions is an explicit goal of the WP. To provide a fairly enough number of IT tools they should be able to be developed in parallel. A modular approach providing solutions for different elements of the production framework developed in the track on methodology (software solutions for quality issues are put off to later tracks) will be followed. Based on the experience with the former ESSnet it should be underlined that these solutions are usually better developed along with the methodology. The following activities are planned:

  • Geolocation -- mobloc: the WP identifies the different aspects for the continuation of the mobloc package already developed in the former ESSnet.
  • Inference -- pestim: continuation of the pestim package already developed in the former ESSnet.

There is a critical issue in the development of the IT solutions, which is the scalability. This is strongly dependent on the concrete data sets available, either instrumental or real. The scalability will be approached depending on this aspect.

Notice that the combination of statistical methodology and software solutions will enable the ESSnet to identify necessary skills and capabilities to integrate this new data source in the standard production process at NSIs, thus providing input to European programmes like the European Statistical Training Programme (ESTP).

Track 5 - Standards and metadata

The production of official statistics with mobile network data is a new process in the ESS. Thus, it requires new standards and metadata. Furthermore, the modular approach in different steps of the process requests that these modules must interact with each other in a transparent way to make them functionally modular. This is the role of standards and metadata gluing all pieces together. Three activities are planned:

  • Structural metadata: standard definitions of variables involved in the process needs to be provided. A first tentative glossary of terms for MND is needed. Also, special attention must be paid to the core variables (ID, space and time) both to definitions and formats. This should be completed with the rest of variables.
  • Standards: for a modular approach to be successful suitable interfaces for them to interact transparently needs to be provided. The plan is to concentrate on the exchange file formats and the process interfaces.
  • Integration: all production elements (methodology and IT) will be integrated to configure an end-to-end process by integrating all the parts, i.e. from raw telco data collection to the final population estimations This should be the high-level output of this work package.

In the development of standards, the aim must be to provide not only a set of concrete standards but also the process of updating them according to the ever-changing technology behind these data.

Track 6 - Quality

In this track, both methodology and IT aspects are to be treated in a combined way to focus on quality aspects. The ultimate goal in this track is basically to address the quality of the process as a whole and of each element. The following activities are planned, based on the preceding elements:

  • Geolocation of network events: quality measures need to be provided for the assignment of variables in the data model.
  • Inference: both the quality of the identification of the target population in the data sets and the inferential step need to be addressed. This activity will also be potentially used as an input for WPK on methodology and quality by adding a layer of abstraction so that this proposal can be used with other big data sources. Instrumental (semi-simulated) data will be used in this activity to assess the performance of quality indicators.
  • Framework: when all pieces are glued together to produce final estimates, their quality needs to be addressed.
  • Software implementation: all the preceding quality indicators must be implemented in software solutions. It is not needed to provide separate libraries or packages for them, since they can be incorporated to the IT tools provided in the IT track.

Track 7 - Application on real data

This track shall concentrate on the application of all previous work to real data. As activities, in a natural way, the WP proposes the application of each element of the production framework to the set of real data under analysis. Thus, the following activities are planned:

  • Real data: the proposals are applied to real data according to preceding tasks (geolocation, inference, quality measures, and assessment of final estimates).
  • Iteration: according to the results obtained by experiments with real data some parts of the methodology or the software implementation could be improved.

Two statistical domains stand up as natural applications of this data source: population dynamics and tourism statistics. The concrete choice will ultimately depend on the agreements reached with the MNOs and data availability.

Finally, two important issues are considered to cover on a wide range of possibilities:

a) A contingency plan should be put in place in case access to real data is not achieved. It is proposed to exploit as much as possible those real data sets already in use by partners in this WP. Findings in the current project will be applied to these data. If further data are accessed, they will also be exploited. In all cases, synthetic data will be instrumentally used when real data is lacking.

b) The scenario with real data from multiple MNOs will be minimally considered, since this is a fairly unrealistic situation. Those methodological, technological, and quality-related findings for a single MNO will be introductorily explored for a standardised application to many MNOs (it is foreseen to propose standard procedures for data preprocessing and microdata model generation). Regarding the inference exercise, some considerations will be made to make the statistical models applicable in the single- and multiple-MNO settings. However, the priority will be to set up the different process steps in the single-MNO scenario.

Track 8 - Visualisation

The final track focuses on the dissemination of results and the orientation of the whole framework, especially for partners in the ESS and stakeholders in general. Every stage in the preceding production framework must be completed with a visualisation layer. Moreover, the website on Github devoted to the software tools developed during the previous ESSnet is planned to be slightly refurbished and completed offering a view of the whole process and guiding the user through the different elements. The following two activities are planned:

  • Geolocation/Inference: this activity gathers the visualisation layers for both the geolocation and inference stages, which must be then made homogeneous.
  • Dissemination: this activity must be clearly user-oriented aiming at showing a complete view of the whole production process.

Roles

Given the number of partners in this WP, the coordination task is large and to make it manageable two participating roles are defined. On the one hand, there will be a core of members (5 partners – ES, FR, NL, IT, RO) taking the responsibility of developing the bulk of the research. They will have to execute most of the tasks depicted in this work plan. It is not essential to have access to real data to play this role but only the commitment to undertake the execution of a large number of tasks. In this way, the research output will be more homogeneous. On the other hand, a second group (EE, DE) will be punctually informed and take part in very concrete tasks. In particular, the application of the different developments to both real and/or simulated data will be a central task for them. Aside, a collaboration with other NSIs (UK, IE, BE) will be established in different ways, since they are carrying out some point actions with mobile phone data and their contribution will be interesting. The expected result will be a common production framework widespread across the ESS.

Within each of these two major participating roles, we envisage five professional/academic roles in the project. They are of course an idealization because in many concrete tasks it is virtually impossible to detach one role from another (e.g. methodological work from software development work, quality assessment from metadata construction, …). Thus, they should be understood as a generic description of the main professional/academic background needed to conduct the corresponding task(s). The following roles are defined:

  • Statistician/Survey manager/Domain expert/Business manager
This is indeed a mixed of those staff members devoted to managing surveys with a deep knowledge in some domain (although this is not compulsory) whose main function is to rule and manage the statistical production process from data collection to dissemination. This role will play a central role in the track on access (Track 1) and application of data (Track 7). It will also be needed in concrete tasks in the other tracks.
  • Methodologist
This will be the role taking charge of the more mathematical content of the project. It will be the leading role in the track on methodology although clearly assisted by software developers (see below). Complementarily, this role will need to assist IT experts in the development of software (see below).
  • IT expert/Software developer
Two slightly different sub-roles can be identified. On the one hand, the IT infrastructure for access and for computing will be a central issue in many parts of the project. A deep knowledge of high-performance computer systems will be needed. On the other hand, the development of software is an explicit request for this project. The methodology must be implemented in concrete software solutions. This role will take charge of these aspects.
  • Quality/Metadata/Standards expert
This role comprises experts on quality, metadata, and statistical standards, which will be needed in the tracks on the development of standards (track 5) and on quality (track 6). The use of new data sources with new methodology and new IT tools is an opportunity for the ESS to standardise the statistical production process from the very beginning in using these new data. A knowledge of international statistical standards is clearly needed.
  • Visualisation/Dissemination expert
This role will take care of providing the visualisation layers to the different elements in the production framework as well as organising the website in Github presenting the results (code, examples, …). GIS knowledge and data visualisation expertise is required to develop these software tools. This role will be in a tight coordination with the general work package on dissemination of the project. Notice that according to the responsibilities of each participating role, the statistician profile will mostly correspond to the non-core partners group.

Finally, notice that all these roles correspond to traditional roles within NSIs. As a result of the work in this WP and in agreement with the partnership models investigated in the first track, the need for a role specialized on telecommunication engineering will be explored (currently, no such role exists in the NSIs). Certainly, this knowledge will be needed, but the issue is whether this role should be part of NSIs or the partnerships will suffice for this.

Milestones and deliverables

See here for an overview of available milestones and deliverables.

WPI milestones

  IM1   Report on the 1st WP meeting 2019   Month 6
  IM2   Report on the 2nd WP meeting 2019   Month 9
  IM1   Report on the 3rd WP meeting 2019   Month 12
  IM2   Report on the WP meeting mid-2020   Month 20

WPI deliverables

  I1   Access to mobile network data: updated overview   Month 20
  I2   Generation of instrumental (semi-simulated) mobile networks data   Month 12
  I3   Methodological framework for the production of official statistics with mobile networks data   Month 15
  I4   Some IT tools for the production of official statistics with mobile networks data   Month 18
  I5   First proposed standards and metadata for the production of official statistics with mobile networks data   Month 16
  I6   Quality issues in the production of official statistics with mobile networks data   Month 21
  I7   Some experimental results with mobile networks data   Month 24
  I8   Visualisation tools for the production of official statistics with mobile networks data   Month 24

Since the approach will be modular, there is not a strong linear dependence in the composition of these documents. However, results from one track will be useful for others and this will be reflected in the development of the work (hence in the production of the deliverables). In this line, some parts of these documents will be produced independently of other tracks while others will wait for key results in the latter. The plan is to publish every independent part of the deliverables as draft versions to be completed later on according to the progress of the work. This is the detailed plan:

  Deliverable/Track   Independent content   Final   Partners
  I1 Month 4 (Scenarios) Month 20 (Framework: depending on ongoing negotiations) INE (Statistics Spain)[1], Statistics Estonia, INSEE (Statistics France)
  I2     Month 12 INE (Statistics Spain), INS (Statistics Romania)
  I3 Month 3 (Two-phase Life-cycle Model) Month 15 (Geolocation of network events & Inference) INE (Statistics Spain), INSEE (Statistics France), ISTAT (Statistics Italy), CBS (Statistics Netherlands), INS (Statistics Romania)
  I4   Month 18 (Possible posterior iterations to be included) INE (Statistics Spain)[1], CBS (Statistics Netherlands), INS (Statistics Romania)
  I5 Month 9 (Glossary of terms, metadata for core and extra variables, and file format) Month 16 (Process interfaces to integrate all elements in the production framework) INE (Statistics Spain), ISTAT (Statistics Italy)
  I6 Month 16 (Issues for geolocation of events and inference) Month 21 (Including software development) INE (Statistics Spain), Destatis (Statistics Germany), ISTAT (Statistics Italy), INS (Statistics Romania)
  I7 Month 17 (application to simulated data) Month 24 (application to real data) INE (Statistics Spain)[1], Destatis (Statistics Germany), Insee (Statistics France)
  I8 Month 19 (geolocation of events & inference) Month 24 INE (Statistics Spain)[1], INSEE (Statistics France), CBS (Statistics Netherlands), INS (Statistics Romania)
  1. 1.0 1.1 1.2 1.3 INE participating in this track mainly as coordinator.