WPJ Overview English (en) français (fr)

From ESSnet Big Data
Jump to: navigation, search

Objectives

The Departments of Tourism of most European National Statistical Institutes for Official Statistics (NSIs) are confronted with strong dynamic socio-economic developments, unexpected technological disruptions and cultural shifts in the last years. They are often required to identify timely new tourism phenomena, tourism trends on the Tourism Industry processes and their relationships. The increasing growth of information leads to big data and information systems targeting to administer, analyse, aggregate and visualize these data. The use of these data can in turn enhance in value surveys and registers data. This is actually the main challenge being faced by NSIs. WPJ will therefore focus on and develop a Tourism Information System which, in the future, should become an integral part of official statistics to intelligently monitor the Tourism Industry.

The main objective of the package is to address the need of a conceptual framework and setting up a smart pilot Tourism Information System that will support statistical production in the field of tourism by integrating various big data sources with administrative registers and statistical databases using innovative statistical methods.

Specific objectives:

  1. Identification and evaluation of availability and quality of big data obtained with the use of various methods
  2. Developing methods of combining data and their spatio-temporal disaggregation
  3. Developing a prototype of a solid production system for tourism statistics.
  4. Recommending which of the identified and improved data sources produce good quality tourism estimates.
  5. Recommending how to set up a smart and robust prototype of Tourism Information System.

Description of work

To analyse and identify potential data sources, information held by external data administrators will be used in order to obtain the widest possible knowledge about habits and travel behaviour of tourists.

Data that can be used as a big data source could be obtained from, among others, web portals offering services in the field of planning, organizing, executing, monitoring and evaluating a tourist trip, as well as from institutions managing urban infrastructure at a destination point. The selection of data sources will be precise enough to be obtained and used in other European countries.

One of the fundamental methods of obtaining data from web portals is webscraping. A necessary prerequisite for the preparation of a well-functioning framework for webscrapping is a selection of appropriate tools and resources. For this purpose, an analysis of the most popular open-source tools will be carried out. The involvement of a large community of programmers in the development of this type of software will shorten the time of implementation and system management. Selection of popular portals, often updated with data, will allow for a quick adaptation and use of the system in the future. The process of scraping will be automated while maintaining the best practices, for example, appropriate delays will be set in server queries, the process itself will be executed during off-peak hours. In order to detect changes in the structure of web portals, appropriate tests will be prepared, and, to check the quality of downloaded data automated processes using regular expressions will be set.

To show the existence of dependency between collected data and their impact on each other, it is intended to integrate acquired information into a multi-purpose statistical product that gives a complete picture of tourism in a given geographical area. This product will be domain- and country-driven. The combination of big data with existing statistical and administrative registers will help to use big data in the official statistical system more effectively.

The main part of online sources is available in the same form in all European Union countries. These sources do not require either payment or work on the part of the data gestor in order to share them. Incorporating Internet sources into regular statistical production will enable data collection on a regular basis, which is their great advantage. Some sources are shared by many gestors. In the case of temporary data unavailability from one gestor, it is possible to get them from another, which allows one to maintain the continuity of results.

Task 1 – Inventory of big data sources related to tourism statistics

Lead CBS (Statistics Netherlands)

a. web scraping: Internet portals related to offers of the accommodation base, transport, traffic flow in local communication, admission tickets, meteorological data, communications on epidemiological threats and natural disasters, etc.

Lead GUS (Statistics Poland), supported by NSI-BG (Statistics Bulgaria), HSL (Hesse Statistical Office, DE), ELSTAT (Statistics Greece), NSI-SK (Statistics Slovakia), INE (Statistics Portugal)

b. external data administrators: data on water consumption, waste production, energy meters, use and prices of parking meters and car parks, car traffic, data from store transactions, payment cards, mobile data operators etc.

Lead NSI-BG (Statistics Bulgaria), supported by HSL (Hesse Statistical Office, DE), ELSTAT (Statistics Greece), CBS (Statistics Netherlands), GUS (Statistics Poland), INE (Statistics Portugal), NSI-SK (Statistics Slovakia)

c. source characteristics including: variable identification, variable taxonomy, variable possible mappings with our statistical variables, variable ontology (relationships and hierarchies description).

Lead INE (Statistics Portugal), supported by ELSTAT (Statistics Greece), CBS (Statistics Netherlands)

Task 2 – Examining availability, legal aspects and the quality of the new identified data sources used in the project

Performed by NSI-BG (Statistics Bulgaria), supported by ELSTAT (Statistics Greece), NSI-SK (Statistics Slovakia)

Only these data sources will be taken into consideration which were not analysed in the ESSnet Big Data I project.

Task 3 – Developing a methodology for combining and disaggregating data from various sources

Lead GUS (Statistics Poland)

a. combining data: The diversity of sources and frequency of big data and the possibility of combining them with administrative registers and statistical databases requires the use of innovative statistical methods (Non Extensive Cross Entropy, Adjusted Spatio-Temporal Disaggregation Model, etc.). This subtask will deliver data on “Expenditure by expenditure categories” and “Average expenditure per trip” in country breakdown. These estimates will be compared with existing statistics.

Lead GUS (Statistics Poland), supported by ELSTAT (Statistics Greece), CBS (Statistics Netherlands)

b. spatial-temporal disaggregation of tourism data: A significant part of the data is available with annual and quarterly frequency, hence there is a need to bring this information to the lowest level of spatial aggregation (villages, districts, housing estates, etc.) and temporary (months, daily, etc.). In this subtask, data on “Number of establishments and bed-places” by NUTS 3 regions will be delivered.

Lead CBS (Statistics Netherlands), supported by ELSTAT (Statistics Greece), GUS (Statistics Poland)

Task 4 – Flash estimates in the field of tourism

Lead GUS (Statistics Poland), supported by ELSTAT (Statistics Greece), CBS (Statistics Netherlands)

Data in the field of tourism statistics are provided with a relatively long delay (in Poland monthly data on the accommodation base at T + 45). Flash estimates in the field of tourism would respond to the growing demand from stakeholders regarding the rapidly changing situation on the tourism market. One of the potential applications of the collected data (offers of the accommodation base among others) will be to check the possibility of combining them with data from sample survey on accommodation to determine future movements of tourists and their expenses. In this task, flash estimates of monthly data on “Nights spent at tourist accommodation establishments” will be obtained. Flash estimates will be compared with existing statistics.

Task 5 – Use of big data sources and developed methodology to improve the quality of data in various statistical areas

Lead CBS (Statistics Netherlands)

a. verification of estimations of the size of tourist traffic according to travel directions, means of transport, types of accommodation.

Lead CBS (Statistics Netherlands), supported by NSI-BG (Statistics Bulgaria), GUS (Statistics Poland)

b. verification of estimation of the amount of expenses related to trips: Numerous Internet portals allow for an ongoing tracking of changes in the prices of air tickets, railway tickets, hotel prices, public transport costs, fuel, meals and admission tickets. Their combination with data obtained from mobile network operators will allow precise estimation of tourist consumption (expenses) and thus will improve the quality of balance of payments data. (Responsibility: PL, supported by Hesse, NL)

Lead GUS (Statistics Poland), supported by CBS (Statistics Netherlands), HSL (Hesse Statistical Office, DE)

c. modification of Tourism Satellite Account based on the Tourism Information System.

Lead GUS (Statistics Poland), supported by ISTAT (Statistics Italy), NSI-SK (Statistics Slovakia)

Task 6 – Summary, problems encountered plus future perspectives

Lead GUS (Statistics Poland), supported by NSI-BG (Statistics Bulgaria), HSL (Hesse Statistical Office, DE), ELSTAT (Statistics Greece), CBS (Statistics Netherlands), INE (Statistics Portugal), NSI-SK (Statistics Slovakia)

Suggest pilots and domains with successful implementation potential for further elaboration. Description of problems encountered and possible ways of dealing with them. Recommendation on legal aspects; availability and sustainability; methodology; quality; technical requirements that will support statistical production in the field of tourism.

Project implementation

When one deals with multiple big data sources, one usually encounters different levels of completeness, detail as well as a lack of data structure. Each source will be analysed by exploratory analysis methods (Cronbach’s alpha and KR-20 for continuous and dichotomous variables, respectively) to determine the consistency of the set and relations between variables. Data gaps will be imputed. Model-based imputation methods (e.g. Support Vector Regression) will be more appropriate for big data than classical imputation methods. There is also a necessity to use methods that select an optimal set of variables for the model, such as Principal Component Analysis.

The lack of frames does not allow to use classical methods of direct estimation. For this reason, non-classical estimators or statistical models will be used. Thus, combining several sources is crucial to obtain a better picture of tourism domain (e.g. Non-Extensive Cross Entropy Econometrics will be investigated as a tool for combining multiple sources).

When it comes to the problem of disaggregation and flash estimates in tourism a non-standard approach is required (e.g. Adjusted Spatio-Temporal method, bridge equations and time series models can be used).

The basic challenge associated with the design and implementation of big data technology is to determine the area of application. The processing of big data is performed using real-time transactional analytics or a comprehensive analysis of archived data. Selection of the area determines the optimal architecture that should be used. Examples of such architecture are local infrastructure, infrastructure located in the cloud and hybrid infrastructure, i.e. connecting the two previous ones.

Throughout the pilot, various solutions will be used to collect, analyse and process obtained information. During the collection and processing of data particular emphasis will be placed on adequate data security.

At the half way of the project, an interim report will be prepared, showing implemented activities, the level of work progress, preliminary results and a general description of applied methods. The final report will be a summary of the work carried out. It will also contain final results, the method of process replication and a full description of the applied methodology. An important part of the final report will be a methodological module, in which innovative methods of combining big data with administrative and statistical data will be described. The technical module will present the prototype of the IT infrastructure of which, the automation of the processes will be an important element. Additionally, the report will describe encountered problems, as well as the ways and resolutions to solve them.

An attempt will be made to identify statistical areas in which the results of the system operation can be used, as well as potential beneficiaries from outside the official statistics (people and organizations) at the micro, meso and macro level.

The following principles will be applied in respect of each of the WP meetings:

  • to be provided one week before the meeting: one draft report per task, mentioning obtained results and challenges for the next period;
  • during the meetings: progress will be discussed and tasks might be (slightly) adjusted, especially in the case of interference between the tasks;
  • to be provided two weeks after the meeting: a deliverable containing conclusions from the meeting + reports per task.

Timeliness and scope of works will be monitored on the basis of monthly work time records submitted by individual team members, directly by the project manager and additionally by relevant organizational units of the Statistical Office in Rzeszów, which do not participate in the project. Regular WebEx meetings with WPJ involved countries will also take place in order to prevent possible threats in the timely completion of works. In addition, work in the project will be carried out by specialists who already have experience in working with administrative sources and big data, thanks to participation in other projects.

Milestones and deliverables

See here for an overview of available milestones and deliverables.

WPJ milestones

  IM2   Report on the WP meeting mid-2019   Month 9
  IM2   Report on the WP meeting mid-2020   Month 20

WPJ deliverables

  J1  ESSnet Methods for webscraping, data processing and analyses   Month 6  
  J2  Interim technical report showing the preliminary results and a general description of the methods used   Month 12  
  J3  Methodological framework report   Month 16  
  J4  Technical Report containing a prototype of IT infrastructure   Month 20  Important elements will include templates and optimal methods of data analysis as well as a description of the automation of the most important processes
  J5  Final Report containing final results and a full description of the methodology used   Month 24