Description of the Workpackage
The purpose of this pilot is to investigate whether webscraping, text mining and inference techniques can be used to collect, process and improve general information about enterprises. Challenges compared to “webscraping / job vacancies” pilot are application of more massive scraping of websites and collecting and analysing more unstructured data.
In particular, the aim is twofold:
- to demonstrate whether business registers can be improved by using webscraping techniques and by applying model-based approaches in order to predict for each enterprise the values of some key variables;
- to verify the possibility to produce statistical outputs using predicted data, in combination or not with other sources of data (survey or administrative data): in particular, a set of benchmark estimates might be the ones produced by the survey on “ICT use by enterprises”, a survey common to EU Member States.
Some specific use cases of interests concern the estimation from enterprises websites of the following information:
- whether an enterprise performs e-commerce or not
- whether an enterprise manages job vacancies on its site
- presence in social media
- contact information: location, contact emails, etc.
- profiling information: type of activity, links with other enterprises, etc.
WP1 will coordinate its work with WP2.
Methodological, quality and technical results of the work package, including intermediate findings, will be used as inputs for the envisaged WP 8 of SGA-2, in case SGA-2 will be realized. When carrying out the tasks listed below, care will be taken that these results will be stored for later use, by using the facilities described at WP 9.
Task 1 – Data access
- For each partner participating in this task, produce an inventory of enterprises to be object of web scraping:
- definition of the population of interest (requirements on enterprises in terms of economic activities and size),
- identification of business registers containing the information needed to identify the enterprises. All the countries involved in the work package will participate in this task.
- For each inventory of enterprises produced in task 1, identification of URLs:
- selection of archives containing the indication of URLs pertaining to the enterprises included in the population of interest (activity 1), and estimation of coverage;
- in case of insufficient coverage, development of applications for searching on the web the URL of an enterprise given its identifiers (denomination, fiscal code, economic activity, etc.), and evaluation of reliability of resulting URL.
- Investigation (both at european and Member State level) of legal aspects and privacy issues concerning scraping websites, in co-ordination with eurostat.
Task 2 – Data handling
- Detailed use cases definition: Stakeholder and user consultation for analysing statistical data needs will be carried out in order to revise and detail the envisioned use cases. In particular, this task will also involve a coordination with ESS.VIP “European System of Interoperable Statistical Business Registers”.
- Review of literature on techniques and available open source solutions for massive web scraping (JSoup, HTTrack, etc.) and study of problems related to website accessibility (blocking mechanism), also in coordination with the UNECE Sandbox project. Implementation of one or more web-scraping techniques in applications to collect enterprises website content, and comparison with approaches followed by third parties (if knowledge is available).
- Carrying out scraping activities of enterprise websites content in involved countries.
- Storage of scraped websites content in a database to be shared with all partners (possibly UNECE Sandbox, to be evaluated also from a legal point of view).
Task 3 – Testing of Methods and Techniques
In this task, we will carry out a testing activity that will be enriched and finalized in SGA-2.
- Select some use cases, out of the defined ones, that allow us to have a good representativeness of the overall potential statistical outputs and information to enrich business registers.
- Build a proof of concept of the selected use cases to predict characteristics of the enterprises by applying text and data mining techniques.
Task 4 – Finalization of Methods and Techniques (foreseen for SGA-2)
Starting from results achieved in task 3, in this task we will fully develop methods and techniques applicable to all use cases. In particular, we will:
- Select a sample of websites and manual identification of related enterprises characteristics and/or use surveys on “ICT and enterprises” in order to get information on some characteristics of the enterprises (to be used as “training” and “validate” datasets for next steps).
- Apply text and data mining techniques (learners) to predict characteristics of the enterprises. Evaluate quality indicators for some of them (e.g. accuracy, sensitivity, specificity). On the basis of the quality indicators, choose the best predictor.
- Application of the best predictor to the whole set of scraped texts in order to predict characteristics of enterprises. Compare and possibly integrate the Business Registers with the obtained information.
- On the basis of predicted values, for different domains of interest (category of economic activity) production of estimates (means and totals) of population parameters (for instance, percentage of enterprises offering e-commerce, present on social media, etc.), and evaluation of related Mean Square Error.
Deliverables (SGA-1 only)
|2.1||Report with legal aspects||month 12|
|2.2||Technical and methodological report describing web scraping, prediction and inference procedures||month 18|
Milestones (SGA-1 only)
|2.3||Progress and technical report of first internal WP-meeting||month 4|