Difference between revisions of "WP1 Sprint 2016 07 28-29 Virtual Ideas (background)" English (en) français (fr)

From ESSnet Big Data
Jump to: navigation, search
Line 1: Line 1:
 
Joint analysis of CEDEFOP data?
 
Joint analysis of CEDEFOP data?
 +
 +
WP1 Virtual Sprint July 2016
 +
 +
 +
 +
 +
 +
In the final report of project ”Real-time labour market information on skill requirements: feasibility study and working prototype”, a data model developed has been presented, see the figure below. This figure is used to explaining a general approach of web scraping for statistics. In Sweden, four data sets of job advertisements from the state employment agency have been exploited, concerning module 2, 3 and 5.
 +
 +
 +
 +
The data sets are xml files; they are then cleaned and transformed into a database, illustrated in module 2 to 3 in the figure. Since the data are from one source, duplication removal is not considered at the first place. Based on the data in module 3, we studied variables e.g. occupation, organization number and enterprise´s sectors.
 +
 +
The data sets have good coverage of occupations and the sectors. Only a small percentage of companies are covered comparing with the business register. However we cannot by the data itself conclude how good the coverage is. Data of other sources need to be complemented.
 +
 +
In the virtual sprint, the interesting questions can be for example:
 +
 +
#How to map from the organization numbers in the adverts to the legal units in statistics of job vacancy, so that we can compare the advertisements data with the job vacancy statistics?
 +
#How to find the duplicate adverts in one source/multiple sources, can we combine the usage of the structured data and the text analysis?

Revision as of 09:50, 11 July 2016

Joint analysis of CEDEFOP data?

WP1 Virtual Sprint July 2016



In the final report of project ”Real-time labour market information on skill requirements: feasibility study and working prototype”, a data model developed has been presented, see the figure below. This figure is used to explaining a general approach of web scraping for statistics. In Sweden, four data sets of job advertisements from the state employment agency have been exploited, concerning module 2, 3 and 5.


The data sets are xml files; they are then cleaned and transformed into a database, illustrated in module 2 to 3 in the figure. Since the data are from one source, duplication removal is not considered at the first place. Based on the data in module 3, we studied variables e.g. occupation, organization number and enterprise´s sectors.

The data sets have good coverage of occupations and the sectors. Only a small percentage of companies are covered comparing with the business register. However we cannot by the data itself conclude how good the coverage is. Data of other sources need to be complemented.

In the virtual sprint, the interesting questions can be for example:

  1. How to map from the organization numbers in the adverts to the legal units in statistics of job vacancy, so that we can compare the advertisements data with the job vacancy statistics?
  2. How to find the duplicate adverts in one source/multiple sources, can we combine the usage of the structured data and the text analysis?