Webscraping - Quality Guidelines for Big Data English (en) français (fr)

From ESSnet Big Data
Jump to: navigation, search

Data class Web scraping (online job vacancies, enterprise characteristics)

  1. Short description of this class of big data, the source(s)  and the structure of the raw data

The class “web scraping” is related to the process of acquiring the data directly from the Internet, i.e., websites. It is based on the variety of methods to filter, process and analyze the data.

Sources of the data include all websites publicly available on the Internet, accessed via web browsers engines or via API (Application Programming Interface). Therefore, there are three basic types of web scrapers to access information from the Internet:

  • crawler that access directly web data and extract information based on HTML/XHTML meta tags,
  • robot that access structured or semi-structured data in various formats (e.g., JSON, CSV),
  • applications that are using API to transfer the data directly from hosting server to the local machine.

Data are extracted from the raw websites by the use of CSS (Cascade Style Sheets) tags and classes. For instance, an example of the website raw data can be as presented in Figure 1.

Figure 1. An example of raw data of the class Web scraping

According to the figure, to extract information in the green boxes, we have to prepare a web scraping robot to extract all readable information from the tag “div” with class “news priority”.

The structure of output data can be delivered in various data formats: CSV, JSON, SQL-like tables etc.

 

  1. Short description of the role of the big data class in the ESSnet(s), including links to deliverables (if already existing)

The role of the Big Data class “web scraping” in ESSNet is related to various use cases conducted by NSI’s to deliver the data directly from the Internet. It includes web scraping:

  • job vacancies[1]
  • enterprise characteristics[2]
  • comments/news[3]
  • Twitter data[4]
  • prices of products and services
  • tourism accommodation establishments[5]
  • border movements[6]

The list of use cases is not limited to use cases written above. However, this list contains the most reliable examples of the use of web scraping in ESSNet.

  1. Basic description which processes are necessary to transform the raw data into statistical data

There are five general steps that should be used when transforming the raw data into statistical data:

  1. Data acquisition
  2. Pre-processing the raw dataset (including tag identification)
  3. Processing data into machine readable format (including data cleansing and text mining methods)
  4. Data evaluation and improving (including imputation of missing data/data linkability)
  5. Data in usable format (e.g., CSV or JSON file)


According to GSBPM framework, phase Collect and Process are used to deliver the output data from web scraping.

  1. Quality guidelines relevant for this big data class

4.1.Coverage

Due to the fact that the sample of web data is in many cases unknown, coverage is a relevant issues we have to tackle with during web scraping. According to the specification of the dataset, it includes various aspects of coverage. For example, web scraping Twitter data leads to the conclusion that the data are not representative because it is not common used and there is usually under-coverage of different groups of individuals. When web scraping websites, we should be aware of the fact that not all enterprises are present in the web. Especially, small enterprises, self-employed persons may not be present in web or social media. Therefore, there is still a problem of over-coverage of large enterprises in contrast to small enterprises.

4.2.Comparability over time

As web data are changing all the time, this data source must be treated with caution when comparing the data over time. Web scrapers should be maintained and modified to the changes. The same should be with algorithms to estimate the final statistical data.

4.3.Process Errors / data source specific errors

            As mentioned above, the biggest issue is a continuous change of the Internet. It means that we should monitor changes on the Internet and modify the web scraping software if necessary.



[5] As above

[6] As above