Social Media - Quality Guidelines for Big Data
Data Class Social Media
By Istat team: Gabriele Ascari, Giovanna Brancato, Paolo, Righi, Tiziana Tuoto
- Short description of this class of big data, the source(s) and the structure of the raw data
The UNECE Task Team on Big Data, in June 2013, classified, among others, Social Networks (human-sourced information), under this umbrella they intend all the collection of human experiences digitized and stored everywhere from personal computers to social networks. Those data are loosely structured and often ungoverned, and include
1100. Social Networks: Facebook, Twitter, Tumblr etc.
1200. Blogs and comments
1300. Personal documents
1400. Pictures: Instagram, Flickr, Picasa etc.
1500. Videos: Youtube etc.
1600. Internet searches
1700. Mobile data content: text messages
1800. User-generated maps
The use of social networks are strongly related to characteristics independent from the NSIs government: some of them, e.g. Twitter, Youtube, can be accessed somehow without subscribing an account, Facebook and Instagram can only be accessed by subscribers, company’s e-mails can be accessed by the company in some countries (e.g. India) while this is definitively not acceptable in many European countries. User-generated and internet search (e.g. Google trends) are somehow publicly accessible.
In this section, we mainly refer to publicly available social media, mainly reporting experiences considering Twitter as case study, due to their largest availability.
Underline differences between these data and other internet scraped data.
- Short description of the role of the big data class in the ESSnet(s), including links to deliverables (if already existing)
In the previous ESSnet on Big Data, SGA-2, the use case “Population” in the work package 7 “Multi-Domains” was dedicated to show the structure of population in different regions according to the specific facts – e.g., public opinion on a topic (in the pilot Brexit) and life satisfaction in different regions, by means of social networks as data source. Therefore, three examples have been conducted. One of them was to identify the scale of the depression in different countries based on the Google Trends. The second was to find social mood according to public events or facts (e.g., Brexit). The goal of the third example was to identify life satisfaction in population according to their comments/posts/tweets.
Many others European and international experiences are currently exploring/exploiting the usage of Social Networks for Official Statistics (e.g. CBS with blablabla, Istat for index on social mood in economy, others for experiments on migrations blablabla.. )
- Basic description which processes are necessary to transform the raw data into statistical data
Let us consider Twitter as social network source. Twitter’s Streaming API is used to collect samples of public tweets. The tweets sample can be filtered according to relevant keywords. The sampling algorithm is entirely controlled by Twitter’s Streaming API and very little is known about it. The API is allowed to return at most a 1% sample of all the tweets produced on Twitter at a given time. When a filter is specified, the API returns all the tweets matching the request up to the “1% of all tweets” limit.
There is however the possibility to buy from the Twitter’s company different tweets sample. (e.g. in Bank of Italy, other example… )
The sampled tweets need to be processed, somehow cleaned and normalized, and then the target information needs to be extracted. Actually, even when a filter is applied, the observed tweets represent only a fuzzy representation of the interest phenomenon. To extract the relevant information, usually natural language processing algorithms are applied, often in an unsupervised, lexicon-based approach; often there are not the conditions to applied supervised Machine Learning methods due to the lack of a proper training set of labeled tweets. The presence of potential out-of-topic tweets should be checked, hopefully introducing a diagnostic step into the extraction and processing models.
- Quality guidelines relevant for this big data class
The following Quality Aspects should be covered within each data class (if not relevant, please state explicitly), the mentioning and description of quality guidelines for additional quality aspects is welcomed!
- Comparability over time (e.g. is the data structure stable over time)
- Measurement errors
- Model Errors (depending on the models applied to transform the raw data to statistical data)
- Process Errors / data source specific errors (depending on the very diverse processes needed to transform the raw data to statistical data)
4.1 Coverage errors in social networks data
Among the differences between big data and traditional sources is the degree of control National Statistical Institutes (NSIs) have on the data acquisition. In traditional surveys, NSIs can plan, design and carry out the acquisition procedures; for the use of administrative data sources, they may have agreements with the data providers and some knowledge on the reliability and quality of the data. Big data, on the other hand, share many characteristics with “found data”. This is true especially for data derived from social networks, which are shared willingly by individuals, out of any control from statistical organizations. First of all, the activities of preserving and maintaining the data, allowing their re-use, are based on the decision of the data curator, who might not have interest in maintain the data useful for statistical purposes. Secondly, the databases can be organized such as not allowing to trace the provenience and the origins of the data (Baker R., 2017).
In the throughput phase, big data are characterized by a set of complex treatment operations. For Twitter data, they can be identified into (Hsieh Y.P and Murphy J., 2017):
- the coverage delimitation, i.e. the establishment of time, territory and language of the tweets;
- the identification of the topical keywords and the definition of data extraction queries; the automated text analysis (or machine learning algorithm) to assign the sentiment and the analyses (and imputation) of demographic data on the profiles behind the Twitter.
From such characteristics derive one of the most controversial aspects of big data with regards to their use in official statistics: the representativeness of the target population. Indeed, in a survey life cycle, coverage errors attain to the dimension of the representation line, i.e. to the target population or the set of units to be studied (“who”) (Groves et al, 2004). Since the generation of the data depends on external factors and not from NSIs decisions, it is often the case that such data will not be representative of the whole population but just a fraction of it, which will probably have specific characteristics that differentiate it from the broader population. In other words, the problem is similar to the one faced by studies dealing with non-probabilistic samples. In such cases, more than the sampling variance, the sampling bias is the most dangerous drawback that the estimates face. For example, when considering Twitter data, it is evident that the collected tweets refer only to specific subsets of the more general population: the subset of people with a Twitter account and the subset of Twitter users that have chosen to share some of their messages publicly. Thus, inferences should not be made from a collection of tweets and any result from the analysis should be limited to the population underlying those tweets.
This concerns another point in the coverage aspect of big data derived from social networks: we often have no means to trace back the data on the events we collected to the units behind such events. Indeed, while we may be interested in studying some characteristics of a population, social networks often do not offer direct information about the units of a population, just what those units have willingly shared with the world. In other words, we collect events derived from the target population, not data on the population itself. Considering the Twitter example, this means that we do not have data regarding the person behind a Twitter account; we actually cannot know if it is a person at all, since the account could be linked to an organization or to multiple individuals. As such, the data collected could be affected both by undercoverage with respect to the target population and overcoverage with respect to specific subpopulations.
A problem in the consideration of the coverage error is the in the use of big data the interest is the maximization of the topic coverage rather than the population coverage, thus introducing a confusion between coverage, measurement error and item missingness.
Establish the population of interest. The definition and study of coverage error require the definition of the target population, that should be explicitly identified in terms of type, time and place.
Research the background of the units. Social network data are comprised of events that are generated by units. By analyzing the content of the messages or related metadata through profiling techniques, it may be possible to identify some characteristics of the units at individual or aggregated level. This is often necessary as Twitter does not require users to submit real personal information as age or occupation, leaving to them the choice to do so. Once unit characteristics have been derived, an analysis of the characteristics of the “observed” units with respect to the target population should be carried out to assess the presence and entity of the coverage error.
Surveys to obtain coverage awareness. Short surveys may be launched in order to identify the characteristics of an observed population, such as the users of a social network. The results should offer an idea of the demographic characteristics of the users and the differences between them and the target population.
'4.2 'Comparability over time (e.g. is the data structure stable over time)
One of the main concerns in introducing the Big Data in the official statistics is the process generating data. Usually, the National Statistical Offices (NSOs) control the overall production process starting from the data collection phase, namely the survey. On the other hand Big Data are generated by a non-statistical purpose and provided by independent actors. In a fixed time point this issue could introduce errors such as coverage errors. In a longer time period an uncontrolled process can introduce troubles on the data comparability between two or more reference times. These troubles mainly depend on the stability of the data structure over time. NSOs have limited chance to manage the data structure over time, then it is import that they are aware of the risk of basing the statistical output on these type of data sources.
We focus on comparability over time considering the social media data for producing social mood analysis. Here it shown a not exhaustive list of the conditions for carrying out a correct comparability analysis:
- Stability of the data provider. It has to provide data over an extended period of time: the question is whether comparable data will be available in the future, from similar provider or sources. Note that several cases of social media platform commonly used years ago now are completely disappeared.
- Stability of the social media data characteristics: functionalities of the social media platform can change over time. As an example Twitter data platform allows to send 280 character limit tweet from the previous 140 limit. The changes of functionalities affect the way of using Tweeter and the data processing (i.e. text mining analysis) could not understand these changes and produce misleading output.
- Stability in the data access policy. Assuming in a given period the social media platform allow to a free data access, partially (i.e. Facebook) or in some sense completely (i.e. Tweeter), data access policy can change and affect the time series. Even though the policy to data access is not modified (i.e. remain the free download), it is important that this policy is stable for the automatic procedures of downloading as well.
- Stability of the algorithms for the intermediate social media data. Considering, Google Trend that represents a sort of throughput of the Big Data generating process (intermediate level of data processing), the Google algorithms change over time in a black-box context performing different results for the same query. NSO should take into account the issue.
Comparability over time can depend not only on the stability of the data provider and on the characteristic of the data source (stability of the data structure) but also on exogenous conditions:
- Different social media platforms can compete over time on the market share of the social media and a given platform can have different level of the target population coverage over time affecting the statistical output.
- Technological innovations (software and hardware) undermine the use of a given social media platform, the appealing of using it, and finally they affect the coverage of the target population over time affecting the statistical output.
To deal with the concerns on the comparability over time of the statistical products NSOs should rely on suitable statistical framework. Here some relevant precautions to take into account are listed:
Integrate use of different data sources. Rely the statistical output on more than one source of data . The sources can be of different typology: Big Data, administrative data, survey data.
Continuous updating of the Data Science techniques. Web Scraping, Text Processing and Machine Learning tools have to be ready to catch the changes of the data structure.
Fit an appropriate statistical methodology for producing the output. According to the Analyse Stage of data generating process by AAPOR (2015), apply a statistical method not sensitive to extreme data and define statistical tools for smoothing the break in the time series related to structural changes of the data source or coverage changes over time .
4.3 Measurement errors in social networks data
Measurement errors attain to the measurement line, i.e. to the variables of interest (the “what”) (Groves et al., 2004).
Measurement errors in Twitter data derive by the choice of the topical keywords and the search queries for data extraction. It has been studied that also small variation in the choice of the topic keywords can lead to wide differences in the extracted data (Hsieh Y.P and Murphy J., 2017). Once the tweets of interest have been extracted, an automated text analysis or machine learning algorithm predicts the sentiment, e.g. positive or negative. These operations are associated with a degree of sensitivity and specificity precision (i.e. proportion of retrieved tweets that are relevant to the target of the search query) and recall (proportion of relevant records obtained by the search query). Such concepts are usually translated into the sensitivity and specificity of the method to extract and interpret the tweet.
Errors propagate: an error in the selection of the tweets may lead to a coverage error. (Se necessario fare un esempio di under e over coverage). Errors in the interpretation of the tweets, leading to measurment error can depend on model errors in the specification of the algorithms applied for the query and interpretation of the tweets.
Establish the target information. The definition and study of measurement errors require the definition of the target variable of interest.
Research on measurement errors. Since the Query and Interpretation operations are those more risky for measurement error, the sensitivity and specificity of the query and interpretation algorithm could be tested on simulated data.
Comparison with other data???
'4.4 'Model Errors (depending on the models applied to transform the raw data to statistical data)
4.5 Process Errors / data source specific errors (depending on the very diverse processes needed to transform the raw data to statistical data)
It seems for these data the Model Errors and Process Errors can be confused with each others. We need to elaborate a bit this point.
AAPOR (2015). Big Data in Survey Research. AAPOR Task Force Report. Public Opinion Quarterly, 79, pp. 839–880.
Baker R. (2017), Big data. A survey research perspective. Chapter 3 in Total survey error in practice. Wiley and Sons.
Essnet Big Data SGA2, WP7, Multidomains: https://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata/images/0/04/WP7_Deliverable_7_7_2018_05_31.pdf
Hsieh Y.P and Murphy J. (2017) Total Twitter error. Decomposing public opinion measurement on Twitter from a Total Survey Error perspective. Chapter 2 in Total survey error in practice. Wiley and Sons.
Groves R. M., Fowler F.J.Jr, Couper M, Lepkowsky J.M, Singer E., Tourangeau R. (2004). Survey Methodology. Wiley, New York.
UNECE task team on big data: https://statswiki.unece.org/display/bigdata/Classification+of+Types+of+Big+Data