=============================================================== Technical issues: =============================================================== ****************************** ************* 1) ************* ****************************** file: WP2_SocialMediaPresence_Dev.py class: SocialMediaPresence method: searchSocialMediaLinks(self,website,level) just after these lines: # if no links have been found on the main page, go one level deeper if none=='1' and level=='1' and len(URLs)>1: smd.goDeeperToFindSocialMedia(website,URLs) i put the following lines (with proper indentation): except Exception: print("") print("=========================> Scraping failed for the current website :",website) print("") in order to avoid program termination in case of error on a particular webpage. ****************************** ************* 2) ************* ****************************** file: WP2_Step2a_CollectingTweetsByKeyword.py Substitute the following line: hashTweet = tweepy.Cursor(api.search, q=keyword, count=5000, lang=language).items(5000) with this line hashTweet = tweepy.Cursor(api.search, q=keyword, count=5000, lang=language, tweet_mode='extended').items(5000) In this way the retrieved tweets will not be truncated ****************************** ************* 3) ************* ****************************** It should be a good idea to log on file what is going on during program execution (especially if the number of enterprises considered is very high) ****************************** ************* 4) ************* ****************************** In step1 it should be advisable to write the csv file during the program execution and not just at the end =============================================================== Methodological issues: =============================================================== In step2a a sample of tweets whose text contains specified keywords (e.g. "commercial","marketing") is collected. We are not 100% sure of the purpose of this step. We guess this sample should be used to prepare a training set to fit a model (e.g. a multilevel logistic model) in step3. If this is the case, the keywords used to filter tweets are meant to represent possible categories of enterprises' activity on Twitter, and the text of the tweets to represent features to be used for prediction. Are we right ? If not, please clarify. If we are right, we would like to share with you the following observations: 1) Is the list of keywords derived from any official classification of enterprises' activity on Social Media? Otherwise, a proper selection of keywords would be the first task to address. Perhaps, domain experts could help. 2) Even if "good" keywords were defined, we cannot take for granted that a tweet which, for instance, contains the keyword "commercial" is, from a semantic point of view, relevant to a "commercial activity". This is a serious concern, as a training set prepared automatically will very likely generate "label noise". Training sets affected by label noise invariably lead to poor predictions. 3) Very likely, the majority of tweets collected through Twitter's streaming api in step2 will not be published by accounts of enterprises. But our target population is a population of enterprises which are active on Twitter. Therefore, it could be a better idea to restrict the selection to tweets published by enterprises. As a general remark, a high quality training set is an essential ingredient for a successful supervised learning. The best option would be (if possible) to derive enterprise activity labels and related tweets from already available survey data.