Difference between revisions of "WP4 AIS data1"
|Line 272:||Line 272:|
'''<font face="courier new">[[Media:20160404 Minutes First Meeting WP4 ESSnet BD.docx.ogg]]</font>'''
'''<font face="courier new">[[Media:20160404 Minutes First Meeting WP4 ESSnet BD.docx.ogg]]</font>'''
[[Media:.docx|Legal Framework of Web Scraping in Big Scale for Official Statistics-v1.docx]]
= File archive =
= File archive =
Revision as of 20:30, 13 July 2016
|WP4 AIS data|
Workpackage 4 (WP4) of ESSnet Big Data focuses on AIS data x.
WP4 is carried out by representatives of five ESSnet Big Data partners, Statistics Netherlands CBS (NL), Statistics Denmark DST (DK), Hellenic Statistical Authority ELSTAT (GR), Statistics Norway SSB (NO) and Central Statistical Office GUS (PL)
More background information on WP4 can be found here.
Maritime traffic increased exponentially over the last decades. This asked for different solutions to ensure safety at sea. In the 1940's radar was developed, but this technology could only detect ships that were in line of sight. Another technology that was launched in the last century was the Global Positioning System (GPS). It was developed for US defence to ensure that US troops always were able to find their bearings wherever they are around the world. Based on this technology, the Automatic Identification System (AIS) was able to broadcast their location and status information over a radio channel, making it possible to detect other ships wherever they were, as long as a radio signal could be sent and received. This AIS data is collected at several places around the world. Within Europe, a.o. EMSA, Kystverket, Hellenic Coastguard, Dirkzwager and Marine Traffic collect the data. The number of applications for this data is enourmous, and some apps are already build on the basis of this data. For official statistics, AIS data could be used to create marine traffic statistics, which could be interesting for estimating emissions or identifying locations at sea with a critical amount of traffic. Furthermore, the AIS data could be used as a backbone for the maritime transport statistics: Based on AIS we know how a ship, that visited a harbour also visitted other harbours. Here, we report on the activities and the decisions we made concerning the activities in the ESSnet WP4 regarding collecting and preprocessing the data.
Obtaining AIS data on European level
The first hurdles to take were deciding the level for which we should collect the data and obtaining the data at that level. There were three levels to choose from:
- National level for each country collaborating in WP4
- European level: all waters within the perimeter of the European countries
- World wide level: a data set covering the whole world.
In the table I, the three levels of data are scored against some criteria which were viewed as important, i.e.: potential use for european statistics, potential use for national statistics, price of the data, size of the data.
|Potential for EU level statistics||-||++||+++|
|Potential use for national statistics||++||+++||++++|
Table I: scoring the different levels of data.
As table I shows, the national data is not very useable for european wide analysis. The only advantages are their pricing (most of the participating countries already have national data) and the size (the data is not that big). European data is much more suited for European as well as national statistics. Since European data contains information about the routes vessels take within Europe, the data gives information of routes, but also about harbours of laden and unladen within Europe. Hence, this data could also have potential extra value for the national level. Concerning the world wide data, one would be able to locate the harbours of laden and unladen around the world. Furthermore, this could be interesting when calculating the environmental fingerprint of certain goods transported over sea. However, this data is gathered by satelites and, as a consequence, it is expensive.
Based on the fact that we would like to make some statistics about European waters, and the extra information we would be able to gather based on the European data, it was decided to obtain European data. Several sources were identified to purchase the data from, i.e.:
- Hellenic Coastguard
- Marine Traffic (.com)
- Joint Research Center (JRC)
In the subsequent paragraphs, a description is given of these sources.
European Maritime Safety Authority (EMSA)
The European Maritime Safety Agency is one of the EU's decentralised agencies. Based in Lisbon, the Agency provides technical assistance and support to the European Commission and Member States in the development and implementation of EU legislation on maritime safety, pollution by ships and maritime security. It has also been given operational tasks in the field of oil pollution response, vessel monitoring and in long range identification and tracking of vessels.
EMSA is an EU agency under the authority of DG MOVE. DG MOVE is the transport policy Directorate-General (DG) of the European Commission. DG MOVE is responsible for the legal framework and governing of the EU maritime databases, while EMSA is responsible for the technical operation of the databases. Access to AIS data is therefore granted by DG MOVE and the High Level Steering Group (HLSG) of SafeSeaNet (SSN, representatives for the maritime authorities in the Member States), while EMSA will provide the data if they told to do so by DG MOVE.
Eurostat has already worked with AIS data from EMSA, they provided AIS data for the year 2011. Currently Eurostat is discussing with EMSA and DG MOVE whether to get an extraction of AIS data for 2014 for the purpose of seeing if a model using AIS data to complement the Port notifications sent by vessels can improve the current EU vessel port call statistics. This request have been approved by the HLSG. Eurostat is currently negotiating the terms of the data access agreement (which has to be signed between Eurostat and EMSA before the data is handed over). Eurostat has asked to have a clause included allowing for sharing the 2014 AIS data extraction with WP4 of the ESSnet Big Data project. If this is accepted, we can test the 2014 AIS data within the framework of the ESSnet Big Data project. Unfortunately, on this moment it seems that EMSA is of the opinion that sharing the AIS data with WP4 of the ESSnet Big Data Programme is not within the mandate for this extraction approved by the HLSG. But this is still under discussion.
If the ESS decides, after WP4, that it will need constant provision of AIS data from the SSN Central system (EMSA) or the SSN National systems (in each Member State) this kind of data access will have to be requested from the HLSG (for access to data from the SSN Central system) or the maritime authorities in each Member State (for access to data from the respective SSN National system). In case data is needed from the SSN Central system (hosted by EMSA), a formal request describing the planned project will have to be sent to DG MOVE by either Eurostat or the ESS.
Statistics Netherlands has also tried to get AIS data from EMSA, but we think it’s better to try obtaining AIS data from EMSA via Eurostat, so that's why we decided not further investigate the possibilities of getting AIS data from EMSA by Statistic Netherlands. We first wait what the results are of the already started request by Eurostat to get the 2014 AIS data also available for this WP.
The main focus of Kystverket (The Norwegian Coastal Administration) is the sea around Norway, but they also have data for other European countries. Statistics of traffic around Norway is available through www.havbase.no and they are working on policy/solution for making some of the underlying data public available. The commercial company, DNV-GL, makes calculations for Kystverket. Kystverket has agreed to share the national AIS data freely with Statistics Norway as input to statistical products.
When it comes to data for other European countries, Kystverket consulted their lawyer/jurist. They found out that sharing such data goes beyond their authorization. They believe that "each NSI must go to the AIS owner in their own country or buy AIS data commercially from a market player such as DNV-GL, Marine Traffic or similar."
Hellenic Coast Guard (HCG)
The Hellenic Cost Guard (HCG) is a public authority that operates under the Ministry of Maritime Affairs and Insular Policy, in Greece. The HCG has a crucial role as the competent administration and the policymaker in the field of maritime industry and shipping. The HCG plans, develops, deploys and operates Systems and IT infrastructure for the effective maritime surveillance in order to fulfill its mission. One of the basic elements of this infrastructure is the coastal AIS base station network.
In reply, to our request for European AIS data, they informed us that they can provide us only with national historical AIS data since mid 2015. According to their opinion, we should address our request for European AIS data at European Maritime Safety Agency (EMSA) in accordance with their established procedures for data acquisition.
Dirkzwager is a private company which provice the maritime industry with data. They have AIS data for the European waters based on land based stations and satellite data. They provide this data to ship companies to follow their ships across the world. We arranged for WP4 that Dirkzwager delivers 6 months (data from 8 Octobre 2015 till 12 April 2016) of data for a very special price of 2000 euro. These data contains AIS data for the European waters based on land based stations. Satellite data is not included. Normally obtaining one month of these data from Dirkzwager would cost 7500 euro and 6 months of data would cost 42000 euro. The data consist of the raw AIS data with a timestamp of the date and time they received the AIS data from the lifestream they use.
MarineTraffic is one of the most popular world's ship tracking service. MarineTraffic collects real-time vessel position data and uses them to create applications. MarineTraffic is global pioneer in AIS vessel tracking, Arround to 800 million vessel positions and 18 million vessels and port related events recorded monthly. A request for European AIS data had been sent to them. A potential collaboration with Marine Traffic is under investigation.
Joint Research Center (JRC)
In brief, the role of the Joint Research Centre (JRC), is the in-house science service of the European Commission.
The JRC draws on over 50 years of scientific work experience and continually builds its expertise based on its seven scientific institutes. They are located in Belgium (Brussels and Geel), Germany, Italy, the Netherlands and Spain.
While most of their scientific work serves the policy Directorates-General of the European Commission, they address key societal challenges while stimulating innovation and developing new methods, tools and standards. They share know-how with the Member States, the scientific community and international partners. The JRC collaborates with over a thousand organisations worldwide whose scientists have access to many JRC facilities through various collaboration agreements.
We had an Webex meeting with JRC to investigate what they already did with AIS data and what the possibilities are of sharing their AIS data with our WP. JRC has a lot of experience with using AIS data. Unfortunately, we can’t use their data because of legal issues, but we can use their experience on saving, cleaning and analysing AIS data.
The result of this investigation on possible sources for AIS data at European level is that we know for sure the we can't obtainEuropean data from Kystverket, Hellenic Coastguard and JRC.We are still investigating the possibilities of getting AIS data at European level by Marine traffic, but this seems to be a very expensive alternative. We prefer using AIS data from EMSA because they provide the EMSA data for free, but the process to obtain the AIS data by EMSA takes a long time. Dirkzwager could provide us 6 months of AIS data at European level on a short period of time and for a very good price, that's why we decided to use this Dirkzwager data within our WP. If the EMSA data will be available during SGA-1 or SGA-2 we will also investigate this data in WP4.
Decisions concerning tools and environment
Some decisions were made concerning the tools to be used and the location where the data is going to be processed and stored. As can be seen in Figure 1, processing and storing tthe data consists of several steps. First the data is preprocessed, and the preprocessed data is moved towards its data store.
There where two possibilities to decode the data:
- At statistics Netherlands, python was used in combination with the AIS package
- At statistics Denmark, Java was used in combination with the libAIS library
For legal reasons it was choosen to keep the data in the Netherlands and process the data in python. After that the data had to be moved to a data store. Since the aim of the project is to get hands on experience in Big Data, it was decided that the data should be processed using the Hadoop Stack. Since the data should be accessible for all the participating countries, it was decided to store the data at the UNECE sandbox in HDFS. From the HDFS, the data can be accessed using the tools that are available in the Hadoop stack, which are:
Pig is a big data tool that makes use of the language Pig Latin, developed by Yahoo. It is an easy accessible language for using Hadoop. Hive is a language developed by Facebook, to create simple SQL queries in Hadoop. RHadoop is an integration of the statistical language R in Hadoop. The integration is convenient and makes it possible to do analysis on Big Data. Finally, Spark is the new kid on the block, which makes it possible to perform much more complex processing on data which is a.o. stored in the HDFS. It can be used in combination with the programming languages Scala, Python, R and Java. However, it is adviced to use it with Python or maybe Scala. Furthermore, within Spark one will be able to write SQL queries using SparkSQL. Because Spark is the most flexible tool, most of our work will be done in Spark. However, the use of the other tools is not prohibited. Spark is accessible via SSH (SSH is a Secure Shell - an encrypted command line interface for remote computers), Pig and Hive are accesible via HUE ( the web interface for Hadoop), which also incorporates a file browser. RHadoop is available over a web version of RStudio. For the final step it is decided that resulting aggregates will be downloaded from the UNECE Sandbox using HUE. Researchers of the different NSI's will be able to analyse the data in their tools of choice, i.e. SPSS, SAS, R, or even import the data into a local database and use that for analysis. It also will be investigated if the data can be accessed by using JDBC or ODBC. However, it should be noted that analysis can also be done by using the web version of RStudio with RHadoop, having as an advantage that one does not have to download the data from the sandbox.
Preprocessing and uploading the data
As can be seen in Figure 1, the data is processed in several stages. The received AIS messages from Dirkzwager are in NMEA format, which is a text encoded binary format (see the 6th field in Figure 2), which has to be decoded. The dataset was about 400 GB in size.
Figure 2. example of raw NMEA encoded AIS data
It is important to note that some messages are distributed over several lines. Line 3 and 4 in Figure 2 encode one message. You can see this because the first element after the AIVDM tag is a 2, denoting that the message is split into 2 parts. The element after that denotes which part of the message it is, so we can see that the 3 line is the first, and the 4th line is the second part. The two strings in the 6th column have to be concatenated before sending to the decoder. As already mentioned, the messages were decoded using the python AIS module. This library was already used in the Netherlands. However, when decoding the European dataset, the program crashed several times. It took some time to figure out on which part of the dataset it actually crashed, since the error occurred 10 times on the in total 4.5 billion cases.
AIS consists of several types of records. We divided the messages into four file types. Some of them are about the position of the vessel (where the message id is 1, 2, 3 or 21), whereas others are about voyage related issues (where the message id is 5). The third was all other AIS messages we could decode, but the message type was not in the position report or voyage related. At this point we don’t see any information in this data, but perhaps in the future there will be some data we can use. The fourth is data we couldn’t decode. The records were written to different sets of files. One set concerned the position of the vessel and the others to the voyage of the vessel.
The set of files with positions of the vessel was the biggest. The position files have the following fields. The timestamps are technically not part of the VDM messages, but are transmitted as separate messages in the AIS stream.
The set of position files include message with id’s 1, 2, 3 and 21. The set of files with voyage related data had message id 5 and consist of the following fields:
- type and cargo
After the data was decoded, all files were zipped to save space. The sets of files with locations and files with static and voyage related data were uploaded to the UNECE sandbox by a secure copy (scp). It took approximately 7 hours to copy the data.
Ships are linked between the two datasets which were uploaded to the UNECE sandbox by the mmsi and approximately by the timestamp. The filenames are build up as follows: <year><month><day><hour><minute>.csv.gz The two datasets are distinguished by placing them in the folders Locations and Messages respectively. The locations data comprises 144 GB of compressed data (about 200 GB uncompressed) and the messages comprised about 5 GB of compressed data (about 7 GB uncompressed). After the data was decoded, it was uploaded to the UNECE sandbox by a secure copy (scp). It took approximately 7 hours to copy the data.
After uploading, the data was copied tot he Hadoop File System (HDFS), by using:
hdfs dfs -put Messages/* AIS/Messages/
hdfs dfs -put Locations/* AIS/Locations/
After which the data was available for use. Sometimes, the ssh-interface of the UNECE sandbox freezes. It is therefore advisable to use the screen-command. This not only holds for when one uploads the data, but also when one wants to run a bigger job in Spark. Before we start the the job, we type:
After which we start the job at hand, e.g.:
we hit ctrl-a and x to exit the screen and let the job run. if we want to reconnect to the screen we type:
after which we can see if the job finished.
Accessing the data in the sandbox
The data will be made accessible within the HDFS filesystem at the location /datasets/AIS/Messages and /datasets/AIS/Locations. From legal issues we have decided to create an AIS group on the Sandbox, so only the usernames from the members of WP4 can access the Dirkzwager data.
In this chapter, it will be demonstrated how the data will be made available from Spark. It is assumed that the spark context is available under the name sc as is standard in the spark shell. After we started the spark shell, we type:
val rawdata = sc.textFile("hdfs://namenode.ib.sandbox.ichec.ie:8020/datasets/AIS/Locations/*.csv.gz");
to get all data available in Spark. The variable 'data' is a so called resilient distributed dataset (RDD). If one only wants to analyse the data of a certain period, the wildcard *.csv.gz can be changed. For instance, if one wants to analyze the 2015 data, one can do:
var rawdata = sc.textFile("hdfs://namenode.ib.sandbox.ichec.ie:8020/datasets/AIS/Locations/20151231*.csv.gz”);
After we loaded the data, we have to split the lines to be able to access the individual fields. This is done by the following line:
var splitdata = rawdata.map(_.split(“,”));
As a next step we have get rid of the headers in the data. Normally we start the header with a hash sign to indicate it as the header line. This makes filtering the headers out much easier. However, this was not done in the case of the AIS data. For that reason we have to scan for the fieldname of the first field:
var data = splitdata.filter(x => x(0)!=”mmsi”);
Now, the RDD data only contains data without headers and can we acces the data. For instance, a record count for that particular day we selected:
gives us: res3: Long = 24358293 as an answer. So, at December 31 2015, 24,358,293 records were collected. One can exit spark with the exit command.
Methodology and techniques
At the start of the project we had to make some decisions about how to obtain AIS data and about what AIS data to obtain. It was decided to obtain the European data from a company called Dirkzwager. As an environment, the UNECE sandbox is used. The advantage of the sandbox is that it is open for all NSI's within WP4. In the future it can be open for all NSI's within the UNECE. After the data was obtained, it was transformed and uploaded to the UNECE sandbox and stored in the Hadoop Filesystem (HDFS). The data will be made available in the HDFS at the location /datasets/AIS/ where two directories are available, i.e. Locations and Messages. We were able to run some first queries on the data, as seen in the previous paragraph, which brings us to the conclusion that we have created a good starting point for the next phase of the project.
[[Media:20160404 Minutes First Meeting WP4 ESSnet BD.docx.ogg]]
link to the page for documents on AIS