ESWC 2017 paper experiment - LinkedPipes Visualization

Automated interconnection of Linked Data applications and datasets

Details of experiments for ESWC 2017 submission

This page provide details of the experiments presented in our ESWC 2017 submission.

Datasets
Applications
Transformers
Experiment results
Platform prototype overview
Reproducing the experimental results

Datasets

We have chosen 16 real-world datasets and manually created their output data graphs. The output data graphs are necessary for the application pipeline disocvery algorithm to be able to include the datasets to discovered pipelines.

They are presented in the following table. For each dataset, the table specifies:

Datasource - link to the datasource from which the dataset was extracted
Dataset - name of the dataset
Dataset extraction query - SPARQL query to extract the dataset from the datasource
Output data graph - output data graph of the dataset necessary for the application pipeline discovery algorithm

Datasource	Dataset	Dataset extraction query	Output data graph
SPARQL endpoint	DBLP	query	graph
SPARQL endpoint	DBPedia - Earthquakes	query	graph
SPARQL endpoint	DBPedia - Towns	query	graph
SPARQL endpoint	European Data Portal	query	graph
SPARQL endpoint	Check actions - Czech Supreme Audit Office	query	graph
SPARQL endpoint	Check actions - Czech Trade Inspection Authority	query	graph
SPARQL endpoint	Legislation CZ - Acts	query	graph
SPARQL endpoint	Legislation CZ - Versions of Acts	query	graph
SPARQL endpoint	Legislation UK - Acts	query	graph
SPARQL endpoint	Legislation UK - Versions of Acts	query	graph
SPARQL endpoint	LinkedMDB	query	graph
SPARQL endpoint	RÚIAN - Address Places in Czech Republic	query	graph
SPARQL endpoint	RÚIAN - Towns in Czech Republic	query	graph
SPARQL endpoint	Subsidies from public budgets in Czech Republic	query	graph
RDF dump	University of Sheffield - Department of Comp. Sc.	query	graph
SPARQL endpoint	Towns in Wikidata	query	graph

Applications

We defined 7 hypothetical applications which consume LD. Hypothetical means that the applications actually do not exist. The discovered pipelines transform datasets to the shape specified by the input descriptor queries we provide for each application.

The applications are presented in the following table. For each application, the table specifies:

Application - name of the application
Description - description of the application
Input descriptor query - input descriptor query of the application necessary for the application pipeline discovery algorithm

Application	Description	Input descriptor query
TimeInstants	Consumes time instants (instances of time:Instant) and shows them on a time line.	query
TimeIntervals	Consumes time intervals (instances of time:Interval) and shows them on a time line.	query
ThingsTimeLines	Consumes versioned things (dct:hasVersion) where each version has a temporal abstraction (instance of time:Interval) and shows versions of a chosen thing on a time line.	query
PlacesOnMap	Consumes spatial things (instances of geo:SpatialThing) and shows them as points on a map.	query
ThingsOnMap	Consumes things with geographical abstractions and shows them as labeled points on a map. A geographical abstraction is a place associated (geo:location) with a location expressed as a spatial thing (instances of geo:SpatialThing).	query
QuantifiedThingsOnMap	Consumes things with geographical and quantified abstractions and shows them as points on a map. Points have labels and quantified values. A geographical abstraction is a place associated (geo:location) with a location expressed as a spatial thing (instances of geo:SpatialThing). A quantified abstraction is a thing with a value associated by rdf:value	query
PersonalProfiles	Consumes persons (instances of foaf:Person) how made (foaf:Person) some things where each thing has a temporal or geographical abstraction (see descriptions above). For a chosen person it shows the things he or she made on a timeline and/or on a map.	query

Transformers

We defined 33 transformers listed in the following table. The table starts with transformers which transform proprietary RDF shapes to non-proprietary ones. Transformers which transform non-proprietary RDF shapes to other non-proprietary shapes follow. A proprietary RDF shape is a shape which contains classes or predicates from proprietary vocabularies. A proprietary vocabulary is a vocabulary which is used only in data sources provided by the same publisher, i.e is not reused by more publishers. For each transformer, the table specifies:

Transformer - name of the transformer
Proprietary input - true when the input expected by the transformer has a proprietary RDF shape.
Update query - update query which defines the transformer

Transformer	Proprietary input	Update query
cedr-dotace-castka2rdf-value	true	query
cedr-sidliNaAdrese2geo-SpatialThing	true	query
cedr-smlouvaPodpisDatum2dct-created	true	query
cedr-smlouvaPodpisDatum2time-Instant	true	query
dbpedia-date2time-Instant	true	query
dbpedia-populationMetro2rdf-value	true	query
dbpedia-populationTotal2rdf-value	true	query
lex-Act2frbr-Work	true	query
movie-initial-release-of2time-Instant	true	query
movie-person-name2foaf-name	true	query
movie-person2foaf-made	true	query
ruian-AdresniMisto2geo-SpatialThing	true	query
ruian-DefinicniBod2geo-SpatialThing	true	query
wikidata-coordinate-location2geo-SpatialThing	true	query
wikidata-population2rdf-value	true	query
bibtex-date2dct-issued	false	query
dct-created2time-Instant	false	query
dct-date2time-Instant	false	query
dct-issued2time-Instant	false	query
dct-valid2time-Interval-01	false	query
dct-valid2time-Interval-02	false	query
foaf-maker2foaf-made	false	query
foaf-name2dct-title	false	query
foaf-rdfs-label2foaf-name	false	query
foaf-skos-prefLabel2foaf-name	false	query
frbr-realization2dct-hasVersion	false	query
frbr-realizationOf2frbr-realization	false	query
gr-legalName2dct-title	false	query
org-hasMembership2org-member	false	query
schema-address2geo-SpatialThing	false	query
schema-GeoCoordinates2geo-SpatialThing	false	query
swrc-editor2foaf-made	false	query
time-Interval2time-Interval	false	query

Experiment results

The experimental results presented in the paper are only a summary of detailed results we provide in this excel file (27,2 MB). It has the following sheets:

Single dataset - (Dataset,Application) pairs for Experiment 1. Column Expected denotes whether the application was identified as the best application for the dataset or not. Expected == false means that the algorithm discovered a pipeline for the pair and it was manually checked after the discovery that the pair is useful. Columns Expected* show ideal pipelines for the pair which were identified before we ran the discovery. Columns Unexpected* show other pipelines discovered by the algorithm which were denoted as useful manually after the discovery. Other columns are helpers or reserved for the future use.
Single dataset - results - shows the full list of discovered pipelines for all datasets in Experiment 1. It does not make sense to read it.
Single dataset - summary - shows the pivot table computed from Single dataset - results. It shows the grouping of pipelines and ranking the groups as described in the paper. Level 1 contains applications. Level 2 contains datasets. Level 3 contains groups of pipelines. Level 4 contains pipelines. Each pipeline is displayed as a sequence of transformers which are present in the pipeline.
Two datasets - (Dataset+Linkset+Dataset,Application) pairs for Experiment 2. The same meaning of columns as for Single dataset.
Two datasets - results - shows the full list of discovered pipelines for in Experiment 2. It does not make sense to read it.
Two datasets - summary - shows the pivot table computed from STwo dataset - results. Its structure is the same as the strurcture of Single dataset - summary.

The ideal pipeline discovered by the algorithm for each pair of dataset(s) and application listed in the provided detailed results is available in the table below. For each pipeline, the table specifies:

Dataset(s) - dataset name (Experiment 1) or names of two datasets and linkset (Experiment 2)
Application - application name
Expected - shows whether the combination of Dataset(s) and Application was expected, i.e. whether the Appliation was chosen as the best for the Dataset(s) before we ran the discovery
Pipeline JSON - link to the JSON representation of the discovered pipeline which can be directly imported to LinkedPipes ETL
Pipeline in LP-ETL - link to the pipeline presented in LinkedPipes ETL user interface where it can be also executed
Result - link to the result of execute of the pipeline in Turtle

Dataset(s)	Application	Expected	Pipeline JSON	Pipeline in LP-ETL	Result
DBLP	PersonalProfilesApplication	Yes	JSON	LP-ETL	TTL
DBPedia - Earthquakes	TimeInstantsApplication	Yes	JSON	LP-ETL	TTL
DBPedia - Towns	PlacesOnMapApplication	Yes	JSON	LP-ETL	TTL
European Data Portal	TimeInstantsApplication	Yes	JSON	LP-ETL	TTL
Check Actions - Czech Supreme Audit Office	TimeIntervalsApplication	Yes	JSON	LP-ETL	TTL
Check Actions - Czech Trade Inspection Authority	TimeInstantsApplication	Yes	JSON	LP-ETL	TTL
Legislation CZ - Acts	TimeInstantsApplication	Yes	JSON	LP-ETL	TTL
Legislation CZ - Versions of Acts	ThingsTimelinesApplication	Yes	JSON	LP-ETL	TTL
Legislation UK - Acts	TimeInstantsApplication	Yes	JSON	LP-ETL	TTL
Legislation UK - Versions of Acts	ThingsTimelinesApplication	Yes	JSON	LP-ETL	TTL
LinkedMDB	PersonalProfilesApplication	Yes	JSON	LP-ETL	IRI included an unencoded space: '32' [line 99891]
RÚIAN - Address Places in Czech Republic	PlacesOnMapApplication	Yes	JSON	LP-ETL	Virtuoso 22023 Error SR...: The result vector is too large SPARQL query.
RÚIAN - Towns in Czech Republic	ThingsOnMapApplication	Yes	JSON	LP-ETL	TTL
Subsidies from public budgets in Czech Republic	TimeInstantsApplication	Yes	JSON	LP-ETL	HTTP 500
University of Sheffield - Department of Computer Science	PersonalProfilesApplication	Yes	JSON	LP-ETL	403 Forbidden
Wikidata - Towns	QuantifiedThingsOnMapApplication	Yes	JSON	LP-ETL	TTL
DBLP	TimeInstantsApplication	No	JSON	LP-ETL	TTL
DBPedia - Earthquakes	PlacesOnMapApplication	No	JSON	LP-ETL	TTL
Check Actions - Czech Supreme Audit Office	TimeInstantsApplication	No	JSON	LP-ETL	TTL
Legislation CZ - Versions of Acts	TimeInstantsApplication	No	JSON	LP-ETL	TTL
Legislation CZ - Versions of Acts	TimeIntervalsApplication	No	JSON	LP-ETL	TTL
Legislation UK - Versions of Acts	TimeInstantsApplication	No	JSON	LP-ETL	TTL
Legislation UK - Versions of Acts	TimeInstantsApplication	No	JSON	LP-ETL	TTL
LinkedMDB	TimeInstantsApplication	No	JSON	LP-ETL	IRI included an unencoded space: '32' [line 99891]
RÚIAN - Towns in Czech Republic	PlacesOnMapApplication	No	JSON	LP-ETL	TTL
University of Sheffield - Department of Computer Science	TimeInstantsApplication	No	JSON	LP-ETL	403 Forbidden
Wikidata - Towns	PlacesOnMapApplication	No	JSON	LP-ETL	TTL
Wikidata - Towns	ThingsOnMapApplication	No	JSON	LP-ETL	TTL
Linkset : Towns from Wikidata --- Towns from RUIAN Towns in Wikidata Towns in Czech Republic - RÚIAN	QuantifiedThingsOnMapApplication	Yes	JSON	LP-ETL	TTL
Linkset : Towns from DBPedia --- Towns from RUIAN Towns in DBPedia Towns in Czech Republic - RÚIAN	QuantifiedThingsOnMapApplication	Yes	JSON	LP-ETL	DBPedia timeout
Linkset : Towns from Wikidata --- Towns from RUIAN Towns in Wikidata Towns in Czech Republic - RÚIAN	ThingsOnMapApplication	No	JSON	LP-ETL	TTL
Linkset : Towns from DBPedia --- Towns from RUIAN Towns in DBPedia Towns in Czech Republic - RÚIAN	ThingsOnMapApplication	No	JSON	LP-ETL	DBPedia timeout

Platform prototype overview

Currently, the platform consist of three different services:

LinkedPipes Discovery
LinkedPipes ETL
LinkedPipes Visualization

The LP-Discovery service is used solely for discovering the application pipelines. It is preconfigured with the aforementioned tranformers and applications. When a discovery is executed, it finds possible application pipelines. A discovered application pipeline can be later exported into a pre-configured LP-ETL instance.

LP-ETL tool is capable of reliable application pipeline execution. LP-Discovery creates a selected pipeline remotely in LP-ETL, executes it and returns an IRI of a named graph that will contain the execution results once LP-ETL finishes the execution.

LP-VIZ implements some visual applications that can be applied on pipeline execution result. One can pass a reference to the named graph used to store pipeline execution results to LP-VIZ and let it visualize the data contained in the referenced graph.

Reproducing the experimental results

This page describes the steps necessary to reproduce the results described in our ESWC 2017 paper submission.

To evaluate the proposed platform we have implemented LinkedPipes. LinkedPipes is a suite of web services, each specialized on different tasks related to processing LinkedData. In this chapter, we describe the current state of the implementation of this suite. We briefly describe the services related to application pipeline discovery. We also describe, how the services communicate with each other in order to support application pipeline discovery workflows.

The application pipeline discovery itself is implemented in the discovery service, which provides the following API:

		start   POST         /discovery/start
		status  GET          /discovery/$id
		list    GET          /discovery/$id/pipelines
		csv     GET          /discovery/$id/csv
		export  GET          /discovery/$id/pipelines/$pipelineId
		execute GET          /discovery/$id/pipelines/$pipelineId/execute
		stop    GET          /discovery/$id/stop

The discovery itself is executed by calling the start API call. This API call expects a discovery configuration JSON object to be posted:

		{
		    "sparqlEndpoints": [{
		        "url": "http://www.europeandataportal.eu/sparql",
		        "descriptorIri": "https://...sample.ttl",
		        "defaultGraphIris": [],
		        "label": "European Data Portal"
		    }]
		}

The configuration object allows the user to specify an array of SPARQL endpoint definitions. For each endpoint, it requires its URL and dereferencable IRI of it's descriptor. Optionally, it accepts also list of default graph IRIs, which are named graphs that are to be considered while performing the discovery. In order to make the discovery results more readable, it also accepts a label that user can define to distinguish the endpoints.

When the start API call is executed, it returns a JSON object containing just one property, which is an ID of the started pipeline discovery instance:

		{ "id": "c1582982-1038-4218-911c-12c94ebd2b19" }

This ID is to be used later as a parameter of the remaining API calls.

The next step is wait for the discovery to complete. Although the partial results (discovered application pipelines are available via the list API call immediately after the iteration in which a pipeline is discovered are found, the \emph{status} API call could be used to wait for the discovery to complete.

		{
		  "pipelineCount": 1,
		  "isFinished": true,
		  "duration": 350
		}

Once the isFinished property is set to true, the pipeline discovery is finished and we can call the list API call to obtain all discovered application pipelines. The returned data contain details about all the discovered pipelines, mainly which of the provided datasources were used, what application is able to consume data from them and what are the transformations needed to assemble the application pipeline. Moreover, it contains an ID assigned to every discovered pipeline.

Such an ID can be later used with the export or execute API calls. The former responds with a JSON-LD data that are in a format consumed by our ETL service. The latter directly contacts a pre-configured ETL service instance, imports the specified pipeline into it and executes its processing. An exemplary result of calling the execute call would be:

			{
				"pipelineId" : "e2df045d-1fe3-46f8-8e61-cf5f168626b2",
				"etlPipelineIri" : "http://xrg12.ms.mff.cuni.cz:8090/resources/pipelines/created-1482867746217",
				"etlExecutionIri" : "http://xrg12.ms.mff.cuni.cz:8090/resources/executions/30fc168d-38b3-4c43-b3fe-2009d0f139e0",
				"resultGraphIri" : "urn:23acd444-edbc-488c-ae31-a99a41b97e70"
			}

There are two more API calls that we did not mention so far. One of them is the stop API call, which can be used to terminate the pipeline discovery if necessary. The last one is named csv, which is an API call that temporarily simulates the ranking service. We used it for conducting the experiments. It implements the aforementioned ranking logic and provides us with a quick overview of what are the results of the discovery.

Result storage:: http://demo.visualization.linkedpipes.com:8890/sparql
LP-ETL instance:: http://xrg12.ms.mff.cuni.cz:8090