Automated interconnection of Linked Data applications and datasets

Details of experiments for ESWC 2017 submission

This page provide details of the experiments presented in our ESWC 2017 submission.

  1. Datasets
  2. Applications
  3. Transformers
  4. Experiment results
  5. Platform prototype overview
  6. Reproducing the experimental results

Datasets

We have chosen 16 real-world datasets and manually created their output data graphs. The output data graphs are necessary for the application pipeline disocvery algorithm to be able to include the datasets to discovered pipelines.

They are presented in the following table. For each dataset, the table specifies:

  • Datasource - link to the datasource from which the dataset was extracted
  • Dataset - name of the dataset
  • Dataset extraction query - SPARQL query to extract the dataset from the datasource
  • Output data graph - output data graph of the dataset necessary for the application pipeline discovery algorithm
DatasourceDatasetDataset extraction queryOutput data graph
SPARQL endpoint DBLP query graph
SPARQL endpoint DBPedia - Earthquakes query graph
SPARQL endpoint DBPedia - Towns query graph
SPARQL endpoint European Data Portal query graph
SPARQL endpoint Check actions - Czech Supreme Audit Office query graph
SPARQL endpoint Check actions - Czech Trade Inspection Authorityquery graph
SPARQL endpoint Legislation CZ - Actsquery graph
SPARQL endpoint Legislation CZ - Versions of Actsquery graph
SPARQL endpoint Legislation UK - Actsquery graph
SPARQL endpoint Legislation UK - Versions of Actsquery graph
SPARQL endpoint LinkedMDBquery graph
SPARQL endpoint RÚIAN - Address Places in Czech Republicquery graph
SPARQL endpoint RÚIAN - Towns in Czech Republicquery graph
SPARQL endpoint Subsidies from public budgets in Czech Republic query graph
RDF dump University of Sheffield - Department of Comp. Sc. query graph
SPARQL endpointTowns in Wikidata query graph

Applications

We defined 7 hypothetical applications which consume LD. Hypothetical means that the applications actually do not exist. The discovered pipelines transform datasets to the shape specified by the input descriptor queries we provide for each application.

The applications are presented in the following table. For each application, the table specifies:

  • Application - name of the application
  • Description - description of the application
  • Input descriptor query - input descriptor query of the application necessary for the application pipeline discovery algorithm
ApplicationDescriptionInput descriptor query
TimeInstantsConsumes time instants (instances of time:Instant) and shows them on a time line.query
TimeIntervalsConsumes time intervals (instances of time:Interval) and shows them on a time line.query
ThingsTimeLinesConsumes versioned things (dct:hasVersion) where each version has a temporal abstraction (instance of time:Interval) and shows versions of a chosen thing on a time line.query
PlacesOnMapConsumes spatial things (instances of geo:SpatialThing) and shows them as points on a map.query
ThingsOnMapConsumes things with geographical abstractions and shows them as labeled points on a map. A geographical abstraction is a place associated (geo:location) with a location expressed as a spatial thing (instances of geo:SpatialThing).query
QuantifiedThingsOnMapConsumes things with geographical and quantified abstractions and shows them as points on a map. Points have labels and quantified values. A geographical abstraction is a place associated (geo:location) with a location expressed as a spatial thing (instances of geo:SpatialThing). A quantified abstraction is a thing with a value associated by rdf:valuequery
PersonalProfilesConsumes persons (instances of foaf:Person) how made (foaf:Person) some things where each thing has a temporal or geographical abstraction (see descriptions above). For a chosen person it shows the things he or she made on a timeline and/or on a map.query

Transformers

We defined 33 transformers listed in the following table. The table starts with transformers which transform proprietary RDF shapes to non-proprietary ones. Transformers which transform non-proprietary RDF shapes to other non-proprietary shapes follow. A proprietary RDF shape is a shape which contains classes or predicates from proprietary vocabularies. A proprietary vocabulary is a vocabulary which is used only in data sources provided by the same publisher, i.e is not reused by more publishers. For each transformer, the table specifies:

  • Transformer - name of the transformer
  • Proprietary input - true when the input expected by the transformer has a proprietary RDF shape.
  • Update query - update query which defines the transformer
TransformerProprietary inputUpdate query
cedr-dotace-castka2rdf-valuetruequery
cedr-sidliNaAdrese2geo-SpatialThingtruequery
cedr-smlouvaPodpisDatum2dct-createdtruequery
cedr-smlouvaPodpisDatum2time-Instanttruequery
dbpedia-date2time-Instanttruequery
dbpedia-populationMetro2rdf-valuetruequery
dbpedia-populationTotal2rdf-valuetruequery
lex-Act2frbr-Worktruequery
movie-initial-release-of2time-Instanttruequery
movie-person-name2foaf-nametruequery
movie-person2foaf-madetruequery
ruian-AdresniMisto2geo-SpatialThingtruequery
ruian-DefinicniBod2geo-SpatialThingtruequery
wikidata-coordinate-location2geo-SpatialThingtruequery
wikidata-population2rdf-valuetruequery
bibtex-date2dct-issuedfalsequery
dct-created2time-Instantfalsequery
dct-date2time-Instantfalsequery
dct-issued2time-Instantfalsequery
dct-valid2time-Interval-01falsequery
dct-valid2time-Interval-02falsequery
foaf-maker2foaf-madefalsequery
foaf-name2dct-titlefalsequery
foaf-rdfs-label2foaf-namefalsequery
foaf-skos-prefLabel2foaf-namefalsequery
frbr-realization2dct-hasVersionfalsequery
frbr-realizationOf2frbr-realizationfalsequery
gr-legalName2dct-titlefalsequery
org-hasMembership2org-memberfalsequery
schema-address2geo-SpatialThingfalsequery
schema-GeoCoordinates2geo-SpatialThingfalsequery
swrc-editor2foaf-madefalsequery
time-Interval2time-Intervalfalsequery

Experiment results

The experimental results presented in the paper are only a summary of detailed results we provide in this excel file (27,2 MB). It has the following sheets:

  1. Single dataset - (Dataset,Application) pairs for Experiment 1. Column Expected denotes whether the application was identified as the best application for the dataset or not. Expected == false means that the algorithm discovered a pipeline for the pair and it was manually checked after the discovery that the pair is useful. Columns Expected* show ideal pipelines for the pair which were identified before we ran the discovery. Columns Unexpected* show other pipelines discovered by the algorithm which were denoted as useful manually after the discovery. Other columns are helpers or reserved for the future use.
  2. Single dataset - results - shows the full list of discovered pipelines for all datasets in Experiment 1. It does not make sense to read it.
  3. Single dataset - summary - shows the pivot table computed from Single dataset - results. It shows the grouping of pipelines and ranking the groups as described in the paper. Level 1 contains applications. Level 2 contains datasets. Level 3 contains groups of pipelines. Level 4 contains pipelines. Each pipeline is displayed as a sequence of transformers which are present in the pipeline.
  4. Two datasets - (Dataset+Linkset+Dataset,Application) pairs for Experiment 2. The same meaning of columns as for Single dataset.
  5. Two datasets - results - shows the full list of discovered pipelines for in Experiment 2. It does not make sense to read it.
  6. Two datasets - summary - shows the pivot table computed from STwo dataset - results. Its structure is the same as the strurcture of Single dataset - summary.

The ideal pipeline discovered by the algorithm for each pair of dataset(s) and application listed in the provided detailed results is available in the table below. For each pipeline, the table specifies:

  • Dataset(s) - dataset name (Experiment 1) or names of two datasets and linkset (Experiment 2)
  • Application - application name
  • Expected - shows whether the combination of Dataset(s) and Application was expected, i.e. whether the Appliation was chosen as the best for the Dataset(s) before we ran the discovery
  • Pipeline JSON - link to the JSON representation of the discovered pipeline which can be directly imported to LinkedPipes ETL
  • Pipeline in LP-ETL - link to the pipeline presented in LinkedPipes ETL user interface where it can be also executed
  • Result - link to the result of execute of the pipeline in Turtle
Dataset(s) Application Expected Pipeline JSON Pipeline in LP-ETL Result
DBLP PersonalProfilesApplication Yes JSON LP-ETL TTL
DBPedia - Earthquakes TimeInstantsApplication Yes JSON LP-ETL TTL
DBPedia - Towns PlacesOnMapApplication Yes JSON LP-ETL TTL
European Data Portal TimeInstantsApplication Yes JSON LP-ETL TTL
Check Actions - Czech Supreme Audit Office TimeIntervalsApplication Yes JSON LP-ETL TTL
Check Actions - Czech Trade Inspection Authority TimeInstantsApplication Yes JSON LP-ETL TTL
Legislation CZ - Acts TimeInstantsApplication Yes JSON LP-ETL TTL
Legislation CZ - Versions of Acts ThingsTimelinesApplication Yes JSON LP-ETL TTL
Legislation UK - Acts TimeInstantsApplication Yes JSON LP-ETL TTL
Legislation UK - Versions of Acts ThingsTimelinesApplication Yes JSON LP-ETL TTL
LinkedMDB PersonalProfilesApplication Yes JSON LP-ETL IRI included an unencoded space: '32' [line 99891]
RÚIAN - Address Places in Czech Republic PlacesOnMapApplication Yes JSON LP-ETL Virtuoso 22023 Error SR...: The result vector is too large SPARQL query.
RÚIAN - Towns in Czech Republic ThingsOnMapApplication Yes JSON LP-ETL TTL
Subsidies from public budgets in Czech Republic TimeInstantsApplication Yes JSON LP-ETL HTTP 500
University of Sheffield - Department of Computer Science PersonalProfilesApplication Yes JSON LP-ETL 403 Forbidden
Wikidata - Towns QuantifiedThingsOnMapApplication Yes JSON LP-ETL TTL
DBLP TimeInstantsApplication No JSON LP-ETL TTL
DBPedia - Earthquakes PlacesOnMapApplication No JSON LP-ETL TTL
Check Actions - Czech Supreme Audit Office TimeInstantsApplication No JSON LP-ETL TTL
Legislation CZ - Versions of Acts TimeInstantsApplication No JSON LP-ETL TTL
Legislation CZ - Versions of Acts TimeIntervalsApplication No JSON LP-ETL TTL
Legislation UK - Versions of Acts TimeInstantsApplication No JSON LP-ETL TTL
Legislation UK - Versions of Acts TimeInstantsApplication No JSON LP-ETL TTL
LinkedMDB TimeInstantsApplication No JSON LP-ETL IRI included an unencoded space: '32' [line 99891]
RÚIAN - Towns in Czech Republic PlacesOnMapApplication No JSON LP-ETL TTL
University of Sheffield - Department of Computer Science TimeInstantsApplication No JSON LP-ETL 403 Forbidden
Wikidata - Towns PlacesOnMapApplication No JSON LP-ETL TTL
Wikidata - Towns ThingsOnMapApplication No JSON LP-ETL TTL
Linkset : Towns from Wikidata --- Towns from RUIAN
Towns in Wikidata
Towns in Czech Republic - RÚIAN
QuantifiedThingsOnMapApplication Yes JSON LP-ETL TTL
Linkset : Towns from DBPedia --- Towns from RUIAN
Towns in DBPedia
Towns in Czech Republic - RÚIAN
QuantifiedThingsOnMapApplication Yes JSON LP-ETL DBPedia timeout
Linkset : Towns from Wikidata --- Towns from RUIAN
Towns in Wikidata
Towns in Czech Republic - RÚIAN
ThingsOnMapApplication No JSON LP-ETL TTL
Linkset : Towns from DBPedia --- Towns from RUIAN
Towns in DBPedia
Towns in Czech Republic - RÚIAN
ThingsOnMapApplication No JSON LP-ETL DBPedia timeout

Platform prototype overview

Currently, the platform consist of three different services:

  • LinkedPipes Discovery
  • LinkedPipes ETL
  • LinkedPipes Visualization

The LP-Discovery service is used solely for discovering the application pipelines. It is preconfigured with the aforementioned tranformers and applications. When a discovery is executed, it finds possible application pipelines. A discovered application pipeline can be later exported into a pre-configured LP-ETL instance.

LP-ETL tool is capable of reliable application pipeline execution. LP-Discovery creates a selected pipeline remotely in LP-ETL, executes it and returns an IRI of a named graph that will contain the execution results once LP-ETL finishes the execution.

LP-VIZ implements some visual applications that can be applied on pipeline execution result. One can pass a reference to the named graph used to store pipeline execution results to LP-VIZ and let it visualize the data contained in the referenced graph.

Reproducing the experimental results

This page describes the steps necessary to reproduce the results described in our ESWC 2017 paper submission.

To evaluate the proposed platform we have implemented LinkedPipes. LinkedPipes is a suite of web services, each specialized on different tasks related to processing LinkedData. In this chapter, we describe the current state of the implementation of this suite. We briefly describe the services related to application pipeline discovery. We also describe, how the services communicate with each other in order to support application pipeline discovery workflows.

The application pipeline discovery itself is implemented in the discovery service, which provides the following API:

		start   POST         /discovery/start
		status  GET          /discovery/$id
		list    GET          /discovery/$id/pipelines
		csv     GET          /discovery/$id/csv
		export  GET          /discovery/$id/pipelines/$pipelineId
		execute GET          /discovery/$id/pipelines/$pipelineId/execute
		stop    GET          /discovery/$id/stop
		

The discovery itself is executed by calling the start API call. This API call expects a discovery configuration JSON object to be posted:

		{
		    "sparqlEndpoints": [{
		        "url": "http://www.europeandataportal.eu/sparql",
		        "descriptorIri": "https://...sample.ttl",
		        "defaultGraphIris": [],
		        "label": "European Data Portal"
		    }]
		}
		

The configuration object allows the user to specify an array of SPARQL endpoint definitions. For each endpoint, it requires its URL and dereferencable IRI of it's descriptor. Optionally, it accepts also list of default graph IRIs, which are named graphs that are to be considered while performing the discovery. In order to make the discovery results more readable, it also accepts a label that user can define to distinguish the endpoints.

When the start API call is executed, it returns a JSON object containing just one property, which is an ID of the started pipeline discovery instance:

		{ "id": "c1582982-1038-4218-911c-12c94ebd2b19" }
		

This ID is to be used later as a parameter of the remaining API calls.

The next step is wait for the discovery to complete. Although the partial results (discovered application pipelines are available via the list API call immediately after the iteration in which a pipeline is discovered are found, the \emph{status} API call could be used to wait for the discovery to complete.

		{
		  "pipelineCount": 1,
		  "isFinished": true,
		  "duration": 350
		}
		

Once the isFinished property is set to true, the pipeline discovery is finished and we can call the list API call to obtain all discovered application pipelines. The returned data contain details about all the discovered pipelines, mainly which of the provided datasources were used, what application is able to consume data from them and what are the transformations needed to assemble the application pipeline. Moreover, it contains an ID assigned to every discovered pipeline.

Such an ID can be later used with the export or execute API calls. The former responds with a JSON-LD data that are in a format consumed by our ETL service. The latter directly contacts a pre-configured ETL service instance, imports the specified pipeline into it and executes its processing. An exemplary result of calling the execute call would be:

			{
				"pipelineId" : "e2df045d-1fe3-46f8-8e61-cf5f168626b2",
				"etlPipelineIri" : "http://xrg12.ms.mff.cuni.cz:8090/resources/pipelines/created-1482867746217",
				"etlExecutionIri" : "http://xrg12.ms.mff.cuni.cz:8090/resources/executions/30fc168d-38b3-4c43-b3fe-2009d0f139e0",
				"resultGraphIri" : "urn:23acd444-edbc-488c-ae31-a99a41b97e70"
			}
		

There are two more API calls that we did not mention so far. One of them is the stop API call, which can be used to terminate the pipeline discovery if necessary. The last one is named csv, which is an API call that temporarily simulates the ranking service. We used it for conducting the experiments. It implements the aforementioned ranking logic and provides us with a quick overview of what are the results of the discovery.

Result storage:
http://demo.visualization.linkedpipes.com:8890/sparql
LP-ETL instance:
http://xrg12.ms.mff.cuni.cz:8090