Automated interconnection of Linked Data applications and datasets
Details of experiments for ESWC 2017 submission
This page provide details of the experiments presented in our ESWC 2017 submission.
Datasets
We have chosen 16 real-world datasets and manually created their output data graphs. The output data graphs are necessary for the application pipeline disocvery algorithm to be able to include the datasets to discovered pipelines.
They are presented in the following table. For each dataset, the table specifies:
- Datasource - link to the datasource from which the dataset was extracted
- Dataset - name of the dataset
- Dataset extraction query - SPARQL query to extract the dataset from the datasource
- Output data graph - output data graph of the dataset necessary for the application pipeline discovery algorithm
Datasource | Dataset | Dataset extraction query | Output data graph |
---|---|---|---|
SPARQL endpoint | DBLP | query | graph |
SPARQL endpoint | DBPedia - Earthquakes | query | graph |
SPARQL endpoint | DBPedia - Towns | query | graph |
SPARQL endpoint | European Data Portal | query | graph |
SPARQL endpoint | Check actions - Czech Supreme Audit Office | query | graph |
SPARQL endpoint | Check actions - Czech Trade Inspection Authority | query | graph |
SPARQL endpoint | Legislation CZ - Acts | query | graph |
SPARQL endpoint | Legislation CZ - Versions of Acts | query | graph |
SPARQL endpoint | Legislation UK - Acts | query | graph |
SPARQL endpoint | Legislation UK - Versions of Acts | query | graph |
SPARQL endpoint | LinkedMDB | query | graph |
SPARQL endpoint | RÚIAN - Address Places in Czech Republic | query | graph |
SPARQL endpoint | RÚIAN - Towns in Czech Republic | query | graph |
SPARQL endpoint | Subsidies from public budgets in Czech Republic | query | graph |
RDF dump | University of Sheffield - Department of Comp. Sc. | query | graph |
SPARQL endpoint | Towns in Wikidata | query | graph |
Applications
We defined 7 hypothetical applications which consume LD. Hypothetical means that the applications actually do not exist. The discovered pipelines transform datasets to the shape specified by the input descriptor queries we provide for each application.
The applications are presented in the following table. For each application, the table specifies:
- Application - name of the application
- Description - description of the application
- Input descriptor query - input descriptor query of the application necessary for the application pipeline discovery algorithm
Application | Description | Input descriptor query |
---|---|---|
TimeInstants | Consumes time instants (instances of time:Instant) and shows them on a time line. | query |
TimeIntervals | Consumes time intervals (instances of time:Interval) and shows them on a time line. | query |
ThingsTimeLines | Consumes versioned things (dct:hasVersion) where each version has a temporal abstraction (instance of time:Interval) and shows versions of a chosen thing on a time line. | query |
PlacesOnMap | Consumes spatial things (instances of geo:SpatialThing) and shows them as points on a map. | query |
ThingsOnMap | Consumes things with geographical abstractions and shows them as labeled points on a map. A geographical abstraction is a place associated (geo:location) with a location expressed as a spatial thing (instances of geo:SpatialThing). | query |
QuantifiedThingsOnMap | Consumes things with geographical and quantified abstractions and shows them as points on a map. Points have labels and quantified values. A geographical abstraction is a place associated (geo:location) with a location expressed as a spatial thing (instances of geo:SpatialThing). A quantified abstraction is a thing with a value associated by rdf:value | query |
PersonalProfiles | Consumes persons (instances of foaf:Person) how made (foaf:Person) some things where each thing has a temporal or geographical abstraction (see descriptions above). For a chosen person it shows the things he or she made on a timeline and/or on a map. | query |
Transformers
We defined 33 transformers listed in the following table. The table starts with transformers which transform proprietary RDF shapes to non-proprietary ones. Transformers which transform non-proprietary RDF shapes to other non-proprietary shapes follow. A proprietary RDF shape is a shape which contains classes or predicates from proprietary vocabularies. A proprietary vocabulary is a vocabulary which is used only in data sources provided by the same publisher, i.e is not reused by more publishers. For each transformer, the table specifies:
- Transformer - name of the transformer
- Proprietary input - true when the input expected by the transformer has a proprietary RDF shape.
- Update query - update query which defines the transformer
Transformer | Proprietary input | Update query |
---|---|---|
cedr-dotace-castka2rdf-value | true | query |
cedr-sidliNaAdrese2geo-SpatialThing | true | query |
cedr-smlouvaPodpisDatum2dct-created | true | query |
cedr-smlouvaPodpisDatum2time-Instant | true | query |
dbpedia-date2time-Instant | true | query |
dbpedia-populationMetro2rdf-value | true | query |
dbpedia-populationTotal2rdf-value | true | query |
lex-Act2frbr-Work | true | query |
movie-initial-release-of2time-Instant | true | query |
movie-person-name2foaf-name | true | query |
movie-person2foaf-made | true | query |
ruian-AdresniMisto2geo-SpatialThing | true | query |
ruian-DefinicniBod2geo-SpatialThing | true | query |
wikidata-coordinate-location2geo-SpatialThing | true | query |
wikidata-population2rdf-value | true | query |
bibtex-date2dct-issued | false | query |
dct-created2time-Instant | false | query |
dct-date2time-Instant | false | query |
dct-issued2time-Instant | false | query |
dct-valid2time-Interval-01 | false | query |
dct-valid2time-Interval-02 | false | query |
foaf-maker2foaf-made | false | query |
foaf-name2dct-title | false | query |
foaf-rdfs-label2foaf-name | false | query |
foaf-skos-prefLabel2foaf-name | false | query |
frbr-realization2dct-hasVersion | false | query |
frbr-realizationOf2frbr-realization | false | query |
gr-legalName2dct-title | false | query |
org-hasMembership2org-member | false | query |
schema-address2geo-SpatialThing | false | query |
schema-GeoCoordinates2geo-SpatialThing | false | query |
swrc-editor2foaf-made | false | query |
time-Interval2time-Interval | false | query |
Experiment results
The experimental results presented in the paper are only a summary of detailed results we provide in this excel file (27,2 MB). It has the following sheets:
- Single dataset - (Dataset,Application) pairs for Experiment 1. Column Expected denotes whether the application was identified as the best application for the dataset or not. Expected == false means that the algorithm discovered a pipeline for the pair and it was manually checked after the discovery that the pair is useful. Columns Expected* show ideal pipelines for the pair which were identified before we ran the discovery. Columns Unexpected* show other pipelines discovered by the algorithm which were denoted as useful manually after the discovery. Other columns are helpers or reserved for the future use.
- Single dataset - results - shows the full list of discovered pipelines for all datasets in Experiment 1. It does not make sense to read it.
- Single dataset - summary - shows the pivot table computed from Single dataset - results. It shows the grouping of pipelines and ranking the groups as described in the paper. Level 1 contains applications. Level 2 contains datasets. Level 3 contains groups of pipelines. Level 4 contains pipelines. Each pipeline is displayed as a sequence of transformers which are present in the pipeline.
- Two datasets - (Dataset+Linkset+Dataset,Application) pairs for Experiment 2. The same meaning of columns as for Single dataset.
- Two datasets - results - shows the full list of discovered pipelines for in Experiment 2. It does not make sense to read it.
- Two datasets - summary - shows the pivot table computed from STwo dataset - results. Its structure is the same as the strurcture of Single dataset - summary.
The ideal pipeline discovered by the algorithm for each pair of dataset(s) and application listed in the provided detailed results is available in the table below. For each pipeline, the table specifies:
- Dataset(s) - dataset name (Experiment 1) or names of two datasets and linkset (Experiment 2)
- Application - application name
- Expected - shows whether the combination of Dataset(s) and Application was expected, i.e. whether the Appliation was chosen as the best for the Dataset(s) before we ran the discovery
- Pipeline JSON - link to the JSON representation of the discovered pipeline which can be directly imported to LinkedPipes ETL
- Pipeline in LP-ETL - link to the pipeline presented in LinkedPipes ETL user interface where it can be also executed
- Result - link to the result of execute of the pipeline in Turtle
Dataset(s) | Application | Expected | Pipeline JSON | Pipeline in LP-ETL | Result |
---|---|---|---|---|---|
DBLP | PersonalProfilesApplication | Yes | JSON | LP-ETL | TTL |
DBPedia - Earthquakes | TimeInstantsApplication | Yes | JSON | LP-ETL | TTL |
DBPedia - Towns | PlacesOnMapApplication | Yes | JSON | LP-ETL | TTL |
European Data Portal | TimeInstantsApplication | Yes | JSON | LP-ETL | TTL |
Check Actions - Czech Supreme Audit Office | TimeIntervalsApplication | Yes | JSON | LP-ETL | TTL |
Check Actions - Czech Trade Inspection Authority | TimeInstantsApplication | Yes | JSON | LP-ETL | TTL |
Legislation CZ - Acts | TimeInstantsApplication | Yes | JSON | LP-ETL | TTL |
Legislation CZ - Versions of Acts | ThingsTimelinesApplication | Yes | JSON | LP-ETL | TTL |
Legislation UK - Acts | TimeInstantsApplication | Yes | JSON | LP-ETL | TTL |
Legislation UK - Versions of Acts | ThingsTimelinesApplication | Yes | JSON | LP-ETL | TTL |
LinkedMDB | PersonalProfilesApplication | Yes | JSON | LP-ETL | IRI included an unencoded space: '32' [line 99891] |
RÚIAN - Address Places in Czech Republic | PlacesOnMapApplication | Yes | JSON | LP-ETL | Virtuoso 22023 Error SR...: The result vector is too large SPARQL query. |
RÚIAN - Towns in Czech Republic | ThingsOnMapApplication | Yes | JSON | LP-ETL | TTL |
Subsidies from public budgets in Czech Republic | TimeInstantsApplication | Yes | JSON | LP-ETL | HTTP 500 |
University of Sheffield - Department of Computer Science | PersonalProfilesApplication | Yes | JSON | LP-ETL | 403 Forbidden |
Wikidata - Towns | QuantifiedThingsOnMapApplication | Yes | JSON | LP-ETL | TTL |
DBLP | TimeInstantsApplication | No | JSON | LP-ETL | TTL |
DBPedia - Earthquakes | PlacesOnMapApplication | No | JSON | LP-ETL | TTL |
Check Actions - Czech Supreme Audit Office | TimeInstantsApplication | No | JSON | LP-ETL | TTL |
Legislation CZ - Versions of Acts | TimeInstantsApplication | No | JSON | LP-ETL | TTL |
Legislation CZ - Versions of Acts | TimeIntervalsApplication | No | JSON | LP-ETL | TTL |
Legislation UK - Versions of Acts | TimeInstantsApplication | No | JSON | LP-ETL | TTL |
Legislation UK - Versions of Acts | TimeInstantsApplication | No | JSON | LP-ETL | TTL |
LinkedMDB | TimeInstantsApplication | No | JSON | LP-ETL | IRI included an unencoded space: '32' [line 99891] |
RÚIAN - Towns in Czech Republic | PlacesOnMapApplication | No | JSON | LP-ETL | TTL |
University of Sheffield - Department of Computer Science | TimeInstantsApplication | No | JSON | LP-ETL | 403 Forbidden |
Wikidata - Towns | PlacesOnMapApplication | No | JSON | LP-ETL | TTL |
Wikidata - Towns | ThingsOnMapApplication | No | JSON | LP-ETL | TTL |
Linkset : Towns from Wikidata --- Towns from RUIAN Towns in Wikidata Towns in Czech Republic - RÚIAN |
QuantifiedThingsOnMapApplication | Yes | JSON | LP-ETL | TTL |
Linkset : Towns from DBPedia --- Towns from RUIAN Towns in DBPedia Towns in Czech Republic - RÚIAN |
QuantifiedThingsOnMapApplication | Yes | JSON | LP-ETL | DBPedia timeout |
Linkset : Towns from Wikidata --- Towns from RUIAN Towns in Wikidata Towns in Czech Republic - RÚIAN |
ThingsOnMapApplication | No | JSON | LP-ETL | TTL |
Linkset : Towns from DBPedia --- Towns from RUIAN Towns in DBPedia Towns in Czech Republic - RÚIAN |
ThingsOnMapApplication | No | JSON | LP-ETL | DBPedia timeout |
Platform prototype overview
Currently, the platform consist of three different services:
- LinkedPipes Discovery
- LinkedPipes ETL
- LinkedPipes Visualization
The LP-Discovery service is used solely for discovering the application pipelines. It is preconfigured with the aforementioned tranformers and applications. When a discovery is executed, it finds possible application pipelines. A discovered application pipeline can be later exported into a pre-configured LP-ETL instance.
LP-ETL tool is capable of reliable application pipeline execution. LP-Discovery creates a selected pipeline remotely in LP-ETL, executes it and returns an IRI of a named graph that will contain the execution results once LP-ETL finishes the execution.
LP-VIZ implements some visual applications that can be applied on pipeline execution result. One can pass a reference to the named graph used to store pipeline execution results to LP-VIZ and let it visualize the data contained in the referenced graph.
Reproducing the experimental results
This page describes the steps necessary to reproduce the results described in our ESWC 2017 paper submission.
To evaluate the proposed platform we have implemented LinkedPipes. LinkedPipes is a suite of web services, each specialized on different tasks related to processing LinkedData. In this chapter, we describe the current state of the implementation of this suite. We briefly describe the services related to application pipeline discovery. We also describe, how the services communicate with each other in order to support application pipeline discovery workflows.
The application pipeline discovery itself is implemented in the discovery service, which provides the following API:
start POST /discovery/start status GET /discovery/$id list GET /discovery/$id/pipelines csv GET /discovery/$id/csv export GET /discovery/$id/pipelines/$pipelineId execute GET /discovery/$id/pipelines/$pipelineId/execute stop GET /discovery/$id/stop
The discovery itself is executed by calling the start API call. This API call expects a discovery configuration JSON object to be posted:
{ "sparqlEndpoints": [{ "url": "http://www.europeandataportal.eu/sparql", "descriptorIri": "https://...sample.ttl", "defaultGraphIris": [], "label": "European Data Portal" }] }
The configuration object allows the user to specify an array of SPARQL endpoint definitions. For each endpoint, it requires its URL and dereferencable IRI of it's descriptor. Optionally, it accepts also list of default graph IRIs, which are named graphs that are to be considered while performing the discovery. In order to make the discovery results more readable, it also accepts a label that user can define to distinguish the endpoints.
When the start API call is executed, it returns a JSON object containing just one property, which is an ID of the started pipeline discovery instance:
{ "id": "c1582982-1038-4218-911c-12c94ebd2b19" }
This ID is to be used later as a parameter of the remaining API calls.
The next step is wait for the discovery to complete. Although the partial results (discovered application pipelines are available via the list API call immediately after the iteration in which a pipeline is discovered are found, the \emph{status} API call could be used to wait for the discovery to complete.
{ "pipelineCount": 1, "isFinished": true, "duration": 350 }
Once the isFinished property is set to true, the pipeline discovery is finished and we can call the list API call to obtain all discovered application pipelines. The returned data contain details about all the discovered pipelines, mainly which of the provided datasources were used, what application is able to consume data from them and what are the transformations needed to assemble the application pipeline. Moreover, it contains an ID assigned to every discovered pipeline.
Such an ID can be later used with the export or execute API calls. The former responds with a JSON-LD data that are in a format consumed by our ETL service. The latter directly contacts a pre-configured ETL service instance, imports the specified pipeline into it and executes its processing. An exemplary result of calling the execute call would be:
{ "pipelineId" : "e2df045d-1fe3-46f8-8e61-cf5f168626b2", "etlPipelineIri" : "http://xrg12.ms.mff.cuni.cz:8090/resources/pipelines/created-1482867746217", "etlExecutionIri" : "http://xrg12.ms.mff.cuni.cz:8090/resources/executions/30fc168d-38b3-4c43-b3fe-2009d0f139e0", "resultGraphIri" : "urn:23acd444-edbc-488c-ae31-a99a41b97e70" }
There are two more API calls that we did not mention so far. One of them is the stop API call, which can be used to terminate the pipeline discovery if necessary. The last one is named csv, which is an API call that temporarily simulates the ranking service. We used it for conducting the experiments. It implements the aforementioned ranking logic and provides us with a quick overview of what are the results of the discovery.
- Result storage:
- http://demo.visualization.linkedpipes.com:8890/sparql
- LP-ETL instance:
- http://xrg12.ms.mff.cuni.cz:8090