SOLIDE at the BMBF Innovation Forum “Civil Security” 2018

© BMBF/VDI Technologiezentrum GmbH – Jörg Carstensen

SDA as part of SOLIDE project participated at the invitation of the Federal Ministry of Education and Research, the BMBF Innovation Forum “Civil Security” 2018 took place on 19 and 20 June 2018. The two-day conference on the framework program “Research for Civil Security” was held in the Café Moskau conference center in Berlin.

SOLIDE, as one of the funded project from BMBF has been presented during the event  in the context of the session “Mission Support – Better Situation Management through Intelligent Information Acquisition”

The SOLIDE project aims to examine a new approach for efficient access to operational data using the command mission management software TecBos Command. The focus here is on the fact that information can be accessed in a natural language dialogue. For this purpose, we do research into subject-specific algorithms for filtering relevant knowledge as well as suitable data integration procedures to make the available data usable and retrievable via dialogues.

SOLIDE is a joint project of PRO DV (Dortmund), Aristech GmbH (Heidelberg) together with the research group Smart Data Analytics (SDA) of the University of Bonn and the Data Science Chair (DICE) of the University of Paderborn.

SDA contribute to the project by providing a cut edge dialog system for providing information support in emergency situations.


A BETTER project for exploiting Big Data in Earth Observation

The SANSA Stack is one of the earmarked big data analytics components to be employed in the BETTER data pipelines.

Big-data Earth observation Technology and Tools Enhancing Research and development is an EU-H2020 research and innovation project started in November 2017 to the end of October 2020.

Data Pipeline Development cycleThe project’s main objective is to implement Big Data solutions (denominated as Data Pipelines) based on the usage of large volumes and heterogeneous Earth Observation datasets. This should help addressing key Societal Challenges, so the users can focus on the analysis of the extraction of the potential knowledge within the data and not on the processing of the data itself.

To achieve that, BETTER is improving the way Big Data service developers interact with end-users. After defining the challenges, the promoters validate the pipelines requirements and co-design the solution with a dedicated development team in a workshop. During the implementation, promoters can continuously test and validate the pipelines. Later, the implemented pipelines will be used by the public in the scope of Hackathons, enabling the use of specific solutions in other areas and the collection of additional user feedback.  


SUBSCRIBE HERE for major project updates.


SANSA 0.4 (Semantic Analytics Stack) Released

We are happy to announce SANSA 0.4 – the fourth release of the Scalable Semantic Analytics Stack. SANSA employs distributed computing via Apache Spark and Flink in order to allow scalable machine learning, inference and querying capabilities for large knowledge graphs.

You can find the FAQ and usage examples at

The following features are currently supported by SANSA:

  • Reading and writing RDF files in N-Triples, Turtle, RDF/XML, N-Quad format
  • Reading OWL files in various standard formats
  • Support for multiple data partitioning techniques
  • SPARQL querying via Sparqlify
  • Graph-parallel querying of RDF using SPARQL (1.0) via GraphX traversals (experimental)
  • RDFS, RDFS Simple, OWL-Horst, EL (experimental) forward chaining inference
  • Automatic inference plan creation (experimental)
  • RDF graph clustering with different algorithms
  • Terminological decision trees (experimental)
  • Anomaly detection (beta)
  • Knowledge graph embedding approaches: TransE (beta), DistMult (beta)

Noteworthy changes or updates since the previous release are:

  • Parser performance has been improved significantly e.g. DBpedia 2016-10 can be loaded in <100 seconds on a 7 node cluster
  • Support for a wider range of data partitioning strategies
  • A better unified API across data representations (RDD, DataFrame, DataSet, Graph) for triple operations
  • Improved unit test coverage
  • Improved distributed statistics calculation (see ISWC paper)
  • Initial scalability tests on 6 billion triple Ethereum blockchain data on a 100 node cluster
  • New SPARQL-to-GraphX rewriter aiming at providing better performance for queries exploiting graph locality
  • Numeric outlier detection tested on DBpedia (en)
  • Improved clustering tested on 20 GB RDF data sets

Deployment and getting started:

  • There are template projects for SBT and Maven for Apache Spark as well as for Apache Flink available to get started.
  • The SANSA jar files are in Maven Central i.e. in most IDEs you can just search for “sansa” to include the dependencies in Maven projects.
  • Example code is available for various tasks.
  • We provide interactive notebooks for running and testing code via Docker.

We want to thank everyone who helped to create this release, in particular the projects Big Data Europe, HOBBIT, SAKE, Big Data Ocean, SLIPO, QROWD, BETTER, BOOST and SPECIAL.

Spread the word by retweeting our release announcement on Twitter. For more updates, please view our Twitter feed and consider following us.

Greetings from the SANSA Development Team


Papers accepted at ISWC 2018

ISWC_2018We are very pleased to announce that our group got 3 papers accepted for presentation at ISWC 2018: The 17th International Semantic Web Conference, which will be held on October 8 – 12, 2018 in Monterey, California, USA.

The International Semantic Web Conference (ISWC) is the premier international forum where Semantic Web / Linked Data researchers, practitioners, and industry specialists come together to discuss, advance, and shape the future of semantic technologies on the web, within enterprises and in the context of the public institution.

Here is the list of the accepted papers with their abstract:

Abstract: Many question answering systems over knowledge graphs rely on entity and relation linking components in order to connect the natural language input to the underlying knowledge graph. Traditionally, entity linking and relation linking has been performed either as a dependent, sequential tasks or as independent, parallel tasks. In this paper, we propose a framework called EARL, which performs entity linking and relation linking as a joint task. EARL implements two different solution strategies for which we provide a comparative analysis in this paper: The first strategy is a formalization of the joint entity and relation linking tasks as an instance of the Generalised Travelling Salesman Problem (GTSP). In order to be computationally feasible, we employ approximate GTSP solvers. The second strategy uses machine learning in order to exploit the connection density between nodes in the knowledge graph. It relies on three base features and re-ranking steps in order to predict entities and relations. We compare the strategies and evaluate them on a dataset with 5000 questions. Both strategies significantly outperform the current state-of-the-art approaches for entity and relation linking.

Abstract: Over the last years, the Semantic Web has been growing steadily. Today, we count more than 10,000 datasets made available online following Semantic Web standards. Nevertheless, many applications, such as data integration, search, and interlinking, may not take the full advantage of the data without having a priori statistical information about its internal structure and coverage. In fact, there are already a number of tools, which offer such statistics, providing basic information about RDF datasets and vocabularies. However, those usually show severe deficiencies in terms of performance once the dataset size grows beyond the capabilities of a single machine. In this paper, we introduce a software library for statistical calculations of large RDF datasets, which scales out to clusters of machines. More specifically, we describe the first distributed in-memory approach for computing 32 different statistical criteria for RDF datasets using Apache Spark. The preliminary results show that our distributed approach improves upon a previous centralized approach we compare against and provides approximately linear horizontal scale-up.  The criteria are extensible beyond the 32 default criteria, is integrated into the larger SANSA framework and employed in at least four major usage scenarios beyond the SANSA community.

Abstract: Institutions from different domains require the integration of data coming from heterogeneous Web sources. Typical use cases include Knowledge Search, Knowledge Building, and Knowledge Completion. We report on the implementation of the RDF Molecule-Based Integration Framework MINTE+ in three domain-specific applications: Law Enforcement, Job Market Analysis, and Manufacturing. The use of RDF molecules as data representation and a core element in the framework gives MINTE+ enough flexibility to synthesize knowledge graphs in different domains. We first describe the challenges in each domain-specific application, then the implementation and configuration of the framework to solve the particular problems of each domain. We show how the parameters defined in the framework allow to tune the integration process with the best values according to each domain. Finally, we present the main results, and the lessons learned from each application.

This work has received funding from the EU Horizon 2020 projects BigDataEurope (GA no. 644564) and QROWD (GA no. 723088), the Marie Skłodowska-Curie action WDAqua(GA No 642795), and HOBBIT (GA. 688227), and (project SlideWiki, grant no. 688095), and the German Ministry of Education and Research (BMBF) in the context of the projects LiDaKrA (Linked-Data-basierte Kriminalanalyse, grant no. 13N13627) and InDaSpacePlus (grant no. 01IS17031).

Looking forward to seeing you at The ISWC 2018.

Paper accepted at GRADES 2018 workshop at SIGMOD / PODS


We are very pleased to announce that our group got 1 paper accepted for presentation at the GRADES workshop at SIGMOS / PODS 2018: The International ACM International Conference on Management of Data, which will be held in Houston, TX, USA, on June 10th – June 15th, 2018.

The annual ACM SIGMOD/PODS Conference is a leading international forum for database researchers, practitioners, developers, and users to explore cutting-edge ideas and results and to exchange techniques, tools, and experiences. The conference includes a fascinating technical program with research and industrial talks, tutorials, demos, and focused workshops. It also hosts a poster session to learn about innovative technology, an industrial exhibition to meet companies and publishers, and a careers-in-industry panel with representatives from leading companies.

The focus of the GRADES 2018 workshop is the application areas, usage scenarios and open challenges in managing large-scale graph-shaped data. The workshop is a forum for exchanging ideas and methods for mining, querying and learning with real-world network data, developing new common understandings of the problems at hand, sharing of data sets and benchmarks where applicable, and leveraging existing knowledge from different disciplines. Additionally, considering specific techniques (e.g., algorithms, data/index structures) in the context of the systems that implement them, rather than describing them in isolation, GRADES-NDA aims to present technical contributions inside the graph, RDF and other data management systems on graphs of a large size.

 Here is the accepted paper with its abstract:

Abstract: In the past decade Knowledge graphs have become very popular and frequently rely on the Resource Description Framework (RDF) or Property Graphs (PG) as their data models. However, the query languages for these two data models – SPARQL for RDF and the PG traversal language Gremlin – are lacking basic interoperability. In this demonstration paper, we present Gremlinator, the first translator from SPARQL – the W3C standardized language for RDF – to Gremlin – a popular property graph traversal language. Gremlinator translates SPARQL queries to Gremlin path traversals for executing graph pattern matching queries over graph databases. This allows a user, who is well versed in SPARQL, to access and query a wide variety of Graph databases avoiding the steep learning curve for adapting to a new Graph Query Language (GQL). Gremlin is a graph computing system-agnostic traversal language (covering both OLTP graph databases and OLAP graph processors), making it a desirable choice for supporting interoperability for querying Graph databases. Gremlinator is planned to be released as an Apache TinkerPop plugin in the upcoming releases.

This work has received funding from the EU H2020 R&I programme for the Marie Skłodowska-Curie action WDAqua (GA No 642795).

Looking forward to seeing you at The GRADES 2018.

Demo Paper accepted at SIGIR 2018

logo_brownWe are very pleased to announce that our group got 1 papers accepted for presentation at the demo session on SIGIR 2018: The 41st International ACM SIGIR Conference on Research and Development in Information Retrieval, which will be held on Ann Arbor Michigan, U.S.A. July 8-12, 2018.

The annual SIGIR conference is the major international forum for the presentation of new research results, and the demonstration of new systems and techniques, in the broad field of information retrieval (IR). The 41st ACM SIGIR conference welcomes contributions related to any aspect of information retrieval and access, including theories and foundations, algorithms and applications, and evaluation and analysis. The conference and program chairs invite those working in areas related to IR to submit high-impact original papers for review.


Here is the accepted paper with its abstract:

Abstract: Question answering (QA) systems provide user-friendly interfaces for retrieving answers from structured and unstructured data to natural language questions. Several QA systems, as well as related components, have been contributed by the industry and research community in recent years. However, most of these efforts have been performed independently from each other and with different focuses and their synergies in the scope of QA have not been addressed adequately.Frankenstein is a novel framework for developing QA systems over knowledge bases by integrating existing state-of-the-art QA components performing different tasks. It incorporates several reusable QA components, employs machine-learning techniques to predict best performing components and QA pipelines for a given question to generate static and dynamic executable QA pipelines. In this demo, attendees will be able to view the different functionalities of Frankenstein for performing independent QA component execution, QA component prediction given an input question as well as the static and dynamic composition of different QA pipelines.

This work has received funding from the EU H2020 R&I programme for the Marie Skłodowska-Curie action WDAqua (GA No 642795.

Looking forward to seeing you at The SIGR 2018.

Papers accepted at ICWE 2018

ICWE_logoWe are very pleased to announce that our group got 2 papers accepted for presentation at the ICWE 2018 : The 18th International Conference on Web Engineering, which will be held on CÁCERES, SPAIN. 5 – 8 JUNE, 2018.

The ICWE is the prime yearly international conference on the different aspects of designing, building, maintaining and using Web applications. The theme for the year 2018 — the 18th edition of the event — is Enhancing the Web with Advanced Engineering. The conference will cover the different aspects of Web Engineering, including the design, creation, maintenance, and usage of Web applications. ICWE2018 is endorsed by the International Society for the Web Engineering (ISWE) and belongs to the ICWE conference series owned by ISWE.

Here are the accepted papers with their abstracts:

  • Efficiently Pinpointing SPARQL Query Containmentsby Claus Stadler, Muhammad Saleem, Axel-Cyrille Ngonga Ngomo, and Jens Lehmann.

    Abstract: Query containment is a fundamental problem in database research, which is relevant for many tasks such as query optimisation, view maintenance and query rewriting. For example, recent SPARQL engines built on Big Data frameworks that precompute solutions to frequently requested query patterns, are conceptually an application of query containment. We present an approach for solving the query containment problem for SPARQL queries – the W3C standard query language for RDF datasets. Solving the query containment problem can be reduced to the problem of deciding whether a sub graph isomorphism exists between the normalized algebra expressions of two queries. Several state-of-the-art methods are limited to matching two queries only, as well as only giving a boolean answer to whether a containment relation holds. In contrast, our approach is fit for view selection use cases, and thus capable of efficiently enumerating all containment mappings among a set of queries. Furthermore, it provides the information about how two queries’ algebra expression trees correspond under containment mappings. All of our source code and experimental results are openly available.


  • A Platform for SemanticallyRepresenting and Analyzing Open Fiscal Databy Fathoni A. Musyaffa, Lavdim Halilaj, Yakun Li, Fabrizio Orlandi, Hajira Jabeen, Sören Auer, and Maria-Esther Vidal.

    Abstract: Budget and spending data are among the most published Open Data datasets on the Web and continuously increasing in terms of volume over time. These datasets tend to be published in large tabular files – without predefined standards – and require complex domain and technical expertise to be used in real-world scenarios. Therefore, the potential benefits of having these datasets open and publicly available are hindered by their complexity and heterogeneity. Linked Data principles can facilitate integration, analysis and usage of these datasets. In this paper, we present (OBEU), a Linked Data-based platform supporting the entire open data life-cycle of budget and spending datasets: from data creation to publishing and exploration. The platform is based on a set of requirements specifically collected by experts in the budget and spending data domain. It follows a micro-services architecture that easily integrates many different software modules and tools for analysis, visualization and transformation of data. Data i represented according to a logical model for open fiscal data which is translated into both RDF data and a tabular data formats. We demonstrate the validity of the implemented OBEU platform with real application scenarios and report on a user study conducted to confirm its usability.

This work was partly supported by the grant from the European Unions Horizon 2020 research Europe flag and innovation programme for the projects HOBBIT (GA no. 688227), QROWD (GA no. 732194), WDAqua (GA no. 642795), the EU H2020 (GA no. 645833) and DAAD scholarship.


Looking forward to seeing you at ICWE 2018.

Invited talk by Dr. Anastasia Dimou

natadimouOn Wednesday, 21st  of March Anastasia Dimou from the Internet Technology & Data Science Lab visited SDA and gave a talk entitled “High Quality Linked Data Generation from Heterogeneous data

Anastasia Dimou is a Post-Doc Researcher at the Internet Technology & Data Science Lab at Gent University, Belgium. Anastasia joined the IDLab research group in February 2013. Her research expertise lies in the area of the Semantic Web, Linked Data Generation and Publication, Data Quality and Integration, Knowledge Representation and Management. She has broad experience on Semantic Wikis and Classification.  As part of her research, she investigated a uniform language for describing the mapping rules for generating high-quality Linked Data from multiple heterogeneous data formats and access interfaces and she also conducted research on Linked Data generation and publishing workflows. Her research activities led to the development of the RML tool chain (RMLProcessor, RMLEditor, RMLValidator, and RMLWorkbench). Anastasia has been involved in different national and l research projects and publications.

Prof. Jens Lehmann invited the speaker to the bi-weekly “SDA colloquium presentations”. The goal of her visit was to exchange experience and ideas on RML tools specialized for data quality and on the fly mapping, including heterogeneous dataset mapping into LOD. Apart from presenting various use cases where RML tools were used, she introduced a declarative RML serialization which models the mapping rules using the well-known yaml language. Anastasia shared with our group future research problems and challenges related to this research area.

In her talk, she introduced a full workflow aka the RML tool chain which models components of an RML mapping lifecycle. She discussed its application to the structure of heterogeneous data sources. Anastasia Dimou mentioned that adding support for data quality during the mapping shall allow users to efficiently explore a structured search space to enable the future violations not only map the range of the known domain but also help to discover new knowledge from the existing knowledge base worth mapping.

During the visit, SDA core research topics and main research projects were presented in a (successful!) attempt to find an intersection on the future collaborations with Anastasia and her research group.

As an outcome of this visit, we expect to strengthen our research collaboration networks with the  Internet Technology & Data Science Lab at UGent, mainly on combining semantic knowledge for exploratory and mapping tools and apply those techniques for a very large-scale KG using our distributed analytics framework SANSA and DBpedia.

Papers and a tutorial accepted at ESWC 2018


We are very pleased to announce that our group got 3 papers accepted for presentation at the ESWC 2018 : The 15th edition of The Extended Semantic Web Conference, which will be held on June 3-7, 2018 in Heraklion, Crete, Greece.

The ESWC is a major venue for discussing the latest scientific results and technology innovations around semantic technologies. Building on its past success, ESWC is seeking to broaden its focus to span other relevant related research areas in which Web semantics plays an important role. ESWC 2018 will present the latest results in research, technologies, and applications in its field. Besides the technical program organized over twelve tracks, the conference will feature a workshop and tutorial program, a dedicated track on Semantic Web challenges, system descriptions and demos, a posters exhibition and a doctoral symposium.

Here are the accepted papers with their abstracts:

  • Formal Query Generation for Question Answering over Knowledge Bases” by Hamid Zafar, Giulio Napolitano and Jens Lehmann.

    Abstract: Question answering (QA) systems often consist of several components such as Named Entity Disambiguation (NED), Relation Extraction (RE), and Query Generation (QG). In this paper, we focus on the QG process of a QA pipeline on a large-scale Knowledge Base (KB), with noisy annotations and complex sentence structures. We therefore propose SQG, a SPARQL Query Generator with modular architecture, enabling easy integration with other components for the construction of a fully functional QA pipeline. SQG can be used on large open-domain KBs and handle noisy inputs by discovering a minimal subgraph based on uncertain inputs, that it receives from the NED and RE components. This ability allows SQG to consider a set of candidate entities/relations, as opposed to the most probable ones, which leads to a significant boost in the performance of the QG component. The captured subgraph covers multiple candidate walks, which correspond to SPARQL queries. To enhance the accuracy, we present a ranking model based on Tree-LSTM that takes into account the syntactical structure of the question and the tree representation of the candidate queries to find the one representing the correct intention behind the question.

  • Frankenstein: a Platform Enabling Reuse of Question Answering Components Paper” Resource Track by  Kuldeep Singh, Andreas Both, Arun Sethupat, Saeedeh Shekarpour.

    Abstract: Recently remarkable trials of the question answering (QA) community yielded in developing core components accomplishing QA tasks. However, implementing a QA system still was costly. While aiming at providing an efficient way for the collaborative development of QA systems, the Frankenstein framework was developed that allows dynamic composition of question answering pipelines based on the input question. In this paper, we are providing a full range of reusable components as independent modules of Frankenstein populating the ecosystem leading to the option of creating many different components and QA systems. Just by using the components described here, 380 different QA systems can be created offering the QA community many new insights. Additionally, we are providing resources which support the performance analyses of QA tasks, QA components and complete QA systems. Hence, Frankenstein is dedicated to improve the efficiency within the research process w.r.t. QA. 

  • Using Ontology-based Data Summarization to Develop Semantics-aware Recommender Systems” by Tommaso Di Noia, Corrado Magarelli, Andrea Maurino, Matteo Palmonari, Anisa Rula.

    Abstract: In the current information-centric era, recommender systems are gaining momentum as tools able to assist users in daily decision-making tasks. They may exploit users’ past behavior combined with side/contextual information to suggest them new items or pieces of knowledge they might be interested in. Within the recommendation process, Linked Data (LD) have been already proposed as a valuable source of information to enhance the predictive power of recommender systems not only in terms of accuracy but also of diversity and novelty of results. In this direction, one of the main open issues in using LD to feed a recommendation engine is related to feature selection: how to select only the most relevant subset of the original LD dataset thus avoiding both useless processing of data and the so called “course of dimensionality” problem. In this paper we show how ontology-based (linked) data summarization can drive the selection of properties/features useful to a recommender system. In particular, we compare a fully automated feature selection method based on ontology-based data summaries with more classical ones and we evaluate the performance of these methods in terms of accuracy and aggregate diversity of a recommender system exploiting the top-k selected features. We set up an experimental testbed relying on datasets related to different knowledge domains. Results show the feasibility of a feature selection process driven by ontology-based data summaries for LD-enabled recommender systems.

These work were supported by an EU H2020 grant provided for the HOBBIT project (GA no. 688227), by German Federal Ministry of Education and Research (BMBF) funding for the project SOLIDE (no. 13N14456) as well as by 
European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 642795, WDAqua project.

Furthermore, we are pleased to inform that we got a tutorial accepted, which will be co-located with the ESWC 2018.

Here is the accepted tutorial and its short description:

Looking forward to seeing you at The ESWC 2018.

Paper accepted at Semantic Web Journal


We are very pleased to announce that our group got a paper accepted at Semantic Web Journal on the Benchmarking Linked Data 2017 issue.

The journal Semantic Web – Interoperability, Usability, Applicability (published and printed by IOS Press, ISSN: 1570-0844), in short Semantic Web journal, brings together researchers from various fields which share the vision and need for more effective and meaningful ways to share information across agents and services on the future internet and elsewhere. As such, Semantic Web technologies shall support the seamless integration of data, on-the-fly composition, and interoperation of Web services, as well as more intuitive search engines. The semantics – or meaning – of information, however, cannot be defined without a context, which makes personalization, trust, and provenance core topics for Semantic Web research. New retrieval paradigms, user interfaces, and visualization techniques have to unleash the power of the Semantic Web and at the same time hide its complexity from the user. Based on this vision, the journal welcomes contributions ranging from theoretical and foundational research over methods and tools to descriptions of concrete ontologies and applications in all areas.

Here is the accepted paper with its abstract:

  • SML-Bench — A Benchmarking Framework for Structured Machine Learning” by Patrick Westphal, Lorenz Bühmann, Simon Bin, Hajira Jabeen, Jens Lehmann.
    Abstract: The availability of structured data has increased significantly over the past decade and several approaches to learn from structured data have been proposed. These logic-based, inductive learning methods are often conceptually similar, which would allow a comparison among them even if they stem from different research communities. However, so far no efforts were made to define an environment for running learning tasks on a variety of tools, covering multiple knowledge representation languages. With SML-Bench, we propose a benchmarking framework to run inductive learning tools from the ILP and semantic web communities on a selection of learning problems. In this paper, we present the foundations of SML-Bench, discuss the systematic selection of benchmarking datasets and learning problems, and showcase an actual benchmark run on the currently supported tools.

This part of work is supported were supported by grants from the EU FP7 Programme for the project GeoKnow (GA no. 318159) as well as for the German Research Foundation project GOLD and the German Ministry for Economic Affairs and Energy project SAKE (GA no. 01MD15006E), the European Union’s Horizon 2020 research and innovation programme for the project SLIPO (GA no. 731581) as well as the European Union’s H2020 research and innovation action HOBBIT (GA 688227) and the CSA BigDataEurope (GA No 644564).