Demo and Poster Papers accepted at ISWC 2019

We are very pleased to announce that our group got 7 demo/poster papers accepted for presentation at ISWC 2019: the 18th International Semantic Web Conference, which will be held on October 26 – 30 2019 in Auckland, New Zealand.

The International Semantic Web Conference (ISWC) is the premier international forum where Semantic Web / Linked Data researchers, practitioners, and industry specialists come together to discuss, advance, and shape the future of semantic technologies on the web, within enterprises and in the context of the public institution.

Here is the list of the accepted papers with their abstract:

  • Querying large-scale RDF datasets using the SANSA frameworkby Claus Stadler, Gezim Sejdiu, Damien Graux, and Jens Lehmann.
    Abstract: In this paper, we present Sparklify: a scalable software component for efficient evaluation of SPARQL queries over distributed RDF datasets. In particular, we demonstrate a W3C SPARQL endpoint powered by our SANSA framework’s RDF partitioning system and Apache SPARK for querying the DBpedia knowledge base. This work is motivated by the lack of Big Data SPARQL systems that are capable of exposing large-scale heterogeneous RDF datasets via a Web SPARQL endpoint.
  • How to feed the Squerall with RDF and other data nuts?by Mohamed Nadjib Mami, Damien Graux, Simon Scerri, Hajira Jabeen, Sören Auer, and Jens Lehmann.
    Abstract: Advances in Data Management methods have resulted in a wide array of storage solutions having varying query capabilities and supporting different data formats. Traditionally, heterogeneous data was transformed off-line into a unique format and migrated to a unique data management system, before being uniformly queried. However, with the increasing amount of heterogeneous data sources, many of which are dynamic,  modern applications prefer accessing directly the original fresh data. Addressing this requirement, we designed and developed Squerall, a software framework that enables the querying of original large and heterogeneous data on-the-fly without prior data transformation. Squerall is built from the ground up with extensibility in consideration, e.g., supporting more data sources. Here, we explain Squerall’s extensibility aspect and demonstrate step-by-step how to add support for RDF data, a new extension to the previously supported range of data sources.
  • Towards Semantically Structuring GitHubby Dennis Oliver Kubitza, Matthias Böckmann, and Damien Graux.
    Abstract: With the recent increase of open-source projects, tools have emerged to enable developers collaborating. Among these, git has received lots of attention and various on-line platforms have been created around this tool, hosting millions of projects. Recently, some of these platforms opened APIs to allow users questioning their public databases of open-source projects. Despite the common protocol core, there are for now no common structures someone could use to link those sources of information. To tackle this, we propose here the first ontology dedicated to the git protocol and also describe GitHub’s features within it to show how it is extendable to encompass more git-based data sources.
  • Microbenchmarks for Question AnsweringSystems Using QaldGen” by Qaiser Mehmood, Abhishek Nadgeri, Muhammad Saleem, Kuldeep Singh, Axel-Cyrille Ngonga Ngomo and Jens Lehmann.
    Abstract: [Microbenchmarks are used to test the individual components of the given systems. Thus, such benchmarks can provide a more detailed analysis pertaining to the different components of the systems. We present a demo of the QaldGen, a framework for generating question samples for micro-benchmarking of Question Answering (QA) systems over Knowledge Graphs (KGs). QaldGen is able to select customized question samples from existing QA datasets. The sampling of questions is carried out by using different clustering techniques. It is flexible enough to select benchmarks of varying sizes and complexities according to user-defined criteria on the most important features to be considered for QA benchmarking. We evaluate the usability of the interface by using the standard system usability scale questionnaire. Our overall usability score of 77.25 (ranked B+) suggests that the online interface is recommendable easy to use, and well-integrated.
  • FALCON: An Entity and Relation Linking framework over DBpediaby Ahmad Sakor, Kuldeep Singh,  Maria Esther Vidal.
    Abstract: [We tackle the problem of entity and relation linking and present FALCON, a rule-based tool able to accurately map entities and relations in short texts to resources in a knowledge graph. FALCON resorts to fundamental principles of the English morphology (e.g., compounding and headword identification) and performs joint entity and relation linking against a short text. We demonstrate the benefits of the rule-based approach implemented in FALCON on short texts composed of various types of entities. The attendees will observe the behavior of FALCON on the observed limitations of Entity Linking (EL) and Relation Linking (RL) tools. The demo is available at
  • Demonstration of a Customizable Representation Model for Graph-Based Visualizations of Ontologies – GizMO” by Vitalis Wiens, Mikhail Galkin, Steffen Lohmann, and Sören Auer
    Abstract: Visualizations can facilitate the development, exploration, communication, and sense-making of ontologies. Suitable visualizations, however, are highly dependent on individual use cases and targeted user groups. In this demo, we present a methodology that enables customizable definitions for ontology visualizations. We showcase its applicability by introducing GizMO, a representation model for graph-based visualizations in the form of node-link diagrams. Additionally, we present two applications that operate on the GizMO representation model and enable individual customizations for ontology visualizations.
  • Predict Missing Links Using PyKEEN by Mehdi Ali, Charles Tapley Hoyt,  Daniel Domingo-Fernandez, and Jens Lehmann.
    Abstract:PyKEEN is a framework, which integrates several approaches to compute knowledge graph embeddings (KGEs). We demonstrate the usage of PyKEEN in a biomedical use case, i.e. we trained and evaluated several KGE models on a biological knowledge graph containing genes’ annotations to pathways and pathway hierarchies from well-known databases. We used the best performing model to predict new links and present an evaluation in collaboration with a domain expert.

This work has received funding from the EU Horizon 2020 projects BigDataOcean (GA no. 732310), Boost4.0 (GA no. 780732), SLIPO (GA no. 731581) and QROWD (GA no. 723088).

Looking forward to seeing you at The ISWC 2019.

Workshop papers accepted at ECML-PKDD/SoGood 2019

We are very pleased to announce that our group got 2 papers accepted at the 4th Workshop on Data Science for Social Good.

SoGood is a peer-reviewed workshop that focuses on how Data Science can and does contribute to social good in its widest sense. The workshop is held from 2016 yearly together with ECML PKDD Conference and this year is on 20th September, Wurzburg, Germany.

Here is the pre-print of the accepted papers with their abstract:

  • Linking Physicians to Medical Research Results via Knowledge Graph Embeddings and Twitter by  Afshin Sadeghi and  Jens Lehmann.
    Abstract: Informing professionals about the latest research results in their field is a particularly important task in the field of health care, since any development in this field directly improves the health status of the patients. Meanwhile, social media is an infrastructure that allows public instant sharing of information, thus it has recently become popular in medical applications. In this study, we apply Multi Distance Knowledge Graph Embeddings (MDE) to link physicians and surgeons to the latest medical breakthroughs that are shared as the research results on Twitter. Our study shows that using this method physicians can be informed about the new findings in their field given that they have an account dedicated to their profession. 
  • Improving Access to Science for Social Good by Mehdi Ali, Sahar Vahdati, Shruti Singh, Sourish Dasgupta, and Jens Lehmann.
    Abstract: One of the major goals of science is to make the world socially a good place to live. The old paradigm of scholarly communication through publishing has generated enormous amount of heterogeneous data and metadata. However, most scientific results are not easy to discover, in particular those results which benefit social good and are also targeted at non-scientific people. In this paper, we showcase a knowledge graph embedding (KGE) based recommendation system to be used by students involved in activities aiming at social good. The recommendation system has been trained on a scholarly knowledge graph, which we constructed. The obtained results highlight that the KGEs successfully encoded the structure of the KG, and therefore, our system could provide valuable recommendations.


This study is partially supported by project MLwin (Maschinelles Lernen mit Wissensgraphen, grant no. 01IS18050F), Cleopatra (grant no. 812997), EPSRC grant EP/M025268/1, the WWTF grant VRG18-013, LAMBDA (GA no. 809965). The authors gratefully acknowledge financial support from the Federal Ministry of Education and Research of Germany (BMBF) which is funding MLwin and European Union Marie Curie ITN that funds Cleopatra, as well as Fraunhofer IAIS.

Paper accepted at TPDL 2019

TPDL-2019We are very pleased to announce that our group got a paper accepted in TPDL 2019 (23nd International Conference on Theory and Practice of Digital Libraries) , which will be held on September 9-12, 2019, OsloMet – Oslo Metropolitan University, Oslo, Norway.

The TPDL is is a well-established scientific and technical forum on the broad topic of digital libraries, bringing together researchers, developers, content providers and users in digital libraries and digital content management. 

TPDL 2019 attempts to facilitate establishing connections and convergences between diverse research communities such as Digital Humanities, Information Sciences and others that could benefit from (and contribute to) ecosystems offered by digital libraries and repositories. To become especially useful to the diverse research and practitioner communities digital libraries need to consider special needs and requirements for effective data utilization, management, and exploitation. 

Here is the pre-print of the accepted paper with its abstract:

Abstract: Recently, semantic data have become more distributed. Available datasets increasingly serve non-technical as well as technical audience. This is also the case with our EVENTSKG dataset, a comprehensive knowledge graph about scientific events, which serves the entire scientific and library community. A common way to query such data is via SPARQL queries. Non-technical users, however, have difficulties with writing SPARQL queries, because it is a time-consuming and error-prone task, and it requires some expert knowledge. This opens the way to natural language interfaces to tackle this problem by making semantic data more accessible to a wider audience, i.e., not restricted to experts. In this work, we present SPARQL-AG, a front-end that automatically generates and executes SPARQL queries for querying EVENT-SKG. SPARQL-AG helps potential semantic data consumers, including non-experts and experts, by generating SPARQL queries, ranging from simple to complex ones, using an interactive web interface. The eminent feature of SPARQL-AG is that users neither need to know the schema of the knowledge graph being queried nor to learn the SPARQL syntax , as SPARQL-AG offers them a familiar and intuitive interface for query generation and execution. It maintains separate clients to query three public SPARQL endpoints when asking for particular entities. The service is publicly available online and has been extensively tested.

Furthermore, we got a poster paper accepted at the Poster & Demo Track.

Here is the list of the accepted poster paper with its abstract:

Abstract: In this work, we tackle the problem of generating comprehensive overviews of research findings in a structured and comparable way. To bring structure to such information and thus to enable researchers to, e.g., explore domain overviews, we present an approach for automatic unveiling of realm overviews for research artifacts (Aurora), an approach to generate overviews of research domains and their relevant artifacts. Aurora is a semi-automatic crowd-sourcing workflow that captures such information into the semantic wiki. Our evaluation confirms that Aurora, when compared to the current manual approach, reduces the effort for researchers to compile and read survey papers.



This work was co-funded by the European Research Council for the project ScienceGRAPH (Grant agreement ID: 819536).

Looking forward to seeing you at The TPDL 2019.

Papers accepted at SEMANTiCS 2019

semantics-2019-rgbWe are very pleased to announce that our group got 5 papers accepted for presentation at SEMANTiCS 2019: 15th International Conference on Semantic Systems, which will be held on Sept. 09-12, 2019 Karlsruhe Germany.

SEMANTiCS is an established knowledge hub where technology professionals, industry experts, researchers and decision-makers can learn about new technologies, innovations and enterprise implementations in the fields of Linked Data and Semantic AI. Since 2005, the conference series has focused on semantic technologies, which are today together with other methodologies such as NLP and machine learning the core of intelligent systems. The conference highlights the benefits of standards-based approaches.

Here is the list of the accepted papers with their abstract:

Abstract: Over the last two decades, the amount of data which has been created, published and managed using Semantic Web standards and especially via Resource Description Framework (RDF) has been increasing. As a result, efficient processing of such big RDF datasets has become challenging. Indeed, these processes require, both efficient storage strategies and query-processing engines, to be able to scale in terms of data size. In this study, we propose a scalable approach to evaluate SPARQL queries over distributed RDF datasets using a semantic-based partition and is implemented inside the state-of-the-art RDF processing framework: SANSA. An evaluation of the performance of our approach in processing large-scale RDF datasets is also presented. The preliminary results of the conducted experiments show that our approach can scale horizontally and perform well as compared with the previous Hadoop-based system. It is also comparable with the in-memory SPARQL query evaluators when there is less shuffling involved.

Abstract: While the multilingual data on the Semantic Web grows rapidly, the building of multilingual ontologies from monolingual ones is still cumbersome and hampered due to the lack of techniques for cross-lingual ontology enrichment. Cross-lingual ontology enrichment greatly facilitates the semantic interoperability between different ontologies in different natural languages. Achieving such enrichment by human labor is very costly and error-prone. Thus, in this paper, we propose a fully automated ontology enrichment approach (OECM), which builds a multilingual ontology by enriching a monolingual ontology from another one in a different natural language, using a cross-lingual matching technique. OECM selects the best translation among all available translations of ontology concepts based on their semantic similarity with the target ontology concepts. We present a use case of our approach for enriching English Scholarly Communication Ontologies using German and Arabic ontologies from the MultiFarm benchmark. We have compared our results with the results from the Ontology Alignment Evaluation Initiative (OAEI 2018). Our approach has higher precision and recall in comparison to five state-of-the-art approaches. Additionally, we recommend some linguistic corrections in the Arabic ontologies in Multifarm which have enhanced our cross-lingual matching results.

Abstract: The disruptive potential of the upcoming digital transformations for the industrial manufacturing domain have led to several reference frameworks and numerous standardization approaches. On the other hand, the Semantic Web community has elaborated remarkable amounts of work for instance on data and service description, integration of heterogeneous sources and devices, and AI techniques in distributed systems. These two work streams are, however, mostly unrelated and only briefly regard the opposite requirements, practices and terminology. We contribute to this gap by providing the Semantic Asset Administration Shell, a RDF-based representation of the Industrie 4.0 Component. We provide an ontology for the latest data model specification, created a RML mapping, supply resources to validate the RDF entities and introduce basic reasoning on the Asset Administration Shell data model. Furthermore, we discuss the different assumptions and presentation patterns, and analyze the implications of a semantic representation on the original data. We evaluate the thereby created overheads, and conclude that the semantic lifting is manageable also for restricted or embedded devices and therefore meets the conditions of Industrie 4.0 scenarios.

Abstract: Increasing digitization leads to a constantly growing amount of data in a wide variety of application domains. Data analytics, including in particular machine learning, plays the key role to gain actionable insights from this data in a variety of domains and real-world applications. However, configuration of data analytics workflows that include heterogeneous data sources requires significant data science expertise, which hinders wide adoption of existing data analytics frameworks by non-experts. In this paper we present the Simple-ML framework that adopts semantic technologies, including in particular domain-specific semantic data models and dataset profiles, to support efficient configuration, robustness and reusability of data analytics workflows. We present semantic data models that lay the foundation for the framework development and discuss the data analytics workflows based on these models. Furthermore, we present an example instantiation of the Simple-ML data models for a real-world use case in the mobility application domain and discuss the emerging challenges.

Abstract: In the Big Data era, the amount of digital data is increasing exponentially. knowledge graphs are gaining attention to handle the variety dimension of Big Data, allowing machines to understand the semantics present in data. For example, knowledge graphs such as STITCH, SIDER, and DrugBank have been developed in the Biomedical Domain. As the number of data increases, it is critical to perform data analytics. Interaction network analysis is especially important in knowledge graphs,e.g., to detect drug-target interactions. Having a good target identification approach helps in accelerating and reducing the cost of discovering new medicines. In this work, we propose a machine learning-based approach that combines two inputs: (1) interactions and similarities among entities, and (2) translation to embeddings technique. We focus on the problem of discovering missing links in the data, called link prediction. Our approach, named SimTransE, is able to analyze the drug-target interactions and similarities. Based on this analysis, SimTransE is able to predict new drug-target interactions. We empirically evaluate Sim-TransE using existing benchmarks and evaluation protocols defined by existing state-of-the-art approaches. Our results demonstrate the good performance of SimTransE in the task of link prediction.

Furthermore, we got 2 demo/poster papers accepted at the Poster & Demo Track.

Here is the list of the accepted poster/demo papers with their abstract:

Abstract: With  the recent  trend on blockchain,  many users want to know more about the important players of the chain. In this study, we investigate  and analyze the Ethereum blockchain network in order to identify the major entities across the transaction network. By leveraging the rich data available through Alethio’s platform in the form of RDF triples we learn about the Hubs and Authorities of the Ethereum transaction network. Alethio uses SANSA for efficient reading and processing of such large-scale RDF data (transactions on Ethereum blockchain) in order to perform analytics e.g. finding top accounts, or typical behavior patterns of exchanges’ deposit wallets and more. 

Abstract: Open Data portals often struggle to provide release features (i.e., stable versioning, up-to-date download links, rich metadata descriptions) for their datasets. By this means, wide adoption of publicly available data collections is hindered, since consuming applications cannot access fresh data sources or might break due to data quality issues. While there exists a variety of tools to efficiently control release processes in software development, the management of dataset releases is not as clear. This paper proposes a deployment pipeline for efficient dataset releases that is based on automated enrichment of DCAT/DataID metadata and is a first step towards efficient deployment pipelining for Open Data publishing.  


This work was partially funded by the EU Horizon2020 projects Boost4.0 (GA no. 780732), BigDataOcean (GA no. 732310), SLIPO (GA no. 731581), QROWD (GA no. 723088), the Federal Ministry of Transport and Digital Infrastructure (BMVI) for the LIMBO project (GA no. 9F2029A and 19F2029G), and Simple-ML project.

Looking forward to seeing you at the SEMANTiCS 2019.

Paper accepted at EPIA 2019

We are very pleased to announce that our group got a paper accepted for presentation at the EPIA 2019: 19th EPIA Conference on Artificial Intelligence, which will be held on Sep 3 – 6, 2019 Vila Real, Portugal.  

The EPIA Conference on Artificial Intelligence is a well-established European conference in the field of AI. The 19th edition, EPIA 2019, will take place at UTAD University, Vila Real in September 3rd-6th, 2019. As in previous editions, this international conference is hosted with the patronage of the Portuguese Association for Artificial Intelligence (APPIA). The purpose of this conference is to promote research in all Artificial Intelligence (AI) areas, covering both theoretical/foundational issues and applications, as well as the scientific exchange among researchers, engineers and practitioners in related disciplines. 

Here is the pre-print of the accepted paper with its abstract: 

Abstract: The information on the internet suffers from noise and corrupt knowledge that may arise due to human and mechanical errors. To further exacerbate this problem, an ever-increasing amount of fake news on social media, or the internet in general, has created another challenge to drawing correct information from the web. This huge sea of data makes it difficult for human fact checkers and journalists to assess all the information manually. In recent years automated Fact-Checking has emerged as a branch of natural language processing devoted to achieving this feat. In this work, we give an overview of recent work, emphasizing on the key challenges faced during the development of such frameworks. We benchmark existing solutions to perform claim classification and introduce a new model dubbed SimpleLSTM, which outperforms baseline by 11%, 10.2% and 18.7% on FEVER-Support, FEVER-Reject, and 3-Class datasets respectively. The data, metadata, and code are released as open-source and are available at


This work was partially funded by the European Union Marie Curie ITN Cleopatra project (GA no. 812997). 

Looking forward to seeing you at The EPIA 2019

Paper accepted at DEXA 2019

dexa2019-logoWe are very pleased to announce that our group got a paper accepted for presentation at the DEXA 2019: 30th International Conference on Database and Expert Systems Applications, which will be held on August 26 – 29, 2019, Linz, Austria.  

DEXA provides a forum to present research results and to examine advanced applications in the field. The conference and its associated workshops offer an opportunity for developers, scientists, and users to extensively discuss requirements, problems, and solutions in database, information, and knowledge systems. 

Here is the pre-print of the accepted paper with its abstract: 

Abstract: Context-specific description of entities –expressed in RDF– poses challenges during data-driven tasks, e.g., data integration, and context-aware entity matching represents a building-block for these tasks. However, existing approaches only consider inter-schema mapping of data sources, and are not able to manage several contexts during entity matching. We devise COMET, an entity matching technique that relies on both the knowledge stated in RDF vocabularies and context-based similarity metrics to match \textit{contextually equivalent} entities. COMET executes a novel 1-1 perfect matching algorithm for matching contextually equivalent entities based on the combined scores of semantic similarity and context similarity. COMET employs the Formal Concept Analysis algorithm in order to compute the context similarity of RDF entities. We empirically evaluate the performance of COMET on a testbed from DBpedia. The experimental results suggest that COMET is able to accurately match equivalent RDF graphs in a context-dependent manner.

This work was partially funded by the European project QualiChain (GA~822404).

Looking forward to seeing you at The DEXA 2019.

Workshop paper accepted at ICAIL/AIAS 2019

aias19We are very pleased to announce that our group got a paper accepted for presentation at the AIAS 2019: The First Workshop on Artificial Intelligence and the Administrative State, which was held in conjunction with the 17th International Conference on AI and Law (ICAIL 2019) on Monday, 17 June 2019 in Montréal, Québec, Canada. 

Recent advances in AI, Machine Learning, Human Language Technology, Network Science, and Human Factors analysis offer promising new approaches to improving the ability of all stakeholders, including agencies themselves, to operate within this complex regulatory environment. The scale of administrative states means that the benefits of automation have very high potential impact, both in improvements to government processes and in the delivery of services and benefits to citizens. At the same time, the black-box nature of many automated decision-making systems, particularly sub-symbolic AI components such as those generated by machine learning algorithms, can create considerable tension with the norms of transparency, accountability, and reason-giving that typically govern administrative action. Explainable, responsible, and trustworthy AI is vital for addressing these factors. 

Here is the pre-print of the accepted paper with its abstract: 

Abstract: The ubiquitous availability of online services and mobile apps results in a rapid proliferation of contractual agreements in the form of privacy policies. Despite the importance of such consent forms, the majority of users tend to ignore them due to their content length and complexity. Thus, users might be consenting policies that are not aligned to regulations in laws such as the GDPR from the EU law. In this study, we propose a hybrid approach which measures a privacy policy’s risk factor applying both supervised deep learning and rule-based information extraction. Benefiting from an annotated dataset of 115 privacy policies, a deep learning component is first able to predict high-level categories for each paragraph. Then, a rule-based module extracts pre-defined attributes and their values, based on high-level classes. Finally, a privacy policy’s risk factor is computed based on these attribute values.

SDA at ESWC 2019 and a Best Demo Award

eswc2019The Extended Semantic Web Conference (ESWC) is a major venue for discussing the latest scientific results and technology innovations around semantic technologies. Building on its past success, ESWC is seeking to broaden its focus to span other relevant related research areas in which Web semantics plays an important role. ESWC 2019 will present the latest results in research, technologies and applications in its field. Besides the technical program organized over twelve tracks, the conference will feature a workshop and tutorial program, a dedicated track on Semantic Web challenges, system descriptions and demos, a posters exhibition and a doctoral symposium.

We are very pleased to announce that we got 2 papers accepted at ESWC 2019 for presentation at the main conference. Additionally, we also had 1 Workshop, 2 Tutorials and 1 workshop paper and 3 Posters/Demo papers accepted. In addition, we participated in the Networking Session.

Furthermore, we are very happy to announce that we won the Best Demo Award for the “MantisTable: a Tool for Creating Semantic Annotations on Tabular Data” by Marco Cremaschi, Anisa Rula, Alessandra Siano, and Flavio De Paoli.

Here are some further pointers in case you want to know more about MantisTable tool:

Among the other presentations, our colleagues presented the following presentations:

Workshops & Tutorials: 


ESWC’19 was a great venue to meet the community, create new connections, talk about current research challenges, share ideas and settle new collaborations. We look forward to the next ESWC conference.
Until then, meet us at SDA!

Papers accepted at ISWC 2019

iswc2019We are very pleased to announce that our group got 14 papers accepted for presentation at ISWC 2019 – the 18th International Semantic Web Conference, which will be held on October 26 – 30 2019 in Auckland, New Zealand. ISWC is an A-ranked conference (CORE ranking) and currently 11th in Google Scholar in the category “Databases & Information Systems” with an h5-index of 41 as well as 4th in terms WWW related conferences in MS Academic Search.

The International Semantic Web Conference (ISWC) is the premier international forum where Semantic Web / Linked Data / Knowledge Graph researchers, practitioners, and industry specialists come together to discuss, advance, and shape the future of semantic technologies on the web, within enterprises and in the context of the public institution.

Here is the list of the accepted papers with their abstract: 

Abstract: Over the last years, Linked Data has grown continuously.  Today, we count more than 10,000 datasets being available online following Linked Data standards.  These standards allow data to be machine readable and inter-operable. Nevertheless, many applications, such as data integration, search, and interlinking, cannot take full advantage of Linked Data if it is of low quality. There exist a few approaches for the quality assessment of Linked Data, but their performance degrades with the increase in data size and quickly grows beyond the capabilities of a single machine. In this paper, we present DistQualityAssessment — an open source implementation of quality assessment of large RDF datasets that can scale out to a cluster of machines. This is the first distributed, in-memory approach for computing different quality metrics for large RDF datasets using Apache Spark. We also provide a quality assessment pattern that can be used to generate new scalable metrics that can be applied to big data. The work presented here is integrated with the SANSA framework and has been applied to at least three use cases beyond the SANSA community. The results show that our approach is more generic, efficient, and scalable as compared to previously proposed approaches.


Abstract: One of the key traits of Big Data is its complexity in terms of representation, structure, or formats. One existing way to deal with it is offered by Semantic Web standards. Among them, RDF –which proposes to model data with triples representing edges in a graph– has received a large success and the semantically annotated data has grown steadily towards a massive scale. Therefore, there is a need for scalable and efficient query engines capable of retrieving such information. In this paper, we propose \emph{Sparklify}: a scalable software component for efficient evaluation of SPARQL queries over distributed RDF datasets. It uses Sparqlify as a SPARQL-to-SQL rewriter for translating SPARQL queries into Spark executable code. Our preliminary results demonstrate that our approach is more extensible, efficient, and scalable as compared to state-of-the-art approaches. Sparklify is integrated into a larger SANSA framework and it serves as a default query engine and has been used by at least three external use scenarios.

Abstract: The  last two  decades witnessed  a remarkable evolution  in terms of data formats,  modalities, and storage capabilities.  Instead of having to adapt one’s application needs to the, earlier limited, available storage options, today there is a wide array of options to choose from to best meet an application’s needs. This has resulted in vast amounts of data available in a variety of forms and formats which, if interlinked and jointly  queried, can generate valuable knowledge and insights. In this article, we describe Squerall: a framework that builds on the principles of Ontology-Based Data Access (OBDA) to enable the querying of disparate heterogeneous sources using a unique query language, SPARQL. In Squerall, original data is queried on-the-fly without prior data materialization or transformation. In particular, Squerall allows the aggregation and joining of large data in a distributed manner. Squerall supports out-of-the-box five data sources and moreover, it can be programmatically extended to cover more sources and incorporate new query engines. The framework provides user interfaces for the creation of necessary inputs, as well as guiding non-SPARQL experts to write SPARQL queries. Squerall is integrated into the popular SANSA stack and available as open-source software via GitHub and as a Docker image

Abstract: Relation linking is an important problem for knowledge graph-based Question Answering. Given a natural language question and a knowledge graph, the task is to identify relevant relations from the given knowledge graph. Since existing techniques for entity extraction and linking are more stable compared to relation linking, our idea is to exploit entities extracted from the question to support relation linking. In this paper, we propose a novel approach, based on DBpedia entities, for computing relation candidates. We have empirically evaluated our approach on different standard benchmarks. Our evaluation shows that our approach significantly outperforms existing baseline systems in both recall, precision and runtime.

Abstract: Over the last years, a number of Linked Data-based Question Answering (QA) systems have been developed. Consequently, the series of Question Answering Over Linked Data (QALD1–QALD9) challenges and other datasets have been proposed to evaluate these systems. However, the QA datasets contain a fixed number of natural language questions and do not allow users to generate micro benchmarks tailored towards specific use-cases. We propose QaldGen, a natural language benchmark generation framework for Knowledge Graphs which is able to generate customised QA benchmark from existing QA repositories. The framework is flexible enough to generate benchmarks of varying sizes and according to the user-defined criteria on the most important features to be considered for QA benchmarking. This is achieved using different clustering algorithms. We compare state-of-the-art QA systems over knowledge graphs by using different QA benchmarks. The observed results show that specialised micro-benchmarking is important to pinpoint the limitations of the various components of QA systems.

Abstract: Knowledge graphs are composed of different elements: entity nodes, relation edges, and literal nodes. Each literal node contains an entity’s attribute value (e.g. the height of an entity of type person) and thereby encodes information which in general cannot be represented by relations between entities alone. However, most of the existing embedding or latent-feature-based methods for knowledge graph analysis only consider entity nodes and relation edges, and thus do not take the information provided by literals into account. In this paper, we extend existing latent feature methods for link prediction by a simple portable module for incorporating literals, which we name LiteralE. Unlike in concurrent methods where literals are incorporated by adding a literal-dependent term to the output of the scoring function and thus only indirectly affect the entity embeddings, LiteralE directly enriches these embeddings with information from literals via a learnable parameterized function. This function can be easily integrated into the scoring function of existing methods and learned along with the entity embeddings in an end-to-end manner. In an extensive empirical study over three datasets, we evaluate LiteralE-extended versions of various state-of-the-art latent feature methods for link prediction and demonstrate that LiteralE presents an effective way to improve their performance. For these experiments, we augmented standard datasets with their literals, which we publicly provide as testbeds for further research. Moreover, we show that LiteralE leads to a qualitative improvement of the embeddings and that it can be easily extended to handle literals from different modalities.

Abstract: The growing interest in free and open-source software which occurred  over the last decades has accelerated the usage of versioning systems to help developers collaborating together in the same projects. As a consequence, specific tools such as git and specialized open-source on-line  platforms gained importance. In this study, we introduce and share SemanGit which provides a resource at the crossroads of both Semantic Web and git web-based version control systems. SemanGit is actually the first collection of linked data extracted from GitHub based on a git ontology we designed and extended to include specific GitHub features. In this article, we present the dataset, describe the extraction process according to the ontology, show some promising analyses of the data and outline how SemanGit could be linked with external datasets or enriched with new sources to allow for more complex analyses.

Abstract: Scientific events have become a key factor of scholarly com- munication for many scientific domains. They are considered as the focal point for establishing scientific relations between scholarly objects such as people (e.g., chairs, participants), places (e.g., location), actions (e.g., roles  of participants), and artifacts (e.g., proceedings) in the scholarly communication domain. Metadata of scientific events have been made available in unstructured or semi-structured formats, which hides the interconnected and complex relationships between them and prevents transparency. To facilitate the management of such metadata, the repres- entation of event-related information in an interoperable form requires a  uniform conceptual modeling. The Scientific Events Ontology (OR-SEO) has been engineered to represent metadata of scientific events. We describe a systematic redesign of the information model that is used as a schema for the event pages of the community wiki, reusing well-known vocabularies to make OR-SEO interoperable in different contexts. OR-SEO is now in use on thousands of Open- events pages, which enables users to represent structured knowledge about events without tackling technical implementation challenges and ontology development.

Abstract: There is an emerging trend of embedding knowledge graphs (KGs) in continuous vector spaces in order to use those for machine learning tasks. Recently, many knowledge graph embedding (KGE) models have been proposed that learn low dimensional representations while trying to maintain the structural properties of the KGs such as the similarity of nodes depending on their edges to other nodes. KGEs can be used to address tasks within KGs such as the prediction of novel links and the disambiguation of entities. They can also be used for downstream tasks like question answering and fact-checking. Overall, these tasks are relevant for the semantic web community. Despite their popularity, the reproducibility of KGE experiments and the transferability of proposed KGE models to research fields outside the machine learning community can be a major challenge. Therefore, we present the KEEN Universe, an ecosystem for knowledge graph embeddings that we have developed with a strong focus on reproducibility and transferability. The KEEN Universe currently consists of the Python packages PyKEEN (Python KnowlEdge Graph EmbeddiNgs), BioKEEN (Biological KnowlEdge Graph EmbeddiNgs), and the KEEN Model Zoo for sharing trained KGE models with the community.

Abstract: The data quality improvement of DBpedia has been in the focus of many publications in the past years with topics covering both knowledge enrichment techniques such as type learning, taxonomy generation, interlinking as well as error detection strategies such as property or value outlier detection, type checking, ontology constraints, or unit-tests, to name just a few. The concrete innovation of the DBpedia FlexiFusion workflow, leveraging the novel DBpedia PreFusion dataset, which we present in this paper, is to massively cut down the engineering workload to apply any of the vast methods available in shorter time and also make it easier to produce customized knowledge graphs or DBpedias. While FlexiFusion is flexible to accommodate other use cases, our main use case in this paper is the generation of richer, language-specific DBpedias for the 20+ DBpedia chapters, which we demonstrate on the Catalan DBpedia. In this paper, we define a set of quality metrics and evaluate them for Wikidata and DBpedia datasets of several language chapters. Moreover, we show that an implementation of FlexiFusion, performed on the proposed PreFusion dataset, increases data size, richness as well as quality in comparison to the source datasets.

Abstract: Answering simple questions over knowledge graphs is a well-studied problem in question answering. Previous approaches for this task built on recurrent and convolutional neural networks (RNNs and CNNs) based architectures that use pretrained word embeddings. It was recently shown that a pretrained transformer network (BERT) can outperform RNN- and CNN based approaches on various natural language processing tasks. In this work, we investigate how well network BERT performs on the entity span prediction and relation prediction subtasks of simple QA. In addition, we provide an evaluation of both BERT and BiLSTM-based models in limited data scenarios.

Abstract: Providing machines with the capability of exploring knowledge graphs and answering natural language questions has been an active area of research over the past decade. In this direction translating natural language questions to formal queries has been one of the key approaches. To advance the research area, several datasets like WebQuestions, QALD and LCQuAD have been published in the past. The biggest data set available for complex questions (LCQuAD) over  knowledge graphs contains five thousand questions. We now provide LC-QuAD 2.0 (Large-Scale Complex Question Answering Dataset) with 30,000 questions, their paraphrases and their corresponding SPARQL queries. LC-QuAD 2.0 is compatible with both Wikidata and DBpedia 2018 knowledge graphs. In this article, we explain how the dataset was created and the variety of questions available with corresponding examples. We further provide a statistical analysis of the dataset. 

Abstract: Non-goal oriented, generative dialogue systems lack the ability to generate answers with grounded facts. A knowledge graph can be considered an abstraction of the real world consisting of well-grounded facts. This paper addresses the problem of generating well grounded responses by integrating knowledge graphs into the dialogue systems response generation process, in an end-to-end manner. A dataset for non-goal oriented dialogues is proposed in this paper in the domain of soccer, conversing on different clubs and national teams along with a knowledge graph for each of these teams. A novel neural network architecture is also proposed as a baseline on this dataset, which can integrate knowledge graphs into the response generation process, producing well articulated, knowledge grounded responses. Empirical evidence suggests that the proposed model performs better than other state-of-the-art models for knowledge graph integrated dialogue systems.

Abstract: In this paper, we conduct an empirical investigation of neural query graph ranking approaches for the task of complex question answering over knowledge graphs. We propose a novel self-attention based slot matching model which exploits the inherent structure of query graphs, our logical form of choice. Our proposed model generally outperforms other ranking models on two QA datasets over the DBpedia knowledge graph, evaluated in different settings. We also show that domain adaption and pre-trained language model based transfer learning yield improvements, effectively offsetting the general lack of training data.


This work was partly supported by the EU Horizon2020 projects BigDataOcean (GA no. 732310), Boost4.0 (GA no. 780732), QROWD (GA no. 723088), SLIPO (GA no. 731581), BETTER (GA 776280), QualiChain (GA 822404), CLEOPATRA (GA no. 812997), LIMBO (Grant no. 19F2029I), OPAL (no. 19F2028A), KnowGraphs (no. 860801), SOLIDE (no. 13N14456), Bio2Vec (grant no. 3454), LAMBDA (#809965), FAIRplus (#802750), the ERC project ScienceGRAPH (#819536), “Industrial Data Space Plus” (GA 01IS17031), Fraunhofer Cluster of Excellence “Cognitive Internet Technologies” (CCIT),  “InclusiveOCW” (grant no. 01PE17004D), the German national funded BmBF project MLwin, the National Natural Science Foundation of China (61673304) and the Key Projects of National Social Science Foundation of China(11&ZD189), EPSRC grant EP/M025268/1, WWTF grant VRG18-013, WMF-funded GlobalFactSync project, and by the ADAPT Centre for Digital Content Technology funded under the SFI Research Centres Programme (Grant 13/RC/2106) and co-funded under the European Regional Development Fund.

Looking forward to seeing you at The ISWC 2019.

Poster and Workshop papers accepted at ESWC 2019

eswc2019We are very pleased to announce that our group got 2 poster papers accepted for presentation at the ESWC 2019 : The 16th edition of the The Extended Semantic Web Conference, which will be held on June 2-6, 2019 in Portorož, Slovenia.

The ESWC is a major venue for discussing the latest scientific results and technology innovations around semantic technologies. Building on its past success, ESWC is seeking to broaden its focus to span other relevant related research areas in which Web semantics plays an important role. ESWC 2019 will present the latest results in research, technologies and applications in its field. Besides the technical program organized over twelve tracks, the conference will feature a workshop and tutorial program, a dedicated track on Semantic Web challenges, system descriptions and demos, a posters exhibition and a doctoral symposium.

Here is the pre-print of the accepted papers with their abstract:

Abstract: Among the various domains involved in large RDF graphs, applications may rely on geographical information which are often carried and presented via Points Of Interests. In particular, one challenge aims at extracting patterns from POIs sets to discover Areas Of Interest. To tackle it, an usual method is to aggregate various points according to specific distances (e.g. geographical) via clustering algorithms. In this study, we present a flexible architecture to design pipelines able to aggregate POIs from contextual to geographical dimensions in a single run. This solution allows any kind of state-of-the-art clustering algorithm combinations to compute AOIs and is built on top of a Semantic Web stack which allows multiple-source querying and filtering through SPARQL.

Abstract: Due to the rapid expansion of multilingual data on the web, developing ontology enrichment approaches has become an interesting and active subject of research. In this paper, we propose a cross-lingual matching approach for ontology enrichment (OECM) in order to enrich an ontology using another one in a different natural language. A prototype for the proposed approach has been implemented and evaluated using the MultiFarm benchmark. Evaluation results are promising and show high precision and recall compared to state-of-the-art approaches.

Furthermore, we got a paper accepted at LASCAR Workshop: 1st Workshop on Large Scale RDF Analytics, co-located with the ESWC 2019.

LASCAR Workshop seeks original “articles and posters” describing theoretical and practical methods as well as techniques for performing scalable analytics on knowledge graphs.

Here is the pre-print of the accepted paper with its abstract:

Abstract: Controlling the usage of business-critical data is essential for every company. While the upcoming age of Industry 4.0 propagates a seamless data exchange between all participating devices, facilities and companies along the production chain, the required data control mechanisms are lacking behind. We claim that for an effective protection, both access and usage control enforcement is a must-have for organizing Industry 4.0 collaboration networks. Formalized and machine-readable policies are one fundamental building block to achieve the needed trust level for real data-driven collaborations. We explain the current challenges of specifying access and usage control policies and outline respective approaches relying on Semantic Web of Things practices. We analyze the requirements and implications of existing technologies and discuss their shortcomings. Based on our experiences from the specification of the International Data Spaces Usage Control Language, the necessary next steps towards automatically monitored and enforced policies are outlined and research needs formulated.



This work was partially funded by the European H2020 SLIPO project (GA. 731581).

Looking forward to seeing you at The ESWC 2019.