Paper accepted at IEEE BigData 2017

IEEE-BIG-DATA17_BOSTONWe are very pleased to announce that our group got a paper accepted for presentation at IEEE BigData 2017, which will be held on December 11th-14th, 2017, Boston, MA, United States.

In recent years, “Big Data” has become a new ubiquitous term. Big Data is transforming science, engineering, medicine, healthcare, finance, business, and ultimately our society itself. The IEEE Big Data conference series started in 2013 has established itself as the top tier research conference in Big Data.
The 2017 IEEE International Conference on Big Data (IEEE Big Data 2017) will provide a leading forum for disseminating the latest results in Big Data Research, Development, and Applications.

Implementing Scalable Structured Machine Learning for Big Data in the SAKE Project” by Simon Bin, Patrick Westphal, Jens Lehmann, and Axel-Cyrille Ngomo Ngonga.

Abstract: Exploration and analysis of large amounts of machine generated data requires innovative approaches. We propose a combination of Semantic Web and Machine Learning to facilitate the analysis. First, data is collected and converted to RDF according to a schema in the Web Ontology Language OWL. Several components can continue working with the data, to interlink, label, augment, or classify. The size of the data poses new challenges to existing solutions, which we solve in this contribution by transitioning from in-memory to database.

This work was supported in part by a research grant from the German Ministry for Finances and Energy under the SAKE project (Grant agreement No. 01MD15006E) and by a research grant from the European Union’s Horizon 2020 research and innovation programme under the SLIPO project (Grant agreement No. 731581).

Looking forward to seeing you at IEEE BigData 2017.

“A Corpus for Complex Question Answering over Knowledge Graphs” elected as Paper of the month at FraunhoferIAIS

DOMfM_yX0AAQopVWe are very pleased to announce that our paper “A Corpus for Complex Question Answering over Knowledge Graphs” by Priyansh TrivediGaurav MaheshwariMohnish Dubey and Jens Lehmann has been elected as the Paper of the month at Fraunhofer IAIS. This award is given to publications that have a high innovation impact in the research field after a committee evaluation.

This research paper has been accepted on ISWC 2017 main conference and the paper presents a large gold standard Question Answering Dataset over DBpedia, and the accompanying framework to make the dataset. This is the largest QA dataset having 5000 questions, and their corresponding SPARQL query. This paper was nominated for the “Best Student Paper Award” in the resource track.

Abstract: Being able to access knowledge bases in an intuitive way has been an active area of research over the past years. In particular, several question answering (QA) approaches which allow to query RDF datasets in natural language have been developed as they allow end users to access knowledge without needing to learn the schema of a knowledge base and learn a formal query language. To foster this research area, several training datasets have been created, e.g.~in the QALD (Question Answering over Linked Data) initiative. However, existing datasets are insufficient in terms of size, variety or complexity to apply and evaluate a range of machine learning based QA approaches for learning complex SPARQL queries. With the provision of the Large-Scale Complex Question Answering Dataset (LC-QuAD), we close this gap by providing a dataset with 5000 questions and their corresponding SPARQL queries over the DBpedia dataset.In this article, we describe the dataset creation process and how we ensure a high variety of questions, which should enable to assess the robustness and accuracy of the next generation of QA systems for knowledge graphs.

The paper and authors were honored for this publication in a special event at Fraunhofer Schloss Birlinghoven, Sankt Augustin, Germany.


SDA at ISWC 2017 – A Ten-Year Best Paper and a Demo Award


The International Semantic Web Conference (ISWC) is the premier international forum where Semantic Web / Linked Data researchers, practitioners, and industry specialists come together to discuss, advance, and shape the future of semantic technologies on the web, within enterprises and in the context of the public institution.

 We are very pleased to announce that we got 6 papers accepted at ISWC 2017 for presentation at the main conference. Additionally, we also had 6 Posters/Demo papers accepted.

Furthermore, we are happy to win the SWSA Ten-Year Best Paper Award, which recognizes the highest impact papers from the 6th International Semantic Web Conference in Busan, Korea in 2007.
Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, Zachary G. Ives. DBpedia: A Nucleus for a Web of Open Data


In addition to this award, we are very happy to announce that we won the Best Demo Award for the SANSA Notebooks:
The Tale of Sansa Spark” by Ivan Ermilov, Jens Lehmann, Gezim Sejdiu, Buehmann Lorenz, Patrick Westphal, Claus Stadler, Simon Bin, Nilesh Chakraborty, Henning Petzka, Muhammad Saleem, Axel-Cyrille Ngonga Ngomo and Hajira Jabeen.

Here are some further pointers in case you want to know more about SANSA:

The audience displayed enthusiasm during the demonstration appreciating the work and asking questions regarding the future of SANSA, technical details and possible synergy with industrial partners and projects. Gezim Sejdiu and Jens Lehmann, who were presenting the demo, were talking 3+ hours non-stop (without even time to eat 😉 ).

Among the other presentations, our colleagues presented the following presentations:


ISWC17 was a great venue to meet the community, create new connections, talk about current research challenges, share ideas and settle new collaborations. We look forward to the next ISWC conference.

Until then, meet us at SDA !

Dr. Maria Maleshkova visits SDA

MariaDr. Maria Maleshkova from Karlsruhe Institute of Technology (KIT) was visiting the SDA group on the 11th of October 2017.

Maria Maleshkova is a postdoctoral researcher at the Karlsruhe Service Research Institute (KSRI) and the Institute of Applied Informatics and Formal Description Methods (AIFB) at the Karlsruhe Institute of Technology. Her research work covers the Web of Things (WoT) and semantics-based data integration topics as well as work in the area of the semantic description of Web APIs, RESTful services and their joint use with Linked Data. Prior to that, she was a Research Associate and a PhD student at the Knowledge Media Institute (KMi) at the Open University, where she worked on projects in the domain of SOA and Web Services.

Prof. Jens Lehmann invited the speaker to the bi-weekly “SDA colloquium presentations”. 40-50 researchers and students from SDA attended. The goal of her visit was to exchange experience and ideas on semantic web techniques specialized for smart services, including Internet of Things, Industry 4.0 technologies and many more. Apart from presenting various use cases where smart services have helped scientists to get useful insights from sensor data, Dr. Maleshkova shared with our group future research problems and challenges related to this research area and shown that what is so smart about Smart Services?

As an outcome of this visit, we expect to strengthen our research collaboration networks with KIT, mainly on combining semantic knowledge and distributed analytics applied on SANSA.

Luís Garcia received the prize of best PhD thesis

logo-capesWe are very pleased to announce that Dr. Luis Paulo Faina Garcia, a researcher from SDA received the prize for the best PhD thesis in Computer Science in 2016 from the Brazilian Government Council. The title of his thesis is “Noise Detection in Classification Problems” with the supervision of Prof. Dr. André de Carvalho from the University of São Paulo. In 2017 his thesis was also selected between the best thesis by the Brazilian Computer Science Society.

The main contributions of his work improved the accuracy in a Machine Learning system based on noise detection to predict non-native species in protected areas of the Brazilian state of São Paulo. The results obtained were several publications in good conferences and and high-quality journals.  


Short Abstract: Large volumes of data have been produced in many application domains. Nonetheless, when data quality is low, the performance of Machine Learning techniques is harmed. Real data are frequently affected by the presence of noise, which, when used in the training of Machine Learning techniques for predictive tasks, can result in complex models, with high induction time and low predictive performance. Identification and removal of noise can improve data quality and, as a result, the induced model. This thesis proposes new techniques for noise detection and the development of a recommendation system based on meta-learning to recommend the most suitable filter for new tasks. Experiments using artificial and real datasets show the relevance of this research.


Prof. Manolis Koubarakis visit SDA

manolisProf. Manolis Koubarakis  from the Department of Informatics and Telecommunications at the National and Kapodistrian University of Athens, was visiting the SDA group on the 21st of September 2017.
Manolis Koubarakis with his research group of Management of Data, Information & Knowledge has been working the last 7 years on managing geospatial data and has contributed to various research projects and applications on this domain. Examples of successful projects include LEO: Linked Earth Observation Data and MELODIES: Maximizing the Exploitation of Linked Open Data in Enterprise and Science, and some of their applications, widely used by the research community, are Strabon (spatiotemporal RDF store)  and SEXTANT (web-based platform for visualizing, exploring and interacting with time-evolving linked geospatial data).

The goal of his visit was to exchange experience and ideas on data management techniques specifically for geospatial data. Apart from presenting various use cases where geospatial tools have helped scientists to get useful insights from scientific data, Prof. Koubarakis shared with our group future research problems and challenges related to this research area. From our side, SDA researchers presented their work on managing Big Data (query processing, analytics, benchmarking, etc.), as well as realated tools, like SANSA – Semantic Analytics Stack and Ontario – Semantic Data Lake.

SDA and MADgIK have already been working together since a few years in the context of the EU H2020 projects Big Data Europe and WDAqua and hope to strengthen this collaboration in new projects and joint research activities. The important outcome of this meeting was the plan to organize a common workshop on managing scientific geospatial data in the near future.

SDA at TPDL2017 & a Honorary Mention Award

TPDL2017_logoTPDL 2017: The 21st version of the International Conference on Theory and Practice of Digital Libraries took place in Thessaloniki, Greece from September 18 to 21, 2017.

We as SDA group had four scientific papers accepted and presented:

And we are happy to win the Honorary award for the long paper entitled ‘Exploiting Interlinked Research Metadata’ presented by Sahar Vahdati.

Paper abstract: OpenAIRE, the Open Access Infrastructure for Research in Europe, aggregates metadata about research (projects, publications, people, organizations, etc.) into a central Information Space. OpenAIRE aims at increasing interoperability and reusability of this data collection by exposing it as Linked Open Data (LOD). By following the LOD principles, it is now possible to further increase interoperability and reusability by connecting the OpenAIRE LOD to other datasets about projects, publications, people, and organizations. Doing so required us to identify link discovery tools that perform well, as well as candidate datasets that provide comprehensive scholarly communication metadata, and then to specify linking rules. We demonstrate the added value that interlinking provides for end users by implementing visual frontends for looking up publications to cite, and publication statistics, and evaluating their usability on top of interlinked vs. non-interlinked data

This year at TPDL 2017, three very interesting keynote speeches were given by Paul Groth on Machines are people too,

Elton Barker on Back to the future: annotating, collaborating and linking in a digital ecosystem and Dimitrios Tzovaras on Visualization in the big data era: data mining from networked information.
Thanks to all organizers at TPDL 2017 mainly general chairs:

  • Yannis Manolopoulos, Aristotle University of Thessaloniki, Greece
  • Lazaros Iliadis, Democritus University of Thrace, Greece

Papers accepted at K-CAP 2017

K-CAP-2017-logoWe are very pleased to announce that our group got 5 papers accepted for presentation at K-CAP 2017, which will be held on December 4th-6th, 2017, Austin, Texas, United States.
The Ninth International Conference on Knowledge Capture attracts researchers from diverse areas of Artificial Intelligence, including knowledge representation, knowledge acquisition, intelligent user interfaces, problem-solving and reasoning, planning, agents, text extraction, and machine learning, information enrichment and visualization, as well as researchers interested in cyber-infrastructures to foster the publication, retrieval, reuse, and integration of data.

Here is the list of the accepted paper with their abstract:

Capturing Knowledge in Semantically-typed Relational Patterns to Enhance Relation Linking” by Kuldeep Singh, Isaiah Onando Mulang, Ioanna Lytra, Mohamad Yaser Jaradeh, Ahmad Sakor, Maria-Esther Vidal, Christoph Lange and Sören Auer.

Abstract: Transforming natural language questions into formal queries is an integral task in Question Answering (QA) systems. QA systems built on knowledge graphs like DBpedia, require an extra step after Natural Language Processing (NLP) for linking words, specifically including named entities and relations, to their corresponding entities in a knowledge graph. To achieve this task, several approaches rely on background knowledge bases containing semantically-typed relations, e.g., PATTY, for an extra disambiguation step. Two major factors may affect the performance of relation linking approaches whenever background knowledge bases are accessed: a)limited availability of such semantic knowledge sources, and b) lack of a systematic approach on how to maximize the benefits of the collected knowledge. We tackle this problem and devise SIBKB, a semantic-based index able to capture knowledge encoded on background knowledge bases like PATTY. SIBKB represents a background knowledge base as a bi-partite and a dynamic index over the relation patterns included the knowledge base. Moreover, we develop a relation linking component able to exploit SIBKB features. The benefits of SIBKB are empirically studied on existing QA benchmarks. Observed results suggest that SIBKB is able to enhance the accuracy of relation linking by up to three times.

SimDoc: Topic Sequence Alignment based Document Similarity Framework” by Gaurav Maheshwari, Priyansh Trivedi, Harshita Sahijwani, Kunal Jha, Sourish Dasgupta and Jens Lehmann.

Abstract: Document similarity is the problem of estimating the degree to which a given pair of documents has similar semantic content. An accurate document similarity measure can improve several enterprise relevant tasks such as document clustering, text mining, and question-answering. In this paper, we show that a document’s thematic flow, which is often disregarded by bag-of-word techniques, is pivotal in estimating their similarity. To this end, we propose a novel semantic document similarity framework, called SimDoc. We model documents as topic-sequences, where topics represent latent generative clusters of related words. Then, we use a sequence alignment algorithm to estimate their semantic similarity. We further conceptualize a novel mechanism to compute topic-topic similarity to fine tune our system. In our experiments, we show that SimDoc outperforms many contemporary bag-of-words techniques in accurately computing document similarity, and on practical applications such as document clustering.

SQCFramework: SPARQL Query Containment Benchmark Generation Framework” by Muhammad Saleem, Claus Stadler, Qaiser Mehmood, Jens Lehmann and Axel-Cyrille Ngonga Ngomo.

Abstract: Query containment is a fundamental problem in data management. Its main application is in global query optimization. A number of SPARQL query containment solvers for SPARQL have been developed recently. To the best of our knowledge, the Query Containment Benchmark (QC-Bench) is the only benchmark for evaluating these containment solvers. However, this benchmark contains a fixed number of synthetic queries, which were handcrafted by its creators. We propose SQCFramework, a SPARQL query containment benchmark generation framework which is able to generate customized SPARQL containment benchmarks from real SPARQL query logs. The framework is flexible enough to generate benchmarks of varying sizes and according to the user-defined criteria on the most important SPARQL features to be considered for query containment benchmarking. The generation of benchmarks is achieved using different clustering algorithms. We compare state-of-the-art SPARQL query containment solvers by using different query containment benchmarks generated from DBpedia and Semantic Web Dog Food query logs.

“Semantic Zooming for Ontology Graph Visualizations” by Vitalis Wiens, Steffen Lohmann and Sören Auer.

Abstract: Visualizations of ontologies, in particular graph visualizations in the form of node-link diagrams, are often used to support ontology development, exploration, verification, and sensemaking. With growing size and complexity of ontology graph visualizations, their represented information tend to become hard to comprehend due to visual clutter and information overload. We present an approach that abstracts and simplifies the underlying graph structure of ontologies. The new approach of semantic zooming for ontology graph visualizations separates the comprised information of an ontology into three layers with discrete levels of detail. The visual appearance layer is defined with the support of expert interviews. The approach is applied on a force-directed layout using the VOWL notation. The mental map is preserved using smart expanding and ordering of elements in the layout. Navigation and sensemaking are supported by local and global exploration methods, halo visualization, and smooth zooming. The results of a user study confirm an increase in readability, visual clarity, and information clarity of ontology graph visualizations enhanced with our semantic zooming approach.

“Bidirectional LSTM with a Context Input Window for Named Entity Recognition in Tweets” by Rafael Peres, Diego Esteves and Gaurav Maheshwari.

Abstract: Lately, with the increasing popularity of social media technologies, applying natural language processing for mining information in tweets has posed itself as a challenging task and has attracted significant research efforts. In contrast with the news text and others formal content, tweets pose a number of new challenges, due to their short and noisy nature. Thus, over the past decade, different Named Entity Recognition (NER) architectures have been proposed to solve this problem. However, most of them are based on handcrafted-features and restricted to a particular domain, which imposes a natural barrier to generalize over different contexts. In this sense, despite the long line of work in NER on formal domains, there are no studies in NER for tweets in Portuguese (despite 17.97
million monthly active users). To bridge this gap, we present a new gold-standard corpus of tweets annotated for Person, Location, and Organization (PLO). Additionally, we also perform multiple NER experiments using a variety of Long Short-Term Memory (LSTM) based models without resorting to any handcrafted rules. Our approach with a centered context input window of word embeddings yields 52.78 F1 score, 38.68% higher compared to a state of the art baseline system


These work were supported by the European Union’s H2020 research and innovation program BigDataEurope (GA no. 644564), WDAqua : Marie Skłodowska-Curie Innovative Training Network (GA no. 642795) and by the European Union’s Horizon 2020 research and innovation programme GRACeFUL (GA no. 640954).

Looking forward to seeing you at K-CAP 2017.

SDA at SEMANTiCS 2017 & a Best Paper Award

logo-semantics-17-smallSEMANTiCS 2017 is an international event on Linked Data and the Semantic Web where business users, vendors and academia meet. Our members have actively participated in 13th SEMANTiCS 2017, which took place in Amsterdam, Nederland, Sept 11-14.

We are very pleased to announce that we got 7 papers accepted at SEMANTiCS 2017 for presentation at the main conference. Additionally, we also had 4 Posters and 2 Demo papers accepted at the same.

Furthermore, adding a feather to our hat, our colleague Harsh Thakkar (@harsh9t) secured a Best Research and Innovation Paper Award for his work “Trying Not to Die Benchmarking – Orchestrating RDF and Graph Data Management Solution Benchmarks Using LITMUS” (Github Org., Website, Docker, PDF).

Abstract. Knowledge graphs, usually modelled via RDF or property graphs, have gained importance over the past decade. In order to decide which Data Management Solution (DMS) performs best for specific query loads over a knowledge graph, it is required to perform benchmarks. Benchmarking is an extremely tedious task demanding repetitive manual effort, therefore it is advantageous to automate the whole process. However, there is currently no benchmarking framework which supports benchmarking and comparing diverse DMSs for both RDF and property graph DMS. To this end, we introduce, the rst working prototype of, LITMUS which provides this functionality as well as ne-grained environment configuration options, a comprehensive set of DMS and CPU-specific key performance indicators and a quick analytical support via custom visualization (i.e. plots) for the benchmarked DMSs.

The audience displayed enthusiasm during the presentation appreciating the work and asking questions regarding the future of his work and possible synergy with industrial partners/projects.

Furthermore, an interested mass also indulged in the Poster & Demo session for their first-hand experience with LITMUS.

Among the other presentations, our colleagues presented the following research papers: 

  • SMJoin: A Multi-way Join Operator for SPARQL Queries by Mikhail Galkin, Kemele M. Endris, Maribel Acosta, Diego Collarana, Maria-Esther Vidal, Sören Auer.
    Mikhail Galkin presented his work on introducing a concept and practical approach for multi-way joins applicable in SPARQL query engines. Multi-way joins refer to operators of n-arity, i.e., supporting more than two inputs. A set of optimizations were proposed to increase the performance of a multi-way operator. The audience was particularly interested in experimental results, the impact of query selectivity and operator optimizations.
  • IDOL: Comprehensive & Complete LOD Insights by Ciro Baron Neto, Dimitris Kontokostas, Amit Kirschenbaum, Gustavo Publio, Diego Esteves, and Sebastian Hellmann.
    The presents challenges and technical barriers at identifying and linking the whole linked open data cloud using a probabilistic data structure called Bloom Filter. The audience was most interested in questions related to the problem of cross-dataset error correction as well as the generation of further analytics and heuristics metadata.

SEMANTiCS was a great venue to meet the community, create new connections, talk about current research challenges, share ideas and settle new collaborations. We look forward to the next SEMANTiCS conference.

SDA at DEXA2017 & a Best Paper Award

dexa2017_newDEXA International Conference on Database and Expert Systems Applications – DEXA 2017 is one of the major venues for discussing the latest scientific results and technologies around database, information, and knowledge systems. Our members have actively participated in 28th DEXA 2017 Conferences and Workshops, which took place in Lyon, France from August 28-31, 2017.

e are very pleased to report that:

4 papers from our group were accepted for presentation @DEXA2017

  • MULDER: Querying the Linked Data Web by Bridging RDF Molecule Templates by Kemele M. Endris, Mikhail Galkin, Ioanna Lytra, Mohamed Nadjib Mami, Maria-Esther Vidal and Sören Auer.
    Kemele M. Endris, presented his work on Querying the Linked Data Web by Bridging RDF Molecule Templates in the main conference.

    The audience showed high interest in his presentation and appreciated such a composition into existing query processing engines.
    Kemele M. Endris secured a Best Research Paper Award for his work.

    Abstract. The increasing number of RDF data sources that allow for querying Linked Data via Web services form the basis for federated SPARQL query processing. Federated SPARQL query engines provide a unified view of a federation of RDF data sources, and rely on source descriptions for selecting the data sources over which unified queries will be executed. Albeit efficient, existing federated SPARQL query engines usually ignore the meaning of data accessible from a data source, and describe sources only in terms of the vocabularies utilized in the data source. Lack of source description may conduce to the erroneous selection of data sources for a query, thus affecting the performance of query processing over the federation. We tackle the problem of federated SPARQL query processing and devise MULDER, a query engine for federations of RDF data sources. MULDER describes data sources in terms of RDF molecule templates, i.e., abstract descriptions of entities belonging to the same RDF class. Moreover, MULDER utilizes RDF molecule templates for source selection, and query decomposition and optimization. We empirically study the performance of MULDER on existing benchmarks, and compare MULDER performance with state-of-the-art federated SPARQL query engines. Experimental results suggest that RDF molecule templates empower MULDER federated query processing, and allow for the selection of RDF data sources that not only reduce execution time, but also increase answer completeness.

DEXA2017 was a great venue to meet the community, create new connections, talk about current research challenges, share ideas and settle new collaborations. We look forward to the next DEXA conference.