We are looking back at a busy and successful year 2017 full of new members, inspirational discussions, exciting conferences, a lot of accepted papers and awards as well as new software releases.
Below is a short summary of the main cornerstones for 2017:
The growth of the group in 2017
SDA is a new group, but not new in the field :). As a group, it was founded by Prof. Dr. Jens Lehmann at the beginning of 2016. The group has members at the University of Bonn with associated researchers at the Fraunhofer Institute for Intelligent Analysis and Information Systems (IAIS) and the Institute for Applied Computer Science Leipzig. Within 2017, the group has grown from 20 members to around 55 members (1 Professor, 1 Akademischer Rat / Assistant Professor, 11 PostDocs, 31 PhD Students,11 master students) as you can see on our team page.
An interesting future for AI and knowledge graphs
Artificial intelligence / machine learning and semantic technologies / knowledge graphs are central topics for SDA. Throughout the year, we have been able to achieve a range of interesting research achievements. This ranges from internationally leading results in question answering over knowledge graphs, to scalable distributed querying, inference and analysis of large RDF datasets as well as new perspectives on industrial data spaces and data integration. Among the race for ever improving achievements in AI, which go far beyond what many could have imagined 10 years ago, our researchers were able to deliver important contributions and continue to shape different sub areas of the growing AI research landscape.
We had 46 papers accepted at well-known conferences (i.e The Web Conference 2018, WWW 2017, AAAI 2017, ISWC 2017, ESWC 2017, DEXA 2017, SEMANTiCS 2017, K-CAP 2017, WI 2017, KESW 2017, IEEE BigData 2017, NIPS 2017, TPDL 2017, ICSC 2018, ICEGOV 2018 and more. We estimate our articles to be cited around 3000+ times per year (based on Google Scholar profiles).
We received several awards in 2017 – click on the posts to find out more:
- A Ten-Year Best Paper, a Demo Award and a Best Reviewer Award at ISWC 2017
- A Seven-Year Best Paper Award at ESWC 2017
- A Honorary Mention Award at TPDL 2017
- A Best Paper Award at SEMANTiCS 2017
- A Best Paper Award at DEXA 2017
- A Best Paper Award at the Pattern Recognition Journal
- The best PhD thesis in Computer Science in 2016 from the Brazilian Government Council
- NVIDIA Pioneering Research Award at ICML 2017
- Paper of the month at Fraunhofer IAIS (November 2017, March 2017)
From the funded projects we were happy to launch the final release of the Big Data Europe platform – an open source Big Data Processing Platform allowing users to install numerous big data processing tools and frameworks and create working data flow applications.
There were several other releases:
- SML-Bench – A Structured Machine Learning benchmark framework 0.2 has been released.
- WebVOWL – A Web-based Visualization of Ontologies had several releases in 2017.
- OpenResearch.org – A Crowd-Sourcing platform for collaborative management of scholarly metadata reached coverage of more than 5K computer science conferences in 2017.
Furthermore, SDA deeply values team bonding activities. Often we try to introduce fun activities that involve teamwork and teambuilding. At our X-mas party, we enjoyed a very international and lovely dinner together, we played a `Secret Santa` and Pantomime game.
Long-term team building through deeper discussions, genuine connections and healthy communication helps us to connect within the group!
Many thanks to all of you who have accompanied and supported us on this way!
Jens Lehmann on behalf of The SDA Research Team
We are very pleased to announce that our group got a paper accepted for presentation at the workshop on Optimal Transport and Machine Learning (http://otml17.marcocuturi.net) at NIPS 2017 : The Thirty-first Annual Conference on Neural Information Processing Systems, which was held on December 4 – 9, 2017 in Long Beach, California.
The Thirty-first Annual Conference on Neural Information Processing Systems (NIPS) is a single-track machine learning and computational neuroscience conference that includes invited talks, demonstrations and oral and poster presentations of refereed papers. NIPS has a responsibility to provide an inclusive and welcoming environment for everyone in the fields of AI and machine learning. Unfortunately, several events held at (or in conjunction with) this year’s conference fell short of these standards.
Here is the accepted paper with its abstract:
Abstract: Since their invention, generative adversarial networks (GANs) have become a popular approach for learning to model a distribution of real (unlabeled) data. Convergence problems during training are overcome by Wasserstein GANs which minimize the distance between the model and the empirical distribution in terms of a different metric, but thereby introduce a Lipschitz constraint into the optimization problem. A simple way to enforce the Lipschitz constraint on the class of functions, which can be modeled by the neural network, is weight clipping. Augmenting the loss by a regularization term that penalizes the deviation of the gradient norm of the critic (as a function of the network’s input) from one, was proposed as an alternative that improves training. We present theoretical arguments why using a weaker regularization term enforcing the Lipschitz constraint is preferable. These arguments are supported by experimental results on several data sets.
This part of work is supported by WDAqua : Marie Skłodowska-Curie Innovative Training Network (GA no. 642795).
We are very pleased to announce that our group got a paper accepted for presentation at the 11th International Conferences on Theory and Practice of Electronic Governance (ICEGOV) 2018, which will be held on April 4 – 6, 2018 in Galway, Ireland.
The conference focuses on the use of technology to transform the working of government and its relationships with citizens, businesses, and other non-state actors in order to improve public governance and its contribution to public policy and development (EGOV). It also promotes the interaction and cooperation between universities, research centres, governments, industries, and non-governmental organizations needed to develop the EGOV community. It is supported by a rich program of keynote lectures, plenary sessions, papers presentations within the thematic sessions, invited sessions, and networking sessions.
Here is the accepted paper with its abstract:
Abstract: Heterogeneity problems within open budgets and spending datasets hinder effective analysis and consumption of these datasets. To understand detailed types of heterogeneities available within open budgets and spending datasets, we analyzed more than 75 datasets from different levels of public administrations. We classified and enumerated these heterogeneities, and see if the heterogeneities found can be represented using state-of-the-art data models designed for representing open budgets and spending data. In the end, lessons learned are provided for public administrators, technical and scientific communities.
This part of work is supported by DAAD and partially by EU H2020 project no. 645833 (OpenBudgets.eu).
Looking forward to seeing you at ICEGOV2018.
We are happy to announce SANSA 0.3 – the third release of the Scalable Semantic Analytics Stack. SANSA employs distributed computing via Apache Spark and Flink in order to allow scalable machine learning, inference and querying capabilities for large knowledge graphs.
- Website: http://sansa-stack.net
- GitHub: https://github.com/SANSA-Stack
- Download: http://sansa-stack.net/downloads-usage/
- ChangeLog: https://github.com/SANSA-Stack/SANSA-Stack/releases
You can find the FAQ and usage examples at http://sansa-stack.net/faq/.
The following features are currently supported by SANSA:
- Reading and writing RDF files in N-Triples, Turtle, RDF/XML, N-Quad format
- Reading OWL files in various standard formats
- Support for multiple data partitioning techniques
- SPARQL querying via Sparqlify (with some known limitations until the next Spark 2.3.* release)
- SPARQL querying via conversion to Gremlin path traversals (experimental)
- RDFS, RDFS Simple, OWL-Horst (all in beta status), EL (experimental) forward chaining inference
- Automatic inference plan creation (experimental)
- RDF graph clustering with different algorithms
- Rule mining from RDF graphs based AMIE+
- Terminological decision trees (experimental)
- Anomaly detection (beta)
- Distributed knowledge graph embedding approaches: TransE (beta), DistMult (beta), several further algorithms planned
Deployment and getting started:
- There are template projects for SBT and Maven for Apache Spark as well as for Apache Flink available to get started.
- The SANSA jar files are in Maven Central i.e. in most IDEs you can just search for “sansa” to include the dependencies in Maven projects.
- There is example code for various tasks available.
- We provide interactive notebooks for running and testing code via Docker.
Greetings from the SANSA Development Team
We are very pleased to announce that we got 3 papers accepted at ICSC 2018 for presentation at the main conference, which will be held on Jan 31 – Feb 2 ,2018, California, United States.
The 12th IEEE International Conference on Semantic Computing (ICSC2018) Semantic Computing (SC) addresses the derivation, description, integration, and use of semantics (“meaning”, “context”, “intention”) for all types of resource including data, document, tool, device, process and people. The scope of SC includes, but is not limited to, analytics, semantics description languages and integration (of data and services), interfaces, and applications including biomed, IoT, cloud computing, software-defined networks, wearable computing, context awareness, mobile computing, search engines, question answering, big data, multimedia, and services.
Here is the list of the accepted paper with their abstract:
Abstract: Researchers spend a lot of time in finding information about people, events, journals, and research areas related to topics of their interest. Digital libraries and digital scholarly repositories usually offer services to assist researchers in this task. However, every research community has its own way of distributing scholarly metadata.
Mailing lists provide an instantaneous channel and are often used for discussing topics of interest to a community of researchers, or to announce important information — albeit in an unstructured way. To bring structure specifically into the announcements of events and thus to enable researchers to, e.g., filter them by relevance, we present a semi-automatic crowd-sourcing workflow that captures metadata of events from call-for-papers emails into the OpenResearch.org semantic wiki. Evaluations confirm that our approach reduces a high number of actions that researchers should do manually to trace the call for papers received via mailing lists.
“Semantic Enrichment of IoT Stream Data On-Demand” by Farah Karim, Ola Al Naameh, Ioanna Lytra, Christian Mader, Maria-Esther Vidal, and Sören Auer
Abstract: Connecting the physical world to the Internet of Things (IoT) allows for the development of a wide variety of applications. Things can be searched, managed, analyzed, and even included in collaborative games.
Industries, health care, and cities are exploiting IoT data-driven frameworks to make these organizations more efficient, thus, improving the lives of citizens. For making IoT a reality, data produced by sensors, smart phones, watches, and other wearables need to be integrated; moreover, the meaning of IoT data should be explicitly represented. However, the Big Data nature of IoT data imposes challenges that need to be addressed in order to provide scalable and efficient IoT data-driven infrastructures. We tackle these issues and focus on the problems of describing the meaning of IoT streaming data using ontologies and integrating this data in a knowledge graph.
We devise DESERT, a SPARQL query engine able to on-Demand factorizE and Semantically Enrich stReam daTa in a knowledge graph.
Resulting knowledge graphs model the semantics or meaning of merged data in terms of entities that satisfy the SPARQL queries and relationships among those entities; thus, only data required for query answering is included in the knowledge graph.
We empirically evaluate the results of DESERT on SRBench, a benchmark of Streaming RDF data.
The experimental results suggest that DESERT allows for speeding up query execution while the size of the knowledge graphs remains relatively low.
Abstract: The amount of Linked Data both open, made available on the Web, and private, exchanged across companies and organizations, have been increasing in recent years. This data can be distributed in form of Knowledge Graphs (KGs), but maintaining these KGs is mainly the responsibility of data owners or providers. Moreover, building applications on top of KGs in order to provide, for instance, analytics, data access control, and privacy is left to the end user or data consumers. However, many resources in terms of development costs and equipment are required by both data providers and consumers, thus impeding the development of real-world applications over KGs. We propose to encapsulate KGs as well as data processing functionalities in a client-side system called Knowledge Graph Container, intended to be used by data providers or data consumers. Knowledge Graph Containers can be tailored to the target environments, ranging from Big Data to light-weight platforms. We empirically evaluate the performance and scalability of Knowledge Graph Containers with respect to state-of-the-art Linked Data management approaches. Observed results suggest that Knowledge Graph Containers increase the availability of Linked Data, as well as efficiency and scalability of various Knowledge Graph management tasks.
These work were supported by the European Union’s H2020 research and innovation program BigDataEurope (GA no. 644564), WDAqua : Marie Skłodowska-Curie Innovative Training Network (GA no. 642795), InDaSpace : a German Ministry for Finances and Energy research grand, DAAD Scholarship, the European Commission with a grant for the H2020 project OpenAIRE2020 (GA no. 643410) , OpenBudgets.eu (GA no. 645833) and by the European Union’s Horizon 2020 IoT European Platform Initiative (IoT-EPI) BioTope (GA No 688203).
Looking forward to seeing you at ICSC 2018.
Prof.dr.ir. Wil van der Aalst is a distinguished university professor at the Technische Universiteit Eindhoven (TU/e) where he is also the scientific director of the Data Science Center Eindhoven (DSC/e). Since 2003 he holds a part-time position at Queensland University of Technology (QUT). Currently, he is also a visiting researcher at Fondazione Bruno Kessler (FBK) in Trento and a member of the Board of Governors of Tilburg University. His personal research interests include process mining, Petri nets, business process management, workflow management, process modeling, and process analysis. Wil van der Aalst has published over 200 journal papers, 20 books (as author or editor), 450 refereed conference/workshop publications, and 65 book chapters. Many of his papers are highly cited (he one of the most cited computer scientists in the world; according to Google Scholar, he has an H-index of 135 and has been cited 80,000 times) and his ideas have influenced researchers, software developers, and standardization committees working on process support. Next to serving on the editorial boards of over 10 scientific journals he is also playing an advisory role for several companies, including Fluxicon, Celonis, and ProcessGold. Van der Aalst received honorary degrees from the Moscow Higher School of Economics (Prof. h.c.), Tsinghua University, and Hasselt University (Dr. h.c.). He is also an elected member of the Royal Netherlands Academy of Arts and Sciences, the Royal Holland Society of Sciences and Humanities, and the Academy of Europe. Recently, he was awarded with a Humboldt Professorship, Germany’s most valuable research award (five million euros), and will move to RWTH Aachen University at the beginning of 2018.
Prof. Jens Lehmann invited the speaker to the bi-weekly “SDA colloquium presentations”. 40-50 researchers and students from SDA attended. The goal of his visit was to exchange experience and ideas on semantic web techniques specialized for process mining, including process modeling, classifications algorithms and many more. Apart from presenting various use cases where process mining has helped scientists to get useful insights from row data, Prof.dr.ir. Wil van der Aalst shared with our group future research problems and challenges related to this research area and gave a talk on “Learning Hybrid Process Models from Events: Process Mining for the Real World (Slides)”
Abstract: Process mining provides new ways to utilize the abundance of event data in our society. This emerging scientific discipline can be viewed as a bridge between data science and process science: It is both data-driven and process-centric. Process mining provides a novel set of techniques to discover the real processes. These discovery techniques return process models that are either formal (precisely describing the possible behaviors) or informal (merely a “picture” not allowing for any form of formal reasoning). Formal models are able to classify traces (i.e., sequences of events) as fitting or non-fitting. Most process mining approaches described in the literature produce such models. This is in stark contrast with the over 25 available commercial process mining tools that only discover informal process models that remain deliberately vague on the precise set of possible traces. There are two main reasons why vendors resort to such models: scalability and simplicity.
In this talk, prof. Van der Aalst will propose to combine the best of both worlds: discovering hybrid process models that have formal and informal elements. The discovered models allow for formal reasoning, but also reveal information that cannot be captured in mainstream formal models. A novel discovery algorithm returning hybrid Petri nets has been implemented in ProM and will serve as an example for the next wave of commercial process mining tools. Prof. Van der Aalst will also elaborate on his collaboration with industry. His research group at TU/e applied process mining in over 150 organizations, developed the open-source tool ProM, and influenced the 20+ commercial process mining tools available today.
During the meeting, SDA core research topics and main research projects were presented and try to find an intersection on the future collaborations with Prof. Van der Aalst and his research group.
As an outcome of this visit, we expect to strengthen our research collaboration networks with TU/e and in the future with RWTH Aachen University, mainly on combining semantic knowledge and distributed computing and analytics.
We are very pleased to announce that our group got a paper accepted for presentation at IEEE BigData 2017, which will be held on December 11th-14th, 2017, Boston, MA, United States.
In recent years, “Big Data” has become a new ubiquitous term. Big Data is transforming science, engineering, medicine, healthcare, finance, business, and ultimately our society itself. The IEEE Big Data conference series started in 2013 has established itself as the top tier research conference in Big Data.
The 2017 IEEE International Conference on Big Data (IEEE Big Data 2017) will provide a leading forum for disseminating the latest results in Big Data Research, Development, and Applications.
“Implementing Scalable Structured Machine Learning for Big Data in the SAKE Project” by Simon Bin, Patrick Westphal, Jens Lehmann, and Axel-Cyrille Ngomo Ngonga.
Abstract: Exploration and analysis of large amounts of machine generated data requires innovative approaches. We propose a combination of Semantic Web and Machine Learning to facilitate the analysis. First, data is collected and converted to RDF according to a schema in the Web Ontology Language OWL. Several components can continue working with the data, to interlink, label, augment, or classify. The size of the data poses new challenges to existing solutions, which we solve in this contribution by transitioning from in-memory to database.
This work was supported in part by a research grant from the German Ministry for Finances and Energy under the SAKE project (Grant agreement No. 01MD15006E) and by a research grant from the European Union’s Horizon 2020 research and innovation programme under the SLIPO project (Grant agreement No. 731581).
Looking forward to seeing you at IEEE BigData 2017.
We are very pleased to announce that our paper “A Corpus for Complex Question Answering over Knowledge Graphs” by Priyansh Trivedi, Gaurav Maheshwari, Mohnish Dubey and Jens Lehmann has been elected as the Paper of the month at Fraunhofer IAIS. This award is given to publications that have a high innovation impact in the research field after a committee evaluation.
This research paper has been accepted on ISWC 2017 main conference and the paper presents a large gold standard Question Answering Dataset over DBpedia, and the accompanying framework to make the dataset. This is the largest QA dataset having 5000 questions, and their corresponding SPARQL query. This paper was nominated for the “Best Student Paper Award” in the resource track.
Abstract: Being able to access knowledge bases in an intuitive way has been an active area of research over the past years. In particular, several question answering (QA) approaches which allow to query RDF datasets in natural language have been developed as they allow end users to access knowledge without needing to learn the schema of a knowledge base and learn a formal query language. To foster this research area, several training datasets have been created, e.g.~in the QALD (Question Answering over Linked Data) initiative. However, existing datasets are insufficient in terms of size, variety or complexity to apply and evaluate a range of machine learning based QA approaches for learning complex SPARQL queries. With the provision of the Large-Scale Complex Question Answering Dataset (LC-QuAD), we close this gap by providing a dataset with 5000 questions and their corresponding SPARQL queries over the DBpedia dataset.In this article, we describe the dataset creation process and how we ensure a high variety of questions, which should enable to assess the robustness and accuracy of the next generation of QA systems for knowledge graphs.
The paper and authors were honored for this publication in a special event at Fraunhofer Schloss Birlinghoven, Sankt Augustin, Germany.
The International Semantic Web Conference (ISWC) is the premier international forum where Semantic Web / Linked Data researchers, practitioners, and industry specialists come together to discuss, advance, and shape the future of semantic technologies on the web, within enterprises and in the context of the public institution.
We are very pleased to announce that we got 6 papers accepted at ISWC 2017 for presentation at the main conference. Additionally, we also had 6 Posters/Demo papers accepted.
Furthermore, we are happy to win the SWSA Ten-Year Best Paper Award, which recognizes the highest impact papers from the 6th International Semantic Web Conference in Busan, Korea in 2007.
Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, Zachary G. Ives. DBpedia: A Nucleus for a Web of Open Data
— Jens Lehmann (@JLehmann82) October 23, 2017
In addition to this award, we are very happy to announce that we won the Best Demo Award for the SANSA Notebooks:
“The Tale of Sansa Spark” by Ivan Ermilov, Jens Lehmann, Gezim Sejdiu, Buehmann Lorenz, Patrick Westphal, Claus Stadler, Simon Bin, Nilesh Chakraborty, Henning Petzka, Muhammad Saleem, Axel-Cyrille Ngonga Ngomo and Hajira Jabeen.
— iswc2017 (@iswc2017) October 25, 2017
Here are some further pointers in case you want to know more about SANSA:
- Website: http://sansa-stack.net/
- GitHub: https://github.com/SANSA-Stack
- Screencasts: https://www.youtube.com/watch?v=aHCoWmzUJlE&t=2s
The audience displayed enthusiasm during the demonstration appreciating the work and asking questions regarding the future of SANSA, technical details and possible synergy with industrial partners and projects. Gezim Sejdiu and Jens Lehmann, who were presenting the demo, were talking 3+ hours non-stop (without even time to eat 😉 ).
Among the other presentations, our colleagues presented the following presentations:
- “Distributed Semantic Analytics using the SANSA Stack” by Jens Lehmann, Gezim Sejdiu, Lorenz Bühmann, Patrick Westphal, Claus Stadler, Ivan Ermilov, Simon Bin, Muhammad Saleem, Axel-Cyrille Ngonga Ngomo and Hajira Jabeen.
Prof. Dr. Jens Lehmann presented a work done on SANSA project with the main focus on offering a compact scalable engine for the whole Semantic Web Stack. The audience showed a high interest on the project, Room was very packed, a lot of people even standing ; around 150 attendees.
— Big Data Europe (@BigData_Europe) October 23, 2017
- “A Corpus for Complex Question Answering over Knowledge Graphs” by Priyansh Trivedi, Gaurav Maheshwari, Mohnish Dubey and Jens Lehmann.
Priyansh Trivedi presented a LC-QuAD: A corpus for complex question answering over knowledge graphs. This paper presents a large gold standard Question Answering Dataset over DBpedia, and the accompanying framework to make the dataset. This is the largest QA dataset having 5000 questions, and their corresponding SPARQL query. This paper was nominated for the “Best Student Paper Award” in the resource track.
— Jens Lehmann (@JLehmann82) October 24, 2017
- “Iguana : A Generic Framework for Benchmarking the Read-Write Performance of Triple Stores” by Felix Conrads, Jens Lehmann, Axel-Cyrille Ngonga Ngomo, Muhammad Saleem and Mohamed Morsey.
Felix Conrads presented his work about IGUANA, a benchmark system for measuring the read/write performance over Triple Stores. The framework allows to run stress tests and was created in collaboration with the DICE group.
— Priyansh (@ekandekha_comma) October 25, 2017
- “Sustainable Linked Data generation: the case of DBpedia” by Wouter Maroy, Anastasia Dimou, Dimitris Kontokostas, Ben De Meester, Jens Lehmann, Erik Mannens and Sebastian Hellmann.
In this work, in which we collaborated with the University of Ghent and InfAI Leipzig, we propose complex extraction workflows based on RML. As a major use case, they will be the base for the DBpedia Information Extraction Framework, which will become more declarative and easier to update.
— Pieter Heyvaert (@PHaDventure) October 23, 2017
- “Diefficiency Metrics: Measuring the Continuous Efficiency of Query Processing Approaches” by Maribel Acosta (KIT), Maria-Esther Vidal, York Sure-Vetter (KIT).
Maribel Acosta presented two experimental metrics related to the so called “dieffency” in incremental query answering. Those metrics allow to better capture the behavior of incremental SPARQL query processors compared to standard metrics, which often measure when the first query result arrives, the last query result arrives and the throughput in between. The metrics generalise to any method that produces results incrementally. This paper was nominated for the “Best Paper Award” at the Resource Track.
Here some pointers:
- Visualization and Interaction for Ontologies and Linked Data (VOILA 2017)
Steffen Lohmann co-organized the International Workshop on Visualization and Interaction for Ontologies and Linked Data (VOILA 2017) for the third time at ISWC. Overall, more than 50 researchers and practitioners took part in this full-day event featuring talks, discussions, and tool demonstrations, including an interactive demo session. The workshop proceedings are published as CEUR-WS vol. 1947.
— Ali Khalili (@hyperir) October 22, 2017
ISWC17 was a great venue to meet the community, create new connections, talk about current research challenges, share ideas and settle new collaborations. We look forward to the next ISWC conference.
Until then, meet us at SDA !
Maria Maleshkova is a postdoctoral researcher at the Karlsruhe Service Research Institute (KSRI) and the Institute of Applied Informatics and Formal Description Methods (AIFB) at the Karlsruhe Institute of Technology. Her research work covers the Web of Things (WoT) and semantics-based data integration topics as well as work in the area of the semantic description of Web APIs, RESTful services and their joint use with Linked Data. Prior to that, she was a Research Associate and a PhD student at the Knowledge Media Institute (KMi) at the Open University, where she worked on projects in the domain of SOA and Web Services.
Prof. Jens Lehmann invited the speaker to the bi-weekly “SDA colloquium presentations”. 40-50 researchers and students from SDA attended. The goal of her visit was to exchange experience and ideas on semantic web techniques specialized for smart services, including Internet of Things, Industry 4.0 technologies and many more. Apart from presenting various use cases where smart services have helped scientists to get useful insights from sensor data, Dr. Maleshkova shared with our group future research problems and challenges related to this research area and shown that what is so smart about Smart Services?
As an outcome of this visit, we expect to strengthen our research collaboration networks with KIT, mainly on combining semantic knowledge and distributed analytics applied on SANSA.