SANSA Collaboration with Alethio

The SANSA team is excited to announce our collaboration with Alethio (a ConsenSys formation). SANSA is the major distributed, open source solution for RDF querying, reasoning and machine learning. Alethio is building an Ethereum analytics platform that strives to provide transparency over what’s happening on the Ethereum p2p network, the transaction pool and the blockchain and provide “blockchain archeology”. Their 5 billion triple data set contains large scale blockchain transaction data modelled as RDF according to the structure of the Ethereum ontology. EthOn – The Ethereum Ontology – is a formalization of concepts/entities and relations of the Ethereum ecosystem represented in RDF and OWL format. It describes all Ethereum terms including blocks, transactions, contracts, nonces etc. as well as their relationships. Its main goal is to serve as a data model and learning resource for understanding Ethereum.

Alethio is interested in using SANSA as a scalable processing engine for their large-scale batch and stream processing tasks, such as querying the data in real time via SPARQL and performing related analytics on a wide range of subjects (e.g. asset turnover for sets of accounts, attack pattern detection or Opcode usage statistics). At the same time, SANSA is interested in further industrial pilot applications for testing the scalability on larger datasets, mature its code base and gain experience on running the stack on production clusters. Specifically, the initial goal of Alethio was to load a 2TB EthOn dataset containing more than 5 billion triples and then performing several analytic queries on it with up to three inner joins. The queries are used to characterize movement between groups of ethereum accounts (e.g. exchanges or investors in ICOs) and aggregate their in and out value flow over the history of the Ethereum blockchain. The experiments were successfully run by Alethio on a cluster with up to 100 worker nodes and 400 cores that have a total of over 3TB of memory available.

I am excited to see that SANSA works and scales well to our data. Now, we want to experiment with more complex queries and tune the Spark parameters to gain the optimal performance for our dataset” said Johannes Pfeffer, co-founder of Alethio. I am glad that Alethio managed to run their workload and to see how well our methods scale to a 5 billion triple dataset”, added Gezim Sejdiu, PhD student at the Smart Data Analytics Group and SANSA core developer.

Parts of the SANSA team, including its leader Prof. Jens Lehmann as well as Dr. Hajira Jabeen, Dr. Damien Graux and Gezim Sejdiu, will now continue the collaboration together with the data science team of Alethio after those successful experiments. Beyond the above initial tests, we are jointly discussing possibilities for efficient stream processing in SANSA, further tuning of aggregate queries as well as suitable Apache Spark parameters for efficient processing of the data. In the future, we want to join hands to optimize the performance of loading the data (e.g. reducing the disk footprint of datasets using compression techniques allowing then more efficient SPARQL evaluation), handling the streaming data, querying, and analytics in real time.

The SANSA team is happily looking forward to further interesting scientific research as well as industrial adaptation.

image1 image2
 Core model of the fork history of the Ethereum Blockchain modeled in EthOn

Paper accepted at ICLR 2018


We are very pleased to announce that our group in collaboration with Fraunhofer IAIS got a paper accepted for poster presentation at ICLR 2018 : The Sixth International Conference on Learning Representations, which will be held on April 30 – May 03, 2018 in Vancouver Convention Center, Vancouver CANADA.

The Sixth edition of ICLR will offer many opportunities to present and discuss latest advances in the performance of machine learning methods and deep learning. With a broad view of the field and include topics such as feature learning, metric learning, compositional modeling, structured prediction, reinforcement learning, and issues regarding large-scale learning and non-convex optimization. The range of domains to which these techniques apply is also very broad, from vision to speech recognition, text understanding, gaming, music, etc.

Here is the accepted paper with its abstract:

  • On the regularization of Wasserstein GANs” by Henning Petzka, Asja Fischer, Denis Lukovnikov 

    Abstract: Since their invention, generative adversarial networks (GANs) have become a popular approach for learning to model a distribution of real (unlabeled) data. Convergence problems during training are overcome by Wasserstein GANs which minimize the distance between the model and the empirical distribution in terms of a different metric, but thereby introduce a Lipschitz constraint into the optimization problem. A simple way to enforce the Lipschitz constraint on the class of functions, which can be modeled by the neural network, is weight clipping. Augmenting the loss by a regularization term that penalizes the deviation of the gradient norm of the critic (as a function of the network’s input) from one, was proposed as an alternative that improves training. We present theoretical arguments why using a weaker regularization term enforcing the Lipschitz constraint is preferable. These arguments are supported by experimental results on several data sets.

This part of work is supported by WDAqua : Marie Skłodowska-Curie Innovative Training Network (GA no. 642795).

Looking forward to seeing you at ICLR 2018.

Invited talk by Svitlana Vakulenko

SvitlanaOn Wednesday, 31st  of January Svitlana Vakulenko from the Institute for Information Business visited SDA and gave a talk entitled “Semantic Coherence for Conversational Browsing of a Knowledge Graph

Svitlana Vakulenko is a researcher at the Institute for Information Business at WU Wien and a PhD student in the Computer Science Department at TU Wien. Her research expertise lies in the area of machine learning for natural language processing. She has been involved in several international research projects and is currently working in CommuniData FFG project (, which aims to enhance the usability of Open Data and its accessibility for non-expert users in local communities. She is involved in other projects as well with the main focus on Question Answering from Tabular data and Open Data Conversational Search and Exploratory Search.

Prof. Jens Lehmann invited the speaker to the bi-weekly “SDA colloquium presentations”. The goal of her visit was to exchange experience and ideas on semantic search and dialogue systems techniques specialized for question answering, including conversational search and exploratory search. Apart from presenting various use cases where semantic exploration using table data and open data has been used she introduced a framework which models these conversational browsing systems. Svitlana shared with our group future research problems and challenges related to this research area and shown that the Semantic Coherence will provide more insight and meaningful results to the conversational browsing scenario.

In this talk, she introduces the task of conversational browsing that goes beyond Question Answering. A framework which models components of a conversational browsing system has been presented and discussed its application to the structure of a Knowledge Graph (KG). She mentioned that adding support for conversational browsing functionality shall allow users to efficiently explore a structured search space to enable the future conversational search systems not only answer a range of questions but also help to discover questions worth asking.

During the visit, SDA core research topics and main research projects were presented in an attempt to find an intersection on the future collaborations with Svitlana and her research group.

As an outcome of this visit, we expect to strengthen our research collaboration networks with the Institute of Information Business at WU Wien, mainly on combining semantic knowledge for exploratory and conversation search and apply those techniques for a very large-scale KG using our distributed analytics framework SANSA.

Papers and workshops accepted at TheWebConference (ex WWW) 2018

TheWebConference_LyonWe are very pleased to announce that our group got 2 papers accepted for presentation at the The 2018 edition of The Web Conference (27th edition of the former WWW conference), which will be held on April 23-27, 2018 in Lyon, France.
The 2018 edition of The Web Conference will offer many opportunities to present and discuss latest advances in academia and industry. This first joint call for 
contributions provides a list of the first calls for: research tracks, workshops, tutorials, exhibition, posters, demos, developers’ track, W3C track, industry track, PhD symposium, challenges, minute of madness, international project track, W4A, hackathon, the BIG web, journal track.

Here are the accepted papers with their abstracts:

  • DL-Learner – A Framework for Inductive Learning on the Semantic Web” by Lorenz Bühmann, Patrick Westphal, Jens Lehmann and Simon Bin (Journal paper track).

    Abstract: In this system paper, we describe the DL-Learner framework, which supports supervised machine learning using OWL and RDF for background knowledge representation. It can be beneficial in various data and schema analysis tasks with applications in different standard machine learning scenarios, e.g. in the life sciences, as well as Semantic Web specific applications such as ontology learning and enrichment. Since its creation in 2007, it has become the main OWL and RDF-based software framework for supervised structured machine learning and includes several algorithm implementations, usage examples and has applications building on top of the framework. The article gives an overview of the framework with a focus on algorithms and use cases.


  • Why Reinvent the Wheel- Let’s Build Question Answering Systems Together” by Kuldeep Singh, Arun Sethupat Radhakrishna, Andreas Both, Saeedeh Shekarpour, Ioanna Lytra, Ricardo Usbeck, Akhilesh Vyas, Akmal Khikmatullaev, Dharmen Punjani, Christoph Lange, Maria-Esther Vidal, Jens Lehmann and Sören Auer ( Research track).

    Abstract: Modern question answering (QA) systems need to flexibly integrate a number of components specialised to fulfil specific tasks in a QA pipeline. Key QA tasks include Named Entity Recognition and Disambiguation, Relation Extraction, and Query Building. Since a number of different software components exist, implementing different strategies for each of these tasks, a major challenge when building QA systems, is how to select and combine the most suitable components into a QA system, given the characteristics of a question. We study this optimisation problem and train Classifiers, which take features of a question as input and have the goal of optimising the selection of QA components based on those features. We then devise a greedy algorithm to identify the pipelines that include the suitable components and can effectively answer the given question. We implement this model within Frankenstein, a QA framework able to select QA components and compose QA pipelines. We evaluate the effectiveness of the pipelines generated by Frankenstein using the QALD and LC-QuAD benchmarks. These results not only suggest that Frankenstein precisely solves the QA optimisation problem, but also enables the automatic composition of optimised QA pipelines, which outperform the static Baseline QA pipeline. Thanks to this flexible and fully automated pipeline generation process, new QA components can be easily included in Frankenstein, thus improving the performance of the generated pipelines.

These work were supported by grants from the EU FP7 Programme for the project GeoKnow (GA no. 318159) as well as for the German Research Foundation project GOLD and the German Ministry for Economic Affairs and Energy project SAKE (GA no. 01MD15006E), the European Union’s Horizon 2020 research and innovation programme for the project SLIPO (GA no. 731581), the EU Horizon 2020 R&I programme for the Marie Sklodowska Curie action WDAqua (GA No 642795), Eurostars project QAMEL (E!9725) as well as the European Union’s H2020 research and innovation action HOBBIT (GA 688227) and the CSA BigDataEurope (GA No 644564).

Furthermore, we are pleased to inform that we got the following workshops, which will be co-located with The Web Conference 2018.

Here is the accepted workshops and their short description:

  • Linked Data on the Web (LDOW2018)  by Tim Berners-Lee (W3C/MIT, USA),  Sarven Capadisli (University of Bonn, Germany), Stefan Dietze (Leibniz Universität Hannover,Germany), Aidan Hogan (Universidad de Chile, Chile), Krzysztof Janowicz (University of California, Santa Barbara, US) and Jens Lehmann (University of Bonn, Germany)
    The Web is developing from a medium for publishing textual documents into a medium for sharing structured data. This trend is fueled on the one hand by the adoption of the Linked Data principles by a growing number of data providers. On the other hand, large numbers of websites have started to semantically mark up the content of their HTML pages and thus also contribute to the wealth of structured data available on the Web. The 11th Workshop on Linked Data on the Web (LDOW2018) aims to stimulate discussion and further research into the challenges of publishing, consuming, and integrating structured data from the Web as well as mining knowledge from the global Web of Data.
    Topics of interest for the workshop include, but are not limited to, the following:

    • Web Data Quality Assessment
    • Web Data Cleansing
    • Integrating Web Data from Large Numbers of Data Sources
    • Mining the Web of Data
    • Linked Data Applications

Social media hashtag: #LDOW2018

  • Semantics, Analytics, Visualisation: Enhancing Scholarly Dissemination  by Alejandra Gonzalez-Beltran (Oxford e-Research Centre, University of Oxford, Oxford, UK), Francesco Osborne (Knowledge Media Institute, Open University, Milton Keynes, UK), Silvio Peroni (Department of Computer Science and Engineering, University of Bologna, Bologna, Italy), Sahar Vahdati (Smart Data Analytics, University of Bonn, Bonn, Germany)
    After the great success of the past three editions, we are pleased to announce SAVE-SD 2018, which wants to bring together publishers, companies and researchers from different fields (including Document and Knowledge Engineering, Semantic Web, Natural Language Processing, Scholarly Communication, Bibliometrics, Human-Computer Interaction, Information Visualisation, Bioinformatics, and Life Sciences) in order to bridge the gap between the theoretical/academic and practical/industrial aspects in regards to scholarly data.
    The following topics will be addressed:

    • semantics of scholarly data, i.e. how to semantically represent, categorise, connect and integrate scholarly data, in order to foster reusability and knowledge sharing;
    • analytics on scholarly data, i.e. designing and implementing novel and scalable algorithms for knowledge extraction with the aim of understanding research dynamics, forecasting research trends, fostering connections between groups of researchers, informing research policies, analysing and interlinking experiments and deriving new knowledge;
    • visualisation of and interaction with scholarly data, i.e. providing novel user interfaces and applications for navigating and making sense of scholarly data and highlighting their patterns and peculiarities.

Looking forward to seeing you at The Web Conference (ex WWW) 2018

Christmas Time at SDA – Time to look back at 2017

christmas-xmas-christmas-tree-decoration(4658x3105)We are looking back at a busy and successful year 2017 full of new members, inspirational discussions, exciting conferences, a lot of accepted papers and awards as well as new software releases.

Below is a short summary of the main cornerstones for 2017:


The growth of the group in 2017

SDA is a new group, but not new in the field :). As a group, it was founded by Prof. Dr. Jens Lehmann at the beginning of 2016. The group has members at the University of Bonn with associated researchers at the Fraunhofer Institute for Intelligent Analysis and Information Systems (IAIS) and the Institute for Applied Computer Science Leipzig. Within 2017, the group has grown from 20 members to around 55 members (1 Professor, 1 Akademischer Rat / Assistant Professor, 11 PostDocs, 31 PhD Students,11 master students) as you can see on our team page.

An interesting future for AI and knowledge graphs

Artificial intelligence / machine learning and semantic technologies / knowledge graphs are central topics for SDA. Throughout the year, we have been able to achieve a range of interesting research achievements. This ranges from internationally leading results in question answering over knowledge graphs, to scalable distributed querying, inference and analysis of large RDF datasets as well as new perspectives on industrial data spaces and data integration. Among the race for ever improving achievements in AI, which go far beyond what many could have imagined 10 years ago, our researchers were able to deliver important contributions and continue to shape different sub areas of the growing AI research landscape.

Papers accepted

We had 46 papers accepted at well-known conferences (i.e The Web Conference 2018, WWW 2017, AAAI 2017, ISWC 2017, ESWC 2017, DEXA 2017, SEMANTiCS 2017, K-CAP 2017, WI 2017, KESW 2017, IEEE BigData 2017, NIPS 2017, TPDL 2017, ICSC 2018, ICEGOV 2018 and more. We estimate our articles to be cited around 3000+ times per year (based on Google Scholar profiles).


We received several awards in 2017 – click on the posts to find out more:


Software releases

SANSA – An open source data flow processing engine for performing distributed computation over large-scale RDF datasets had 2 successfully released during 2017 (SANSA 0.3 and SANSA 0.2).

From the funded projects we were happy to launch the final release of the Big Data Europe platform – an open source Big Data Processing Platform allowing users to install numerous big data processing tools and frameworks and create working data flow applications.

There were several other releases:

  • SML-Bench – A Structured Machine Learning benchmark framework 0.2 has been released.
  • WebVOWL – A Web-based Visualization of Ontologies had several releases in 2017.
  • – A Crowd-Sourcing platform for collaborative management of scholarly metadata reached coverage of more than 5K computer science conferences in 2017.

Furthermore, SDA deeply values team bonding activities. :-) Often we try to introduce fun activities that involve teamwork and teambuilding. At our X-mas party, we enjoyed a very international and lovely dinner together, we played a `Secret Santa` and Pantomime game.


Long-term team building through deeper discussions, genuine connections and healthy communication helps us to connect within the group!

Many thanks to all of you who have accompanied and supported us on this way!

Jens Lehmann on behalf of The SDA Research Team

SDA at NIPS 2017

NipsWe are very pleased to announce that our group got a paper accepted for presentation at the workshop on Optimal Transport and Machine Learning ( at  NIPS 2017 : The Thirty-first Annual Conference on Neural Information Processing Systems, which was held on December 4 – 9, 2017 in Long Beach, California.

The Thirty-first Annual Conference on Neural Information Processing Systems (NIPS) is a single-track machine learning and computational neuroscience conference that includes invited talks, demonstrations and oral and poster presentations of refereed papers.  NIPS has a responsibility to provide an inclusive and welcoming environment for everyone in the fields of AI and machine learning. Unfortunately, several events held at (or in conjunction with) this year’s conference fell short of these standards.

Here is the accepted paper with its abstract:

On the regularization of Wasserstein GANsby Henning Petzka, Asja Fischer, Denis Lukovnikov.

Abstract: Since their invention, generative adversarial networks (GANs) have become a popular approach for learning to model a distribution of real (unlabeled) data. Convergence problems during training are overcome by Wasserstein GANs which minimize the distance between the model and the empirical distribution in terms of a different metric, but thereby introduce a Lipschitz constraint into the optimization problem. A simple way to enforce the Lipschitz constraint on the class of functions, which can be modeled by the neural network, is weight clipping. Augmenting the loss by a regularization term that penalizes the deviation of the gradient norm of the critic (as a function of the network’s input) from one, was proposed as an alternative that improves training. We present theoretical arguments why using a weaker regularization term enforcing the Lipschitz constraint is preferable. These arguments are supported by experimental results on several data sets.

This part of work is supported by WDAqua : Marie Skłodowska-Curie Innovative Training Network (GA no. 642795).


Paper accepted at ICEGOV 2018


We are very pleased to announce that our group got a paper accepted for presentation at the 11th International Conferences on Theory and Practice of Electronic Governance  (ICEGOV) 2018, which will be held on April 4 – 6, 2018 in Galway, Ireland.

The conference focuses on the use of technology to transform the working of government and its relationships with citizens, businesses, and other non-state actors in order to improve public governance and its contribution to public policy and development (EGOV). It also promotes the interaction and cooperation between universities, research centres, governments, industries, and non-governmental organizations needed to develop the EGOV community. It is supported by a rich program of keynote lectures, plenary sessions, papers presentations within the thematic sessions, invited sessions, and networking sessions.

Here is the accepted paper with its abstract:

Classifying Data Heterogeneity within Budget and Spending Open Data” by Fathoni A. Musyaffa, Fabrizio Orlandi, Hajira Jabeen, and Maria-Esther Vidal.

Abstract: Heterogeneity problems within open budgets and spending datasets hinder effective analysis and consumption of these datasets. To understand detailed types of heterogeneities available within open budgets and spending datasets, we analyzed more than 75 datasets from different levels of public administrations. We classified and enumerated these heterogeneities, and see if the heterogeneities found can be represented using state-of-the-art data models designed for representing open budgets and spending data. In the end, lessons learned are provided for public administrators, technical and scientific communities.

This part of work is supported by DAAD and partially by EU H2020 project no. 645833 (

Looking forward to seeing you at ICEGOV2018.

SANSA 0.3 (Semantic Analytics Stack) Released

We are happy to announce SANSA 0.3 – the third release of the Scalable Semantic Analytics Stack. SANSA employs distributed computing via Apache Spark and Flink in order to allow scalable machine learning, inference and querying capabilities for large knowledge graphs.

You can find the FAQ and usage examples at

The following features are currently supported by SANSA:

  • Reading and writing RDF files in N-Triples, Turtle, RDF/XML, N-Quad format
  • Reading OWL files in various standard formats
  • Support for multiple data partitioning techniques
  • SPARQL querying via Sparqlify (with some known limitations until the next Spark 2.3.* release)
  • SPARQL querying via conversion to Gremlin path traversals (experimental)
  • RDFS, RDFS Simple, OWL-Horst (all in beta status), EL (experimental) forward chaining inference
  • Automatic inference plan creation (experimental)
  • RDF graph clustering with different algorithms
  • Rule mining from RDF graphs based AMIE+
  • Terminological decision trees (experimental)
  • Anomaly detection (beta)
  • Distributed knowledge graph embedding approaches: TransE (beta), DistMult (beta), several further algorithms planned

Deployment and getting started:

  • There are template projects for SBT and Maven for Apache Spark as well as for Apache Flink available to get started.
  • The SANSA jar files are in Maven Central i.e. in most IDEs you can just search for “sansa” to include the dependencies in Maven projects.
  • There is example code for various tasks available.
  • We provide interactive notebooks for running and testing code via Docker.

We want to thank everyone who helped to create this release, in particular the projects Big Data Europe, HOBBIT, SAKE, Big Data Ocean, SLIPO, QROWD and BETTER.

Greetings from the SANSA Development Team



Papers accepted at ICSC 2018

ICSC2018We are very pleased to announce that we got 3 papers accepted at ICSC 2018 for presentation at the main conference, which will be held on Jan 31 – Feb 2 ,2018,  California, United States.

The 12th IEEE International Conference on Semantic Computing (ICSC2018) Semantic Computing (SC) addresses the derivation, description, integration, and use of semantics (“meaning”, “context”, “intention”) for all types of resource including data, document, tool, device, process and people. The scope of SC includes, but is not limited to, analytics, semantics description languages and integration (of data and services), interfaces, and applications including biomed, IoT, cloud computing, software-defined networks, wearable computing, context awareness, mobile computing, search engines, question answering, big data, multimedia, and services.

Here is the list of the accepted paper with their abstract:

“SAANSET: Semi-Automated Acquisition of Scholarly Metaadata using Platform” by Rebaz Omar, Sahar Vahdati, Christoph Lange, Maria-Esther Vidal and Andreas Behrend

Abstract: Researchers spend a lot of time in finding information about people, events, journals, and research areas related to topics of their interest. Digital libraries and digital scholarly repositories usually offer services to assist researchers in this task. However, every research community has its own way of distributing scholarly metadata.
Mailing lists provide an instantaneous channel and are often used for discussing topics of interest to a community of researchers, or to announce important information — albeit in an unstructured way. To bring structure specifically into the announcements of events and thus to enable researchers to, e.g., filter them by relevance, we present a semi-automatic crowd-sourcing workflow that captures metadata of events from call-for-papers emails into the semantic wiki. Evaluations confirm that our approach reduces a high number of actions that researchers should do manually to trace the call for papers received via mailing lists.

“Semantic Enrichment of IoT Stream Data On-Demand” by Farah Karim, Ola Al Naameh, Ioanna Lytra, Christian Mader, Maria-Esther Vidal, and Sören Auer

Abstract: Connecting the physical world to the Internet of Things (IoT) allows for the development of a wide variety of applications. Things can be searched, managed, analyzed, and even included in collaborative games.
Industries, health care, and cities are exploiting IoT data-driven frameworks to make these organizations more efficient, thus, improving the lives of citizens. For making IoT a reality, data produced by sensors, smart phones, watches, and other wearables need to be integrated; moreover, the meaning of IoT data should be explicitly represented. However, the Big Data nature of IoT data imposes challenges that need to be addressed in order to provide scalable and efficient IoT data-driven infrastructures. We tackle these issues and focus on the problems of describing the meaning of IoT streaming data using ontologies and integrating this data in a knowledge graph.
We devise DESERT, a SPARQL query engine able to on-Demand factorizE and Semantically Enrich stReam daTa in a knowledge graph.
Resulting knowledge graphs model the semantics or meaning of merged data in terms of entities that satisfy the SPARQL queries and relationships among those entities; thus, only data required for query answering is included in the knowledge graph.
We empirically evaluate the results of DESERT on SRBench, a benchmark of Streaming RDF data.
The experimental results suggest that DESERT allows for speeding up query execution while the size of the knowledge graphs remains relatively low.


“Shipping Knowledge Graph Management Capabilities to Data Providers and Consumers” by Omar Al-Safi, Christian Mader, Ioanna Lytra, Mikhail Galkin, Kemele Endris, Maria-Esther Vidal, and Sören Auer

Abstract: The amount of Linked Data both open, made available on the Web, and private, exchanged across companies and organizations, have been increasing in recent years. This data can be distributed in form of Knowledge Graphs (KGs), but maintaining these KGs is mainly the responsibility of data owners or providers. Moreover, building applications on top of KGs in order to provide, for instance, analytics, data access control, and privacy is left to the end user or data consumers. However, many resources in terms of development costs and equipment are required by both data providers and consumers, thus impeding the development of real-world applications over KGs. We propose to encapsulate KGs as well as data processing functionalities in a client-side system called Knowledge Graph Container, intended to be used by data providers or data consumers. Knowledge Graph Containers can be tailored to the target environments, ranging from Big Data to light-weight platforms. We empirically evaluate the performance and scalability of Knowledge Graph Containers with respect to state-of-the-art Linked Data management approaches. Observed results suggest that Knowledge Graph Containers increase the availability of Linked Data, as well as efficiency and scalability of various Knowledge Graph management tasks.


These work were supported by the European Union’s H2020 research and innovation program BigDataEurope (GA no. 644564), WDAqua : Marie Skłodowska-Curie Innovative Training Network (GA no. 642795), InDaSpace :  a German Ministry for Finances and Energy research grand, DAAD Scholarship, the European Commission with a grant for the H2020 project OpenAIRE2020 (GA no. 643410) , (GA no. 645833) and by the European Union’s Horizon 2020 IoT European Platform Initiative (IoT-EPI) BioTope (GA No 688203).

Looking forward to seeing you at ICSC 2018. Wil van der Aalst visits SDA

WvdA-BvO-24059(3256x1832) Wil van der Aalst from Technische Universiteit Eindhoven (TU/e) was visiting the SDA group on the 29th of November 2017. Wil van der Aalst is a distinguished university professor at the Technische Universiteit Eindhoven (TU/e) where he is also the scientific director of the Data Science Center Eindhoven (DSC/e). Since 2003 he holds a part-time position at Queensland University of Technology (QUT). Currently, he is also a visiting researcher at Fondazione Bruno Kessler (FBK) in Trento and a member of the Board of Governors of Tilburg University. His personal research interests include process mining, Petri nets, business process management, workflow management, process modeling, and process analysis. Wil van der Aalst has published over 200 journal papers, 20 books (as author or editor), 450 refereed conference/workshop publications, and 65 book chapters. Many of his papers are highly cited (he one of the most cited computer scientists in the world; according to Google Scholar, he has an H-index of 135 and has been cited 80,000 times) and his ideas have influenced researchers, software developers, and standardization committees working on process support. Next to serving on the editorial boards of over 10 scientific journals he is also playing an advisory role for several companies, including Fluxicon, Celonis, and ProcessGold. Van der Aalst received honorary degrees from the Moscow Higher School of Economics (Prof. h.c.), Tsinghua University, and Hasselt University (Dr. h.c.). He is also an elected member of the Royal Netherlands Academy of Arts and Sciences, the Royal Holland Society of Sciences and Humanities, and the Academy of Europe. Recently, he was awarded with a Humboldt Professorship, Germany’s most valuable research award (five million euros), and will move to RWTH Aachen University at the beginning of 2018.

Prof. Jens Lehmann invited the speaker to the bi-weekly “SDA colloquium presentations”. 40-50 researchers and students from SDA attended. The goal of his visit was to exchange experience and ideas on semantic web techniques specialized for process mining, including process modeling, classifications algorithms and many more. Apart from presenting various use cases where process mining has helped scientists to get useful insights from row data, Wil van der Aalst shared with our group future research problems and challenges related to this research area and gave a talk on “Learning Hybrid Process Models from Events: Process Mining for the Real World (Slides)”

Abstract: Process mining provides new ways to utilize the abundance of event data in our society. This emerging scientific discipline can be viewed as a bridge between data science and process science: It is both data-driven and process-centric. Process mining provides a novel set of techniques to discover the real processes. These discovery techniques return process models that are either formal (precisely describing the possible behaviors) or informal (merely a “picture” not allowing for any form of formal reasoning). Formal models are able to classify traces (i.e., sequences of events) as fitting or non-fitting. Most process mining approaches described in the literature produce such models. This is in stark contrast with the over 25 available commercial process mining tools that only discover informal process models that remain deliberately vague on the precise set of possible traces. There are two main reasons why vendors resort to such models: scalability and simplicity. 

In this talk, prof. Van der Aalst will propose to combine the best of both worlds: discovering hybrid process models that have formal and informal elements. The discovered models allow for formal reasoning, but also reveal information that cannot be captured in mainstream formal models. A novel discovery algorithm returning hybrid Petri nets has been implemented in ProM and will serve as an example for the next wave of commercial process mining tools. Prof. Van der Aalst will also elaborate on his collaboration with industry. His research group at TU/e applied process mining in over 150 organizations, developed the open-source tool ProM, and influenced the 20+ commercial process mining tools available today.

During the meeting, SDA core research topics and main research projects were presented and try to find an intersection on the future collaborations with Prof. Van der Aalst  and his research group.

As an outcome of this visit, we expect to strengthen our research collaboration networks with TU/e and in the future with RWTH Aachen University, mainly on combining semantic knowledge and distributed computing and analytics.