Thesis Announcements

The Smart Data Analytics group is always looking for good students to write theses. The topics can be in one of the following broad areas:

Please note that the list below is only a small sample of possible topics and ideas. Please contact us to discuss further, to find new topics, or to suggest a topic of your own.

Open Theses

Topic                                                                                                                                                                                                           LevelContact Person
Solving Mini Chess via Distributed Deep Reinforcement Learning and Proof Number Search

In 1956, a chess program beat a (novice) human opponent for the first time in a chess variant called Los Alamos chess. Los Alamos Chess has a 6×6 board size and is therefore more suitable for computer programs – such variants are called “mini chess”. While chess engines have improved dramatically over the past decades, the game theoretic value (either a draw or a win for White can be forced) of Los Alamos Chess is still unknown. In this thesis, the goal is to prove the game theoretic value of the game using a combination of a) deep reinforcement learning techniques for determining the best move in a particular position, b) position solvers based on proof number search, c) endgame table bases generated for Los Alamos Chess and d) a distributed proof tree manager allowing to execute the approach over a cluster. If the student is capable of implementing the approach on a small cluster, we will apply for an execution within a large-scale cluster consisting of hundreds of computing nodes. Given that substantial resources may be invested for this thesis, students applying for it should have excellent grades, excellent programming skills and enthusiasm for game solving or chess.
MProf. Dr. Jens Lehmann
Conversational AI & Climate Change
Climate Change is one of the major challenges humanity has to face. One of the key elements required for addressing its consequences is to provide objective information to citizens. Conversational AI methods (aka chatbots, speech assistants, dialogue systems, …) have become increasingly important over the past few years. In this master thesis, students can contribute to building a climate change chatbot by addressing one of the main challenges in deep learning and natural language processing below:
  • Improving reading comprehension techniques by transfer learning from large text corpora such as SquAD to text documents describing climate change
  • Improving the translation of natural language questions to queries against (climate) knowledge graphs)
  • Interactive question answering techniques for capturing user feedback
MDr. Ricardo Usbeck,
Prof. Dr. Jens Lehmann
Extracting Analogies from Clustered Ideas of Users
Research on crowd ideation platforms has shown that crowds can foster the potential to obtain creative ideas. Traditional strategies for leveraging them, such as organizing and describing their idea as text, require time and organizational effort in order to obtain results. Furthermore, such a process prevents the crowd from thinking outside the box. Therefore, the goal of this thesis is to develop an interface similar to [1] that allows spatial arrangement and clustering of ideas. Next, analogy techniques are applied to clustered ideas in order to capture the schema/model behind them. Finally, based on schemas obtained from different clusters, the user can provide his/her idea.

Keywords: Open Innovation, Crowd Ideation, Creativity, Analogy, Semantics, Knowledge Graphs, Clustering, Information Extraction.
[1] IdeaHound: Improving Large-scale Collaborative Ideation with Crowd-Powered Real-time Semantic Modeling, 2016
MDr. Abderrahmane Khiat
OWL Git-Diff: Towards an expressive Gitt-Diff tool for OWL Ontologies
The current git-diff tools use a text-based comparison to differentiate between OWL ontologies where the git users are not well-informed with the real semantic-based changes of those ontologies. OWL Git-Diff will be integrated within version-control systems to express in a userfriendly way the changes in OWL ontology versions. Ecco [1] tool compares OWL 2 ontologies where its output is an XML-based text which cannot be easily integrated within such systems, also it does not support various serialization formats of OWL representation. However, OWL GitDiff is intended to be well-integrated within different version-control systems and to show expressive messages to the users, allowing them to understand various changes in an ontology. An extended work of OWL Git-Diff can be to show graphically the different changes in a sequence of ontology versions.
[1] Gonçalves, Rafael S. et al. “Ecco: A Hybrid Diff Tool for OWL 2 ontologies.” OWLED (2012). URL:here
MAhmad Hemid, Dr. Abderrahmane Khiat
Generating Creative Ideas for Crowd Ideation Platforms
subtopic1: knowledge graph-based approach
subtopic2: machine learning-based approach
Research on creativity claims that new ideas come through a combination of ideas, also known as combinational creativity, for example Bracelet + lifebuoy = Self Rescue Bracelet. However, it is quite challenging for computer-based approaches to generate valuable combinations and recognize their values (Boden 2009). The goal of this thesis is to develop a solution to obtain valuable combinations of ideas.
  • For the knowledge graph-based approach, the solution consists of (1) representing ideas more formally (close to logic representation), (2) find relations between ideas and (3) employ some operators to produce new idea compositions.
  • For the machine learning-based approach, the solution consists of structuring ideas into Purpose and Mechanism and then employing text generation techniques such as Markov chain model to produce new idea combinations.

Keywords: Combinatorial Creativity, Crowd Ideation, Creativity, Machine Learning, Description Logics, Knowledge Graphs, Information Extraction.
MDr. Abderrahmane Khiat
Selecting High Quality Ideas in Crowd Ideation Platforms
Crowd ideation platforms emerged as a promising solution to obtain innovative products. Their strength lies in crowds with different backgrounds to generate many ideas where potential innovative solutions for social problems can be sparked. However, these platforms introduce a new challenge. Many ideas are duplicate, similar or too trivial to add new value to products. Thus, filtering low-quality ideas is proved unfeasible to achieve manually. The goal of this thesis is to develop an automatic solution that evaluates the creativity of ideas.

Keywords: Open Innovation, Crowd Ideation, Creativity Measurements, Idea Raking, Machine Learning, Information Extraction.
MDr. Abderrahmane Khiat
Applying Knowledge graph embeddings for Context-aware Question Answering
The task for Question Answering faces new challenges when applied in scenarios with frequently changing information sets, such as a driving car. Current semantic parsing approaches rely on the extraction of named entities and according predicates from the input to match these with patterns in static Knowledge Bases. So far, there is little to no effort to include knowledge about the environment (i.e. context) into the QA pipeline. To improve the performance for the so-called Context-aware QA, you will work on solutions to adopt different Graph embeddings approaches into the QA process. Please refer to the job description for further information.
MJewgeni Rose
Smart Home – Akquise von Individualwissen im Kundendienst-Umfeld
Refer to the Miele job description for further information.
MGiulio Napolitano
Scalable graph kernels for RDF data
Develop graph kernels forRDF data and use traditional machine learning methods for classification.
B, MDr. Hajira Jabeen
Distributed Knowledge graph Clustering
Clustering of heterogenous data contained in a Knowledge graphs
B, MDr. Hajira Jabeen
Distributed Anomaly Detection in RDF
Detecting anomalies in data is a vital task, with numerous high-impact applications in areas such as security, finance, health care, and law enforcement. While numerous techniques have been developed in past years for spotting outliers and anomalies in unstructured collections of multi-dimensional points, with graph data becoming ubiquitous, techniques for structured graph data have been of focus recently. As objects in graphs have long-range correlations, a suite of novel technology has been developed for anomaly detection in graph data.
B, MDr. Hajira Jabeen
PyTorch Integration in Spark
PyTorch is an open source deep learning platform that provides a seamless path from research prototyping to production deployment. This thesis integrates PyTorch in Apache Spark following guidelines and run KG embeddig models as preliminary tests.
BDr. Hajira Jabeen
Use of Ontology information in Knowledge graph embeddings
Knowledge graph embedding represents entities as vectors in a common vector space. The objective of this work is to extend the existing KG embedding models to include schema information in the KG embedding models like in TransE, or ConvE etc and compare the performance.
B, MDr. Hajira Jabeen
Negative sampling in Knowledge graph embeddings
Knowledge graph embedding represents entities as vectors in a common vector space. The objective of this work is to extend the existing KG embedding models to include more efficient and effective negative sampling methods like in TransE, or ConvE etc and compare the performance.
B, MDr. Hajira Jabeen
Entity Resolution
Entity resolution is the task of identifying all mentions that represent the same real-world entity within a knowledge base or across multiple knowledge bases. We address the problem of performing entity resolution on RDF graphs containing multiple types of nodes, using the links between instances of different types to improve the accuracy.
B, MDr. Hajira Jabeen
Rule/Concept Learning in Knowledge Graphs
In the Semantic Web context, OWL ontologies play the key role of domain conceptualizations while the corresponding assertional knowledge is given by the heterogeneous Web resources referring to them. However, being strongly decoupled, ontologies and assertional bases can be out of sync. In particular, an ontology may be incomplete, noisy, and sometimes inconsistent with the actual usage of its conceptual vocabulary in the assertions. Despite of such problematic situations, we aim at discovering hidden knowledge patterns from ontological knowledge bases, in the form of multi-relational association rules, by exploiting the evidence coming from the (evolving) assertional data. The final goal is to make use of such patterns for (semi-)automatically enriching/completing existing ontologies.
 B, MDr. Hajira Jabeen
Intelligent Semantic Creativity : Culinarian
Computational creativity is an emerging branch of artificial intelligence that places computers in the center of the creative process. We aimt to create a computational system that creates flavorful, novel, and perhaps healthy culinary recipes by drawing on big data techniques. It brings analytics algorithms together with disparate data sources from culinary science.
In the most ambitious form, the system would employ human-computer interaction for rating different recipes and model the human cogitive ability for the cooking process.
The end result is going to be an ingredient list, proportions, and as well as a directed acyclic graph representing a partial ordering of culinary recipe steps.
B, MDr. Hajira Jabeen
IoT Data Catalogues
While platforms and tools such as Hadoop and Apache Spark allow for efficient processing of Big Data sets, it becomes increasingly challenging to organize and structure these data sets. Data sets have various forms ranging from unstructured data in files to structured data in databases. Often the data sets reside in different storage systems ranging from traditional file systems, over Big Data files systems (HDFS) to heterogeneous storage systems (S3, RDBMS, MongoDB, Elastic Search, …). At AGT International, we are dealing primarily with IoT data sets, i.e. data sets that have been collected from sensors and that are processed using Machine Learning-based (ML) analytic pipelines. The number of these data sets is rapidly growing increasing the importance of generating metadata that captures both technical (e.g. storage location, size) and domain metadata and correlates the data sets with each other, e.g. by storing provenance (data set x is a processed version of data set y) and domain relationships.
MDr. Martin Strohbach, Prof. Dr. Jens Lehmann


(Work at AGT International in Darmstadt)

Understanding Short-Text: a Named Entity Recognition perspective Named Entity Recognition (NER) models play an important role in the Information Extraction (IE) pipeline. However, despite decent performance of NER models on newswire datasets, to date, conventional approaches are not able to successfully identify classical named-entity types in short/noisy texts. This thesis will thoroughly investigate NER in microblogs and propose new algorithms to overcome current state-of-the-art models in this research area.B, M Diego Esteves
Multilingual Fact Validation Algorithms DeFacto (Deep Fact Validation) is an algorithm able to validate facts by finding trustworthy sources for them on the Web. Currently, it supports 3 main languages (en, de and fr). The goal of this thesis is to explore and implement alternative information retrieval (IR) methods to minimize the dependency of external tools on verbalizing natural language patterns. As result, we expect to enhance the algorithm performance by expanding its coverage.B, MDiego Esteves
Experimental Analysis of Class CS Problems
In this thesis, we explore unsolved problems of theoretical computer science with machine learning methods, especially reinforcement learning.
B, MDiego Esteves
Generating Property Graphs from RDF using a semantic preserving conversion approach
Graph Databases are on a rise since the last decade due to their dominance in mining and analysis of complex networks. Property Graphs (PGs), one of the graph data models which Graph Databases use, are suitable for the representation of many real-life application scenarios. They allow to efficiently represent complex networks (e.g. social networks, E-commerce) and interactions. In order to leverage this advantage of graph databases, conversions of other data models to property graphs are a current area of research. The aim of this thesis is to (i) propose a novel systematic conversion approach for generating PGs from RDF (one of the graph data models) (ii) and carry out exhaustive experiments on both RDF and PG datasets with respect to their native storage databases (i.e. Graph DBs vs Triplestores). This will allow to identify the types of queries for which graph databases offer performance advantages and ideally allow to adapt the storage mechanism accordingly. The outcome of this work will be integrated into the LITMUS framework, which is an open extensible framework for benchmarking of diverse Data Management Solutions.
BHarsh Thakkar
Relation Linking for Question Answering in German
The task of relation linking in question answering is the identification of the relation (predicate) in a given question and its linking to the corresponding entity in a knowledge base. It is an important step in question answering, which allows us afterwards to build formal queries against, e.g., a knowledge graph. Most of the existing question answering systems focus on the English language and very few question answering components support other languages like German. The goal of this thesis is to identify from the literature as well as develop relation extraction tools that could be adapted to work for German questions.
B, MDr. Ioanna Lytra
Hidden Research Community Detection 2.0
Scientific communities are well known as research fields, however, researchers communicate in hidden communities that are built considering the types of communities considering the co-authorship, topic interest, attended events etc. In this thesis which will be the second phase of an already done master thesis, we will focus on identifying more of such communities by defining similarity metrics inside objects of a research knowledge graph we will build using several datasets.
MSahar Vahdati
Movement of Research Results and Education through OpenCourseWare
This thesis is a research based work in which we will build a knowledge graph for OCW (online courses) and development of research topics considered in this KG, we will use an analytics tool to define interesting queries that can give us insights on answering the research question of how aligned is research with teaching material.
B, MSahar Vahdati
Development and implementation of a semantic Configuration- and Change-Management
This thesis is offered in cooperation with Schaeffler Technologies AG & Co. KG. A solid knowledge of OWL and RDF is needed and a general interest in configuration and change management. The thesis is available and work environment is possible in English. A more detailed description in German is available here (pdf).
MNiklas Petersen
An Approach for (Big) Product Matching
Consider comparing the same product data from thousands of e-shops. However, there are two main challenges that make the comparison difficult. First, the completeness of the product specifications used for organizing the products differs across different e-shops. Second, the ability to represent information about product data and their taxonomy is very diverse. To improve the consumer experience, e.g., by allowing for easily comparing offers by different vendors, approaches for product integration on the Web are needed.
The main focus of this work is on data modeling and semantic enrichment of product data in order to obtain an effective and efficient product matching result.
MDr. Giulio NapolitanoDebanjan Chaudhuri
Learning word representations for out-of-vocabulary words using their contexts.
Natural language processing (NLP) research has recently witnessed a significant boost, following the introduction of word embeddings as proposed by Mikolov et. al. (2013) (Distributed Representations of Words and Phrases and their Compositionality). However, one of the biggest challenges of using word embeddings using the vanilla neural net architecture with words as input and context as outputs is the handling of out-of-vocabulary (oov) words, as the model fails badly on unseen words. In this project we are suggesting an architecture using the proposed word2vec model only. Here, given an unseen word, we would predict a distributed embedding for it using the contexts it is being used in using the matrix that has learned to predict context given the word. (More details)
MDr. Giulio NapolitanoDebanjan Chaudhuri
Reflecting on the User Experience Challenges of CEUR Make GUI and Harnessing the Experience: CEUR Make GUI
CEUR Make GUI is a graphical user interface supporting the workflow of publishing open access proceedings of scientific workshops via, one of the largest open access repositories. For more details on the topic please go through the publications mentioned below. In this thesis we aim to work on improving the user experience challenges documented in the first part of the thesis as cited below and also on refactoring the current code base. We would like to address the challenges such as permissiveness of the input fields and displaying of feedback through well defined UI patterns.Email:Muhammad.rohan.ali.asmat@iais.fraunhofer.deCurrent Repository . Thesis . Publication . Task Board
B, MRohan Asmat
Developing Collaborative Workspace for Workshop Editors: CEUR Make GUI
CEUR Make GUI is a graphical user interface supporting the workflow of publishing open access proceedings of scientific workshops via, one of the largest open access repositories. For more details on the topic please go through the publications mentioned below. In this thesis we aim to work on producing a collaborative workspace for editing workshop proceedings and enhancing the user experience of the software. Based on the development of collaborative workspace we would also like to address the user experience and collaborative and cooperative workspace challenges through a structured protocol.Email:Muhammad.rohan.ali.asmat@iais.fraunhofer.deCurrent Repository . Thesis . Publication . Task Board
B, MRohan Asmat
Provide tools for LaTeX leveraging semantic web standards
When articles are written (and submitted to pear review), one of the biggest fear of researchers is to forget some state-of-the-art works. Indeed, articles should be positioned among the already existing ones to show they are new. However, a specific relevant paper can sometimes be forgotten by authors. To avoid this unpleasant situation, one could imagine a LaTeX package able to check if no citation is missing in a manuscript. To do so, several things might be implemented: (i) extending the already existing pdf2rdf tool by implementing a tex2rdf module; (ii) generating bib-code from these RDF data; (iii) extracting RDF data from the reference sections of articles; (iv) aggregating all these RDF data and loading this dataset into a store; (v) developing a LaTeX package which would be able to automatically query this endpoint to possibly provide missing references.
MDr. Damien Graux

In a nutshell, this topic aims at leveraging the Semantic Web standards in the context of professional descriptions. Indeed, to describe oneself, the common behavior is to write a CV (or a web-page, or both) which should in theory fit the targeted goal. More generally, these kinds of presentation can also be done by companies or large groups to present a whole team for tenders or proposals perspectives…

More precisely, the topic will give the opportunity to automatically realize the tasks of extracting/enriching people information to make them compliant with a specific goal.

In details, RDF2Résumé would imply the following distinct steps:
A] Converting RDF to professional material:
1. to design a CV vocabulary to be used after;
2. to be provide a simple tool (a simple piece of software such as a script) able to generate -let’s say- LaTeX/HTML code from an RDF file compliant with the aforementioned vocabulary;
3. in parallel to propose several final templates;
4. and finally to realize a basic user-interface; (+) to give the possibility of automatically changing languages.
B] Enabling semantic enrichments of the profiles using external sources.
C] Linking the tool with an already realized job platform to automatically link job seekers with job offers.

MDr. Damien Graux
A block-chain forecast model: Extracting & analyzing user most used smart-contract features to predict block-chain future
In the recent past years, the block-chain concept [1] has become a key technology to record transactions between two parties while providing several properties. Up on the general block-chain architecture, some distributed computing platform have emerged such as the Ethereum [2] which gives the opportunity of building and deploying smart contracts [3]: automatic actions that can be triggered by specific events in the chain.By construction, block-chain-related technologies are open-source and publicly available which allows one user to check for instance the complete history of the chain or some specific events. Moreover, the structure of the chain itself can also be represented as a large knowledge graph.The goal of this study is to crawl the Ethereum block-chain smart-contracts history -leveraging the knowledge graph introduced above- in order to compute statistics and then try to predict the future of the chain. To do so, several steps have to be done:
1. being able to retrieve information inside the large RDF graph representing the chain using the SPARQL query language;
2. understanding the way smart-contracts are scripted;
3. deploying ML algorithm on these data excerpts;
4. drawing conclusions…
MDr. Damien Graux
Semantic Integration Approach for Big Data
Dimension = Volume & Variety
Current Big Data platforms do not provide a semantic integration mechanism, especially in the context of integrating semantically equivalent entities that not share an ID.
In the context of this thesis, the student will evaluate and make the necessary extensions to the MINTE integration framework in a Big Data scenario.
Datasets: We are going to work with Biomedical Dataset
Programming Language: Scala
Frameworks: Ideally integrated in SANSA platform, but this is not a must.
Synthesizing Knowledge Graphs from web sources with the MINTE framework
Semantic Join Operator to Integrate Heterogeneous RDF Graphs
MINTE semantically integrating RDF graphs
MDiego Collarana
Semantic Similarity Metric for Big Data
Dimension = Volume, Variety
Identifying when two entities, coming from different data sources, are the same is a key step in the data analysis process.
The goal of this thesis topic is to evaluate the performance of the semantic similarity metrics we have develop in a Big Data scenario.
So we will build a framework/operators of the semantic similarity functions and evaluate.We are going to work with the following metrics: GADES, GARUM, FCA (New to be develop) (See references).Datasets: We are going to work with Biomedical Dataset.
Programming Language: Scala, Java
Frameworks: Ideally integrated in SANSA platform, but this is not a must.
A Semantic Similarity Measure Based on Machine Learning and Entity Characteristics
A Graph-based Semantic Similarity Measure
MDiego Collarana
Embedding’s for RDF Molecules
The use of embeddings in the NLP community is already a common practice. Currently there are the same efforts in the Knowledge Graphs community.
Several approaches such as TransE, RDF2Vec, etc… propose models to create embeddings out of the RDF molecules.
The goal of this thesis is to extend the similarity metric MateTee (see references) with the state-of-the-art-approaches to create embedding from Knowledge Graph Entities.
Datasets: We are going to work with Knowledge Graphs such as DBpedia y Drugbank.
Programming Language: Phython
A Semantic Similarity Metric Based on Translation Embeddings for Knowledge Graphs
MDiego Collarana
Hybrid Embedding for RDF Molecules
Following with the topic discussed above, the goal in this thesis is to research about hybrid embeddings. i.e., combining Word Embeddings with Knowledge Graph embeddings.
This more a foundational research.
Programming Language: Python
No references for the moment, part of the work is to find some related literature.
MDiego Collarana
RDF Molecules Browser
Forster serendipitous discoveries by browsing RDF molecules of data, specially focus on the facets/filters to promote knowledge discovery not intended initially.
Programming Language: ReactJS
A Faceted Reactive Browsing Interface for Multi RDF Knowledge Graph Exploration
A Serendipity-Fostering Faceted Browser for Linked Data
Fostering Serendipitous Knowledge Discovery using an Adaptive Multigraph-based Faceted Browser
MDiego Collarana

Ongoing Theses

  • Query Decomposer and Optimizer for querying scientific datasets
    Supervisor: Dr. Ioanna Lytra; Level: M; Year: 2019
  • Knowledge Data Containers with Access Control and Security Capabilities
    Supervisor: Dr. Ioanna Lytra; Level: M; Year: 2019
  • RDF compression techniques
    Supervisor: Dr. Damien GrauxGezim Sejdiu; Level: M; Year: 2019

Completed Theses