The Smart Data Analytics group is always looking for good students to write theses. The topics can be in one of the following broad areas:
- Distributed Semantic Analytics
- Semantic Question Answering
- Structured Machine Learning
- Deep Learning
- Software Engineering for Data Science
- Semantic Data Management
- Knowledge Extraction and Validation
Please note that the list below is only a small sample of possible topics and ideas. Please contact us to discuss further, to find new topics, or to suggest a topic of your own.
|Distributed in-memory SPARQL Processing
SPARQL queriessearch for specified patterns in RDF data using triple patterns as building blocks. Although many different solutions for efficient querying have been proposed in the past. We want to explore the novel tensor based RDF representations and use them for fast and scalable Querying on large scale cluster. We will use a computational in-memory framework for distributed SPARQL query answering,based on the notion of degree of freedom of a triple. This algorithm relies on a general model of RDF graph based on the firstprinciples of linear algebra, in particular ontensorial calculus.
|B, M||Dr. Hajira Jabeen|
Entity resolution is the task of identifying all mentions that represent the same real-world entity within a knowledge base or across multiple knowledge bases. We address the problem of performing entity resolution on RDF graphs containing multiple types of nodes, using the links between instances of different types to improve the accuracy.
|B, M||Dr. Hajira Jabeen|
|Rule/Concept Learning using Swarm and Evolutionary Computation
In the Semantic Web context, OWL ontologies play the key role of domain conceptualizations while the corresponding assertional knowledge is given by the heterogeneous Web resources referring to them. However, being strongly decoupled, ontologies and assertional bases can be out of sync. In particular, an ontology may be incomplete, noisy, and sometimes inconsistent with the actual usage of its conceptual vocabulary in the assertions. Despite of such problematic situations, we aim at discovering hidden knowledge patterns from ontological knowledge bases, in the form of multi-relational association rules, by exploiting the evidence coming from the (evolving) assertional data. The final goal is to make use of such patterns for (semi-)automatically enriching/completing existing ontologies.
|B, M||Dr. Hajira Jabeen|
|Intelligent Semantic Creativity : Culinarian
Computational creativity is an emerging branch of artificial intelligence that places computers in the center of the creative process. We aimt to create a computational system that creates flavorful, novel, and perhaps healthy culinary recipes by drawing on big data techniques. It brings analytics algorithms together with disparate data sources from culinary science.
In the most ambitious form, the system would employ human-computer interaction for rating different recipes and model the human cogitive ability for the cooking process.
The end result is going to be an ingredient list, proportions, and as well as a directed acyclic graph representing a partial ordering of culinary recipe steps.
|B, M||Dr. Hajira Jabeen|
|IoT Data Catalogues
While platforms and tools such as Hadoop and Apache Spark allow for efficient processing of Big Data sets, it becomes increasingly challenging to organize and structure these data sets. Data sets have various forms ranging from unstructured data in files to structured data in databases. Often the data sets reside in different storage systems ranging from traditional file systems, over Big Data files systems (HDFS) to heterogeneous storage systems (S3, RDBMS, MongoDB, Elastic Search, …). At AGT International, we are dealing primarily with IoT data sets, i.e. data sets that have been collected from sensors and that are processed using Machine Learning-based (ML) analytic pipelines. The number of these data sets is rapidly growing increasing the importance of generating metadata that captures both technical (e.g. storage location, size) and domain metadata and correlates the data sets with each other, e.g. by storing provenance (data set x is a processed version of data set y) and domain relationships.
|M||Dr. Martin Strohbach, Prof. Dr. Jens Lehmann
(Work at AGT International in Darmstadt)
|Big Data quality Assessment (assigned)
Data quality is considered as a multidimensional concept that covers different aspects of quality such as accuracy, completeness, and timeliness. With the advent of Big Data, traditional quality assessment techniques are facing different challenges. Therefore, we should adopt the traditional techniques to big data technologies. The goal of this thesis is to re-implement the assessment techniques in the SANSA framework.
|B, M||Dr. Anisa Rula, Gezim Sejdiu|
|Understanding Short-Text: a Named Entity Recognition perspective Named Entity Recognition (NER) models play an important role in the Information Extraction (IE) pipeline. However, despite decent performance of NER models on newswire datasets, to date, conventional approaches are not able to successfully identify classical named-entity types in short/noisy texts. This thesis will thoroughly investigate NER in microblogs and propose new algorithms to overcome current state-of-the-art models in this research area.||B, M||Diego Esteves|
|Multilingual Fact Validation Algorithms DeFacto (Deep Fact Validation) is an algorithm able to validate facts by finding trustworthy sources for them on the Web. Currently, it supports 3 main languages (en, de and fr). The goal of this thesis is to explore and implement alternative information retrieval (IR) methods to minimize the dependency of external tools on verbalizing natural language patterns. As result, we expect to enhance the algorithm performance by expanding its coverage.||B, M||Diego Esteves|
|Experimental Analysis of Class CS Problems
In this thesis, we explore unsolved problems of theoretical computer science with machine learning methods, especially reinforcement learning.
|B, M||Diego Esteves|
|Generating Property Graphs from RDF using a semantic preserving conversion approach
Graph Databases are on a rise since the last decade due to their dominance in mining and analysis of complex networks. Property Graphs (PGs), one of the graph data models which Graph Databases use, are suitable for the representation of many real-life application scenarios. They allow to efficiently represent complex networks (e.g. social networks, E-commerce) and interactions. In order to leverage this advantage of graph databases, conversions of other data models to property graphs are a current area of research. The aim of this thesis is to (i) propose a novel systematic conversion approach for generating PGs from RDF (one of the graph data models) (ii) and carry out exhaustive experiments on both RDF and PG datasets with respect to their native storage databases (i.e. Graph DBs vs Triplestores). This will allow to identify the types of queries for which graph databases offer performance advantages and ideally allow to adapt the storage mechanism accordingly. The outcome of this work will be integrated into the LITMUS framework, which is an open extensible framework for benchmarking of diverse Data Management Solutions.
|Graph partitioning for RDF data (Taken)
Big RDF datasets need to be stored and processed in distributed RDF data stores that are built on top of cluster servers. Several partitioning schemes like horizontal, vertical, and hash partitioning, exist that allow for splitting the datasets into several nodes, in order to achieve scalability and efficient query processing. The goal of this thesis is to study graph partitioning approaches for RDF data, compare the state of the art, and implement corresponding algorithms that will be integrated into the SANSA framework.
|B, M||Dr. Ioanna Lytra, Gezim Sejdiu|
|Recommendation system for RDF partitioners
In order to store and query big RDF datasets efficiently in distributed environments, different partitioning techniques need to be implemented. Several techniques have been proposed for splitting Big RDF Data, ranging from vertical, hash, graph to semantic-based partitioners. However, the selection of the “best partitioner” depends highly on the structure of the dataset and the query efficiency and effectiveness are coupled to the query engine used. The goal of this thesis will be to develop a recommender system that will suggest the “best partitioner” based on the structure of the data and specific requirements.
|B, M||Gezim Sejdiu, Dr. Ioanna Lytra|
|Relation Linking for Question Answering in German
The task of relation linking in question answering is the identification of the relation (predicate) in a given question and its linking to the corresponding entity in a knowledge base. It is an important step in question answering, which allows us afterwards to build formal queries against, e.g., a knowledge graph. Most of the existing question answering systems focus on the English language and very few question answering components support other languages like German. The goal of this thesis is to identify from the literature as well as develop relation extraction tools that could be adapted to work for German questions.
|B, M||Dr. Ioanna Lytra|
|Query Decomposer and Optimizer for querying scientific datasets
The amount of scientific datasets has increased dramatically in recent years. Copernicus data repository – http://www.copernicus.eu/ is a prominent example of a collection of datasets related to climate, atmosphere, agriculture, and marine domains, publicly available on the Web. Until now, scientists have to look for the appropriate datasets, download them, and query/analyze them using their own infrastructure. Being able to query/analyze scientific data without knowing about the underlying datasets is not at the moment possible. The goal of this thesis will be to create a query engine that will be able to query scientific datasets transparently, without being aware of the available datasets.
|B, M||Dr. Ioanna Lytra|
|Knowledge Data Containers with Access Control and Security Capabilities
The amount of Linked Data both open, made available on the Web, and private, exchanged across companies and organizations, have been increasing in recent years. This data can be distributed in form of Knowledge Graphs (KGs), but maintaining these KGs is mainly the responsibility of data owners or providers. Moreover, building applications on top of KGs in order to provide, for instance, analytics, data access control, and privacy is left to the end user or data consumers. However, many resources in terms of development costs and equipment are required by both data providers and consumers, thus impeding the development of real-world applications over KGs. KGs as well as data processing functionalities can be encapsulated in a client-side system called Knowledge Graph Container, intended to be used by data providers or data consumers. The goal of this thesis is to integrate access control and security capabilities in these KG containers.
|M||Dr. Ioanna Lytra|
|Hidden Research Comunity Detection 2.0
Scientific communities are well known as research fields, however, researchers communicate in hidden communities that are built considering the types of communities considering the co-authorship, topic interest, attended events etc. In this thesis which will be the second phase of an already done master thesis, we will focus on identifying more of such communities by defining similarity metrics inside objects of a research knowledge graph we will build using several datasets.
|Movement of Research Results and Education through OpenCourseWare
This thesis is a research based work in which we will build a knowledge graph for OCW (online courses) and development of research topics considered in this KG, we will use an analytics tool to define interesting queries that can give us insights on answering the research question of how aligned is research with teaching material.
|B, M||Sahar Vahdati|
|Development and implementation of a semantic Configuration- and Change-Management
This thesis is offered in cooperation with Schaeffler Technologies AG & Co. KG. A solid knowledge of OWL and RDF is needed and a general interest in configuration and change management. The thesis is available and work environment is possible in English. A more detailed description in German is available here (pdf).
|An Approach for (Big) Product Matching
Consider comparing the same product data from thousands of e-shops. However, there are two main challenges that make the comparison difficult. First, the completeness of the product specifications used for organizing the products differs across different e-shops. Second, the ability to represent information about product data and their taxonomy is very diverse. To improve the consumer experience, e.g., by allowing for easily comparing offers by different vendors, approaches for product integration on the Web are needed.
The main focus of this work is on data modeling and semantic enrichment of product data in order to obtain an effective and efficient product matching result.
|M||Dr. Giulio Napolitano, Debanjan Chaudhuri|
|Learning word representations for out-of-vocabulary words using their contexts.
Natural language processing (NLP) research has recently witnessed a significant boost, following the introduction of word embeddings as proposed by Mikolov et. al. (2013) (Distributed Representations of Words and Phrases and their Compositionality). However, one of the biggest challenges of using word embeddings using the vanilla neural net architecture with words as input and context as outputs is the handling of out-of-vocabulary (oov) words, as the model fails badly on unseen words. In this project we are suggesting an architecture using the proposed word2vec model only. Here, given an unseen word, we would predict a distributed embedding for it using the contexts it is being used in using the matrix that has learned to predict context given the word. (More details)
|M||Dr. Giulio Napolitano, Debanjan Chaudhuri|
|Reflecting on the User Experience Challenges of CEUR Make GUI and Harnessing the Experience: CEUR Make GUI
CEUR Make GUI is a graphical user interface supporting the workflow of publishing open access proceedings of scientific workshops via CEUR-WS.org, one of the largest open access repositories. For more details on the topic please go through the publications mentioned below. In this thesis we aim to work on improving the user experience challenges documented in the first part of the thesis as cited below and also on refactoring the current code base. We would like to address the challenges such as permissiveness of the input fields and displaying of feedback through well defined UI patterns.Email:Muhammad.email@example.comCurrent Repository . Thesis . Publication . Task Board
|B, M||Rohan Asmat|
|Developing Collaborative Workspace for Workshop Editors: CEUR Make GUI
CEUR Make GUI is a graphical user interface supporting the workflow of publishing open access proceedings of scientific workshops via CEUR-WS.org, one of the largest open access repositories. For more details on the topic please go through the publications mentioned below. In this thesis we aim to work on producing a collaborative workspace for editing workshop proceedings and enhancing the user experience of the software. Based on the development of collaborative workspace we would also like to address the user experience and collaborative and cooperative workspace challenges through a structured protocol.Email:Muhammad.firstname.lastname@example.orgCurrent Repository . Thesis . Publication . Task Board
|B, M||Rohan Asmat|
|RDF compression techniques
As a starting point, realizing a fresh state-of-the-art of compression techniques for RDF could be made. These techniques can mainly be divided into two families: the ones that compresse as much as possible datasets in order to make transfers easier (see e.g. the study of Fernández et al.) and the ones which still allow data to be queried (see e.g. the HDT structure). Secondly, a reflexion on a new compression model may be thinked about and then realized/implemented successfully -obviously, a already have some suggestion which could help the student 😉 like for instance try to compress the RDF graphs according to patterns which could be used in parallel of SPARQL query shapes.
|Provide tools for LaTeX leveraging semantic web standards
When articles are written (and submitted to pear review), one of the biggest fear of researchers is to forget some state-of-the-art works. Indeed, articles should be positioned among the already existing ones to show they are new. However, a specific relevant paper can sometimes be forgotten by authors. To avoid this unpleasant situation, one could imagine a LaTeX package able to check if no citation is missing in a manuscript. To do so, several things might be implementd: (i) extending the already existing pdf2rdf tool by implementing a tex2rdf module; (ii) generating bib-code from these RDF data; (iii) extracting RDF data from the reference sections of articles; (iv) aggregating all these RDF data and loading this dataset into a store; (v) developing a LaTeX package which would be able to automatically query this endpoint to possibly provide missing references.
Basically, this subject would offer the student the posibility of entering in the SemanticWeb world while creating a fancy and useful tool. In a nutshell, RDF2Résumé would imply (1) to design a résumé ontology; (2) to be provide a simple tool (a simple piece of software such as a script) able to generate -let’s say- LaTeX code from an RDF file compliant with the aforementioned ontology; (3) in parallel to propose several final résumé templates; (4) and finally to realize a basic user-interface; (+) to give the possibility of automatically changing languages.