MohamedSherifSenior Researcher
Agile Knowledge Engineering and Semantic Web (AKSW)
University of Leipzig

Profiles: LinkedInGoogle Scholar, DBLPGitHubGoogle+

Augustusplatz 10, Room P905, 04109 Leipzig
sherif@informatik.uni-leipzig.de
Phone: +49-341-97-32260

 

Short CV


Dr. Mohamed Sherif is a Senior Researcher at the University of Leipzig. Mohamed’s research interests are in the area of RDF Workflows, Link Discovery and Fusion.

Research Interests


  • Machine Learning
  • RDF Workflows
  • Link Discovery and
  • Fusion

Publications


2017

  • M. A. Sherif and A. N. Ngomo, “A Systematic Survey of Point Set Distance Measures for Link Discovery,” Semantic web journal, 2017.
    [BibTeX] [Abstract] [Download PDF]
    Large amounts of geo-spatial information have been made available with the growth of the Web of Data. While discovering links between resources on the Web of Data has been shown to be a demanding task, discovering links between geo-spatial resources proves to be even more challenging. This is partly due to the resources being described by the means of vector geometry. Especially, discrepancies in granularity and error measurements across data sets render the selection of appropriate distance measures for geo-spatial resources difficult. In this paper, we survey existing literature for point-set measures that can be used to measure the similarity of vector geometries. We then present and evaluate the ten measures that we derived from literature. We evaluate these measures with respect to their time-efficiency and their robustness against discrepancies in measurement and in granularity. To this end, we use samples of real data sets of different granularity as input for our evaluation framework. The results obtained on three different data sets suggest that most distance approaches can be led to scale. Moreover, while some distance measures are significantly slower than other measures, distance measure based on means, surjections and sums of minimal distances are robust against the different types of discrepancies.

    @Article{sherif-pointset-survey-swj,
    Title = {{A Systematic Survey of Point Set Distance Measures for Link Discovery}},
    Author = {Mohamed Ahmed Sherif and Axel-Cyrille Ngonga Ngomo},
    Journal = {Semantic Web Journal},
    Year = {2017},
    Abstract = {Large amounts of geo-spatial information have been made available with the growth of the Web of Data. While discovering links between resources on the Web of Data has been shown to be a demanding task, discovering links between geo-spatial resources proves to be even more challenging. This is partly due to the resources being described by the means of vector geometry. Especially, discrepancies in granularity and error measurements across data sets render the selection of appropriate distance measures for geo-spatial resources difficult. In this paper, we survey existing literature for point-set measures that can be used to measure the similarity of vector geometries. We then present and evaluate the ten measures that we derived from literature. We evaluate these measures with respect to their time-efficiency and their robustness against discrepancies in measurement and in granularity. To this end, we use samples of real data sets of different granularity as input for our evaluation framework. The results obtained on three different data sets suggest that most distance approaches can be led to scale. Moreover, while some distance measures are significantly slower than other measures, distance measure based on means, surjections and sums of minimal distances are robust against the different types of discrepancies.},
    Keywords = {2017 group_aksw slipo sys:relevantFor:infai sys:relevantFor:bis ngonga simba sherif geo-distance limes},
    Owner = {sherif},
    Url = {http://www.semantic-web-journal.net/content/systematic-survey-point-set-distance-measures-link-discovery-1}
    }

  • M. A. Sherif, K. Dreßler, P. Smeros, and A. Ngonga Ngomo, “RADON – Rapid Discovery of Topological Relations,” in Proceedings of the thirty-first aaai conference on artificial intelligence (aaai-17), 2017.
    [BibTeX] [Abstract] [Download PDF]
    Geospatial data is at the core of the Semantic Web, of which the largest knowledge base contains more than 30 billions facts. Reasoning on these large amounts of geospatial data requires efficient methods for the computation of links between the resources contained in these knowledge bases. In this paper, we present RADON – efficient solution for the discovery of topological relations between geospatial resources according to the DE9-IM standard. Our evaluation shows that we outperform the state of the art significantly and by several orders of magnitude.

    @InProceedings{radon_2017,
    Title = {{RADON} - {Rapid Discovery of Topological Relations}},
    Author = {Mohamed Ahmed Sherif and Kevin Dre{\ss}ler and Panayiotis Smeros and Axel-Cyrille {Ngonga Ngomo}},
    Booktitle = {Proceedings of The Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17)},
    Year = {2017},
    Abstract = {Geospatial data is at the core of the Semantic Web, of which the largest knowledge base contains more than 30 billions facts. Reasoning on these large amounts of geospatial data requires efficient methods for the computation of links between the resources contained in these knowledge bases. In this paper, we present RADON - efficient solution for the discovery of topological relations between geospatial resources according to the DE9-IM standard. Our evaluation shows that we outperform the state of the art significantly and by several orders of magnitude.},
    Keywords = {sherif limes projecthobbit hobbit geiser group_aksw SIMBA sys:relevantFor:infai sys:relevantFor:bis sys:relevantFor:leds leds ngonga bioasq kevin},
    Owner = {sherif},
    Timestamp = {2016.11.14},
    Url = {https://svn.aksw.org/papers/2017/AAAI_RADON/public.pdf}
    }

  • {. A. Sherif, A. {Ngonga Ngomo}, and J. Lehmann, “WOMBAT – A Generalization Approach for Automatic Link Discovery,” in 14th extended semantic web conference, portorož, slovenia, 28th may – 1st june 2017, 2017.
    [BibTeX] [Abstract] [Download PDF]
    A significant portion of the evolution of Linked Data datasets lies in updating the links to other datasets. An important challenge when aiming to update these links automatically under the open-world assumption is the fact that usually only positive examples for the links exist. We address this challenge by presenting and evaluating WOMBAT , a novel approach for the discovery of links between knowledge bases that relies exclusively on positive examples. WOMBAT is based on generalisation via an upward refinement operator to traverse the space of link specification. We study the theoretical characteristics of WOMBAT and evaluate it on 8 different benchmark datasets. Our evaluation suggests that WOMBAT outperforms state-of-the-art supervised approaches while relying on less information. Moreover, our evaluation suggests that WOMBAT ????????s pruning algorithm allows it to scale well even on large datasets.

    @InProceedings{WOMBAT_2017,
    Title = {{WOMBAT} - {A Generalization Approach for Automatic Link Discovery}},
    Author = {Sherif, {Mohamed Ahmed} and {Ngonga Ngomo}, Axel-Cyrille and Lehmann, Jens},
    Booktitle = {14th Extended Semantic Web Conference, Portoro{\v{z}}, Slovenia, 28th May - 1st June 2017},
    Year = {2017},
    Publisher = {Springer},
    Abstract = {A significant portion of the evolution of Linked Data datasets lies in updating the links to other datasets. An important challenge when aiming to update these links automatically under the open-world assumption is the fact that usually only positive examples for the links exist. We address this challenge by presenting and evaluating WOMBAT , a novel approach for the discovery of links between knowledge bases that relies exclusively on positive examples. WOMBAT is based on generalisation via an upward refinement operator to traverse the space of link specification. We study the theoretical characteristics of WOMBAT and evaluate it on 8 different benchmark datasets. Our evaluation suggests that WOMBAT outperforms state-of-the-art supervised approaches while relying on less information. Moreover, our evaluation suggests that WOMBAT ????????s pruning algorithm allows it to scale well even on large datasets.},
    Bdsk-url-1 = {http://svn.aksw.org/papers/2017/ESWC_WOMBAT/public.pdf},
    Keywords = {2017 group_aksw sys:relevantFor:geoknow sys:relevantFor:infai sys:relevantFor:bis ngonga simba sherif group_aksw geoknow wombat lehmann MOLE},
    Url = {http://svn.aksw.org/papers/2017/ESWC_WOMBAT/public.pdf}
    }

2016

  • K. Georgala, M. A. Sherif, and A. N. Ngomo, “An efficient approach for the generation of allen relations,” in Proceedings of the 22nd european conference on artificial intelligence (ecai) 2016, the hague, 29. august – 02. september 2016, 2016.
    [BibTeX] [Abstract] [Download PDF]
    Event data is increasingly being represented according to the Linked Data principles. The need for large-scale machine learning on data represented in this format has thus led to the need for efficient approaches to compute RDF links between resources based on their temporal properties. Time-efficient approaches for computing links between RDF resources have been developed over the last years. However, dedicated approaches for linking resources based on temporal relations have been paid little attention to. In this paper, we address this research gap by presenting A EGLE , a novel approach for the efficient computation of links between events according to Allen’s interval algebra. We study Allen’s relations and show that we can reduce all thirteen relations to eights simpler relations. We then present an efficient algorithm with a complexity of O(n log n) for computing these eight relations. Our evaluation of the runtime of our algorithms shows that we outperform the state of the art by up to 4 orders of magnitude while maintaining a precision and a recall of 100\%.

    @InProceedings{allenalgebra,
    Title = {An Efficient Approach for the Generation of Allen Relations},
    Author = {Kleanthi Georgala and Mohamed Ahmed Sherif and Axel-Cyrille Ngonga Ngomo},
    Booktitle = {Proceedings of the 22nd European Conference on Artificial Intelligence (ECAI) 2016, The Hague, 29. August - 02. September 2016},
    Year = {2016},
    Abstract = {Event data is increasingly being represented according to the Linked Data principles. The need for large-scale machine learning on data represented in this format has thus led to the need for efficient approaches to compute RDF links between resources based on their temporal properties. Time-efficient approaches for computing links between RDF resources have been developed over the last years. However, dedicated approaches for linking resources based on temporal relations have been paid little attention to. In this paper, we address this research gap by presenting A EGLE , a novel approach for the efficient computation of links between events according to Allen's interval algebra. We study Allen's relations and show that we can reduce all thirteen relations to eights simpler relations. We then present an efficient algorithm with a complexity of O(n log n) for computing these eight relations. Our evaluation of the runtime of our algorithms shows that we outperform the state of the art by up to 4 orders of magnitude while maintaining a precision and a recall of 100\%.},
    Keywords = {sys:relevantFor:infai group_aksw simba georgala sherif ngonga sake projecthobbit limes},
    Url = {http://svn.aksw.org/papers/2016/ECAI_AEGLE/public.pdf}
    }

  • M. A. Sherif, M. Hassan, T. Soru, A. Ngonga Ngomo, and J. Lehmann, “Lion’s den: feeding the linklion,” in Proceedings of ontology matching workshop, 2016.
    [BibTeX] [Download PDF]
    @InProceedings{lionsden16,
    Title = {Lion's Den: Feeding the LinkLion},
    Author = {Mohamed Ahmed Sherif and Mofeed Hassan and Tommaso Soru and Axel-Cyrille {Ngonga Ngomo} and Jens Lehmann},
    Booktitle = {Proceedings of Ontology Matching Workshop},
    Year = {2016},
    Keywords = {sherif hassan soru lehmann ngonga geoknow group_aksw SIMBA sys:relevantFor:infai sys:relevantFor:bis limes},
    Owner = {sherif},
    Timestamp = {2016.09.26},
    Url = {http://disi.unitn.it/~pavel/om2016/papers/om2016_poster5.pdf}
    }

  • M. A. Sherif, “Automating Geospatial RDF Dataset Integration and Enrichment,” PhD Thesis PhD Thesis, Leipzig, Germany, 2016.
    [BibTeX] [Abstract] [Download PDF]
    Over the last years, the Linked Open Data (LOD) has evolved from a mere 12 to more than 10, 000 knowledge bases. These knowledge bases come from diverse domains including (but not limited to) publications, life sciences, social networking, government, media, linguistics. Moreover, the LOD cloud also contains a large number of crossdomain knowledge bases such as DBpedia and Yago2. These knowledge bases are commonly managed in a decentralized fashion and contain partly overlapping information. This architectural choice has led to knowledge pertaining to the same domain being published by independent entities in the LOD cloud. For example, information on drugs can be found in Diseasome as well as DBpedia and Drugbank. Furthermore, certain knowledge bases such as DBLP have been published by several bodies, which in turn has lead to duplicated content in the LOD. In addition, large amounts of geo-spatial information have been made available with the growth of heterogeneous Web of Data. The concurrent publication of knowledge bases containing related information promises to become a phenomenon of increasing importance with the growth of the number of independent data providers. Enabling the joint use of the knowledge bases published by these providers for tasks such as federated queries, cross-ontology question answering and data integration is most commonly tackled by creating links between the resources described within these knowledge bases. Within this thesis, we spur the transition from isolated knowledge bases to enriched Linked Data sets where information can be easily integrated and processed. To achieve this goal, we provide concepts, approaches and use cases that facilitate the integration and enrichment of information with other data types that are already present on the Linked Data Web with a focus on geo-spatial data. The first challenge that motivates our work is the lack of measures that use the geographic data for linking geo-spatial knowledge bases. This is partly due to the geo-spatial resources being described by the means of vector geometry. In particular, discrepancies in granularity and error measurements across knowledge bases render the selection of appropriate distance measures for geo-spatial resources difficult. We address this challenge by evaluating existing literature for pointset measures that can be used to measure the similarity of vector geometries. Then, we present and evaluate the ten measures that we derived from the literature on samples of three real knowledge bases. The second challenge we address in this thesis is the lack of automatic Link Discovery (LD) approaches capable of dealing with geospatial knowledge bases with missing and erroneous data. To this end,we present Colibri, an unsupervised approach that allows discovering links between knowledge bases while improving the quality of the instance data in these knowledge bases. A Colibri iteration begins by generating links between knowledge bases. Then, the approach makes use of these links to detect resources with probably erroneous or missing information. This erroneous or missing infor- mation detected by the approach is finally corrected or added. The third challenge we address is the lack of scalable LD approaches for tackling big geo-spatial knowledge bases. Thus, we present Deterministic Particle-Swarm Optimization (DPSO), a novel load balancing technique for LD on parallel hardware based on particle-swarm optimization. We combine this approach with the Orchid algorithm for geo-spatial linking and evaluate it on real and artificial data sets. The lack of approaches for automatic updating of links of an evolving knowledge base is our fourth challenge. This challenge is addressed in this thesis by the Wombat algorithm. Wombat is a novel approach for the discovery of links between knowledge bases that relies exclusively on positive examples. Wombat is based on generalisation via an upward refinement operator to traverse the space of Link Specifications (LS). We study the theoretical characteristics of Wombat and evaluate it on different benchmark data sets. The last challenge addressed herein is the lack of automatic approaches for geo-spatial knowledge base enrichment. Thus, we propose Deer, a supervised learning approach based on a refinement operator for enriching Resource Description Framework (RDF) data sets. We show how we can use exemplary descriptions of enriched resources to generate accurate enrichment pipelines. We evaluate our approach against manually defined enrichment pipelines and show that our approach can learn accurate pipelines even when provided with a small number of training examples. Each of the proposed approaches is implemented and evaluated against state-of-the-art approaches on real and/or artificial data sets. Moreover, all approaches are peer-reviewed and published in a con- ference or a journal paper. Throughout this thesis, we detail the ideas, implementation and the evaluation of each of the approaches. Moreover, we discuss each approach and present lessons learned. Finally, we conclude this thesis by presenting a set of possible future extensions and use cases for each of the proposed approaches.

    @PhdThesis{Sherif-thesis-2016,
    Title = {{Automating Geospatial RDF Dataset Integration and Enrichment}},
    Author = {Mohamed Ahmed Sherif},
    School = {University of Leipzig},
    Year = {2016},
    Address = {Leipzig, Germany},
    Month = {December},
    Note = {\url{http://www.qucosa.de/recherche/frontdoor/?tx_slubopus4frontend[id]=21570}},
    Type = {PhD Thesis},
    Abstract = {Over the last years, the Linked Open Data (LOD) has evolved from a mere 12 to more than 10, 000 knowledge bases. These knowledge bases come from diverse domains including (but not limited to) publications, life sciences, social networking, government, media, linguistics. Moreover, the LOD cloud also contains a large number of crossdomain knowledge bases such as DBpedia and Yago2. These knowledge bases are commonly managed in a decentralized fashion and contain partly overlapping information. This architectural choice has led to knowledge pertaining to the same domain being published by independent entities in the LOD cloud. For example, information on drugs can be found in Diseasome as well as DBpedia and Drugbank. Furthermore, certain knowledge bases such as DBLP have been published by several bodies, which in turn has lead to duplicated content in the LOD. In addition, large amounts of geo-spatial information have been made available with the growth of heterogeneous Web of Data. The concurrent publication of knowledge bases containing related information promises to become a phenomenon of increasing importance with the growth of the number of independent data providers. Enabling the joint use of the knowledge bases published by these providers for tasks such as federated queries, cross-ontology question answering and data integration is most commonly tackled by creating links between the resources described within these knowledge bases. Within this thesis, we spur the transition from isolated knowledge bases to enriched Linked Data sets where information can be easily integrated and processed. To achieve this goal, we provide concepts, approaches and use cases that facilitate the integration and enrichment of information with other data types that are already present on the Linked Data Web with a focus on geo-spatial data. The first challenge that motivates our work is the lack of measures that use the geographic data for linking geo-spatial knowledge bases. This is partly due to the geo-spatial resources being described by the means of vector geometry. In particular, discrepancies in granularity and error measurements across knowledge bases render the selection of appropriate distance measures for geo-spatial resources difficult. We address this challenge by evaluating existing literature for pointset measures that can be used to measure the similarity of vector geometries. Then, we present and evaluate the ten measures that we derived from the literature on samples of three real knowledge bases. The second challenge we address in this thesis is the lack of automatic Link Discovery (LD) approaches capable of dealing with geospatial knowledge bases with missing and erroneous data. To this end,we present Colibri, an unsupervised approach that allows discovering links between knowledge bases while improving the quality of the instance data in these knowledge bases. A Colibri iteration begins by generating links between knowledge bases. Then, the approach makes use of these links to detect resources with probably erroneous or missing information. This erroneous or missing infor- mation detected by the approach is finally corrected or added. The third challenge we address is the lack of scalable LD approaches for tackling big geo-spatial knowledge bases. Thus, we present Deterministic Particle-Swarm Optimization (DPSO), a novel load balancing technique for LD on parallel hardware based on particle-swarm optimization. We combine this approach with the Orchid algorithm for geo-spatial linking and evaluate it on real and artificial data sets. The lack of approaches for automatic updating of links of an evolving knowledge base is our fourth challenge. This challenge is addressed in this thesis by the Wombat algorithm. Wombat is a novel approach for the discovery of links between knowledge bases that relies exclusively on positive examples. Wombat is based on generalisation via an upward refinement operator to traverse the space of Link Specifications (LS). We study the theoretical characteristics of Wombat and evaluate it on different benchmark data sets. The last challenge addressed herein is the lack of automatic approaches for geo-spatial knowledge base enrichment. Thus, we propose Deer, a supervised learning approach based on a refinement operator for enriching Resource Description Framework (RDF) data sets. We show how we can use exemplary descriptions of enriched resources to generate accurate enrichment pipelines. We evaluate our approach against manually defined enrichment pipelines and show that our approach can learn accurate pipelines even when provided with a small number of training examples. Each of the proposed approaches is implemented and evaluated against state-of-the-art approaches on real and/or artificial data sets. Moreover, all approaches are peer-reviewed and published in a con- ference or a journal paper. Throughout this thesis, we detail the ideas, implementation and the evaluation of each of the approaches. Moreover, we discuss each approach and present lessons learned. Finally, we conclude this thesis by presenting a set of possible future extensions and use cases for each of the proposed approaches.
    },
    Bdsk-url-1 = {http://www.qucosa.de/recherche/frontdoor/?tx_slubopus4frontend[id]=21570},
    Keywords = {2016 group_aksw sys:relevantFor:geoknow sys:relevantFor:infai sys:relevantFor:bis ngonga simba sherif group_aksw geoknow deer lehmann MOLE},
    Owner = {sherif},
    Timestamp = {2016.12.05},
    Url = {http://www.qucosa.de/recherche/frontdoor/?tx_slubopus4frontend[id]=21570}
    }

2015

  • C. Stadler, J. Unbehauen, P. Westphal, M. A. Sherif, and J. Lehmann, “Simplified RDB2RDF mapping,” in Proceedings of the 8th workshop on linked data on the web (ldow2015), florence, italy, 2015.
    [BibTeX] [Abstract] [Download PDF]
    The combination of the advantages of widely used relational databases and semantic technologies has attracted significant research over the past decade. In particular, mapping languages for the conversion of databases to RDF knowledge bases have been developed and standardized in the form of R2RML. In this article, we first review those mapping languages and then devise work towards a unified formal model for them. Based on this, we present the Sparqlification Mapping Language (SML), which provides an intuitive way to declare mappings based on SQL VIEWS and SPARQL construct queries. We show that SML has the same expressivity as R2RML by enumerating the language features and show the correspondences, and we outline how one syntax can be converted into the other. A conducted user study for this paper juxtaposing SML and R2RML provides evidence that SML is a more compact syntax which is easier to understand and read and thus lowers the barrier to offer SPARQL access to relational databases.

    @InProceedings{sml,
    Title = {Simplified {RDB2RDF} Mapping},
    Author = {Claus Stadler and Joerg Unbehauen and Patrick Westphal and Mohamed Ahmed Sherif and Jens Lehmann},
    Booktitle = {Proceedings of the 8th Workshop on Linked Data on the Web (LDOW2015), Florence, Italy},
    Year = {2015},
    Abstract = {The combination of the advantages of widely used relational databases and semantic technologies has attracted significant research over the past decade. In particular, mapping languages for the conversion of databases to RDF knowledge bases have been developed and standardized in the form of R2RML. In this article, we first review those mapping languages and then devise work towards a unified formal model for them. Based on this, we present the Sparqlification Mapping Language (SML), which provides an intuitive way to declare mappings based on SQL VIEWS and SPARQL construct queries. We show that SML has the same expressivity as R2RML by enumerating the language features and show the correspondences, and we outline how one syntax can be converted into the other. A conducted user study for this paper juxtaposing SML and R2RML provides evidence that SML is a more compact syntax which is easier to understand and read and thus lowers the barrier to offer SPARQL access to relational databases.},
    Bdsk-url-1 = {svn.aksw.org/papers/2015/LDOW_SML/paper-camery-ready_public.pdf},
    Keywords = {2015 group_aksw group_mole mole stadler lehmann sherif sys:relevantFor:geoknow geoknow peer-reviewed MOLE westphal},
    Url = {svn.aksw.org/papers/2015/LDOW_SML/paper-camery-ready_public.pdf}
    }

  • M. A. Sherif and A. Ngonga Ngomo, “An optimization approach for load balancing in parallel link discovery,” in Semantics 2015, 2015.
    [BibTeX] [Abstract] [Download PDF]
    Many of the available RDF datasets describe millions of resources by using billions of triples. Consequently, millions of links can potentially exist among such datasets. While parallel implementations of link discovery approaches have been developed in the past, load balancing approaches for local implementations of link discovery algorithms have been paid little attention to. In this paper, we thus present a novel load balancing technique for link discovery on parallel hardware based on particle-swarm optimization. We combine this approach with the Orchid algorithm for geo-spatial linking and evaluate it on real and artificial datasets. Our evaluation suggests that while naïve approaches can be super-linear on small data sets, our deterministic particle swarm optimization outperforms both naïve and classical load balancing approaches such as greedy load balancing on large datasets.

    @InProceedings{sherifDPSO,
    Title = {An Optimization Approach for Load Balancing in Parallel Link Discovery},
    Author = {Mohamed Ahmed Sherif and Axel-Cyrille {Ngonga Ngomo}},
    Booktitle = {SEMANTiCS 2015},
    Year = {2015},
    Abstract = {Many of the available RDF datasets describe millions of resources by using billions of triples. Consequently, millions of links can potentially exist among such datasets. While parallel implementations of link discovery approaches have been developed in the past, load balancing approaches for local implementations of link discovery algorithms have been paid little attention to. In this paper, we thus present a novel load balancing technique for link discovery on parallel hardware based on particle-swarm optimization. We combine this approach with the Orchid algorithm for geo-spatial linking and evaluate it on real and artificial datasets. Our evaluation suggests that while na{\"i}ve approaches can be super-linear on small data sets, our deterministic particle swarm optimization outperforms both na{\"i}ve and classical load balancing approaches such as greedy load balancing on large datasets.},
    Bdsk-url-1 = {http://svn.aksw.org/papers/2015/SEMANTICS_DPSO/public.pdf},
    Keywords = {2015 sys:relevantFor:geoknow geoknow ngonga sherif simba group_aksw sys:relevantFor:infai sys:relevantFor:bis SIMBA limes},
    Url = {http://svn.aksw.org/papers/2015/SEMANTICS_DPSO/public.pdf}
    }

  • {. A. Sherif, A. {Ngonga Ngomo}, and J. Lehmann, “Automating RDF dataset transformation and enrichment,” in 12th extended semantic web conference, portorož, slovenia, 31st may – 4th june 2015, 2015.
    [BibTeX] [Abstract] [Download PDF]
    With the adoption of RDF across several domains, come growing requirements pertaining to the completeness and quality of RDF datasets. Currently, this problem is most commonly addressed by manually devising means of enriching an input dataset. The few tools that aim at supporting this endeavour usually focus on supporting the manual definition of enrichment pipelines. In this paper, we present a supervised learning approach based on a refinement operator for enriching RDF datasets. We show how we can use exemplary descriptions of enriched resources to generate accurate enrichment pipelines. We evaluate our approach against eight manually defined enrichment pipelines and show that our approach can learn accurate pipelines even when provided with a small number of training examples.

    @InProceedings{DEER_2015,
    Title = {Automating {RDF} Dataset Transformation and Enrichment},
    Author = {Sherif, {Mohamed Ahmed} and {Ngonga Ngomo}, Axel-Cyrille and Lehmann, Jens},
    Booktitle = {12th Extended Semantic Web Conference, Portoro{\v{z}}, Slovenia, 31st May - 4th June 2015},
    Year = {2015},
    Publisher = {Springer},
    Abstract = {With the adoption of RDF across several domains, come growing requirements pertaining to the completeness and quality of RDF datasets. Currently, this problem is most commonly addressed by manually devising means of enriching an input dataset. The few tools that aim at supporting this endeavour usually focus on supporting the manual definition of enrichment pipelines. In this paper, we present a supervised learning approach based on a refinement operator for enriching RDF datasets. We show how we can use exemplary descriptions of enriched resources to generate accurate enrichment pipelines. We evaluate our approach against eight manually defined enrichment pipelines and show that our approach can learn accurate pipelines even when provided with a small number of training examples.},
    Bdsk-url-1 = {http://svn.aksw.org/papers/2015/ESWC_DEER/public.pdf},
    Keywords = {2015 group_aksw sys:relevantFor:geoknow sys:relevantFor:infai sys:relevantFor:bis ngonga simba sherif group_aksw geoknow deer lehmann MOLE},
    Url = {http://svn.aksw.org/papers/2015/ESWC_DEER/public.pdf}
    }

  • J. Lehmann, S. Athanasiou, A. Both, A. Garcia-Rojas, G. Giannopoulos, D. Hladky, K. Hoeffner, J. J. L. Grange, A. N. Ngomo, M. A. Sherif, C. Stadler, M. Wauer, P. Westphal, and V. Zaslawski, “Managing geospatial linked data in the geoknow project.” , 2015, pp. 51-78.
    [BibTeX] [Download PDF]
    @InBook{ios_geoknow_chapter,
    Title = {Managing Geospatial Linked Data in the GeoKnow Project},
    Author = {Jens Lehmann and Spiros Athanasiou and Andreas Both and Alejandra Garcia-Rojas and Giorgos Giannopoulos and Daniel Hladky and Konrad Hoeffner and Jon Jay Le Grange and Axel-Cyrille Ngonga Ngomo and Mohamed Ahmed Sherif and Claus Stadler and Matthias Wauer and Patrick Westphal and Vadim Zaslawski},
    Pages = {51--78},
    Year = {2015},
    Series = {Studies on the Semantic Web},
    Keywords = {2015 group_aksw sys:relevantFor:infai sys:relevantFor:bis sys:relevantFor:geoknow lehmann ngonga MOLE sherif hoeffner geoknow wauer westphal},
    Url = {http://jens-lehmann.org/files/2015/ios_geoknow_chapter.pdf}
    }

  • J. Lehmann, S. Athanasiou, A. Both, L. Buehmann, A. Garcia-Rojas, G. Giannopoulos, D. Hladky, K. Hoeffner, J. J. L. Grange, A. N. Ngomo, R. Pietzsch, R. Isele, M. A. Sherif, C. Stadler, M. Wauer, and P. Westphal, “The GeoKnow handbook,” 2015.
    [BibTeX] [Download PDF]
    @TechReport{geoknow_handbook,
    Title = {The {G}eo{K}now Handbook},
    Author = {Jens Lehmann and Spiros Athanasiou and Andreas Both and Lorenz Buehmann and Alejandra Garcia-Rojas and Giorgos Giannopoulos and Daniel Hladky and Konrad Hoeffner and Jon Jay Le Grange and Axel-Cyrille Ngonga Ngomo and Rene Pietzsch and Robert Isele and Mohamed Ahmed Sherif and Claus Stadler and Matthias Wauer and Patrick Westphal},
    Year = {2015},
    Keywords = {2015 group_aksw sys:relevantFor:infai sys:relevantFor:bis sys:relevantFor:geoknow lehmann ngonga MOLE sherif hoeffner geoknow westphal buehmann},
    Url = {http://jens-lehmann.org/files/2015/geoknow_handbook.pdf}
    }

2014

  • M. A. Sherif and A. Ngonga Ngomo, “Semantic quran: a multilingual resource for natural-language processing,” Semantic web journal, vol. XXX, pp. 1-5, 2014.
    [BibTeX] [Abstract] [Download PDF]
    In this paper we describe the Semantic Quran dataset, a multilingual RDF representation of translations of the Quran. The dataset was created by integrating data from two different semi-structured sources and aligned to an ontology designed to represent multilingual data from sources with a hierarchical structure. The resulting RDF data encompasses 43 different languages which belong to the most under-represented languages in the Linked Data Cloud, including Arabic, Amharic and Amazigh. We designed the dataset to be easily usable in natural-language processing applications with the goal of facilitating the development of knowledge extraction tools for these languages. In particular, the Semantic Quran is compatible with the Natural-Language Interchange Format and contains explicit morpho-syntactic information on the utilized terms. We present the ontology devised for structuring the data. We also provide the transformation rules implemented in our extraction framework. Finally, we detail the link creation process as well as possible usage scenarios for the Semantic Quran dataset.

    @Article{SHNG14,
    Title = {Semantic Quran: A Multilingual Resource for Natural-Language Processing},
    Author = {Mohamed Ahmed Sherif and Axel-Cyrille {Ngonga Ngomo}},
    Journal = {Semantic Web Journal},
    Year = {2014},
    Pages = {1-5},
    Volume = {XXX},
    Abstract = {In this paper we describe the Semantic Quran dataset, a multilingual RDF representation of translations of the Quran. The dataset was created by integrating data from two different semi-structured sources and aligned to an ontology designed to represent multilingual data from sources with a hierarchical structure. The resulting RDF data encompasses 43 different languages which belong to the most under-represented languages in the Linked Data Cloud, including Arabic, Amharic and Amazigh. We designed the dataset to be easily usable in natural-language processing applications with the goal of facilitating the development of knowledge extraction tools for these languages. In particular, the Semantic Quran is compatible with the Natural-Language Interchange Format and contains explicit morpho-syntactic information on the utilized terms. We present the ontology devised for structuring the data. We also provide the transformation rules implemented in our extraction framework. Finally, we detail the link creation process as well as possible usage scenarios for the Semantic Quran dataset.},
    Bdsk-url-1 = {http://www.semantic-web-journal.net/system/files/swj503.pdf},
    Keywords = {group_aksw SIMBA sys:relevantFor:infai sys:relevantFor:bis ngonga limes simba sherif 2014 limes semanticquran},
    Owner = {ngongaan},
    Timestamp = {2014.03.06},
    Url = {http://www.semantic-web-journal.net/system/files/swj503.pdf}
    }

  • M. A. Sherif, S. Coelho, R. Usbeck, S. Hellmann, J. Lehmann, M. Brümmer, and A. Both, “Nif4oggd – nlp interchange format for open german governmental data,” in The 9th edition of the language resources and evaluation conference, 26-31 may, reykjavik, iceland, 2014.
    [BibTeX] [Abstract] [Download PDF]
    In the last couple of years the amount of structured open government data has increased significantly. Already now, citizens are able to leverage the advantages of open data through increased transparency and better opportunities to take part in governmental decision making processes. Our approach increases the interoperability of existing but distributed open governmental datasets by converting them to the RDF-based NLP Interchange Format (NIF). Furthermore, we integrate the converted data into a geodata store and present a user interface for querying this data via a keyword-based search. The language resource generated in this project is publicly available for download and via a dedicated SPARQL endpoint.

    @InProceedings{NIF4OGGD,
    Title = {NIF4OGGD - NLP Interchange Format for Open German Governmental Data},
    Author = {Sherif, Mohamed A. and Coelho, Sandro and Usbeck, Ricardo and Hellmann, Sebastian and Lehmann, Jens and Br{\"u}mmer, Martin and Both, Andreas},
    Booktitle = {The 9th edition of the Language Resources and Evaluation Conference, 26-31 May, Reykjavik, Iceland},
    Year = {2014},
    Abstract = {In the last couple of years the amount of structured open government data has increased significantly. Already now, citizens are able to leverage the advantages of open data through increased transparency and better opportunities to take part in governmental decision making processes. Our approach increases the interoperability of existing but distributed open governmental datasets by converting them to the RDF-based NLP Interchange Format (NIF). Furthermore, we integrate the converted data into a geodata store and present a user interface for querying this data via a keyword-based search. The language resource generated in this project is publicly available for download and via a dedicated SPARQL endpoint.},
    Bdsk-url-1 = {http://svn.aksw.org/papers/2014/LREC_NIF4OGGD/public.pdf},
    Keywords = {sys:relevantFor:infai sys:relevantFor:bis sys:relevantFor:geoknow sherif hellmann kilt lehmann usbeck bruemmer nif4oggd group_aksw kilt Lidmole MOLE simba},
    Url = {http://svn.aksw.org/papers/2014/LREC_NIF4OGGD/public.pdf}
    }

  • S. Pokharel, M. A. Sherif, and J. Lehmann, “Ontology based data access and integration for improving the effectiveness of farming in nepal,” in Proc. of the international conference on web intelligence, 2014.
    [BibTeX] [Abstract] [Download PDF]
    It is widely accepted that food supply and quality are major problems in the 21st century. Due to the growth of the world’s population, there is a pressing need to improve the productivity of agricultural crops, which hinges on different factors such as geographical location, soil type, weather condition and particular attributes of the crops to plant. In many regions of the world, information about those factors is not readily accessible and dispersed across a multitude of different sources. One of those regions is Nepal, in which the lack of access to this knowledge poses a significant burden for agricultural planning and decision making. Making such knowledge more accessible can boot up a farmer’s living standard and increase their competitiveness on national and global markets. In this article, we show how we converted several available, although not easily accessible, datasets to RDF, thereby lowering the barrier for data re-usage and integration. We describe the conversion, linking, and publication process as well as use cases, which can be implemented using the farming datasets in Nepal.

    @InProceedings{wi_farming_nepal,
    Title = {Ontology Based Data Access and Integration for Improving the Effectiveness of Farming in Nepal},
    Author = {Suresh Pokharel and Mohamed Ahmed Sherif and Jens Lehmann},
    Booktitle = {Proc. of the International Conference on Web Intelligence},
    Year = {2014},
    Abstract = {It is widely accepted that food supply and quality are major problems in the 21st century. Due to the growth of the world's population, there is a pressing need to improve the productivity of agricultural crops, which hinges on different factors such as geographical location, soil type, weather condition and particular attributes of the crops to plant. In many regions of the world, information about those factors is not readily accessible and dispersed across a multitude of different sources. One of those regions is Nepal, in which the lack of access to this knowledge poses a significant burden for agricultural planning and decision making. Making such knowledge more accessible can boot up a farmer's living standard and increase their competitiveness on national and global markets. In this article, we show how we converted several available, although not easily accessible, datasets to RDF, thereby lowering the barrier for data re-usage and integration. We describe the conversion, linking, and publication process as well as use cases, which can be implemented using the farming datasets in Nepal.},
    Bdsk-url-1 = {http://svn.aksw.org/papers/2014/WI2014_agriNepalData/public.pdf},
    Keywords = {group_aksw MOLE 2014 sys:relevantFor:infai sys:relevantFor:bis sys:relevantFor:geoknow topic_geospatial lehmann sherif},
    Url = {http://svn.aksw.org/papers/2014/WI2014_agriNepalData/public.pdf}
    }

  • A. Ngonga Ngomo, M. A. Sherif, and K. Lyko, “Unsupervised link discovery through knowledge base repair,” in Extended semantic web conference (eswc 2014), 2014.
    [BibTeX] [Abstract] [Download PDF]
    The Linked Data Web has developed into a compendium of partly very large datasets. Devising efficient approaches to compute links between these datasets is thus central to achieve the vision behind the Data Web. Several unsupervised approaches have been developed to achieve this goal. Yet, so far, none of these approaches makes use of the replication of resources across several knowledge bases to improve the accuracy it achieves while linking. In this paper, we present Colibri, an iterative unsupervised approach for link discovery. Colibri allows discovering links between n datasets (n ??????? 2) while improving the quality of the instance data in these datasets. To this end, Colibri combines error detection and correction with unsupervised link discovery. We evaluate our approach on benchmark datasets with respect to the F-score itachieves. Our results suggest that Colibri can significantly improve the results of unsupervised machine-learning approaches for link discovery while correctly detecting erroneous resources.

    @InProceedings{colibriESWC14,
    Title = {Unsupervised Link Discovery Through Knowledge Base Repair},
    Author = {Axel-Cyrille {Ngonga Ngomo} and Mohamed Ahmed Sherif and Klaus Lyko},
    Booktitle = {Extended Semantic Web Conference (ESWC 2014)},
    Year = {2014},
    Abstract = {The Linked Data Web has developed into a compendium of partly very large datasets. Devising efficient approaches to compute links between these datasets is thus central to achieve the vision behind the Data Web. Several unsupervised approaches have been developed to achieve this goal. Yet, so far, none of these approaches makes use of the replication of resources across several knowledge bases to improve the accuracy it achieves while linking. In this paper, we present Colibri, an iterative unsupervised approach for link discovery. Colibri allows discovering links between n datasets (n ??????? 2) while improving the quality of the instance data in these datasets. To this end, Colibri combines error detection and correction with unsupervised link discovery. We evaluate our approach on benchmark datasets with respect to the F-score itachieves. Our results suggest that Colibri can significantly improve the results of unsupervised machine-learning approaches for link discovery while correctly detecting erroneous resources.},
    Bdsk-url-1 = {http://svn.aksw.org/papers/2013/ISWC_CHIMERA/public.pdf},
    Keywords = {ngonga sherif lyko simba group_aksw sys:relevantFor:infai sys:relevantFor:bis SIMBA limes},
    Owner = {sherif},
    Timestamp = {2014.03.17},
    Url = {http://svn.aksw.org/papers/2013/ISWC_CHIMERA/public.pdf}
    }

  • J. J. L. Grange, J. Lehmann, S. Athanasiou, A. G. Rojas, G. Giannopoulos, D. Hladky, R. Isele, A. Ngonga Ngomo, M. A. Sherif, C. Stadler, and M. Wauer, “The geoknow generator: managing geospatial data in the linked data web,” in Proceedings of the linking geospatial data workshop, 2014.
    [BibTeX] [Download PDF]
    @InProceedings{lgd_geoknow_generator,
    Title = {The GeoKnow Generator: Managing Geospatial Data in the Linked Data Web},
    Author = {Jon Jay Le Grange and Jens Lehmann and Spiros Athanasiou and Alejandra Garcia Rojas and Giorgos Giannopoulos and Daniel Hladky and Robert Isele and Axel-Cyrille {Ngonga Ngomo} and Mohamed Ahmed Sherif and Claus Stadler and Matthias Wauer},
    Booktitle = {Proceedings of the Linking Geospatial Data Workshop},
    Year = {2014},
    Bdsk-url-1 = {http://jens-lehmann.org/files/2014/lgd_geoknow_generator.pdf},
    Keywords = {2014 group_aksw group_mole mole ngonga lehmann sherif topic_Lifecycle sys:relevantFor:infai sys:relevantFor:bis sys:relevantFor:lod2 sys:relevantFor:geoknow geoknow lod lod2page peer-reviewed MOLE simba wauer stadler},
    Owner = {jl},
    Timestamp = {2014.04.12},
    Url = {http://jens-lehmann.org/files/2014/lgd_geoknow_generator.pdf}
    }

2013

  • A. Zaveri, J. Lehmann, S. Auer, M. M. Hassan, M. A. Sherif, and M. Martin, “Publishing and interlinking the global health observatory dataset,” Semantic web journal, vol. Special Call for Linked Dataset descriptions, iss. 3, pp. 315-322, 2013.
    [BibTeX] [Abstract] [Download PDF]
    The improvement of public health is one of the main indicators for societal progress. Statistical data for monitoring public health is highly relevant for a number of sectors, such as research (e.g. in the life sciences or economy), policy making, health care, pharmaceutical industry, insurances etc. Such data is meanwhile available even on a global scale, e.g. in the Global Health Observatory (GHO) of the United Nations’s World Health Organization (WHO). GHO comprises more than 50 different datasets, it covers all 198 WHO member countries and is updated as more recent or revised data becomes available or when there are changes to the methodology being used. However, this data is only accessible via complex spreadsheets and, therefore, queries over the 50 different datasets as well as combinations with other datasets are very tedious and require a significant amount of manual work. By making the data available as RDF, we lower the barrier for data re-use and integration. In this article, we describe the conversion and publication process as well as use cases, which can be implemented using the GHO data.

    @Article{zaveri-gho,
    Title = {Publishing and Interlinking the Global Health Observatory Dataset},
    Author = {Amrapali Zaveri and Jens Lehmann and S{\"o}ren Auer and Mofeed M. Hassan and Mohamed A. Sherif and Michael Martin},
    Journal = {Semantic Web Journal},
    Year = {2013},
    Number = {3},
    Pages = {315-322},
    Volume = {Special Call for Linked Dataset descriptions},
    Abstract = {The improvement of public health is one of the main indicators for societal progress. Statistical data for monitoring public health is highly relevant for a number of sectors, such as research (e.g. in the life sciences or economy), policy making, health care, pharmaceutical industry, insurances etc. Such data is meanwhile available even on a global scale, e.g. in the Global Health Observatory (GHO) of the United Nations's World Health Organization (WHO). GHO comprises more than 50 different datasets, it covers all 198 WHO member countries and is updated as more recent or revised data becomes available or when there are changes to the methodology being used. However, this data is only accessible via complex spreadsheets and, therefore, queries over the 50 different datasets as well as combinations with other datasets are very tedious and require a significant amount of manual work. By making the data available as RDF, we lower the barrier for data re-use and integration. In this article, we describe the conversion and publication process as well as use cases, which can be implemented using the GHO data. },
    Bdsk-url-1 = {http://www.semantic-web-journal.net/system/files/swj433.pdf},
    Date-modified = {2013-07-11 19:43:06 +0000},
    Ee = {http://dx.doi.org/10.3233/SW-130102},
    Keywords = {2013 MOLE group_aksw zaveri martin lehmann auer hassan sherif sys:relevantFor:infai sys:relevantFor:bis sys:relevantFor:lod2 lod2page peer-reviewed gho},
    Owner = {micha},
    Url = {http://www.semantic-web-journal.net/system/files/swj433.pdf}
    }

  • A. Zaveri, D. Kontokostas, M. A. Sherif, L. Bühmann, M. Morsey, S. Auer, and J. Lehmann, “User-driven quality evaluation of dbpedia,” in Proceedings of 9th international conference on semantic systems, i-semantics ’13, graz, austria, september 4-6, 2013, 2013, pp. 97-104.
    [BibTeX] [Abstract] [Download PDF]
    Linked Open Data (LOD) comprises of an unprecedented volume of structured datasets on the Web. However, these datasets are of varying quality ranging from extensively curated datasets to crowdsourced and even extracted data of relatively low quality. We present a methodology for assessing the quality of linked data resources, which comprises of a manual and a semi-automatic process. The first phase includes the detection of common quality problems and their representation in a quality problem taxonomy. In the manual process, the second phase comprises of the evaluation of a large number of individual resources, according to the quality problem taxonomy via crowdsourcing. This process is accompanied by a tool wherein a user assesses an individual resource and evaluates each fact for correctness. The semi-automatic process involves the generation and verification of schema axioms. We report the results obtained by applying this methodology to DBpedia. We identified 17 data quality problem types and 58 users assessed a total of 521 resources. Overall, 11.93\% of the evaluated DBpedia triples were identified to have some quality issues. Applying the semi-automatic component yielded a total of 222,982 triples that have a high probability to be incorrect. In particular, we found that problems such as object values being incorrectly extracted, irrelevant extraction of information and broken links were the most recurring quality problems. With this study, we not only aim to assess the quality of this sample of DBpedia resources but also adopt an agile methodology to improve the quality in future versions by regularly providing feedback to the DBpedia maintainers.

    @InProceedings{zaveri2013,
    Title = {User-driven Quality Evaluation of DBpedia},
    Author = {Amrapali Zaveri and Dimitris Kontokostas and Mohamed Ahmed Sherif and Lorenz B\"uhmann and Mohamed Morsey and S\"oren Auer and Jens Lehmann},
    Booktitle = {Proceedings of 9th International Conference on Semantic Systems, I-SEMANTICS '13, Graz, Austria, September 4-6, 2013},
    Year = {2013},
    Pages = {97-104},
    Publisher = {ACM},
    Abstract = {Linked Open Data (LOD) comprises of an unprecedented volume of structured datasets on the Web. However, these datasets are of varying quality ranging from extensively curated datasets to crowdsourced and even extracted data of relatively low quality. We present a methodology for assessing the quality of linked data resources, which comprises of a manual and a semi-automatic process. The first phase includes the detection of common quality problems and their representation in a quality problem taxonomy. In the manual process, the second phase comprises of the evaluation of a large number of individual resources, according to the quality problem taxonomy via crowdsourcing. This process is accompanied by a tool wherein a user assesses an individual resource and evaluates each fact for correctness. The semi-automatic process involves the generation and verification of schema axioms. We report the results obtained by applying this methodology to DBpedia. We identified 17 data quality problem types and 58 users assessed a total of 521 resources. Overall, 11.93\% of the evaluated DBpedia triples were identified to have some quality issues. Applying the semi-automatic component yielded a total of 222,982 triples that have a high probability to be incorrect. In particular, we found that problems such as object values being incorrectly extracted, irrelevant extraction of information and broken links were the most recurring quality problems. With this study, we not only aim to assess the quality of this sample of DBpedia resources but also adopt an agile methodology to improve the quality in future versions by regularly providing feedback to the DBpedia maintainers.},
    Bdsk-url-1 = {http://svn.aksw.org/papers/2013/ISemantics_DBpediaDQ/public.pdf},
    Date-modified = {2015-02-06 06:56:39 +0000},
    Ee = {http://doi.acm.org/10.1145/2506182.2506195},
    Keywords = {zaveri sherif morsey buemann kontokostas auer lehmann group_aksw sys:relevantFor:infai sys:relevantFor:bis sys:relevantFor:lod2 lod2page 2013 event_I-Semantics dbpediadq sys:relevantFor:geoknow topic_QualityAnalysis dataquality MOLE buehmann},
    Owner = {soeren},
    Timestamp = {2013.06.01},
    Url = {http://svn.aksw.org/papers/2013/ISemantics_DBpediaDQ/public.pdf}
    }