Asha Subramanian

Asha Subramanian

Asha Subramanian

Inferencing in the Large: Towards Automation of Semantic Integration and Knowledge Representation of Open Data

Data available on public domain especially though open data initiatives such as data.gov, data.gov.in, data.gov.uk publish useful information on various aspects of government policies and administration. One could derive immense insights by semantically integrating such datasets across various domains. Semantic Integration involves extraction of common domains or themes that explain a collection of datasets by identifying unique resources for data values and relation amongst rows of data across these datasets using known or custom vocabularies and knowledge bases. The natural taxonomy and classification of the entities, in-
stances and properties in the vocabularies allow for extraction of themes relevant to the datasets. Multiple research efforts have addressed the problem of semantic annotation of web tables and csv tables, which mainly involves interpreting tabular data by linking them to relevant vocabularies however they have not focussed on the problem of semantic integration of tables. Linking Government Data is an active research interest. The current process
to semantically link such datasets is largely manual and involves manual identification of vocabularies, classes and properties for each dataset, creating templates which will then automate the process of mapping the data to the identified vocabularies. Our work presents two models, 1) the generation of semantically linked data for the open datasets using vocabularies from LOD Cloud1 such as DBpedia2 YAGO [6], SKOS3, UMBEL4 etc and 2) representing the data in an intuitive homegrown Knowledge Representation Framework called MWF (Many Worlds on a Frame), a framework loosely modelled on Kripke Semantics([7]). MWF allows for rich representation of data across two aspects, namely, the type hierarchy(is-a) relationship and the containment hierarchy(is-in) relationship supported by roles and associations to transform the open datasets into a web of semantically interlinked themes and their associations.