An Information Content and Set of Common Superconcepts-Based Algorithm to Estimate Similarity between Concepts of Ontologies ()
1. Introduction
Ontologies are manifestations of the common understanding of a domain. His domain is agreed between several agents, whose agreement in turn facilitates reliable, meaningful communications. All these communications in turn lead to other benefits such as interoperability, reuse and sharing [1] [2] . Therefore, communication or sharing of information and knowledge between different concepts has become a fundamental trend in computational linguistics [3] .
One of the most important activities in the study of ontologies is semantic similarity between words. This activity is necessary in many domains such as medicine [4] , learning [5] , large-scale storage systems and data management [6] . Semantic similarity is the set of correspondences between entities (e.g., classes, properties, and predicates) that form ontology. Semantic similarity is also a measure used to calculate the similarity between two concepts. In most existing works, the semantic similarity between two concepts of a same ontology was calculated.
However, for a given domain (e.g., transport, health), there are several ontologies designed whose exploration for the correspondence of the semantic content of concepts is increasingly difficult. Moreover, according to [7] , by using more than one ontology, it provides additional knowledge which improves the study of similarity and solves all cases of semantic similarity measurement whose concepts are not represented in only one ontology.
Given two ontologies in the same domain and two concepts belonging respectively to each of these ontologies, the problem studied here is to find an efficient measure of semantic similarity between these two concepts.
To solve this problem, we propose a method of semantic similarity between concepts from different ontologies.
The main contributions of our study are as follows:
• The proposition of new versions of two existing measures (i.e., Jiang & Conrah and Lin measures) to consider the fact that concepts belong to different ontologies.
• The proposition of a method that considers the Information Content (IC) and the set of Common Superconcepts (CS), which represents the set of concepts of common parents. Indeed, the knowledge revealed by the concepts is used to augment the information already presented in ontologies or taxonomies [8] . This method includes use of the two new measures proposed (Equation (8) and Equation (9)).
The remainder of this paper is organised as follows. Section 2 presents related works. Section 3 presents the problem formulation. Section 4 describes the proposed method. Sections 5 and 6 are related to the evaluated results, and the conclusion and future work respectively.
2. Related Works
Several methods have been proposed to determine semantic similarities. These methods are classified into two types: semantic similarity classification for a single ontology and semantic similarity classification for multiple ontologies [9] . In the first type, semantic similarity is studied for concepts belonging to a same ontology. In the second type, semantic similarity is studied for concepts belonging to different ontologies. Semantic similarity for a single ontology includes four methods: Structure-based, information content-based, feature-based, and hybrid-based methods. While semantic similarity between concepts of different ontologies includes two methods: length-based and function-based methods.
2.1. Semantic Similarity Based on a Single Ontology
2.1.1. Structure-Based Method
The structure-based method considers the shortest path between concepts in an ontology to calculate the semantic similarity. This method calculates the shortest path while the degree of similarity is determined according to the path length. There are various measurements on structure based method which have been used by [10] [11] . [10] [11] proposed a measure, called distance on the power set of nodes in a semantic network. The distance is the average minimum path length over all combinations of nodes between two subsets of nodes. While [12] [13] [14] discuss the depth method which takes into account the depth of the edges connecting two concepts in the ontology structure. This method calculates the depth between the root and the target concept.
2.1.2. Information Content-Based Method
The information content-based method is also known as the corpus-based method. It determines the similarity between two concepts based on the probabilities of each concept in the ontology and the occurrences of words in a given corpus [8] . This knowledge revealed by the corpus is used to augment the information already presented in ontologies or taxonomies. It consists of different measures including the Resnik [15] , Lin measure [16] , Jiang and Conrath measures [17] .
2.1.3. Feature-Based Method
Feature-based methods have attempted to overcome the limitations of structure-based method. Because the taxonomic links in an ontology do not necessarily represent uniform distances. The problem of structure-based method is solved by considering the degree of overlap between the ontological feature sets. As a result, they are more general and can potentially be applied in ontology similarity estimation contexts, where pairs of concepts belong to two different ontologies. A situation in which edge counting methods cannot be applied directly [18] . Thus, unlike edge counting measures which are based on the notion of minimum path distance, feature-based approaches evaluate the similarity between concepts according to their properties [19] .
2.1.4. Hybrid-Based Method
The hybrid method considers different sources of information to calculate the similarity score between concepts. This method is a mixture of several characteristics, namely: attribute similarity, ontology structure, information content and Lowest Common Ancestor (LCA) node depth. This method has been the subject of several studies, including [20] which proposed a hybrid method that combines term-based and concept-based methods. These measures concern structure and contextual relevance measures. The major advantage of this approach is that if the knowledge of an information source is inadequate, it can be derived from other sources [21] . Thus, the quality of the similarity measure would be improved. Among the works based on this method, we can cite [22] and [23] .
The information content method is the most widely used for estimating the similarity between two concepts in the same ontology. However, two concepts from multiple ontologies can be similar.
2.2. Semantic Similarity Based on Multiple Ontologies
The last ten years, the increase in the number of resources on the web has led researchers to introduce methods that calculate the similarity between concepts belonging to several ontologies.
These methods are:
• Structure-based method.
• Feature-based method.
2.2.1. Structure-Based Method
These methods have been proposed to evaluate the similarity of concepts by exploiting different knowledge. Some of these methods have been adapted to the biomedical domain to study clinical information [4] . Works in [4] presented and analysed methods to determine their advantages and disadvantages. Using an input such as SNOMED CT [24] , a new method based on the exploitation of the taxonomic structure of a biomedical ontology was proposed. This proposed similarity method achieves a similar level of precision as corpus-based approaches, while retaining the low computational complexity and lack of constraints of path-based method.
2.2.2. Feature-Based Method
Works in [24] proposed a method for measuring semantic similarity between two concepts c1 and c2 that belong to two different ontologies, each of which models knowledge from a different perspective. This method is based on the evaluation of the union of the set of superconcepts of c1 and c2 in each ontology and on the search for equivalences between them. The set of shared superconcepts for c1 belonging to ontology O1 and c2 to ontology O2 is composed of the superconcepts of c1 and c2 having the same label, as well as the subsumers of these equivalent superconcepts. Three different sets are highlighted:
• The set of superconcepts defined by Equation (1) and Equation (2) were proposed in [24]
(1)
(2)
The set of superconcepts of a given concept ci is all the concepts that are linked to ci, from the root to this concept ci, including ci itself.
• The set of terminologically Equivalent Superconcepts (ES) in
defined by were proposed in [24] :
(3)
In Equation (3),
means terminological correspondence.
• Finally, the set of Common Superconcepts (CS) in
defined by were proposed in [24] :
(4)
: The set of superconcepts of concept c1 in concept hierarchy
of concepts
of ontology O1 (Equation (1)).
: The set of superconcepts of concept c2 in hierarchy
of concepts
of ontology O2 (Equation (2)).
The set of common superconcepts (CS) in
is composed of the elements in ES and all superconcepts of the elements in ES (Equation (3) and Equation (4)). Unlike structure-based methods for calculating similarity, this similarity method considers all parents. It also evaluates sets of superconcepts instead of paths. This method exploits the taxonomic structure of the biomedical ontology. Therefore, it does not consider taxonomic depths such as information content [24] [25] .
3. Problem Specification
In this section, the assumptions, notations, and definitions of the problem are given.
3.1. Problem Statement
Generally, for a given domain, several ontologies are relevant to each other. In many works most similarity methods did not support more than one input ontology. However, for the same domain, some concepts belonging to different ontologies are semantically equivalent [7] . Therefore, the problem is to estimate a similarity measure between two concepts belonging to different ontologies.
3.2. Assumption and Notation
We make the following assumptions:
• Assumption 1: All ontologies belong to the same domain. However, the two concepts studied belong to two different ontologies.
• Assumption 2: The concepts whose similarity we want to estimate have at least two terminologically equivalent superconcepts.
3.3. Problem Formulation
Given: Two ontologies O1 and O2, two concepts c1 and c2 respectively of O1 and O2.
Goal: Estimate a similarity measure between two concepts c1 and c2 from multiple ontologies.
4. Proposed Method
This section presents our method to solve the semantic similarity problem.
Consider the set of ontologies O = {O1, O2}. Let c1 and c2 be two concepts such that
and
. The algorithm determines the semantic similarity of two concepts c1 and c2 from two different ontologies O1 and O2 respectively. Indeed, the algorithm uses the set of common superconcepts, information content IC values and terminologically equivalent superconcepts ES.
Information content (IC) is used in our method because it determines the similarity between two concepts based on the probabilities of each concept in the ontology and the occurrences of words in a given corpus [8] . This knowledge revealed by the corpus is used to augment the information already presented in ontologies or taxonomies.
When we refer to ontologies O1 and O2, their common superconcepts
is equal to
. Let Q be the sum of the information content of each concept belonging to the CS such that:
(5)
• In [21] , the expression for the semantic similarity measure between two concepts c1 and c2 in a single ontology according to Lin and Jiang & Conrath is represented respectively by Equation (6) and Equation (7) were proposed in [26] :
(6)
(7)
where:
• IC(x) represents the information content of the concept x.
• In our work,
is equal to
.
• The set of common superconcepts CS is defined as the most specific superconcepts (the one with the depth) that define the path between the pair of evaluated concepts. It is the union of the set of superconcepts
, of concept c1, and that of
of concept c2.
•
represents the sum of the information content of all common parents along the paths for the concepts whose similarity is to be calculated in a single ontology.
To calculate the similarity of concepts from two ontologies, it is necessary to consider the most specific superconcepts which define the path between the pair of concepts being evaluated (in our case, c1 and c2). Our work proposes integrating the information content of Q superconcepts into Equations (6) and (7). It results our proposed measures through Equation (8) and Equation (9):
(8)
(9)
Description of Our Algorithm
The literature review revealed that in [27] , an algorithm can determine the semantic similarity between two concepts c1 and c2 from the same ontology. This algorithm is based on the information content of concepts, therefore by rechecking the path between the concepts. However, the formulation of this algorithm does not allow to determine the semantic similarity of two concepts from different ontologies, which is addressed here. In other words, using the proposed algorithm in [27] , finding the semantic similarity between two concepts from two ontologies is almost impossible. Because if the information content alone is used for the determination of the semantic similarity in two different ontologies, it will not succeed in establishing the taxonomic link between the two concepts studied.
Recall that [24] proposes a measure based on the exploitation of the taxonomic structure of a biomedical ontology. The algorithm chooses two concepts c1 and c2 respectively in the ontologies O1 and O2. These two concepts are the objects of similarity measurement research.
Our algorithm consists of three steps. Indeed, the algorithm chooses two concepts c1 and c2 respectively in the ontologies O1 and O2. These two concepts are the objects of similarity measurement research. The proposed algorithm consists of 4 main steps:
• Step 1 (Refer to Line 3) calculates the information content of a concept. For each concept ci belonging to the set of concepts C or C', the information content of ci is then calculated from the number of descendants of ci and the total number of concepts Max in ontology O.
• Step 2 (lines 4 to 6) involves extracting for concepts c1 and c2, the set of superconcepts
, and
, the set of terminologically equivalent superconcept (ES), then the set of common superconcepts (CS).
• Step 3 (Line 7) calculates Q, which is the sum of the information content IC of each concept in the set of common superconcepts CS.
• Step 4 (line 8) calculates the semantic similarity between concepts c1 and c2 according to Equation (8) and Equation (9).
Algorithm 1 allows us to provide experimental results to support our approach in the next section of our paper (Section 5.1).
5. Performance Evaluation
We conducted a simulation to compare the efficiency of the methods. Python was used as the programming language during the simulation. This simulation
Algorithm 1. SemSim_SCS_c1, c2.
was performed on a computer equipped with a CPU, i7-2620M @ 2.70 GHz RAM, and 128 GB SSD with a windows 10 operating system.
5.1. Simulation Process
The simulation process followed the same steps as those described in the literature. Lin and Jiang & Conrath proposed a comparison of existing measures on 15 pairs of concepts from two reference datasets [26] . The simulation of their proposal confirmed that some concepts are equivalent or roughly equivalent. For example, the comparison shows that for concepts such as cheese and ice cream, eye and chin, lung and liver, the similarity value is the same for the existing and proposed measures, because the concepts are subsumed by numerous super-concepts.
In our simulation, in addition to
for ontologies O1 and O2, we considered the Least Common Subsummer (LCS) of the concepts defined in Tables 1-3. The LCS represents the superconcept whose maximum depth defines the path between the pair of concepts evaluated (Figure 1). These are sLCS for ontology O1 and rLCS for ontology O2. Next, we compared the pairs of concepts derived from the descendants of s3 and r3.
Table 1. Comparison of proposed and existing measures for subsumed.
Table 2. Comparison of proposed and existing measures for similars pairs.
Figure 1. Ontologies 01 and 02 with their concepts.
Table 3. Comparison of proposed and existing measures for non-similar pairs.
5.2. Results
After the simulation, the results are as follows. The simulation performed obeys the condition of Section 5.1, which is
.
1) The similarity measure is calculated using all concepts that have a relationship in each ontology with each of the equal concepts including s3 in the ontology O1 and r3 in ontology O2. Thus, in the ontology O1 the concepts concerned are:
In ontology O2:
2) Each concept of Tab_sim_O1 is associated with Tab_sim_O2 for the calculation of the similarity. First, the first concept s4 of Tab_sim_O1 is associated with the first concept r4 of Tab_sim_O2. After r4 of Tab_sim_O2, s4 is always associated with r11 of Tab_sim_O2 to calculate the similarity. The same process is repeated between s4 of Tab_sim_O1 and all the other concepts of Tab_sim_O2, until the last concept of Tab_sim_O2. After s4, the next concept s10 of Tab_sim_O1 is chosen to perform the same process.
3) After having traversed all the concepts of Tab_sim_O1 and Tab_sim_O2, the various results of the similarity calculation are compared. The results of the semantic similarities of each pair were calculated. Then, the highest value is kept for examination [27] .
6. Discussion
In Table 1, all concepts are subsumed by numerous super concepts, except the last concepts of each ontology. This allows our proposed measure to provide good results. The pairs (c4, r10), (c4, c3) and (c1, r11) have the same similarity value of Lin and JC (0.195925 for Lin and 0.189355 for JC), while the similarity values of our approach SCS_Lin and SCS_JC of these pairs vary and are larger than those reported in the literature.
In Table 2, the pairs have the same similarity values for the proposed and the existing measure: (s24, r22) and (s21, r22) whose values are [0.195059 for Lin, 0.188941 for J & C, 0.684695 for SCS_Lin, 0.647065 for SCS_JC]. The same is true for the pairs (s26, r26) and (s16, r26) whose values are [0.186517 for Lin, 0.184656 for J & C, 0.184656 for SCS_Lin, 0.642780 for SCS_JC]. Our measure shows better results of the similarity measure for concepts that are subsumed by many superconcepts.
In Table 3, all values of similarity measures for the proposed and the existing measure are different for all pairs. However, the proposed measures are higher than the existing measures.
The different comparison graphs of the concepts explained above are illustrated in Figures 2-4. It results that the similarity values from our measures
Figure 2. 1st round of comparison of proposed and existing measures.
Figure 3. 2nd round of comparison of proposed and existing measures.
Figure 4. 3rd round of comparison of proposed and existing measures.
(SCS_Lin and SCS_J & C) are better those existing (Lin and Jin). These different results lead us to the conclusion that our method is more efficient to calculate the semantic similarity between two concepts belonging to two different ontologies.
7. Conclusions
The problem studied in this paper is to find the semantic similarity between the concepts of two different ontologies. Existing works such as [27] have proposed an approach to calculate the semantic similarity of two concepts in the same ontology. However, several concepts are from different ontologies.
To overcome this difficulty, we proposed:
• A hybrid method based on the method of [24] and [27] . Our method is based on information content and set of common superconcepts. This method uses our proposed versions of existing measures.
The simulation confirmed that this method is more effective in solving the problem studied.
According to [2] by using more than one ontology, it provides additional knowledge that improves the similarity study and solves all cases where the terms are not represented in an ontology. Therefore, our method is an essential tool that can be used by ontology developers to evaluate several ontologies in the decision process.
Our method can provide semantic similarity between only two concepts from two ontologies.
• However, taking a lot of concepts from a lot of ontologies as input can be tricky to solve. So, future work could explore this issue.