Using Semantic Perimeters with Ontologies to Evaluate the Semantic Similarity of Scientific Papers

Samia Iltache, Catherine Comparot, Malik Si Mohammed, Pierre-Jean Charrel


The work presented in this paper deals with the use of ontologies to compare scientific texts. It particularly deals with scientific papers, specifically their abstracts, short texts that are relatively well structured and normally provide enough knowledge to allow a community of readers to assess the content of the associated scientific papers. The problem is, therefore, to determine how to assess the semantic proximity/similarity of two papers by examining their respective abstracts. Given that a domain ontology provides a useful way to represent knowledge relative to a given domain, this work considers ontologies relative to scientific domains. Our process begins by defining the relevant domain for an abstract through an automatic classification that makes it possible to associate this abstract to its relevant scientific domain, chosen from several candidate domains. The content of an abstract is represented in the form of a conceptual graph which is enriched to construct its semantic perimeter. As presented below, this notion of semantic perimeter usefully allows us to assess the similarity between the texts by matching their graphs. Detecting plagiarism is the main application field addressed in this paper, among the many possible application fields of our approach.
Povzetek: Delo v tem prispevku obravnava uporabo ontologij za primerjavo znanstvenih besedil. Odkrivanje plagiacije je glavno področje uporabe, obravnavano v tem dokumentu, med mnogimi možnimi področji uporabe našega pristopa.

Full Text:



G. Salton and M.J. McGill, Introduction to modern information retrieval, McGraw-Hill, New York, 1983.

G. Salton, “The SMART Retrieval System – Experiments in Automatic Document Processing,” Prentice-Hall, Inc. Upper Saddle River, NJ, USA, 1971.

C.J. Crouch, S. Apte, et H. Bapat, “Using the extended vector model for xml retrieval,” In INEX INEX 2002.

E.A. Fox, “Extending the Boolean and Vector Space Models of information retrieval with p-norm queries and multiple concept types”, PhD thesis, Department of Computer Science, Cornell University, 1983

D. Carmel, Y. Maarek, M. Mandelbrot, and A. Soffer, “Searching xml documents via xml fragments,” . In Proceedings of SIGIR 2003, pages 151– 158, 2003.

M. Fuller, E. Mackie, R. Sacks-Davis, and R. Wilkinson, “Structural answers for a large structured document collection,” In Proceedings of ACM SIGIR 1993, Pitthsburgh, pages 204–213, 1993.

T. Schileder and H. Meus, “Querying and ranking XML documents,”, Journal of the American Society for Information Science and Technology, 53(6) :pages 489–503, 2002.

T. Joachims, “A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization,” Proceedings of ICML-97, Tennessee, 1997, pp.143-151

S. Jaillet, A. Laurent and M. Teisseire, “Sequential patterns for text categorization”. Intelligent Data Analysis, IOS Press, 2006.

P. Soucy, G. W. Mineau, “A Simple k-NN Algorithm For Text Categorization,” Proceedings 2001 IEEE International Conference on Data Mining.

A. Hotho, A. Maedche and S. Staab, “Ontology-based Text Document Clustering,” KI 16 (4), 2002, pp. 48-54.

S. B. Kotsiantis, “Supervised Machine Learning: A Review of Classification Techniques,” Informatica 31 2007, pp. 249-268.

Y. Yang and X. Liu, “A re-examination of text categorization methods,” 22nd Annual International SIGIR, Berkley, August 1999, p. 42–49.

T. Joachims, “Text categorization with support vector machines : learning with many relevant features,” Proceedings of ECML-98, 10th European Conference on Machine Learning, Chemnitz, DE, 1998, Springer Verlag, Heidelberg, DE, p. 137–142.

E. Gabrilovich and S. Markovitch, “Feature Generation for Text categorization Using World Knowledge,” IJCAI 2005: the Nineteenth International Joint Conference on Artificial Intelligence, Edinburgh, Scotland, UK, July 30-August 5.2005, pp. 1048-1053.

A. Hotho, S. Staab and G. Stumme, “Ontologies Improve Text Document Clustering,” ICDM:3rd IEEE International Conference on Data Minin 2003, pp. 541-544.

H. H. Tar and T.T. Soe.Nyunt, “Ontology-Based Concept Weighting for Text documents,” International Conference on Information Communication and Management IPCSIT vol.16, IACSIT Press, Singapore, 2011.

B. Pincemin, “Similarites texte – texts expérience d’une application de diffusion ciblée et propositions,” In Matemáticas y Tratamiento de Corpus,” Actes du 2ème séminaire de l’Ecole interlatine de linguistique appliquée,San Millán de la Cogolla, Logroño, (Espagne), 19-23 septembre 2000.

K. Beyer, J. Goldstein, R. Ramakrishnan and U. Shaft, “When is `nearest neighbor' meaningful,”. In Proceedings of ICDT-1999, pages 217-235.

U.L.D.N. Gunasinghe, W.A.M. De Silva, N.H.N.D. de Silva , A.S. Perera, W.A.D. Sashika and W.D.T.P. Premasiri. “Sentence similarity measuring by vector space model,” 14 th International Conference on Advances in ICT for Emerging Regions (ICTer) 10-13 decembre 2014, Colombo (Sri Lanka)

Y. Liu, C. Sun, L. Lin, Y. Zhao and X. Wang, “Computing Semantic Text Similarity Using Rich Features,” 29th Paci_c Asia Conference on Language, Information and Computation pages 44 – 52 Shanghai, China, October 30 - November 1, 2015

J. Lewis, S. Ossowski, J. Hicks, M. Errami and H. R. Garner, “Text similarity: an alternative way to search MEDLINE,” Bioinformatics Vol. 22 no. 18 2006, pages 2298–2304

E. Yamamoto, M. Kishida, Y. Takenami, Y. Takeda and K. Umemura, “Dynamic programming matching for large scale information retrieval,” In Sixth International Workshop on Information Retrieval with Asian Languages, Sapparo, Japan.2003.

W. Ma and T. Suel. “Structural Sentence Similarity Estimation for Short Texts”, 29th International FLAIRS Conference, Florida, May 2016.

D. Dudognon, G. Hubert and B. Ralalason, “Proxigénéa : Une mesure de similarite conceptuelle,” In Proceedings of the Colloque Veille Strategique Scientifique et Technologique (VSST 2010).

M. Baziz, M. Boughanem, H. Prade and G. Pasi: “A Fuzzy Set Approach to Concept-based Information Retrieval,” In the 4th Conference of the European Society for Fuzzy Logic and Technology and the 11ème Eleventh Rencontres Francophones sur la Logique Floue et ses Applications (Eusflat-LFA 2005 joint Conference), Barcelona, Spain, 7 septembre 9 septembre 2005.

K. M. Shenoy, K.C. Shet, U.D. Acharya, “semantic plagiarism detection system using ontology mapping,” Advanced Computing: An International Journal ( ACIJ ), Vol.3, No.3, May 2012

R. Lukashenko, V. Graudina and J. Grundspenkis, “Computer-Based Plagiarism Detection Methods and Tools: An Overview” International Conference on Computer Systems and Technologies - CompSysTech’07, Bulgaria, June 14 – 15 2007.

K. Vani, D. Gupta, “Investigating the Impact of Combined Similarity Metrics and POS tagging in Extrinsic Text Plagiarism Detection System,” International Conference on Advances in Computing, Communications and Informatics (ICACCI), 2015.

A. H. Osman, N. Salim, M. S. Binwahlan, H. Hentably and A. M. Ali. “conceptual similarity and graph-based method for plagiarism detection,” Journal of Theoretical and Applied Information Technology 31st October 2011. Vol. 32 No.2.

D. Rusu, “Semantic Graphs Derived from Triplets with Application in Document Summarization,” Informatica 33 2009, pp. 357–362.

S. Iltache, C. Comparot, M. Si Mohammed, P. J. Charrel, “Using domain ontologies for classification and semantic interpretation of documents,” ALLDATA 2016 : The Second International Conference on Big Data, Small Data, Linked Data and Open Data. February 21 - 25, 2016 - Lisbon, Portugal.

N. Fuhr and K. Grossjohann, “XIRQL : a query language for information retrieval in XML documents,” In In Proceedings of SIGIR 2001, Toronto, Canada, 2003.

E. Omodei, Y. Guo, J. P. Cointet and T. Poibeau, “Analyse discursive automatique du corpus ACL Anthology ,” 21ème Traitement Automatique des Langues Naturelles, Marseille, 2014.

Y. Guo, A. Korhonen and T. Poibeau, “A Weakly-supervised Approach to Argumentative Zoning of Scientific Documents,” Proceedings of the 2011 conference on Empirical Methods in Natural Language Processing, pages 273–283, Edinburgh, Scotland, UK, July 27–31, 2011.

B. Magnini and G. Cavaglià. “Integrating Subject Field Codes into WordNet,” in Gavrilidou M., Crayannis G., Markantonatu S., Piperidis S. and Stainhaouer G. (Eds.) Proceedings of LREC-2000, Second International Conference on Language Resources and Evaluation, Athens, Greece, 31 May - 2 June, 2000, pp. 1413-1418.

C. Fellbaum, “WordNet: An Electronic Lexical Database,” MIT Press, Cambridge MA. 1998.

D. C. Howe, “RiTa: creativity support for computational literature,” In Proceedings of the seventh ACM conference on Creativity and cognition (C&C '09). ACM, New York, NY, USA, 2009, pp. 205-210.

K. Toutanova, D. Klein, C. Manning, and Y. Singer, “Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network”, In Proceedings of HLT-NAACL, 2003, pp. 252-259.

M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann and I. H. Witten. “The WEKA Data Mining Software: An Update,” SIGKDD Explorations, Vol. 11, Issue 1, 2009.


Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.