

<?xml version="1.0" encoding="UTF-8"?>
<record>
  <title>A Hybrid Geometric Approach for Measuring Similarity Level Among Documents and Document Clustering</title>
  <journal>Journal of E-Technology</journal>
  <author>Arash Heidarian, Michael J.Dinneen</author>
  <volume>6</volume>
  <issue>3</issue>
  <year>2015</year>
  <doi></doi>
  <url>http://www.dline.info/jet/fulltext/v6n3/v6n3_3.pdf.pdf</url>
  <abstract>The increasing numbers of textual documents from diverse sources such as different websites (e.g. social
networks, news, magazines, blogs and medical recommendation websites), publications and articles and medical
prescriptions leads to massive amounts of daily complex data. This phenomenon has caused many researchers to focus on
analysing the content and measuring the similarities among the documents and texts to cluster them. One popular method
to measure the similarity between documents is to represent the documents as vectors and measure the similarity among
them based on the angle or Euclidean distance between each pair. By only considering these two criteria for similarity
measurement, we may miss important underlying similarities in this area. We propose a new method, TS-SS, to measure the
similarity level among documents, in such a way that one hopes to better understand which documents are more (or less)
similar. This similarity level can be used as a handy measure for clustering and recommendation systems for documents. It
also can be used to show top n similar documents to a particular document or a search query. Our study gives insights on
the drawbacks of geometrical and non-geometrical similarity measures and provides a novel method to combine the other
geometric criteria into a method to measure the similarity level among documents from new prospective. We apply Euclidean
distance, Cosine similarity and our new method on four labelled datasets. Finally we report how these three geometrical
similarity measures perform in terms of similarity level and clustering purity using four evaluation techniques. The
evaluationsâ€™ results show that our new model outperforms the other measures.</abstract>
</record>
