
<?xml version="1.0" encoding="UTF-8"?>
<record>
  <title>OPTIMA: On-Line Balancing of Range-Partitioned Data with Imperfect Partitioning Vector</title>
  <journal>International Journal of Computational Linguistics Research</journal>
  <author>Djahida Belayadi, Khaled-Walid Hidouci, Khadidja Midoun</author>
  <volume>9</volume>
  <issue>4</issue>
  <year>2018</year>
  <doi> 10.6025/jcl/2018/9/4/135-147</doi>
  <url>http://www.dline.info/jcl/fulltext/v9n4/jclv9n4_1.pdf</url>
  <abstract>Range query has a crucial role in large-scale data analysis. Unfortunately, the performance may be severely
degraded by data skew. Such problem is often faced in large-scale parallel databases, peer-to-peer (P2P) systems as well as in
Cloud computing. State-of-the-art methods designed to handle this problem offer significant improvements over naive implementations.
However, performance could be further improved by reducing the cost of global skew knowledge and broadcasting.
Ganesan, Bawa and Garcia-Molina proposed a load-balancing algorithm that guarantees a good ratio between the
maximum and minimum loads among nodes. However, their algorithm requires global max-min load information to use local
load balancing operations. Global load information can be found with O (log n) messages. In order to reduce this cost, we
propose OPTIMA, a novel online load balancing approach for range partitioned data. Whenever a partition becomes overloaded,
data transfers are performed, in background, from the most loaded nodes to the least loaded ones as in Ganesan et al.,
work. As a result, the partition boundaries and data sizes change. The key point of our proposal is the imperfect knowledge of
the global load information (partition statistics). We introduce the concept of â€œImperfect Partitioning Vectorâ€ (IPV), where,
both nodes and clients have an approximate information about the load distribution. They can nevertheless locate any data
with almost the same efficiency as using exact partition statistics. Furthermore, maintaining load distribution statistics do not
require exchanging additional messages or maintaining a data structure as opposed to the cost of efficient solutions from the
state-of-art (which require at least O (log n) messages).</abstract>
</record>
