OPTIMA: On-Line Balancing of Range-Partitioned Data with Imperfect Partitioning Vector

OPTIMA: On-Line Balancing of Range-Partitioned Data with Imperfect Partitioning Vector International Journal of Computational Linguistics Research Djahida Belayadi, Khaled-Walid Hidouci, Khadidja Midoun 9 4 2018 10.6025/jcl/2018/9/4/135-147 http://www.dline.info/jcl/fulltext/v9n4/jclv9n4_1.pdf Range query has a crucial role in large-scale data analysis. Unfortunately, the performance may be severely degraded by data skew. Such problem is often faced in large-scale parallel databases, peer-to-peer (P2P) systems as well as in Cloud computing. State-of-the-art methods designed to handle this problem offer significant improvements over naive implementations. However, performance could be further improved by reducing the cost of global skew knowledge and broadcasting. Ganesan, Bawa and Garcia-Molina proposed a load-balancing algorithm that guarantees a good ratio between the maximum and minimum loads among nodes. However, their algorithm requires global max-min load information to use local load balancing operations. Global load information can be found with O (log n) messages. In order to reduce this cost, we propose OPTIMA, a novel online load balancing approach for range partitioned data. Whenever a partition becomes overloaded, data transfers are performed, in background, from the most loaded nodes to the least loaded ones as in Ganesan et al., work. As a result, the partition boundaries and data sizes change. The key point of our proposal is the imperfect knowledge of the global load information (partition statistics). We introduce the concept of â€œImperfect Partitioning Vectorâ€ (IPV), where, both nodes and clients have an approximate information about the load distribution. They can nevertheless locate any data with almost the same efficiency as using exact partition statistics. Furthermore, maintaining load distribution statistics do not require exchanging additional messages or maintaining a data structure as opposed to the cost of efficient solutions from the state-of-art (which require at least O (log n) messages).