Processing Exact Results for Queries over Data Streams

This research thread focuses on processing exact results for join queries over high speed data streams using limited resources, and proposes several novel techniques for processing join queries incorporating secondary storage and non-dedicated computers. Existing approaches for stream joins either, (a) deal with memory limitations by shedding loads, and therefore can not produce exact or highly accurate results for the stream joins over data streams with time varying arrivals of stream tuples, or (b) suffer from large I/O-overhead due to random disk accesses. The proposed techniques exploit the high bandwidth of a disk subsystem by rendering the data access pattern largely sequential, eliminating small, random disk accesses. We investigate an I/O-efficient algorithm to process hybrid join queries, that join a fast, time varying or bursty data stream and a persistent disk relation. Such a hybrid join is the crux of a number of common transformations in an active data warehouse. The proposed scheme reduces response time in output results by exploiting spatio-temporal locality within the input stream, and minimizes disk overhead through disk-I/O amortization.

We propose an algorithm to parallelize a stream join operator over a shared- nothing system. The proposed algorithm distributes the processing loads across a number of independent, non-dedicated nodes, based on a fixed or predefined communication pattern; dynamically maintains the degree of declustering in order to minimize communication and processing overheads; and presents mechanisms for reducing storage and communication overheads while scaling over a large number of nodes. We present experimental results showing the efficacy of the proposed algorithms.

Related Publications:

L. Chakraborty and Ajit Singh, "A Partition-based Approach to Support Streaming Updates over Persistent Data in an Active Data Warehouse", 23rd IEEE International Parallel and Distributed Processing Symposium, IEEE Computer Society, Rome, Italy, May 2009
L. Chakraborty and Ajit Singh, "Processing Exact Results for Sliding Window Joins Over Time-Sequence, Streaming Data Using a Disk Archive", First Asean Conference on Intelligent Information and Database Systems, IEEE Computer Society, April 2009
L. Chakraborty A. Singh, and Sagar Naik, "Scalable Update Propagation in Partially Replicated Disconnected Client Server Databases", International Conference on Enterprise Information Systems, Cypru, Greece, May 2005
L. Chakraborty and Ajit Singh, "Streaming Updates over Persistent Data in an Active Data Warehouse", Submitted to a Journal
L. Chakraborty and Ajit Singh, "Processing Exact Results for Sliding Window Joins over Data Streams Using Disk Storage" Submitted to a Journal
L. Chakraborty and Ajit Singh, "Adaptive Parallelism for Stream Joins in a Shared Nothing System", Under Preparation
L. Chakraborty and Ajit Singh, "A Disk-based Adaptive Approach to Memory Limited Computation of Exact Results for Windowed Stream Joins", Under Preparation