- Spatiotemporal Join processing for Indoors Moving Objects
Contact tracing (CT) is a category of analytics that monitors whether two objects have come to close distance of (i.e., intersects with) each other for a certain period of time. Spatiotemporal join operator is essential to such analytics, where two sets of data streams are being checked for intersection in spatial and/or temporal dimension in real-time for trajectory data.
The nature of the aforementioned analytics category and its application require processing of huge number of users’ trajectories. This poses a computational challenge as well as a a privacy concern, since the trajectory data of users need to be collected and co-located for contact tracing processing.
This work approaches the above problem by devising efficient processing schemes of such spatiotemporal joins in an energy consumption aware approach due to the typical battery restrictions of mobile devices, while protecting users' privacy.
- DCS: Detection of Correlated Streams
The DCS framework detects correlations in pairs of data streams and timeseries by employing caching concepts, scheduling techniques, and enforcing policies to navigate the exploration task space efficiently.
On-line analytics processing (OLAP) tasks are vital for the field of data processing due to the increasing demand for real-time analysis of large volumes of timeseries or data streams that are produced at high velocity. Data analytics help data scientists in their daily tasks, where they need tools to search for and detect patterns and relationships in the vast data they explore.
An important indicator for finding such patterns is to find the correlation between pairs of timeseries or data streams. The correlation can be used to finding similarity (or dissimilarity) measures, running threshold queries, or reducing the size of the data, yet, preserving some of its characteristics.
In this research, we try to optimize the task of detecting correlation between data streams and timeseries. We do this through employing scheduling techniques, such that the exploration task is more efficient and more insightful. The three angles of this research are the use of (1) incremental sliding-window computation of aggregates, to avoid unnecessary re-computations, (2) intelligent scheduling of computation steps and operations, driven by a utility function, and (3) an exploration policy that tunes the utility function based on some observed data insights.
- Environmentally Aware Urban Analytics
IoT is a great vehicle for enabling solutions of problems in the connected environment that surrounds us (i.e., smart homes and smart cities), but only recently have the use of sensors and IoT has been proposed to address issues related to energy efficiency. Our hypothesis is that data processing and decision making needs to be carried out at the network edge, specifically as close to the physical system as possible, where data are generated and used, in order to produce results in real time and make sure the data is not exposed to privacy and security risks. To this end, we propose to leverage scheduling principles and statistical techniques in the context of two applications, namely aiming to reduce duty cycle of HVAC systems in smart homes and to mitigate road congestion in a smart cities. The common goal in these two aims is the reduction of energy consumption and reduction of atmospheric pollution.
To achieve our first aim, we propose intelligent scheduling of the duty cycles of HVAC systems in residential buildings. Our solution combines linear and polynomial regression enabled estimator that drives the calculations about the amounts of time thermally conditioned air should be supplied to each room. The output from our estimator is fed into our scheduler based on integer linear programming to decrease the duty cycle of the home’s HVAC systems. We evaluate the effectiveness and efficiency of our HVAC solution with a dataset collected from several residential houses in the state of Pennsylvania.
To achieve the second aim, we propose the concept of virtual bus lanes, that combines on demand creation of bus lanes in conjunction with dynamic control of traffic lights. Moreover, we propose to guide drivers through less congested routes using bulletin boards that provide them information in real time for such routes. Our methods are anchored to priority scheduling, incremental windowed-based aggregation and shortest path first Dijkstra’s algorithm. We evaluate the effectiveness and efficiency of our virtual bus lanes solution with a real dataset from the city of Beijing, China and a twenty four hours traffic scenario from the city of Luxembourg.
Course Work Projects
- Simulator for a modified PowerPc 604 and 620 Architectures
The goal was to design, implement, and evaluate the performance of a dynamically scheduled processor. Thus, we implemented the simulator using Tomasulo algorithm with renaming registers and reordering buffers. Also, we did implement a dynamic branch prediction using a target buffer.
- Simulators for cache coherence protocols (MSI, MESI) in CMPs
The goal was to build a CMP with dynamically configured cores, and each core has its own L1 private cache, and they all share a unified L2 cache. The protocols are writing back with invalidation. We implemented the simulator and evaluated the performance of each protocol.
- Simple Remote Procedure Call System
In this project, we were asked to design, implement, and evaluate the performance of a sRPC system. We implemented a client, server, and a port-mapper. We addressed issues related to parameter passing, binding, exception handling, call semantics, performance and data representation. We achieved server high availability and load balancing through replication and name resolution respectively.
- MiniGoogle: Document Indexing and Querying
The main objective of this project is to design and implement a basic data-intensive application to index and search large documents. More specifically, the goal is to design a simple search engine, referred to as tiny-Google, to retrieve documents relevant to simple search queries submitted by users. We did implement a replicated and reliable client/server model that consists of: the client, the server (has the indexing and querying masters), the helpers (for the mapping and reducing), and the name-server (for the name resolution).
- Simplified File Transfer Protocol System
The purpose of a file transfer protocol is to enable the transfer of files between machines, typically under the command of a user. Several issues need to be addressed in the design of a file transfer protocol, including dealing with differences in file name conventions, text and data representation, and directory structure. Furthermore, the protocol must ensure reliable transfer of files from one system to another. We did implement the system in a layered fashion with a replicated and load balanced servers and name-servers. Also, we did implement an error simulation module to introduce unreliability to the medium. We used the Go-Back-N as a sliding window protocol. Consequently, we did conduct a thorough analysis of the performance of the system with multiple experiments. Those involved different packet error rates, different packet drop rates, and different retransmission timeouts.
- myTRC: my Transactional Row Column store DBMS
The objective of the project is to develop the Transactional Row Column store myTRC that efficiently supports concurrent execution of OLTP (i.e., transactions) and OLAP (i.e., aggregate queries) workloads. myTRC provides limited transactional support. Limited support means that it does not support full durability. In addition to serializable and atomic access, it also provides the standard uncontrolled access to files.
- Checkers Solver
Developed a Checkers solver model that competed against my classmates’ engines. The engine was developed using minimax algorithm with alpha beta pruning, and some cutoff techniques using some common heuristics, as well as some of my own developed heuristics, I was able to win the tournament of my class.
- PASTRI: PAirwise STream CoRrelation Identifier (only Time series no data streams)
Developed a system that aims at finding accurate results of correlations between pairs of time series in real-time. By doing this, we minimize the time to the first produced results - i.e. achieve interactive response of the system and keeps refining the results, while the user is busy, exploring a subset of the provided results. Our approach modeled the length of each time series, which is of interest, and used a utility cost function, to identify the highly correlated streams. Our experiments show a speedup of 10 times producing the first results, compared to other systems.