Last Updated: January 2023. Note that thesis and project lists are subject to change.
Are you an outstanding undergraduate student looking to see what research is all about? Or perhaps are you thinking about graduate studies? There are two main ways you can be involved in my group.
Here are some examples of past ECE 499 / URA projects:
My group has openings both at the MASc and PhD level, to perform research on hardware/software architectures for the next generation of embedded systems (cyber-physical systems). I am looking for students with an interest in one or more of the following broad areas:
If you plan to apply to ECE at U Waterloo and you would like to work in my group, you are encouraged to contact me by email. Point out why you are interested in my research and why you think you would be a good addition to the team. If I like your background, I will usually ask you to apply for admission in the next available term (the university admits students in the Fall, Winter and Spring term). Once applications are released to faculty for review, if I like your application, I will usually ask you for a phone/skype interview.
Please note that admission to U Waterloo is highly competitive. I can not positively answer all requests. Past research experience in my core areas is not required for MASc students (however, expertise based on university courses and projects certainly helps), but it is a requirement when seeking direct admission in the PhD program. Many students in the department are first admitted in the MASc program and then move to the PhD program with the agreement of their advisor.
The following are some ideas for the type of Master/PhD thesis that you would be likely undertaking in my group; note that I strongly encourage independent and original research by all students.
By integrating all system components on a single die, the System-on-Chip (SoC) paradigm promises to greatly increase reliability and reduce costs, packaging size and power consumption of embedded systems. The explosion of the smart-phone market is a prime testament to such trend, but SoCs are also becoming more and more popular in real-time systems such as those employed in the avionic, automotive and medical industry.
Unfortunately, current real-time Operating Systems (OSs) are ill-suited to modern SoC architectures, because they rely on a critical assumption: the CPU is the only active component in the system. However, SoCs typically contain a variety of different processors, such as GPU, packet processors, compressors, etc., all of which are active components able to initiate communication and access memory. Hence, traditional CPU-based protection and isolation mechanisms are not sufficient anymore.
The goal of this thesis would be to design new OS abstraction and mechanisms to support strict isolation and timing predictability for applications running on multiple, heterogeneous processors on a SoC. Possible research topics include: 1) memory and cache partitioning; 2) task execution model; 3) driver model; 4) allocation of services; 5) virtualization and memory protection; 6) security guarantees for CPS. Key objectives would be predictable performance, and ease of certification for safety-critical applications. Key skills include OS and system software design, but also real-time scheduling (not necessarily by the same student).
The goal of this thesis would be to extend automatic program transformation according to the PRedictable Execution Model (PREM) to parallel programs. In particular, we envision that affine computational kernels could be analyzed according to the polyhedral model and automatically transformed to execute according to the load - execution - unload PREM paradigm.
A first main complexity is how to optimize the program over multiple loop levels. A second main complexity is how to optimize across distributed loops. Both situation are very common in computational kernels used, for example, in neural networks. A third main complexity is to optimze the load / unload phases over non-regular memory access patterns; we envision that such issue could be ameliorated by generating custom DMA units through HLS in FPGA.
We are interested in targeting multiple architectures found in modern SoC, including: 1) arrays of cores using two-level scratchpad memory; 2) GPU; 3) AI cores.
High-performance computer architectures are normally designed to optimize average-case performance. Howeer, such optimizations often rely on speculative features, such as prefetching and request reordering, which can adversely affect worst-case scenarios. For this reason, it has been difficult to provide timing guarantees on the latecy of memory requests in modern SoC platform.
As a consequence, researchers in the real-time domain have devised a set of architectures (including caches, buses, main memory controllers...) specifically designed to provide tight latecy bounds. However, such bounds are generally achieved by disabling most of the optimizations targeted at average performance. Hence, existing designs exhibit a fundamental trade-off between average performance and worst-case guarantees.
Our goal for this research is to overcome such trade-off by designing architectures that provide tight latency bounds with minimal performance degradation. In particular, we have recently introduced the Duetto paradigm that achieves such result by pairing a conventional memory arbiter with a real-time one. The goal for this thesis would be to demonstrate that the paradigm can be extended throughout the memory hierarchy, and support application-specific configuration. Activities would involve both architectural design based on cycle-accurate simulation, as well as RTL design (not necessarily by the same student), with the goal of demonstrating the paradigm on a RISC-V based platform.
Modern Cyber-Physical Systems (CPS) are complex, integrated architectures. Timing analysis is crucial to ensure that the computation performed in the cyber part of the system (electronic components) correctly interacts with the physical world. Unfortunately, such analysis is made more complex by the presence of multiple cyber and physical resources shared both among hardware components and software applications. Such shared resources include processing cycles, interconnection bandwidth, cache space, memory bandwidth, power consumption and many more.
To avoid over-pessimism in the timing analysis, it is essential to properly configure and partition shared resources among software partitions / virtual machines. The key idea is that we can avoid worst-case scenarios through a careful assignment of shared resources. For example, CPU scheduling could be altered to avoid that two memory-intensive tasks run at the same time on a multi-core. The two key goals of this thesis would be to: 1) study optimization algorithms to best allocate resources based on profiling information about executed applications; 2) leverage less-pessimistic timing analysis based on the introduced resource isolation.