Navneet Aron, Shivnath Babu, Sorav Bansal, and Abhyudaya Chodisetti
With the advent of high speed link technologies, the bottleneck in network processing is shifting to the CPU. In general, dedicated network processors are used to process high speed network traffic fast enough. TCP/IP processing may become a bottleneck if a general purpose processor is used. In this paper, we examine if the TCP/IP stack processing can be made more efficient by parallelizing it, making it suited for chip-multiprocessors. We implemented a parallel TCP stack by extending an existing TCP stack for embedded devices called LwIP. We found that we could indeed parallelize and gain performance by parallelizing the LwIP stack and show the performance improvement through experiments on a shared-memory multiprocessor machine.
Joel Coburn, Jayanth Gummaraju, Varun Malhotra, Janani Ravi, and Suzanne Rivoire
Reconfigurable architectures present many opportunities for specialized optimization of hardware and software, which can often be harnessed only with a high-level understanding of the programs to be run. Using generic standard libraries is a convenient mechanism for the programmer to convey this high-level information to the compiler. In this paper, we propose compile-time techniques for optimizing programs that use the C++ Standard Template Library and a framework for integrating the techniques into a compiler. We also show ways to refine our analyses in response to future experimental results.
Metha Jeeradit, Jean Suh, Honggo Wijaya, and Chi Ho Yue
Designing a CMP system is a hard optimization problem because of the large parameter space involved. In this paper, we present a method for automatically determining an ideal CMP configuration for a given application set using genetic programming. Our approach is a static, software-based implementation that provides an effective method of searching for the ideal hardware configurations. The results of our approach can also be used as a training set for a neural network extension that can dynamically reconfigure the hardware based on the applications being run.
Initial results are promising and show substantial improvements between the best configuration found in the first iteration and the last iteration. Convergence of the algorithm is also fast to within 20 generations for a combined application set.
Jing Jiang, Ilya Katsnelson, and Ernesto Staroswiecki
The drive to achieve high levels of availability and reliability in computer systems has fueled the development of fault-tolerance techniques for some time. These techniques, however, have been designed mostly with single processor, symmetric multiprocessor (SMP), or small chip-multiprocessor (CMP) systems in mind (i.e. up to two cores per chip) . There is a need for this work to be extended for larger CMP and Polymorphic systems.
In this paper we present a technique to detect and recover from both transient and permanent errors within a Chip-Multiprocessor. We also run benchmarks in the presence of faults to evaluate both the overall performance degradation caused by the errors when our fault-tolerance scheme is used, and the local performance profile to better understand the implications of a fault to a processor element and its neighbors.
Finally, since this is not only a research paper, but also a class report for EE392C, Spring 2003, we describe our experience while working on this project, as well as most of our ideas that are now part of the future work section.
Dave Bloom, Brad Schumitsch, Garret Smith, and John Whaley
We investigated techniques for automatically exploiting method-level parallelism using method-level speculation. AMP profiles the code and then automatically identifies candidates for speculation. We constructed a full tool chain for trace simulations, and investigated a variety of heuristics for choosing which methods upon which to speculate. Using a 4 processor system, on a set of standard sequential benchmarks our best heuristic has an average running time of only 73% the sequential version.
Rohit Kumar Gupta, Paul Wang Lee, Wajahat Qadeer, and Rebecca Sara Schultz
In order to obtain significant speed-ups for data parallel applications, vector processors have been used successfully. However, vector architectures may not be suitable for general purpose computing. Many novel techniques have been developed for that purpose, most notably in the direction of multithreading. We have developed a core for a scalable chip multiprocessor that efficiently combined both topologies. We present a hardware overview, discuss details of the pipeline and instruction set and demonstrate how our system compares to other scalar and vector systems.
Amin Firoozshahian, Arjun Singh and John Kim
As the number of input ports scale in an interconnect network on chip, the existing structures used such as a bus or a crossbar have limitations. Other network topologies such as a torus or a mesh topology scale well but they are not amenable to the ordering desired with its large path diversity. This paper presents two schemes to overcome these difficulties: a multistage arbitration scheme, which allows the arbitration scheme to be divided such that the arbitration cycle time can be small, and a Clos-like structure, which allows for better scaling of the interconnection and also achieves higher throughput. To verify these two schemes, a cycle accurate simulator was developed and the performance of the two schemes are compared in terms of throughput, latency and logic cost. Simulation results of the two schemes as well as their properties presented in the paper show that a Clos-like network can achieve a much better throughput while suffering from higher cost.