Publications

ASPLOS'22 SOL: Safe on-Node Learning in Cloud Platforms

ASPLOS'22 ShEF: Shielded Enclaves for Cloud FPGAs

ASPLOS'22 RecShard: Statistical Feature-Based Memory Optimization for Industry-Scale Neural Recommendation

ACM TOS RAIL: Predictable, Low Tail Latency for NVMe Flash

CIDR'22 VIVA: An End-to-End System for Interactive Video Analytics

CIDR'22 A Progress Report on DBOS: A Database-oriented Operating System

arXiv Practical Scheduling for Real-World Serverless Computing

SOSP'21 Syrup: User-Defined Scheduling Across the Stack

SOSP'21 GhOSt: Fast & Flexible User-Space Delegation of Linux Scheduling

VLDB'22 DBOS: A DBMS-Oriented Operating System

SoCC'21 Llama: A Heterogeneous & Serverless Framework for Auto-Tuning Video Analytics Pipelines

SoCC'21 Faa$T: A Transparent Auto-Scaling Cache for Serverless Applications

ATC'21 INFaaS: Automated Model-less Inference Serving

HotOS'21 A Case against (Most) Context Switches

EuroSys'21 SmartHarvest: Harvesting Idle CPUs Safely and Efficiently in the Cloud

IEEE CAL RAMBO: Resource Allocation for Microservices Using Bayesian Optimization

OSDI'20 RackSched: A Microsecond-Scale Scheduler for Rack-Scale Computers

SoCC'20 Leveraging Application Classes to Save Power in Highly-Utilized Data Centers

arXiv DBOS: A Proposal for a Data-Centric Operating System

Top Picks'20 AsmDB: Understanding and Mitigating Front-End Stalls in Warehouse-Scale Computers

IEEE Micro The Hot Chips Renaissance

ASPLOS'20 Interstellar: Using Halide's Scheduling Language to Analyze DNN Accelerators

ASPLOS'20 Classifying Memory Access Patterns for Prefetching

Poly'20 A Polystore Based Database Operating System (DBOS)

HotNets'19 Mind the Gap: A Case for Informed Request Scheduling at the NIC

SoCC'19 Centralized Core-Granular Scheduling for Serverless Functions

ATC'19 From Laptop to Lambda: Outsourcing Everyday Jobs to Thousands of Transient Functional Containers

ISCA'19 AsmDB: Understanding and Mitigating Front-End Stalls in Warehouse-Scale Computers

HotOS'19 A Case for Managed and Model-Less Inference Serving

ASPLOS'19 TANGRAM: Optimized Coarse-Grained Dataflow for Scalable NN Accelerators

arXiv A New Frontier for Pull-Based Graph Processing

NSDI'19 Shinjuku: Preemptive Scheduling for µSecond-Scale Tail Latency

;login Pocket: Elastic Ephemeral Storage for Serverless Analytics

;login Outsourcing Everyday Jobs to Thousands of Cloud Functions with gg

arXiv Trevor: Automatic configuration and scaling of stream processing pipelines

OSDI'18 Pocket: Elastic Ephemeral Storage for Serverless Analytics

ACM TACO QuMan: Profile-Based Improvement of Cluster Utilization

ATC'18 Understanding Ephemeral Storage for Serverless Analytics

ATC'18 Selecta: Heterogeneous Cloud Storage Configuration for Data Analytics

ICML'18 Learning Memory Access Patterns

CACM Amdahl's Law for Tail Latency

PLDI'18 Spatial: A Language and Compiler for Application Accelerators

Top Picks'18 Uncovering the Security Implications of Cloud Multi-Tenancy with Bolt

Top Picks'18 Plasticine: A Reconfigurable Accelerator for Parallel Patterns

arXiv Learning Memory Access Patterns

HPCA'18 Memory Hierarchy for Web Search

PPoPP'18 Making Pull-Based Graph Processing Performant

HPCA'18 GraphP: Reducing Communication for PIM-Based Graph Processing with Efficient Data Partition

ACM TOCS Corrigendum to “The IX Operating System: Combining Low Latency, High Throughput and Efficiency in a Protected Dataplane”

arXiv AppSwitch: Resolving the Application Identity Crisis

ATC'17 Persona: A High-Performance Bioinformatics Framework

ASPLOS'17 TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory

ASPLOS'17 ReFlex: Remote Flash ≈ Local Flash

ASPLOS'17 Bolt: I Know What You Did Last Summer... In The Cloud

ISCA'17 Plasticine: A Reconfigurable Architecture For Parallel Paterns

Top Picks'17 DRAF: A Low-Power DRAM-Based Reconfigurable Acceleration Fabric

CODES'17 3D Nanosystems Enable Embedded Abundant-Data Computing: Special Session Paper

ACM TOCS The IX Operating System: Combining Low Latency, High Throughput, and Efficiency in a Protected Dataplane

IEEE CAL Security Implications of Data Mining in Cloud Scheduling

ISCA'16 DRAF: A Low-Power DRAM-Based Reconfigurable Acceleration Fabric

ISCA'16 Automatic Generation of Efficient Accelerators for Reconfigurable Hardware

Tech Report IX Open-source v1.0 – Deployment and Evaluation Guide

Top Picks'16 Improving Resource Efficiency at Scale with Heracles

ASPLOS'16 HCloud: Resource-Efficient Provisioning in Shared Cloud Systems

ASPLOS'16 Generating Configurable Hardware from Parallel Patterns

Eurosys'16 Flash Storage Disaggregation

HPCA'16 HRL: Efficient and flexible reconfigurable logic for near-data processing

IEEE Computer Energy-Efficient Abundant-Data Computing: The N3XT 1,000x

PACT'15 Practical Near-Data Processing for In-Memory Analytics Frameworks

SoCC'15 Tarcil: Reconciling Scheduling Speed and Quality in Large Shared Clusters

SoCC'15 Energy Proportionality and Workload Consolidation for Latency-Critical Applications

ISCA'15 Heracles: Improving Resource Efficiency at Scale

Top Picks'15 Convolution Engine: Balancing Efficiency and Flexibility in Specialized Computing

OSDI'14 IX: A Protected Dataplane Operating System for High Throughput and Low Latency

ISCA'14 Towards Energy Proportionality for Large-Scale Latency-Critical Workloads

Top Picks'14 Quality-of-Service-Aware Scheduling in Heterogeneous Data centers with Paragon

EuroSys'14 Reconciling High Server Utilization and Sub-Millisecond Quality-of-Service

ASPLOS'14 Quasar: Resource-Efficient and QoS-Aware Cluster Management

NVMW'14 High Performance Hardware-Accelerated Flash Key-Value Store

HPCA'14 Dynamic management of TurboMode in modern multi-core chips

ACM TOCS QoS-Aware Scheduling in Heterogeneous Datacenters with Paragon

IISWC'13 iBench: Quantifying interference for datacenter applications

IISWC'13 Locality-Aware Task Management for Unstructured Parallelism: A Quantitative Limit Study

ISCA'13 ZSim: Fast and Accurate Microarchitectural Simulation of Thousand-Core Systems

ICAC'13 QoS-Aware Admission Control in Heterogeneous Datacenters

ISCA'13 Convolution Engine: Balancing Efficiency & Flexibility in Specialized Computing

IEEE Micro Selected Research from Hot Chips 24

DATE'13 Resource Efficient Computing for Warehouse-Scale Datacenters

ASPLOS'13 Paragon: QoS-Aware Scheduling for Heterogeneous Datacenters

TRANSACT'13 Enhanced Concurrency Control with Transactional NACKs

IEEE CAL The Netflix Challenge: Datacenter Edition

SCIS Measuring and analyzing the energy use of enterprise computing systems

IISWC'12 ECHO: Recreating Network Traffic Maps for Datacenters with Tens of Thousands of Servers

OSDI'12 Dune: Safe User-Level Access to Privileged CPU Features

CODES'12 A Case of System-Level Hardware/Software Co-Design and Co-Verification of a Commodity Multi-Processor System with Custom Hardware

IEEE CAL Decoupling Datacenter Storage Studies from Access to Large-Scale Applications

ISCA'12 Towards Energy-Proportional Datacenter Memory with Mobile DRAM

IGCC'12 Green Enterprise Computing Data: Assumptions and Realities

Top Picks'12 Scalable and Efficient Fine-Grained Cache Partitioning with Vantage

ACM TACO Improving System Energy Efficiency with Memory Rank Subsetting

HPCA'12 SCD: A Scalable Coherence Directory with Flexible Sharer Set Encoding

IISWC'11 Decoupling Datacenter Studies from Access to Large-Scale Applications: A Modeling Approach for Storage Workloads

CACM Understanding Sources of Inefficiency in General-Purpose Chips

MobiHeld'11 MARS: Adaptive Remote Execution for Multi-Threaded Mobile Devices

TPCTC'11 Time and Cost-Efficient Modeling and Generation of Large-Scale TPCC/TPCE/TPCH Workloads

CACM The Case for RAMCloud

ISCA'11 Vantage: Scalable and Efficient Fine-Grain Cache Partitioning

MapReduce'11 Phoenix++: Modular MapReduce for Shared-Memory Systems

ICDCS'11 Cross-Examination of Datacenter Workload Modeling Techniques

ISPASS'11 Storage I/O Generation and Replay for Datacenter Applications

ASPLOS'11 Hardware Acceleration of Transactional Memory on Commodity Systems

PACT'11 Dynamic Fine-Grain Scheduling of Pipeline Parallelism

MICRO'10 The ZCache: Decoupling Ways and Associativity

IISWC'10 Eigenbench: A Simple Exploration Tool for Orthogonal TM Characteristics

IEEE Micro Server Engineering Insights for Large-Scale Online Services

ISCA'10 Understanding Sources of Inefficiency in General-Purpose Chips

ICS'10 Making Nested Parallel Transactions Practical Using Lightweight Hardware Support

SPAA'10 Implementing and Evaluating Nested Parallel Transactions in Software Transactional Memory

FCCM'10 FARM: A Prototyping Environment for Tightly-Coupled, Heterogeneous Architectures

NOCS'10 Evaluating Bufferless Flow Control for On-Chip Networks

ACM TACO An Analysis of On-Chip Interconnection Networks for Large-Scale Chip Multiprocessors

Usenix OSR Tainting is Not Pointless

Usenix OSR On the Energy (in)Efficiency of Hadoop Clusters

ICECCS'10 Implementing and Evaluating a Model Checker for Transactional Memory Systems

ASPLOS'10 Flexible Architectural Support for Fine-Grain Scheduling

Usenix OSR The Case for RAMClouds: Scalable High-Performance Storage Entirely in DRAM

SC'09 Future Scaling of Processor-Memory Interfaces

IISWC'09 Phoenix Rebirth: Scalable MapReduce on a Large-Scale Shared-Memory System

Usenix Security'09 Nemesis: Preventing Authentication & Access Control Vulnerabilities in Web Applications

HotChips'09 The Stanford Pervasive Parallelism Lab

IEEE CAL Power Management of Datacenter Workloads Using Per-Core Power Gating

ICS'09 Fast Memory Snapshot for Concurrent Programmingwithout Synchronization

DSN'09 Decoupling Dynamic Information Flow Tracking with a dedicated coprocessor

ISCA'09 A Memory System Design Framework: Creating Smart Memories

IEEE Micro Guest Editors' Introduction: Hot Chips Turns 20

POPL'09 Feedback-Directed Barrier Optimization in a Strongly Isolated STM

OSDI'08 Hardware Enforcement of Application Security Policies Using Tagged Memory

ACM TACO Comparative Evaluation of Memory Models for Chip Multiprocessors

HotPower'08 A Comparison of High-Level Full-System Power Models

IISWC'08 STAMP: Stanford Transactional Applications for Multi-Processing

CACM Transactional Memory

Usenix Security'08 Real-World Buffer Overflow Protection for Userspace & Kernelspace

SPAA'08 Improving Software Concurrency with Hardware-Assisted Memory Snapshot

SPAA'08 ASED: Availability, Security, and Debugging Support Usingtransactional Memory

HPCA'08 Thread-safe dynamic binary translation using transactional memory

IEEE Computer Models and Metrics to Enable Energy-Efficiency Optimizations

PACT'07 The OpenTM Transactional Application Programming Interface

CASES'07 A Low Power Front-End for Embedded Processors Using a Block-Aware Instruction Set

SPAA'06 Towards Soft Optimization Techniques for Parallel Cognitive Applications

ISCA'07 Raksha: A Flexible Information Flow Architecture for Software Security

SIGMOD'07 JouleSort: A Balanced Energy-Efficiency Benchmark

ISCA'07 Comparing Memory Systems for Chip Multiprocessors

ISCA'07 An Effective Hybrid Transactional Memory System with Strong Isolation Guarantees

PPoPP'07 Transactional Programming in a Multi-Core Environment

PPoPP'07 Transactional Collection Classes

IEEE Micro RAMP: Research Accelerator for Multiple Processors

PPoPP'07 Potential Show-Stoppers for Transactional Synchronization

IEEE CAL From Chaos to QoS: Case Studies in CMP Resource Management

HPCA'07 Evaluating MapReduce for Multi-Core and Multiprocessor Systems

HPCA'07 A Scalable, Non-Blocking Approach to Transactional Memory

Top Picks'07 Transactional Memory: The Hardware-Software Interface

DATE'07 Register Pointer Architecture for Efficient Embedded Processors

DATE'07 ATLAS: A Chip-Multiprocessor with Transactional Memory Support

ACM Queue Unlocking Concurrency: Multicore Programming with Transactional Memory

dasCMP'06 From Chaos to QoS: Case Studies in CMP Resource Management

SCP Executing Java Programs with Transactional Memory

ASPLOS'06 Tradeoffs in Transactional Memory Virtualization

PACT'06 Testing Implementations of Transactional Memory

HPEC'06 CEARCH: Cognition Enabled Architecture

ACM TACO Block-Aware Instruction Set Architecture

ICPP'06 Vector Lane Threading

HotChips'06 RAMP: Research Accelerator for Multiple Processors

PLDI'06 The Atomos Transactional Programming Language

MOBS'06 Full-system Power Analysis and Modeling for Server Environments

WTW'06 Early Release: Friend or Foe

WDDD'06 Deconstructing Hardware Architectures for Security

ISCA'06 Architectural Semantics for Practical Transactional Memory

WTW'06 Parallelizing SPECjbb2000 with Transactional Memory

STMCS'06 The Software Stack for Transactional Memory: Challenges and Opportunities

DATE'06 Simultaneously Improving Code Size, Performance, and Energy in Embedded Processors

HPCA'06 The common case transactional behavior of multithreaded programs

Report Library-based Prefetching for Pointer Intensive Applications

WARFP'06 Building and Using the ATLAS Transactional Memory System

GLOBECOM'05 Automatic power management schemes for Internet servers and data centers

SCOOL'05 Transactional Execution of Java Programs

Tech Report RAMP: Research Accelerator for Multiple Processors - A Community Vision for a Shared Experimental Parallel HW/SW Platform

PACT'05 Characterization of TCC on Chip-Multiprocessors

EuroPar'05 Improving Instruction Delivery with a Block-Aware ISA

ISPLED'05 Energy-Efficient and High-Performance Instruction Fetch Using a Block-Aware ISA

ICS'05 TAPE: A Transactional Application Profiling Environment

ICPP'05 Heuristics for Profile-Driven Method-Level Speculative Parallelization

WARFP'05 ATLAS: A Scalable Emulator for Transactional Parallel Systems

Top Picks'04 Transactional Coherence and Consistency: Simplifying Parallel Hardware and Software

LAR'04 Stream Virtual Machine and Two-Level Compilation Model for Streaming Architectures and Languages

ASPLOS'04 Programming with Transactional Coherence and Consistency (TCC)

PACT'04 The Stream Virtual Machine

ISCA'04 Transactional Memory Coherence and Consistency

Top Picks'03 Scalable vector processors for embedded systems

ISCA'03 Overcoming the limitations of conventional vector processors

MICRO'02 Vector vs. superscalar and VLIW architectures for embedded multimedia benchmarks

PhD Thesis Scalable Vector Media Processors for Embedded Systems

IEEE Hardware/compiler codevelopment for an embedded media processor

IMS'00 Proceedings of the 2nd Workshop on Intelligent Memory Systems

IMS'00 Exploiting On-chip Memory Bandwidth in the VIRAM Compiler

HotChips'00 Vector IRAM: A Media-oriented Vector Processor with Embedded DRAM

DATE'00 How to solve the current memory access and data transfer bottlenecks: at the processor architecture or at the compiler level?

MS Thesis A Media-Enhanced Vector Architecture for Embedded Memory Systems

IEEE Computer A new direction for computer architecture research

ICCD'97 Intelligent RAM (IRAM): the industrial setting, applications, and architectures

IEEE Computer Scalable processors in the billion-transistor era: IRAM

ARVLSI'97 Pipelined multi-queue management in a VLSI ATM switch chip with credit-based flow-control

ISCA'97 The Energy Efficiency Of Iram Architectures

WMLD'97 Evaluation of Existing Architectures in IRAM Systems

IEEE Micro A case for intelligent RAM

ISSCC'97 Intelligent RAM (IRAM): chips that remember and compute

BS Thesis The Architecture, Operation, and Design of the Queue Management Block in the ATLAS I ATM Switch