Latest in cs.ar

total 1420took 0.14s
TiM-DNN: Ternary in Memory accelerator for Deep Neural NetworksSep 15 2019The use of lower precision to perform computations has emerged as a popular technique to enable complex Deep Neural Networks (DNNs) to be realized on energy-constrained platforms. In the quest for lower precision, studies to date have shown that ternary ... More
Instructional Level ParallelismSep 14 2019This paper is a review of the developments in Instruction level parallelism. It takes into account all the changes made in speeding up the execution. The various drawbacks and dependencies due to pipelining are discussed and various solutions to overcome ... More
QuTiBench: Benchmarking Neural Networks on Heterogeneous HardwareSep 11 2019Neural Networks have become one of the most successful universal machine learning algorithms. They play a key role in enabling machine vision and speech recognition for example. Their computational complexity is enormous and comes along with equally challenging ... More
Packet Chasing: Spying on Network Packets over a Cache Side-ChannelSep 11 2019This paper presents Packet Chasing, an attack on the network that does not require access to the network, which works regardless of the privilege level of the process receiving the packets. A spy process can easily probe and discover the exact cache location ... More
NoCs in Heterogeneous 3D SoCs: Co-Design of Routing Strategies and MicroarchitecturesSep 10 2019Heterogeneous 3D System-on-Chips (3D SoCs) are the most promising design paradigm to combine sensing and computing within a single chip. A special characteristic of communication networks in heterogeneous 3D SoCs is the varying latency and throughput ... More
TMA: Tera-MACs/W Neural Hardware Inference Accelerator with a Multiplier-less Massive Parallel ProcessorSep 08 2019Computationally intensive Inference tasks of Deep neural networks have enforced revolution of new accelerator architecture to reduce power consumption as well as latency. The key figure of merit in hardware inference accelerators is the number of multiply-and-accumulation ... More
SPRING: A Sparsity-Aware Reduced-Precision Monolithic 3D CNN Accelerator Architecture for Training and InferenceSep 02 2019CNNs outperform traditional machine learning algorithms across a wide range of applications. However, their computational complexity makes it necessary to design efficient hardware accelerators. Most CNN accelerators focus on exploring dataflow styles ... More
Touché: Towards Ideal and Efficient Cache Compression By Mitigating Tag Area OverheadsSep 02 2019Compression is seen as a simple technique to increase the effective cache capacity. Unfortunately, compression techniques either incur tag area overheads or restrict data placement to only include neighboring compressed cache blocks to mitigate tag area ... More
An Ultra-Efficient Memristor-Based DNN Framework with Structured Weight Pruning and Quantization Using ADMMAug 29 2019The high computation and memory storage of large deep neural networks (DNNs) models pose intensive challenges to the conventional Von-Neumann architecture, incurring substantial data movements in the memory hierarchy. The memristor crossbar array has ... More
A Machine Learning Accelerator In-Memory for Energy HarvestingAug 29 2019There is increasing demand to bring machine learning capabilities to low power devices. By integrating the computational power of machine learning with the deployment capabilities of low power devices, a number of new applications become possible. In ... More
Tiny but Accurate: A Pruned, Quantized and Optimized Memristor Crossbar Framework for Ultra Efficient DNN ImplementationAug 27 2019The state-of-art DNN structures involve intensive computation and high memory storage. To mitigate the challenges, the memristor crossbar array has emerged as an intrinsically suitable matrix computation and low-power acceleration framework for DNN applications. ... More
BRISC-V: An Open-Source Architecture Design Space Exploration ToolboxAug 27 2019In this work, we introduce a platform for register-transfer level (RTL) architecture design space exploration. The platform is an open-source, parameterized, synthesizable set of RTL modules for designing RISC-V based single and multi-core architecture ... More
Cyclic Sequence Generators as Program Counters for High-Speed FPGA-based ProcessorsAug 26 2019This paper compares the performance of conventional radix-2 program counters with program counters based on Feedback Shift Registers (FSRs), a class of cyclic sequence generator. FSR counters have constant time scaling with bit-width, $N$, whereas FPGA-based ... More
Tvarak: Software-managed hardware offload for DAX NVM storage redundancyAug 26 2019Tvarak efficiently implements system-level redundancy for direct-access (DAX) NVM storage. Production storage systems complement device-level ECC (which covers media errors) with system-checksums and cross-device parity. This system-level redundancy enables ... More
4-Bit High-Speed Binary Ling AdderAug 25 2019Binary addition is one of the most primitive and most commonly used applications in computer arithmetic. A large variety of algorithms and implementations have been proposed for binary addition. Huey Ling proposed a simpler form of CLA equations which ... More
Enabling and Exploiting Partition-Level Parallelism (PALP) in Phase Change MemoriesAug 21 2019Phase-change memory (PCM) devices have multiple banks to serve memory requests in parallel. Unfortunately, if two requests go to the same bank, they have to be served one after another, leading to lower system performance. We observe that a modern PCM ... More
Comparing ternary and binary adders and multipliersAug 20 2019While many papers have proposed implementations of ternary adders and ternary multipliers, no comparisons have generally been done with the corresponding binary ones. We compare the implementations of binary and ternary adders and multipliers with the ... More
Boosting the Bounds of Symbolic QED for Effective Pre-Silicon Verification of Processor CoresAug 19 2019Existing techniques to ensure functional correctness and hardware trust during pre-silicon verification face severe limitations. In this work, we systematically leverage two key ideas: 1) Symbolic QED, a recent bug detection and localization technique ... More
Across-Stack Profiling and Characterization of Machine Learning Models on GPUsAug 19 2019The world sees a proliferation of machine learning/deep learning (ML) models and their wide adoption in different application domains recently. This has made the profiling and characterization of ML models an increasingly pressing task for both hardware ... More
Ternary circuits: why R=3 is not the Optimal Radix for ComputationAug 19 2019A demonstration that e=2.718 rounded to 3 is the best radix for computation is disproved. The MOSFET-like CNTFET technology is used to compare inverters, Nand, adders, multipliers, D Flip-Flops and SRAM cells. The transistor count ratio between ternary ... More
A Computational Model for Tensor Core UnitsAug 19 2019To respond to the need of efficient training and inference of deep neural networks, a pletora of domain-specific hardware architectures have been introduced, such as Google Tensor Processing Units and NVIDIA Tensor Cores. A common feature of these architectures ... More
A bi-directional Address-Event transceiver block for low-latency inter-chip communication in neuromorphic systemsAug 18 2019Neuromorphic systems typically use the Address-Event Representation (AER) to transmit signals among nodes, cores, and chips. Communication of Address-Events (AEs) between neuromorphic cores/chips typically requires two parallel digital signal buses for ... More
Workload-Aware Opportunistic Energy Efficiency in Multi-FPGA PlatformsAug 18 2019The continuous growth of big data applications with high computational and scalability demands has resulted in increasing popularity of cloud computing. Optimizing the performance and power consumption of cloud resources is therefore crucial to relieve ... More
CHoNDA: Near Data Acceleration with Concurrent Host AccessAug 18 2019Near-data accelerators (NDAs) that are integrated with main memory have the potential for significant power and performance benefits. Fully realizing these benefits requires the large available memory capacity to be shared between the host and the NDAs ... More
Reinforcement Learning based Interconnection Routing for Adaptive Traffic OptimizationAug 13 2019Applying Machine Learning (ML) techniques to design and optimize computer architectures is a promising research direction. Optimizing the runtime performance of a Network-on-Chip (NoC) necessitates a continuous learning framework. In this work, we demonstrate ... More
Work-in-Progress: A Simulation Framework for Domain-Specific System-on-ChipsAug 10 2019Heterogeneous system-on-chips (SoCs) have become the standard embedded computing platforms due to their potential to deliver superior performance and energy efficiency compared to homogeneous architectures. They can be particularly suited to target a ... More
TensorDIMM: A Practical Near-Memory Processing Architecture for Embeddings and Tensor Operations in Deep LearningAug 08 2019Recent studies from several hyperscalars pinpoint to embedding layers as the most memory-intensive deep learning (DL) algorithm being deployed in today's datacenters. This paper addresses the memory capacity and bandwidth challenges of embedding layers ... More
TensorDIMM: A Practical Near-Memory Processing Architecture for Embeddings and Tensor Operations in Deep LearningAug 08 2019Aug 25 2019Recent studies from several hyperscalars pinpoint to embedding layers as the most memory-intensive deep learning (DL) algorithm being deployed in today's datacenters. This paper addresses the memory capacity and bandwidth challenges of embedding layers ... More
High-Level Combined Deterministic and Pseudoexhuastive Test Generation for RISC ProcessorsAug 08 2019Recent safety standards set stringent requirements for the target fault coverage in embedded microprocessors, with the objective to guarantee robustness and functional safety of the critical electronic systems. This motivates the need for improving the ... More
Near-Memory Computing: Past, Present, and FutureAug 07 2019The conventional approach of moving data to the CPU for computation has become a significant performance bottleneck for emerging scale-out data-intensive applications due to their limited data reuse. At the same time, the advancement in 3D integration ... More
3D-aCortex: An Ultra-Compact Energy-Efficient Neurocomputing Platform Based on Commercial 3D-NAND Flash MemoriesAug 07 2019The first contribution of this paper is the development of extremely dense, energy-efficient mixed-signal vector-by-matrix-multiplication (VMM) circuits based on the existing 3D-NAND flash memory blocks, without any need for their modification. Such compatibility ... More
Addressing multiple bit/symbol errors in DRAM subsystemAug 05 2019As DRAM technology continues to evolve towards smaller feature sizes and increased densities, faults in DRAM subsystem are becoming more severe. Current servers mostly use CHIPKILL based schemes to tolerate up-to one/two symbol errors per DRAM beat. Multi-symbol ... More
PERI: A Posit Enabled RISC-V CoreAug 05 2019Owing to the failure of Dennard's scaling the last decade has seen a steep growth of prominent new paradigms leveraging opportunities in computer architecture. Two technologies of interest are Posit and RISC-V. Posit was introduced in mid-2017 as a viable ... More
CREST: Hardware Formal Verification with ANSI-C Reference SpecificationsAug 04 2019This paper presents CREST, a prototype front-end tool intended as an add-on to commercial EDA formal verifcation environments. CREST is an adaptation of the CBMC bounded model checker for C, an academic tool widely used in industry for software analysis ... More
Analysis and Optimization of I/O Cache Coherency Strategies for SoC-FPGA DeviceAug 04 2019Unlike traditional PCIe-based FPGA accelerators, heterogeneous SoC-FPGA devices provide tighter integrations between software running on CPUs and hardware accelerators. Modern heterogeneous SoC-FPGA platforms support multiple I/O cache coherence options ... More
Towards Multidimensional Verification: Where Functional Meets Non-FunctionalAug 01 2019Trends in advanced electronic systems' design have a notable impact on design verification technologies. The recent paradigms of Internet-of-Things (IoT) and Cyber-Physical Systems (CPS) assume devices immersed in physical environments, significantly ... More
Runtime Mitigation of Packet Drop Attacks in Fault-tolerant Networks-on-ChipAug 01 2019Fault-tolerant routing (FTR) in Networks-on-Chip (NoCs) has become a common practice to sustain the performance of multi-core systems with an increasing number of faults on a chip. On the other hand, usage of third-party intellectual property blocks has ... More
Generalized Fault-Tolerance Topology Generation for Application Specific Network-on-ChipsAug 01 2019The Network-on-Chips is a promising candidate for addressing communication bottlenecks in many-core processors and neural network processors. In this work, we consider the generalized fault-tolerance topology generation problem, where the link or switch ... More
Mixed-level identification of fault redundancy in microprocessorsJul 29 2019A new high-level implementation independent functional fault model for control faults in microprocessors is introduced. The fault model is based on the instruction set, and is specified as a set of data constraints to be satisfied by test data generation. ... More
Pyramid: Machine Learning Framework to Estimate the Optimal Timing and Resource Usage of a High-Level Synthesis DesignJul 29 2019The emergence of High-Level Synthesis (HLS) tools shifted the paradigm of hardware design by making the process of mapping high-level programming languages to hardware design such as C to VHDL/Verilog feasible. HLS tools offer a plethora of techniques ... More
The Preliminary Evaluation of a Hypervisor-based Virtualization Mechanism for Intel Optane DC Persistent Memory ModuleJul 28 2019Non-volatile memory (NVM) technologies, being accessible in the same manner as DRAM, are considered indispensable for expanding main memory capacities. Intel Optane DCPMM is a long-awaited product that drastically increases main memory capacities. However, ... More
A Workload and Programming Ease Driven Perspective of Processing-in-MemoryJul 26 2019Many modern and emerging applications must process increasingly large volumes of data. Unfortunately, prevalent computing paradigms are not designed to efficiently handle such large-scale data: the energy and performance costs to move this data between ... More
Performance Comparison of Quasi-Delay-Insensitive Asynchronous AddersJul 24 2019In this technical note, we provide a comparison of the design metrics of various quasi-delay-insensitive (QDI) asynchronous adders, where the adders correspond to diverse architectures. QDI adders are robust, and the objective of this technical note is ... More
Reconfigurable multiplier architecture based on memristor-cmos with higher flexibilityJul 22 2019Multiplication is an indispensable operation in most of digital signal processing systems. Recently, many systems need to execute different types of algorithms on a multiplier. Therefore, it needs complicated computation and large area occupation. In ... More
PPAC: A Versatile In-Memory Accelerator for Matrix-Vector-Product-Like OperationsJul 19 2019Processing in memory (PIM) moves computation into memories with the goal of improving throughput and energy-efficiency compared to traditional von Neumann-based architectures. Most existing PIM architectures are either general-purpose but only support ... More
CADS: Core-Aware Dynamic Scheduler for Multicore Memory ControllersJul 17 2019Memory controller scheduling is crucial in multicore processors, where DRAM bandwidth is shared. Since increased number of requests from multiple cores of processors becomes a source of bottleneck, scheduling the requests efficiently is necessary to utilize ... More
Coprocessors: failures and successesJul 16 2019The appearance and disappearance of coprocessors by integration into the CPU, the success or failure of coprocessors are examined by summarizing their characteristics from the mainframes of the 1960s. The coprocessors most particularly reviewed are the ... More
Fast Modeling L2 Cache Reuse Distance Histograms Using Combined Locality Information from Software TracesJul 11 2019To mitigate the performance gap between the CPU and the main memory, multi-level cache architectures are widely used in modern processors. Therefore, modeling the behaviors of the downstream caches becomes a critical part of the processor performance ... More
Efficient Uncertainty Modeling for System Design via Mixed Integer ProgrammingJul 10 2019The post-Moore era casts a shadow of uncertainty on many aspects of computer system design. Managing that uncertainty requires new algorithmic tools to make quantitative assessments. While prior uncertainty quantification methods, such as generalized ... More
A Range Matching CAM for Hierarchical Defect Tolerance Technique in NRAM StructuresJul 10 2019Due to the small size of nanoscale devices, they are highly prone to process disturbances which results in manufacturing defects. Some of the defects are randomly distributed throughout the nanodevice layer. Other disturbances tend to be local and lead ... More
FusionAccel: A General Re-configurable Deep Learning Inference Accelerator on FPGA for Convolutional Neural NetworksJul 04 2019The deep learning accelerator is one of the methods to accelerate deep learning network computations, which is mainly based on convolutional neural network acceleration. To address the fact that concurrent convolutional neural network accelerators are ... More
TicToc: Enabling Bandwidth-Efficient DRAM Caching for both Hits and Misses in Hybrid Memory SystemsJul 04 2019This paper investigates bandwidth-efficient DRAM caching for hybrid DRAM + 3D-XPoint memories. 3D-XPoint is becoming a viable alternative to DRAM as it enables high-capacity and non-volatile main memory systems; however, 3D-XPoint has 4-8x slower read, ... More
To Update or Not To Update?: Bandwidth-Efficient Intelligent Replacement Policies for DRAM CachesJul 04 2019This paper investigates intelligent replacement policies for improving the hit-rate of gigascale DRAM caches. Cache replacement policies are commonly used to improve the hit-rate of on-chip caches. The most effective replacement policies often require ... More
On the Optimal Refresh Power Allocation for Energy-Efficient MemoriesJul 02 2019Refresh is an important operation to prevent loss of data in dynamic random-access memory (DRAM). However, frequent refresh operations incur considerable power consumption and degrade system performance. Refresh power cost is especially significant in ... More
On the Optimal Refresh Power Allocation for Energy-Efficient MemoriesJul 02 2019Jul 18 2019Refresh is an important operation to prevent loss of data in dynamic random-access memory (DRAM). However, frequent refresh operations incur considerable power consumption and degrade system performance. Refresh power cost is especially significant in ... More
Bridging the Architecture Gap: Abstracting Performance-Relevant Properties of Modern Server ProcessorsJul 01 2019We describe a universal modeling approach for predicting single- and multicore runtime of steady-state loops on server processors. To this end we strictly differentiate between application and machine models: An application model comprises the loop code, ... More
HTS: A Hardware Task Scheduler for Heterogeneous SystemsJun 29 2019As the Moore's scaling era comes to an end, application specific hardware accelerators appear as an attractive way to improve the performance and power efficiency of our computing systems. A massively heterogeneous system with a large number of hardware ... More
Tucker Tensor Decomposition on FPGAJun 28 2019Tensor computation has emerged as a powerful mathematical tool for solving high-dimensional and/or extreme-scale problems in science and engineering. The last decade has witnessed tremendous advancement of tensor computation and its applications in machine ... More
Mixed-Signal Charge-Domain Acceleration of Deep Neural networks through Interleaved Bit-Partitioned ArithmeticJun 27 2019Low-power potential of mixed-signal design makes it an alluring option to accelerate Deep Neural Networks (DNNs). However, mixed-signal circuitry suffers from limited range for information encoding, susceptibility to noise, and Analog to Digital (A/D) ... More
Mixed-Signal Charge-Domain Acceleration of Deep Neural networks through Interleaved Bit-Partitioned ArithmeticJun 27 2019Jul 09 2019Low-power potential of mixed-signal design makes it an alluring option to accelerate Deep Neural Networks (DNNs). However, mixed-signal circuitry suffers from limited range for information encoding, susceptibility to noise, and Analog to Digital (A/D) ... More
Mixed-Signal Charge-Domain Acceleration of Deep Neural networks through Interleaved Bit-Partitioned ArithmeticJun 27 2019Jul 12 2019Low-power potential of mixed-signal design makes it an alluring option to accelerate Deep Neural Networks (DNNs). However, mixed-signal circuitry suffers from limited range for information encoding, susceptibility to noise, and Analog to Digital (A/D) ... More
FPGA-based Multi-Chip Module for High-Performance ComputingJun 26 2019Current integration, architectural design and manufacturing technologies are not suited for the computing density and power efficiency requested by Exascale computing. New approaches in hardware architecture are thus needed to overcome the technological ... More
Automatic Conversion from Flip-flop to 3-phase Latch-based DesignsJun 25 2019Latch-based designs have many benefits over their flip-flop based counterparts but have limited use partially because most RTL specifications are flop-centric and automatic conversion of FF to latch-based designs is challenging. Conventional conversion ... More
Adaptive Precision CNN Accelerator Using Radix-X Parallel Connected Memristor CrossbarsJun 22 2019Neural processor development is reducing our reliance on remote server access to process deep learning operations in an increasingly edge-driven world. By employing in-memory processing, parallelization techniques, and algorithm-hardware co-design, memristor ... More
A Retrospective Recount of Computer Architecture Research with a Data-Driven Study of Over Four Decades of ISCA PublicationsJun 22 2019This study began with a research project, called DISCvR, conducted at the IBM-ILLINOIS Center for Cognitive Computing Systems Reseach. The goal of DISCvR was to build a practical NLP based AI pipeline for document understanding which will help us better ... More
An Overview of In-memory Processing with Emerging Non-volatile Memory for Data-intensive ApplicationsJun 15 2019The conventional von Neumann architecture has been revealed as a major performance and energy bottleneck for rising data-intensive applications. %, due to the intensive data movements. The decade-old idea of leveraging in-memory processing to eliminate ... More
Branch Prediction Is Not a Solved Problem: Measurements, Opportunities, and Future DirectionsJun 13 2019Modern branch predictors predict the vast majority of conditional branch instructions with near-perfect accuracy, allowing superscalar, out-of-order processors to maximize speculative efficiency and thus performance. However, this impressive overall effectiveness ... More
Thread Batching for High-performance Energy-efficient GPU Memory DesignJun 13 2019Massive multi-threading in GPU imposes tremendous pressure on memory subsystems. Due to rapid growth in thread-level parallelism of GPU and slowly improved peak memory bandwidth, the memory becomes a bottleneck of GPU's performance and energy efficiency. ... More
Data Conversion in Area-Constrained Applications: the Wireless Network-on-Chip CaseJun 12 2019Network-on-Chip (NoC) is currently the paradigm of choice to interconnect the different components of System-on-Chips (SoCs) or Chip Multiprocessors (CMPs). As the levels of integration continue to grow, however, current NoCs face significant scalability ... More
Transport Triggered Array Processor for Vision ApplicationsJun 10 2019Low-level sensory data processing in many Internet-of-Things (IoT) devices pursue energy efficiency by utilizing sleep modes or slowing the clocking to the minimum. To curb the share of stand-by power dissipation in those designs, near-threshold/sub-threshold ... More
Practical Byte-Granular Memory Blacklisting using CaliformsJun 05 2019Jun 10 2019Recent rapid strides in memory safety tools and hardware have improved software quality and security. While coarse-grained memory safety has improved, achieving memory safety at the granularity of individual objects remains a challenge due to high performance ... More
Practical Byte-Granular Memory Blacklisting using CaliformsJun 05 2019Jun 06 2019Recent rapid strides in memory safety tools and hardware have improved software quality and security. While coarse-grained memory safety has improved, achieving memory safety at the granularity of individual objects remains a challenge due to high performance ... More
Practical Byte-Granular Memory Blacklisting using CaliformsJun 05 2019Recent rapid strides in memory safety tools and hardware have improved software quality and security. While coarse-grained memory safety has improved, achieving memory safety at the granularity of individual objects remains a challenge due to high performance ... More
Pangloss: a novel Markov chain prefetcherJun 03 2019We present Pangloss, an efficient high-performance data prefetcher that approximates a Markov chain on delta transitions. With a limited information scope and space/logic complexity, it is able to reconstruct a variety of both simple and complex access ... More
Ara: A 1 GHz+ Scalable and Energy-Efficient RISC-V Vector Processor with Multi-Precision Floating Point Support in 22 nm FD-SOIJun 02 2019In this paper, we present Ara, a 64-bit vector processor based on the version 0.5 draft of RISC-V's vector extension, implemented in GlobalFoundries 22FDX FD-SOI technology. Ara's microarchitecture is scalable, as it is composed of a set of identical ... More
Sparse Matrix to Matrix Multiplication: A Representation and Architecture for Acceleration (long version)Jun 02 2019Accelerators for sparse matrix multiplication are important components in emerging systems. In this paper, we study the main challenges of accelerating Sparse Matrix Multiplication (SpMM). For the situations that data is not stored in the desired order ... More
Isolation-Aware Timing Analysis and Design Space Exploration for Predictable and Composable Many-Core SystemsMay 31 2019Jun 24 2019Composable many-core systems enable the independent development and analysis of applications which will be executed on a shared platform where the mix of concurrently executed applications may change dynamically at run time. For each individual application, ... More
Isolation-Aware Timing Analysis and Design Space Exploration for Predictable and Composable Many-Core SystemsMay 31 2019Composable many-core systems enable the independent development and analysis of applications which will be executed on a shared platform where the mix of concurrently executed applications may change dynamically at run time. For each individual application, ... More
iVAMS 1.0: Polynomial-Metamodel-Integrated Intelligent Verilog-AMS for Fast, Accurate Mixed-Signal Design OptimizationMay 30 2019Electronic circuit behavioral models built with hardware description/modeling languages such as Verilog-AMS for system-level simulations are typically functional models. They do not capture the physical design (layout) information of the target design. ... More
Fallout: Reading Kernel Writes From User SpaceMay 29 2019Recently, out-of-order execution, an important performance optimization in modern high-end processors, has been revealed to pose a significant security threat, allowing information leaks across security domains. In particular, the Meltdown attack leaks ... More
Polystore++: Accelerated Polystore System for Heterogeneous WorkloadsMay 24 2019Modern real-time business analytic consist of heterogeneous workloads (e.g, database queries, graph processing, and machine learning). These analytic applications need programming environments that can capture all aspects of the constituent workloads ... More
Indicating Asynchronous MultipliersMay 24 2019Multiplication is a basic arithmetic operation that is encountered in almost all general-purpose microprocessing and digital signal processing applications, and multiplication is physically realized using a multiplier. This paper discusses the physical ... More
In-DRAM Bulk Bitwise Execution EngineMay 23 2019Jun 04 2019Many applications heavily use bitwise operations on large bitvectors as part of their computation. In existing systems, performing such bulk bitwise operations requires the processor to transfer a large amount of data on the memory channel, thereby consuming ... More
In-DRAM Bulk Bitwise Execution EngineMay 23 2019Many applications heavily use bitwise operations on large bitvectors as part of their computation. In existing systems, performing such bulk bitwise operations requires the processor to transfer a large amount of data on the memory channel, thereby consuming ... More
Low-power Programmable Processor for Fast Fourier Transform Based on Transport Triggered ArchitectureMay 18 2019This paper describes a low-power processor tailored for fast Fourier transform computations where transport triggering template is exploited. The processor is software-programmable while retaining an energy-efficiency comparable to existing fixed-function ... More
Performance Analysis of 6T and 9T SRAMMay 18 2019The SRAM cell is made up of latch, which ensures that the cell data is preserved as long as power is turned on and refresh operation is not required for the SRAM cell. SRAM is widely used for on-chip cache memory in microprocessors, game software, computers, ... More
HALLS: An Energy-Efficient Highly Adaptable Last Level STT-RAM Cache for Multicore SystemsMay 18 2019Spin-Transfer Torque RAM (STT-RAM) is widely considered a promising alternative to SRAM in the memory hierarchy due to STT-RAM's non-volatility, low leakage power, high density, and fast read speed. The STT-RAM's small feature size is particularly desirable ... More
Fast TLB Simulation for RISC-V SystemsMay 16 2019Address translation and protection play important roles in today's processors, supporting multiprocessing and enforcing security. Historically, the design of the address translation mechanisms has been closely tied to the instruction set. In contrast, ... More
Indicating Asynchronous Array MultipliersMay 15 2019Multiplication is an important arithmetic operation that is frequently encountered in microprocessing and digital signal processing applications, and multiplication is physically realized using a multiplier. This paper discusses the physical implementation ... More
Fully Integrated On-FPGA Molecular Dynamics SimulationsMay 14 2019The implementation of Molecular Dynamics (MD) on FPGAs has received substantial attention. Previous work, however, has consisted of either proof-of-concept implementations of components, usually the range-limited force; full systems, but with much of ... More
FPGA-based Binocular Image Feature Extraction and Matching SystemMay 13 2019Image feature extraction and matching is a fundamental but computation intensive task in machine vision. This paper proposes a novel FPGA-based embedded system to accelerate feature extraction and matching. It implements SURF feature point detection and ... More
FPGA-based Binocular Image Feature Extraction and Matching SystemMay 13 2019May 14 2019Image feature extraction and matching is a fundamental but computation intensive task in machine vision. This paper proposes a novel FPGA-based embedded system to accelerate feature extraction and matching. It implements SURF feature point detection and ... More
Reconfigurable Hardware Implementation of the Successive Overrelaxation MethodMay 12 2019In this chapter, we study the feasibility of implementing SOR in reconfigurable hardware. We use Handel-C, a higher level design tool, to code our design, which is analyzed, synthesized, and placed and routed using the FPGAs proprietary software (DK Design ... More
Optimizing Routerless Network-on-Chip Designs: An Innovative Learning-Based FrameworkMay 11 2019Machine learning applied to architecture design presents a promising opportunity with broad applications. Recent deep reinforcement learning (DRL) techniques, in particular, enable efficient exploration in vast design spaces where conventional design ... More
Exploiting Fine-Grain Ordered Parallelism in Dense Matrix AlgorithmsMay 09 2019Dense linear algebra kernels are critical for wireless applications, and the oncoming proliferation of 5G only amplifies their importance. Many such matrix algorithms are inductive, and exhibit ample amounts of fine-grain ordered parallelism -- when multiple ... More
SAWL:A Self-adaptive Wear-leveling NVM Scheme for High Performance Storage SystemsMay 08 2019In order to meet the needs of high performance computing (HPC) in terms of large memory, high throughput and energy savings, the non-volatile memory (NVM) has been widely studied due to its salient features of high density, near-zero standby power, byte-addressable ... More
Efficient Similarity-aware Compression to Reduce Bit-writes in Non-Volatile Main Memory for Image-based ApplicationsMay 07 2019Image bitmaps have been widely used in in-memory applications, which consume lots of storage space and energy. Compared with legacy DRAM, non-volatile memories (NVMs) are suitable for bitmap storage due to the salient features in capacity and power savings. ... More
Rethinking Arithmetic for Deep Neural NetworksMay 07 2019We consider efficiency in deep neural networks. Hardware accelerators are gaining interest as machine learning becomes one of the drivers of high-performance computing. In these accelerators, the directed graph describing a neural network can be implemented ... More
IRONHIDE: A Secure Multicore Architecture that Leverages Hardware Isolation Against Microarchitecture State AttacksApr 29 2019Modern microprocessors enable aggressive hardware virtualization that exposes the microarchitecture state of the processor due to temporal sharing of hardware resources. This paper proposes a novel secure multicore architecture, IRONHIDE that aims to ... More
IRONHIDE: A Secure Multicore that Efficiently Mitigates Microarchitecture State Attacks for Interactive ApplicationsApr 29 2019Aug 09 2019Microprocessors enable aggressive hardware virtualization by means of which multiple processes temporally execute on the system. These security-critical and ordinary processes interact with each other to assure application progress. However, temporal ... More