Scalable Multi-Core Simulation Using Parallel Dynamic Binary Translation

    Paper - Scalable Multi-Core Simulation Using Parallel Dynamic Binary Translation
  • Type: Conference Paper
  • Authors:
    O.Almer, I.Böhm, T.Edler von Koch, B.Franke, S.Kyle, V.Seeker, C.Thompson and N.Topham.
  • To appear in Proceedings of the International Symposium on Systems, Architectures, Modeling, and Simulation (SAMOS'11), Samos, Greece, July 19-22, 2011.
  • Abstract:

    In recent years multi-core processors have seen broad adoption in application domains ranging from embedded systems through general-purpose computing to large-scale data centres. Simulation technology for multi-core systems, however, lags behind and does not provide the simulation speed required to effectively support design space exploration and parallel software development. While state-of-the-art instruction set simulators (ISS) for single-core machines reach or exceed the performance levels of speed-optimised silicon implementations of embedded processors, the same does not hold for multi-core simulators where large performance penalties are to be paid. In this paper we develop a fast and scalable simulation methodology for multi-core platforms based on parallel and just-in-time (JIT) dynamic binary translation (DBT). Our approach can model large-scale multi-core configurations, does not rely on prior profiling, instrumentation, or compilation, and works for all binaries targeting a state-of-the-art embedded multi-core platform implementing the ARCompact instruction set architecture (ISA). We have evaluated our parallel simulation methodology against the industry standard SPLASH-2 and EEMBC MULTIBENCH benchmarks and demonstrate simulation speeds up to 25,307 MIPS on a 32-core x86 host machine for as many as 2048 target processors whilst exhibiting minimal and near constant overhead.

Generalized Just-In-Time Trace Compilation using a Parallel Task Farm in a Dynamic Binary Translator

    Paper - Generalized Just-In-Time Trace Compilation using a Parallel Task Farm in a Dynamic Binary Translator
  • Type: Conference Paper
  • Authors:
    I.Böhm, T.Edler von Koch, S. Kyle, B.Franke and N.Topham.
  • To Appear in ACM SIGPLAN 2011 Conference on Programming Language Design and Implementation (PLDI'11), San Jose, CA, June 2011.
  • Abstract:

    Dynamic Binary Translation (DBT) is the key technology behind cross-platform virtualization and allows software compiled for one Instruction Set Architecture (ISA) to be executed on a processor supporting a different ISA. Under the hood, DBT is typically implemented using Just-In-Time (JIT) compilation of frequently executed program regions, also called traces. The main challenge is translating frequently executed program regions as fast as possible into highly efficient native code. As time for JIT compilation adds to the overall execution time, the JIT compiler is often decoupled and operates in a separate thread independent from the main simulation loop to reduce the overhead of JIT compilation. In this paper we present two innovative contributions. The first contribution is a generalized trace compilation approach that considers all frequently executed paths in a program for JIT compilation, as opposed to previous approaches where trace compilation is restricted to paths through loops. The second contribution reduces JIT compilation cost by compiling several hot traces in a concurrent task farm. Altogether we combine generalized light-weight tracing, large translation units, parallel JIT compilation and dynamic work scheduling to ensure timely and efficient processing of hot traces. We have evaluated our industry-strength, LLVM-based parallel DBT implementing the ARCompact ISA against three benchmark suites (EEMBC, BioPerf and SPEC CPU2006) and demonstrate speedups of up to 2.08 on a standard quad-core Intel Xeon machine. Across short- and long-running benchmarks our scheme is robust and never results in a slowdown. In fact, using four processors total execution time can be reduced by on average 11.5 % over state-of-the-art decoupled, parallel (or asynchronous) JIT compilation.

Cycle-Accurate Performance Modelling in an Ultra-Fast Just-In-Time Dynamic Binary Translation Instruction Set Simulator

    Paper - Cycle-Accurate Performance Modelling in an Ultra-Fast Just-In-Time Dynamic Binary Translation Instruction Set Simulator
  • Type: Paper
  • Authors:
    I.Böhm, B.Franke and N.Topham.
  • Proceedings of the International Symposium on Systems, Architectures, Modeling, and Simulation (SAMOS'10), Samos, Greece, July 19-22, 2010.
  • Award: Received Best Paper Award
  • Abstract:

    Instruction set simulators (ISS) are vital tools for compiler and processor architecture design space exploration and verification. State-of-the-art simulators using just-in-time (JIT) dynamic binary translation (DBT) techniques are able to simulate complex embedded processors at speeds above 500 MIPS. However, these functional ISS do not provide microarchitectural observability. In contrast, low-level cycle-accurate ISS are too slow to simulate full-scale applications, forcing developers to revert to FPGA-based simulations. In this paper we demonstrate that it is possible to run ultra-high speed cycle-accurate instruction set simulations surpassing FPGA-based simulation speeds. We extend the JIT DBT engine of our ISS and augment JIT generated code with a verified cycle-accurate processor model. Our approach can model any microarchitectural configuration, does not rely on prior profiling, instrumentation, or compilation, and works for all binaries targeting a state-of-the-art embedded processor implementing the ARCompact™ instruction set architecture (ISA). We achieve simulation speeds up to 63 MIPS on a standard x86 desktop computer, whilst the average cycle-count deviation is less than 1.5 % for the industry standard EEMBC and CoreMark benchmark suites.

Integrated Instruction Selection and Register Allocation for Compact Code Generation Exploiting Freeform Mixing of 16- and 32-bit Instructions

    Paper - Integrated Instruction Selection and Register Allocation for Compact Code Generation Exploiting Freeform Mixing of 16- and 32-bit Instructions
  • Type: Conference Paper
  • Authors: T.Koch, I.Böhm and B.Franke
  • To Appear In: International Symposium on Code Generation and Optimization (CGO'10), Toronto, Canada, 2010.
  • Abstract:

    For memory constrained embedded systems code size is at least as important as performance. One way of increasing code density is to exploit compact instruction formats, e.g. ARM Thumb(-2), where the processor either operates in standard or compact instruction mode. The ARCompact ISA considered in this paper is different in that it allows freeform mixing of 16- and 32-bit instructions without a mode switch. Compact 16-bit instructions can be used anywhere in the code given that additional register constraints are satisfied. In this paper we present an integrated instruction selection and register allocation methodology and develop two approaches for mixed-mode code generation: a simple opportunistic scheme and a more advanced feedback-guided instruction selection scheme. We have implemented a code generator targeting the ARCompact ISA and evaluated its effectiveness against the ARC 750D embedded processor and the EEMBC benchmark suite. On average, we achieve a code size reduction of 16.7% across all benchmarks whilst at the same time improving performance by on average 17.7%.

Automated Code Generation using Dynamic Programming

Book - Automatic Code Generation
  • Type: Book
  • Author: Igor Böhm
  • Publisher: VDM Verlag Dr. Mueller e.K. (8 Feb 2008)
  • ISBN-10: 3836461587
  • ISBN-13: 978-3836461580
  • Abstract:

    Building compiler back ends from declarative specifications that map tree structured intermediate representations onto target machine code is the topic of this book. Although many tools and approaches have been devised to tackle the problem of automated code generation, there is still room for improvement. In this context we present HBURG, an implementation of a code generator generator that emits compiler back ends from concise tree pattern specifications written in our code generator description language. The language features attribute grammar style specifications and allows for great flexibility with respect to the placement of semantic actions. Our main contribution is to show that these language features can be integrated into automatically generated code generators that perform optimal instruction selection based on tree pattern matching combined with dynamic programming. In order to substantiate claims about the usefulness of our language we provide two complete examples that demonstrate how to specify code generators for RISC and CISC architectures. Compiler writers are the primary target audience of this book.

Automated Code Generation using Dynamic Programming Techniques

Master Thesis - Automatic Code Generation Using Dynamic Programming Techniques
  • Abstract:

    Building compiler back ends from declarative specifications that map tree structured intermediate representations onto target machine code is the topic of this thesis. Although many tools and approaches have been devised to tackle the problem of automated code generation, there is still room for improvement. In this context we present HBURG, an implementation of a code generator generator that emits compiler back ends from concise tree pattern specifications written in our code generator description language. The language features attribute grammar style specifications and allows for great flexibility with respect to the placement of semantic actions. Our main contribution is to show that these language features can be integrated into automatically generated code generators that perform optimal instruction selection based on tree pattern matching combined with dynamic programming. In order to substantiate claims about the usefulness of our language we provide two complete examples that demonstrate how to specify code generators for RISC and CISC architectures.

  • Kurzfassung:

    Diese Diplomarbeit beschreibt HBURG, ein Werkzeug das aus einer Spezifikation des abstrakten Syntaxbaums eines Programms und der Spezifikation der gewünschten Zielmaschine automatisch einen Codegenerator für diese Maschine erzeugt. Abbildungen zwischen abstrakten Syntaxbäumen und einer Zielmaschine werden durch Baummuster definiert. Für diesen Zweck haben wir eine deklarative Beschreibungssprache entwickelt, die es ermöglicht den Baummustern Attribute beizugeben, wodurch diese gleichsam parametrisiert werden können. Darüber hinaus hat man die Möglichkeit Baummuster an beliebigen Stellen mit semantischen Aktionen zu versehen. Wir zeigen, dass diese Spracheigenschaften in solchen Codegeneratoren anwendbar sind, die auf der Technik der Baummustererkennung und dynamischer Programmierung basieren. Zwei Beispiele der Codegenerator Spezifikationen für RISC und CISC Zielmaschinen sollen die Mächtigkeit der entwickelten Sprache aufzeigen.


Rule Based vs. Statistical Chunking of CoNLL Data Sets

Seminal Paper - Rule Based vs. Statistical Chunking of CoNLL Data Sets
  • Abstract:

    One of the most common operations in language processing are segmentation and labeling [7]. Chunking is a popular representative of a segmentation process which aims to segment tagged tokens into meaningful structures. This paper compares two chunking approaches, namely an approach based on regular expression rules developed by a human, and a machine based chunking approach based on a N-gram statistical tagger. Experimental results show that the performance of the machine based chunker is very similar to the results obtained by the regular expression chunker. Another interesting fact is that it was considerably harder to define regular expressions which capture noun phrases than verb phrases. Obviously this difficulty was caused by the fact that the structure of noun phrases may be very complex.


Unigram Backoff vs. TnT: Evaluating Part of Speech Taggers

Seminal Paper - Unigram Backoff vs. TnT: Evaluating Part of Speech Taggers
  • Abstract:

    Automated statistical part-of-speech (POS) tagging has been a very active research area for many years and is the foundation of natural language processing systems. This paper introduces and analyzes the performance of two part-of-speech taggers, namely the NLTK unigram backoff tagger and the TnT tagger, a trigram tagger. Experimental results show that the TnT tagger outperforms the NLTK unigram backoff tagger. However, the tagging accuracy of both taggers decreases almost by the same number when the taggers are tested on a corpus which differs from the one they were trained on.


RAIDframe & Distributed RAID

Seminal Paper - RAIDframe & Distributed RAID
  • Abstract:

    This paper is divided into two parts, both related to interesting developments in the area of computer storage. The first part deals with RAID and introduces the RAIDframe framework, a rapid tool for prototyping RAID architectures. In the second part, two distributed storage concepts, namely Storage Area Networks (SAN) and Networked Attached Storage (NAS), are introduced and compared to Direct Attached Storage.
    Over the past decade several trends in computer history have driven the design of storage subsystems towards increasing parallelism. An innovation that improves both dependability and performance of storage systems is disk arrays. Since disk arrays consist of many disk drives and, hence, many disk arms, rather than one large drive with one disk arm, potential throughput can be increased, thus improving performance.
    After having covered the basics, the need for a framework, namely RAIDframe, which enables rapid RAID prototyping will be elaborated. Many key ideas and concepts implemented in RAIDframe will be explained, in order to justify the power and flexibility of RAIDframe. Finally a section describing the most important steps which are necessary to extend the RAIDframe framework with a new architecture, shows the extensibility of the RAIDframe framework.
    While storage prices keep on dropping the cost of data remains invaluable. High speed computer networks make it possible to aggregate servers into central organization wide serverfarms. Keeping reliable storage on a per server basis can soon become a maintenance and cost problem. Distributed storage helps organizations to aggregate reliable storage into central managed places, thus reducing the amount of per server maintenance overhead. Currently there exist two approaches for doing distributed storage. Storage Area Networks and Network Attached Storage. This paper introduces the main concepts behind them, and compares distributed to local storage. Two concrete distributed storage protocols are presented to help the reader in getting a better overview.


PortBrowser: A user interface for the BSD ports system

PortBrowser: A user interface for the BSD ports system
  • Abstract:

    BSD operating systems offer a very powerful and flexible framework for installing third party packages, called the BSD ports system. This framework provides a complete environment coupled with a rich set of commands, which is able to build almost any kind of software from source code and package it. Thus any piece of software which has been incorporated into the BSD ports system, can easily be installed, deleted or upgraded. The PortBrowser, a graphical front end for the BSD ports system, is a light weight tool which offers a simple, portable, and secure environment for the most frequent tasks provided by the ports system. It is a very useful tool for novice as well as experienced users, since it allows for easy browsing through the BSD ports tree, and offers search facilities to a certain extent. This paper covers the basics and some details of various flavors of BSD ports tree implementations, followed by a description of the PortBrowser project. Implementation details as well as the GUI design of the PortBrowser are covered and described in an easy to understand manner.


Integration of Security Measures and Techniques in an Operating System (considering OpenBSD as an example)

Integration of Security Measures and Techniques in an Operating System
  • Abstract:

    This paper covers defensive security technologies and secure programming techniques employed in the OpenBSD operating system. Privilege separation, a secure design pattern which is implemented in many privileged processes, is discussed based on the network time protocol daemon OpenNTPD. Afterwards a closer look at address space and memory protection techniques like W^X and the Stack Smashing Protector is taken. Finally the process of gathering entropy in order to produce high quality random numbers is discussed based on examples found in the OpenBSD operating system.