Languages, Systems, and Data Seminar (Fall 2021)

Time: Fridays, noon - 1:05pm (PT)
Location: The Internet / The LSD Lab (Engineering 2, Room 398)
Organizers: Lindsey Kuper and Tyler Sorensen

The Languages, Systems, and Data Seminar meets weekly to discuss interesting topics in the areas of programming languages, systems, databases, formal methods, security, software engineering, verification, architecture, and beyond. Our goal is to encourage interactions and discussions between students, researchers, and faculty with interests in these areas. The seminar is open to everyone interested. Participating UCSC students should register for the 2-credit course CSE 280O (let the organizers know if you’re an undergrad and need a permission code).

For fall 2021, we will continue to host the LSD Seminar on Zoom, but local folks can gather in person in the lab to tune into the Zoom talks together.

Talks will be advertised on the ucsc-lsd-seminar-announce (for anyone) and lsd-group (for UCSC-affiliated people) mailing lists.

Date	Speaker	Title
Sept. 24	None	Welcome!
Oct. 1	Maria Schett	Cheaper (& correct) blockchain protocols and programs
Oct. 8	Tanvir Ahmed Khan	Rescuing Data Center Processors
Oct. 15	Elizabeth Dinella	Data Driven Program Merge
Oct. 22	Vaishnavi Sundararajan	Better Safe than Sorry: Symbolic Verification for Security Protocols
Oct. 29	Jenna Wise	Gradual Verification of Recursive Heap Data Structures
Nov. 5	Thomas Koehler	Optimizing Processing Pipelines with a Rewrite-Based Domain-Extensible Compiler
Nov. 12	Giulia Guidi	Sparse Matrices and High-Performance Computing Meet Biology
Nov. 19	Hayley LeBlanc	Finding Crash Consistency Bugs in Persistent Memory File Systems
Dec. 3	Sydney Gibson	Efficient Verification of High-Performance Cryptographic Routines on New Hardware

Sept. 24

Welcome!

Oct. 1

Speaker: Maria Schett

Title: Cheaper (& correct) blockchain protocols and programs

Abstract: We will explore how we can make it cheaper to run blockchain protocols and programs—while guaranteeing correctness!

For cheaper protocols, we move from a blockchain to a blockgraph. Such a blockgraph, which we show to be a reliable point-to-point link, is built together by a set of servers. Thereby, the servers embed many parallel runs of a deterministic protocol into it. And because the protocol is deterministic, every server can locally replay the protocol—and no more messages need to be sent [1]!

For cheaper programs, we super-optimize bytecode of the Ethereum Virtual Machine, which runs on many servers to execute programs on the blockchain. Now, to avoid the halting problem, every executed bytecode costs money. So we have a clear optimization target—money—and we use an automated theorem prover to synthesize cheaper, observationally equivalent, programs [2].

Bio: Maria A Schett is currently finishing her PhD on “Cost Reduction With Guarantees: Formal Reasoning Applied To Blockchain Technologies” at University College London. She loves living in London and talking about blockchains, distributed systems, and compilers. She is conversational in term rewriting and qualitative research (from a previous life) and in Esperanto.

[1] “Embedding a Deterministic BFT Protocol in a Block DAG”, Maria A Schett and George Danezis, Proc. of the 2021 ACM Symposium on Principles of Distributed Computing (PODC) 2021

[2] “Synthesis of Super-Optimized Smart Contracts using Max-SMT”, Elvira Albert, Pablo Gordillo, Albert Rubio and Maria A Schett, Proc. of 32nd International Conference on Computer-Aided Verification (CAV) 2020

Oct. 8

Speaker: Tanvir Ahmed Khan

Title: Rescuing Data Center Processors

Abstract: To serve billions of users around the world, modern web applications that run across data centers access huge datasets and perform complex application logic. As a result, data center applications face two major challenges: (1) poor data access behavior and (2) poor instruction access behavior. In my research, I demonstrate that novel hardware-software codesign effectively solves both challenges. Specifically, I observe that both data and instruction accesses in data center applications follow a deeply repetitive pattern that can be efficiently optimized by profiling the application’s program flow behavior. In this talk, I will first present an overview of my techniques to improve data and instruction accesses. I will then describe two of my techniques in detail, showing how these techniques outperform prior proposals from Google.

Bio: Tanvir Ahmed Khan is a fifth-year Ph.D. candidate at the University of Michigan. His research interests lie at the intersection of computer architecture, compilers, and operating systems. He is interested in designing techniques at the boundary of hardware and software that enable software to better leverage hardware resources. He was a Facebook Fellowship (2020) finalist. His research on data center applications’ performance optimizations has appeared in top computer architecture and systems venues like ISCA, MICRO, PLDI, and OSDI. His work is being used by Intel and ARM to design the next-generation CPU architectures. Tanvir expects to graduate soon and after graduation, he is interested in tenure-track faculty positions.

Oct. 15

Speaker: Elizabeth Dinella

Title: Data Driven Program Merge

Abstract: In collaborative software development, program merging is the mechanism to integrate changes from multiple programmers. The most widespread merge algorithm, employed in modern version control systems, is an unstructured merge. Unstructured merge treats input programs as a sequence of characters and reports a conflict when changes interfere textually. On the other hand, structured approaches leverage knowledge of the underlying language to perform “intelligent” merges over the program’s AST. These approaches have shown significant advancements in automatic merging for Java. However, these methods have not found their way into modern version control systems as they are typically language dependent and do not generalize to dynamic languages such as JavaScript.

In a 1991 seminal paper on program merge [1], Westfechtel voiced: “A tool is urgently needed which automates the process of merging as much as possible.” Thirty years later, and there have not been significant advances in the prevailing merge algorithm - unstructured merge.

In this work, we take a fresh data-driven look at the problem of merge resolution. Guided by a multilingual dataset of over 200,000 merges, we explore deep learning techniques to resolve unstructured merge conflicts. Our key innovation is an edit-aware embedding of merge inputs.

Our resulting model achieves 63–68% accuracy of merge resolution synthesis, yielding nearly a 3x performance improvement over existing structured merge tools. Finally, we demonstrate that our approach is sufficiently flexible to work with source code files in Java, JavaScript, TypeScript, and C# programming languages, and can generalize zero-shot to unseen languages.

[1] Bernhard Westfechtel. 1991. Structure-oriented merging of revisions of software documents. In Proceedings of the 3rd international workshop on Software configuration management (SCM ‘91).

Bio: Elizabeth Dinella is a fourth-year PhD student at the University of Pennsylvania. Her research interests can be broadly described as “data-driven program reasoning.” Most recently, Elizabeth has worked on data-driven bug finding (Hoppity), program merge, and test oracle inference. She is an admirer of coffee and her chow chow puppy Cinnabon.

Oct. 22

Speaker: Vaishnavi Sundararajan

Title: Better Safe than Sorry: Symbolic Verification for Security Protocols

Abstract: A large part of our lives is spent online, creating, sharing, and consuming data in various forms. We use a profusion of security protocols to handle this more-often-than-not sensitive data, and connect to the internet and to the world at large. How secure are these communications? How private? These are fundamental questions to which we demand definitive answers. Even with highly sophisticated methods of testing, one cannot get unconditional answers, and this is where formal verification steps in. One can use logic or automata or similar formal methods to conclusively tell us if there are security flaws in these protocols. In this talk, we will provide a quick and dirty introduction to the world of symbolic verification, with a focus on security protocols. We will then look at a few ways in which the protocols of today’s day and age are modelled and verified.

Bio: Vaishnavi Sundararajan is a postdoc at UCSC. She completed her PhD at Chennai Mathematical Institute, with a thesis on formal aspects of certification in security protocols. She is interested in the areas of logic, security, proof theory and verification. Before UCSC, she was a postdoc at IRISA, Rennes, and a Research Associate at Ericsson Research, Bengaluru.

Oct. 29

Speaker: Jenna Wise

Title: Gradual Verification of Recursive Heap Data Structures

Abstract: Current static verification techniques do not provide good support for incrementality, making it difficult for developers to focus on specifying and verifying the properties and components that are most important. Dynamic verification approaches support incrementality, but cannot provide static guarantees. To bridge this gap, prior work proposed gradual verification, which supports incrementality by allowing every assertion to be complete, partial, or omitted, and provides sound verification that smoothly scales from dynamic to static checking. Our OOPSLA’20 paper [1] extended gradual verification to programs that manipulate recursive, mutable data structures on the heap. This talk outlines our extension and discusses technical challenges addressed during its development, such as ensuring consistency between static and dynamic checking. I also present the design of a gradual verifier that we built as follow-up work to our OOPSLA paper. It is implemented on top of the Viper static verifier and supports the C0 programming language. The C0 language is a safer, smaller subset of C taught in introductory CS courses at CMU. Finally, I conclude the talk with my short-term plans to evaluate our tool—which includes a case study on web browser JIT code—and my long-term vision for the field of gradual verification.

Bio: I am a fifth year PhD student in the Institute for Software Research at Carnegie Mellon University, and I am co-advised by Jonathan Aldrich and Joshua Sunshine. My research interests lie at the intersection of programming languages, software verification, and software engineering. In general, I intend to make verification technology, like formal verification and program analysis, more usable in practice. Towards this end, my thesis is dedicated to work in gradual verification, which supports incremental formal verification through the use of both static and dynamic techniques. I have also used ideas from gradual verification to produce a gradual null pointer analysis [2] that reduces false positives compared to state-of-the-art tools. It systematically deploys run-time checks wherever the underlying static analysis is imprecise. Further, I previously contributed to the language design of Penrose — which generates diagrams from mathematical prose — and Obsidian — a programming language that facilitates the development of secure blockchain applications. Outside of work, I enjoy playing video games, such as League of Legends and Final Fantasy XIV Online. I also enjoy bowling and often average more than 170 points per game!

[1] Wise, J., Bader, J., Wong, C., Aldrich, J., Tanter, É., Sunshine, J., “Gradual Verification of Recursive Heap Data Structures”, The ACM SIGPLAN conference on Systems, Programming, Languages, and Applications: Software for Humanity (SPLASH20), OOPSLA, November 15-20, 2020, Online

[2] Estep, S., Wise, J., Aldrich, J., Tanter, É., Bader, J., and Sunshine, J., “Gradual Program Analysis for Null Pointers”, The 35th European Conference on Object-Oriented Programming (ECOOP21), July 11-17, 2021, Online

Nov. 5

Speaker: Thomas Koehler

Title: Optimizing Processing Pipelines with a Rewrite-Based Domain-Extensible Compiler

Abstract: Domain-specific optimizing compilers such as Halide enable programs to be expressed at a convenient high-level, while generating high-performance code for parallel architectures. As domains of interest expand towards deep learning, probabilistic programming and beyond, it becomes increasingly clear that it is unsustainable to redesign domain specific compilers for each new domain. In addition, the rapid growth of hardware architectures to optimize for poses great challenges for designing these compilers.

In this talk, I will show how to extend a unifying domain-extensible compiler with domain-specific as well as hardware-specific optimizations. Optimizations are not hard-coded into the compiler but are expressed as user-defined rewrite rules that are composed into strategies controlling the optimization process. On four mobile ARM multi-core CPUs, the code generated for the Harris image processing pipeline outperforms the image processing library OpenCV by up to 16x and achieves performance close to - or even up to 1.4x better than - the state-of-the-art image processing compiler Halide.

However, optimization strategies are difficult to write because they need to decide how to apply rewrite rules. I will finish this talk by presenting a technique we call “guided equality saturation via sketching” to abstract over individual rewrites.

Bio: I am a PhD student in computer science at the University of Glasgow in Scotland, where I work closely with my supervisor Michel Steuwer. I am the lead developer of the Rise project, which combines a high-level functional language with a system of rewrite rules to encode optimization choices. It provides a domain-extensible way to generate high performance code for diverse hardware architectures.

Personal website

Nov. 12

Speaker: Giulia Guidi

Title: Sparse Matrices and High-Performance Computing Meet Biology

Abstract: In computational and data sciences, the need for scalable computing and data systems has recently increased due to the flood of data in areas such as genomics. Yet, scalable parallel programming in distributed memory is difficult, and high-performance computing (HPC) systems are typically allocated to specific research communities and have long user wait times, limiting access to resources and scientific discoveries.

In this context, we have developed a novel set of genomics algorithms for de novo genome assembly (i.e., reconstruction of an unknown genome from redundant, erroneous genomic sequences) that are integrated into the diBELLA 2D software package and are based on sparse matrix multiplication supporting general semiring abstraction. This enables the creation and easy modification of powerful genomics pipelines that take advantage of massively parallel hardware without exposing low-level architecture. diBELLA 2D is up to 2x faster on 100s nodes than a 1D algorithm based on distributed hash tables, which are more difficult to parallelize. diBELLA 2D integrates GPU support in the most compute-intensive stages of the pipeline to take advantage of today’s heterogeneous HPC hardware.

To ensure that the genomics research community and others in general, can benefit from HPC, the development of distributed algorithms such as diBELLA 2D must be coupled with efforts to make distributed computing more accessible. To this end, we have shown that we are on the cusp of a paradigm shift in high-performance computing (HPC) away from purely institutional or agency-wide HPC systems to cloud computing, as the latter has made significant advances in networking technology and HPC system software.

Bio: Giulia is a Ph.D. candidate in Computer Science at UC Berkeley and a graduate research assistant at the Computational Research Division of Lawrence Berkeley National Laboratory advised by Aydın Buluç and Kathy Yelick. Giulia is a 2020 SIGHPC Computational & Data Science Fellow and received her M.Sc. and B.Sc. in Biomedical Engineering from Politecnico di Milano. Her work is in the area of computer systems research, including cloud and parallel computing, and she is interested in building a collaborative interdisciplinary research program. Giulia is currently working on the challenges of large-scale computational biology, as well as the algorithms and software infrastructures that meet the usability and performance demand of this community. Currently, she is developing a novel algorithm for de novo assembly of genomes in distributed memory using long-read sequencing data and sparse matrix abstraction as part of the ExaBiome project. Giulia is also working on how to make cloud computing more accessible for high-performance scientific computing.

Nov. 19

Speaker: Hayley LeBlanc

Title: Finding Crash Consistency Bugs in Persistent Memory File Systems

Abstract: Persistent memory (PM) is a new storage technology that combines the speed and byte-addressability of DRAM with the durability of hard disks and solid state drives. Recent research has explored using PM in a number of different applications like databases and file systems. However, programming PM applications correctly is difficult; in order for data to be guaranteed durable on PM, developers must understand and use a set of assembly instructions to manage persistence. In particular, PM applications are prone to crash consistency bugs, where buggy data management leads to incorrect behavior after a crash.

PM file systems are a growing subset of PM applications, but there are currently no testing tools that target crash consistency bugs in these file systems. In this talk, I will describe my work on building a crash consistency testing framework designed specifically for PM file systems. I will present an overview of our testing approach and the bugs that the tool has found, and discuss the implications of these bugs on the development of PM file systems.

Bio: Hayley LeBlanc is a second year PhD student at the University of Texas at Austin. She is interested in developing techniques and tools to make sure that storage systems are reliable and correct. She also enjoys fencing and volunteering at her local animal shelter.

Dec. 3

Speaker: Sydney Gibson

Title: Efficient Verification of High-Performance Cryptographic Routines on New Hardware

Abstract: Formal proofs are often employed to show correctness of cryptographic routines at the assembly level, especially in instances where architecture-optimized performance is desirable or where compilers do not exist for the target ISA. While common cryptographic routines are often implemented in many different ISAs, the current cost of verification scales poorly with the number of architectures. In this talk, we explore solutions for writing correctness proofs for cryptographic routines which are generic across different architectures.

Specifically, we will look at assembly implementations of an RSA signature verification routine in RISC-V and in the OpenTitan BigNumber Accelerator (OTBN) ISA. Despite dramatic differences in these two ISAs, we find that the high-level control flow structure of the RSA routine is quite similar across implementations. We use this observation to construct a framework for writing partial architecture-generic assembly-level proofs which are abstracted over hardware details (e.g., register width). We will discuss how this framework can be used to reduce the overhead of proving the correctness of common cryptographic routines on different architectures, and look to future possibilities for further reducing this overhead.~

Bio: Sydney is a second-year Ph.D. student in the Computer Science Department of Carnegie Mellon, advised by Bryan Parno. She is interested in formal techniques for verifying safety and security properties of low-level, high-performance systems at scale. Sydney received her B.S. and M.Eng. from MIT, where she worked on verification techniques for concurrent, crash-safe systems.