Languages, Systems, and Data Seminar (Winter 2022)

Time: Fridays, noon - 1:05pm (PT)
Location: The Internet / The LSD Lab (Engineering 2, Room 398)
Organizers: Lindsey Kuper and Tyler Sorensen

The Languages, Systems, and Data Seminar meets weekly to discuss interesting topics in the areas of programming languages, systems, databases, formal methods, security, software engineering, verification, architecture, and beyond. Our goal is to encourage interactions and discussions between students, researchers, and faculty with interests in these areas. The seminar is open to everyone interested. Participating UCSC students should register for the 2-credit course CSE 280O (let the organizers know if you’re an undergrad and need a permission code).

For winter 2022, we will continue to host the LSD Seminar on Zoom, but local folks can gather in person in the lab to tune into the Zoom talks together, assuming in-person meetings are allowed.

Talks will be advertised on the ucsc-lsd-seminar-announce (for anyone) and lsd-group (for UCSC-affiliated people) mailing lists.

Date	Speaker	Title
Jan. 7	Dev Purandare	Append is Near: Log-based Data Management on ZNS SSDs
Jan. 14	Max Willsey	What’s up with E-Graphs?
Jan. 21	Jonathan Castello	Simulating spacecraft: Not exactly rocket science
Jan. 28	Reese Levine	A Strong Approach to Testing Weak Memory Models
Feb. 4	Kostis Kaffes	Syrup: User-Defined Scheduling across the Stack
Feb. 11	Tzu-Han Hsu	Bounded Model Checking for Hyperproperties
Feb. 18	Matthew Weidner	Designing for Eventual Consistency with Simple, Composable CRDTs
Feb. 25	Jocelyn Chen	Web question answering with neurosymbolic program synthesis
March 4	Jie Zhou	Fat Pointers for Temporal Memory Safety of C
March 11	Sorawee Porncharoenwase	An Expressive and Optimal Pretty Printer

Jan. 7

Speaker: Dev Purandare

Title: Append is Near: Log-based Data Management on ZNS SSDs

Abstract: Log-based data management systems use storage as if it was an append-only medium in order to transform random writes into sequential writes, delivering a major advantage when logs were persisted on hard disks. Although solid-state drives (SSDs) offer improved random write capabilities, sequential writes continue to be advantageous due to locality and space efficiency. However, the inherent properties of flash-based SSDs induce significant disadvantages when utilizing a random write block interface, causing write amplification, uneven wear, log stacking, and garbage collection overheads. To eliminate these disadvantages, Zoned Namespace (ZNS) SSDs have recently been introduced. They offer increased capacity, reduced write amplification, and higher performance but require the host to participate in data placement through zones, which have sequential-write semantics and must be explicitly reset. In ZNS, the Zone Append primitive allows the host to push down fine-grained data placement onto the device, supporting appends to a zone without knowing the location of the tail. Full zones become immutable, greatly simplifying disaggregated storage and operations like replication.

We propose another pushdown technique, Group Append, which allows appends of data that is smaller (or possibly larger) than a block, offloading data buffering to the controller. We explore how ZNS SSDs with Zone Append, Group Append, and computational storage can benefit four log-based data management areas: (i) log-based file systems, (ii) LSM trees such as RocksDB, (iii) database systems, and (iv) event logs/shared logs. We also propose research directions for all four log-based data management using ZNS SSDs.

Bio: Dev is a PhD candidate in the Computer Science department at University of California Santa Cruz. His research interests lie in storage systems, operating systems and distributed systems, and programming languages. Dev completed his Bachelor’s in Computer Engineering at Savitribai Phule Pune University (SPPU) in India. At UCSC he has been working on improving the lifetime of low cost, low reliability SSDs and evolving systems for Zoned Namespace SSDs. Dev has been an Engineering Intern at Riverbed Networks (2017), and an Architecture Enabling Intern at SK Hynix America Ltd. (2019). He can be found on twitter (@dev14e), LinkedIn (devashishp), and email (devashish@ucsc.edu).

Jan. 14

Speaker: Max Willsey

Title: What’s up with E-Graphs?

Abstract: E-graphs are an important data structure inside SMT solvers, but recently they have been used for a whole lot more, including optimization, program synthesis, and verification!

This talk will give a whirlwind introduction to e-graphs and equality saturation, as well as some new algorithms that make them faster. I’ll also introduce egg, a toolchain that packages all this up into a generic, reusable library. Then, we’ll move on to some applications of these new techniques, including floating point accuracy, 3D CAD, deep learning, and more. Finally, I’ll talk about some of the current work folks are using egg for, including a teaser for our upcoming POPL paper that connects e-graphs to databases.

Bio: Max is a postdoc at the University of Washington in the PLSE group, where he also did his PhD. His main line of work is on egg, an e-graph and equality saturation toolkit.

Jan. 21

Speaker: Jonathan Castello

Title: Simulating spacecraft: Not exactly rocket science

Abstract: Caltech’s Jet Propulsion Laboratory (JPL) operates a fleet of autonomous spacecraft across our solar system. For more than thirty years, operators have validated planned activities against a simulated model of their spacecraft in order to preserve these valuable resources. These models are typically implemented in a domain-specific language (a DSL), which is then interpreted for an execution plan by reusable simulation systems.

As we make our spacecraft even more intelligent, our spacecraft models become correspondingly more complex. Existing simulation systems were designed with a minimal DSL and limited scope, and have been significantly expanded under a constant flow of feature requests. Unfortunately, this has led to simulation systems and models that are difficult to understand and maintain without relying on global reasoning and deep expertise.

Merlin is a newly open-sourced simulation system under development at JPL. Merlin models are written in plain Java, which we try to make as “uninteresting” as possible for modelers. Instead, we spend our “novelty budget” mostly under the hood. This talk will showcase the concurrency semantics supporting local reasoning in Merlin, and some of the techniques we take to keep Merlin out of the way.

Bio: Jonathan Castello is a software engineer most recently with Caltech’s Jet Propulsion Laboratory, where he worked on both modern and legacy ground software for spacecraft operations. His research interests sit at the nexus of modular design, concurrency, and monotonic state.

Jan. 28

Speaker: Reese Levine

Title: A Strong Approach to Testing Weak Memory Models

Abstract: GPUs are increasingly being used for compute workloads that require interactions between threads using shared memory. Like many CPUs, GPUs expose weak memory models that allow for more aggressive compiler and hardware optimizations at the expense of a more complex programming model. Testing these memory models is an ongoing line of research.

This talk will detail our ongoing project on testing GPU memory models. First, we will show a motivating example from an optimized renderer and illustrate how we can reason about the synchronization in the renderer using classic weak memory model litmus tests. Next, we will overview recent work on testing GPU memory models, including complex heuristics necessary to reveal weak behaviors in testing. Our new work builds on this existing work in several ways: (1) we provide a simple DSL and a compiler for expressing litmus tests and testing heuristics; we currently have backends for Vulkan and WebGPU. (2) A new parallel execution model for litmus tests, which allows tests to be executed an order of magnitude faster than in prior works. (3) A web application interface for running litmus tests through WebGPU. We will conclude with a discussion on how these contributions might lead to more robust testing methodologies and hope to facilitate discussions on how they relate to conformance testing.

Bio: Reese Levine is a second year PhD student at UC Santa Cruz advised by Tyler Sorensen. He is interested in research in parallel, concurrent, and distributed systems, and his work currently focuses on testing weak memory models in GPUs.

Feb. 4

Speaker: Kostis Kaffes

Title: Syrup: User-Defined Scheduling across the Stack

Abstract: Suboptimal scheduling decisions in operating systems, networking stacks, and application runtimes are often responsible for poor application performance, including higher latency and lower throughput. These poor decisions stem from a lack of insight into the applications and requests the scheduler is handling and a lack of coherence and coordination between the various layers of the stack, including NICs, kernels, and applications. We propose Syrup, a framework for user-defined scheduling. Syrup enables untrusted application developers to express application-specific scheduling policies across these system layers without being burdened with the low-level system mechanisms that implement them. Application developers write a scheduling policy with Syrup as a set of matching functions between inputs (threads, network packets, network connections) and executors (cores, network sockets, NIC queues) and then deploy it across system layers without modifying their code. Syrup supports multi-tenancy as multiple co-located applications can each safely and securely specify a custom policy. We present several examples of uses of Syrup to define application and workload-specific scheduling policies in a few lines of code, deploy them across the stack, and improve performance up to 8x compared with default policies.

Bio: Kostis Kaffes is a final-year Ph.D. candidate in Electrical Engineering at Stanford University, advised by Christos Kozyrakis. He is broadly interested in computer systems, cloud computing, and scheduling. His thesis focuses on end-host, rack-scale, and cluster-scale scheduling for microsecond-scale tail latency. Recently, he has been looking for ways to make it easier to implement and deploy custom scheduling policies across different layers of the stack. Kostis’s research has been supported by a Facebook Research Award and various scholarships and fellowships from Stanford, A.G. Leventis Foundation, and Gerondelis Foundation. Prior to Stanford, he received his undergraduate degree in Electrical and Computer Engineering from the National Technical University of Athens in Greece.

Feb. 11

Speaker: Tzu-Han Hsu

Title: Bounded Model Checking for Hyperproperties

Abstract: Hyperproperties is a powerful framework for specifying and reasoning about important classes of requirements that were not possible with trace-based languages such as the classic temporal logics. This talk will introduce a novel bounded model checking (BMC) algorithm for hyperproperties expressed in HyperLTL. Just as the classic BMC technique for LTL primarily aims at finding bugs, our approach also targets identifying counterexamples. Followed by the reduction of BMC for LTL to SAT solving, our BMC approach naturally reduces to QBF solving, as HyperLTL allows explicit and simultaneous quantification over multiple traces. Our algorithm is sound based on our theory of bounded semantics, which guarantees correct BMC results under finite exploration. We will also present our implemented tool HyperQube, a push-button QBF-based bounded model checker for hyperproperties, and demonstrate the effectiveness and efficiency of our approach via a rich set of practical applications, including security, concurrent data structures, path planning for multi-agent systems, and secrecy-preserving refinement mapping synthesis.

Bio: Tzu-Han Hsu is a second-year Ph.D. student in Computer Science and Engineering department at Michigan State University, advised by Dr. Borzoo Bonakdarpour. Her research areas include formal methods, model checking, verification, and synthesis for security/privacy policies. She is recently working on the topic of hyperproperties, a framework that can reason about multiple traces simultaneously, which has rich applications in formal analysis especially for multi-threaded and concurrent programs.

Before MSU, Tzu-Han received her bachelor’s degrees in Computer Science and Music-Piano Performance from Iowa State University in 2020. Tzu-Han can be reached on Twitter (@TzuHanH), LinkedIn (tzuhanhsu), email (tzuhan@msu.edu), and her personal website (https://tzuhancs.github.io/).

Feb. 18

Speaker: Matthew Weidner

Title: Designing for Eventual Consistency with Simple, Composable CRDTs

Abstract: Many distributed systems allow a group of devices to edit some shared state. Examples include collaborative apps, such as Google Docs and Figma, and distributed key-value stores. Often, these systems are highly available: each device can edit its own replica of the state immediately, then sync up with other devices in the background. Eventual consistency requires that once this syncing completes, all devices agree on the new state.

Eventual consistency is challenging because devices might make conflicting concurrent edits. For example, two users might type at the same time in a Google Doc, or one user might add entries to a spreadsheet while another alters the layout and formatting. Somehow, we must combine all these edits in a way that is consistent across devices and hopefully also makes sense to users. Conflict-free Replicated Data Types (CRDTs) provide one solution to this challenge, by giving programmers data structures with built-in replication and eventual consistency. However, they tend to be hard to understand or customize.

In this talk, I will propose a way of creating CRDTs by composing simple, easy-to-understand pieces. Ideally, this could let programmers design eventually consistent systems while maintaining complete control over the system’s behavior. A blog post describes the proposal in more detail.

Bio: Matthew is a third year PhD student at Carnegie Mellon University, advised by Heather Miller. His research focuses on tools for decentralized systems, with a particular interest in enabling open-source, local-first collaborative apps. Previously, he completed an MPhil at Cambridge University as a Churchill Scholar, where he studied decentralized secure group messaging protocols for the TRVE Data project.

Feb. 25

Speaker: Jocelyn Chen

Title: Web question answering with neurosymbolic program synthesis

Abstract: As the amount of information available on the web proliferates, there is a growing need for tools that can extract relevant information from the website. Due to the importance of the problem, there has been a flurry of research activity on information extraction and wrapper induction by using powerful neural models and program synthesis. While wrapper induction tools based on program synthesis work well when the target webpages have a shared global schemes, they are not as effective on structurally heterogeneous websites such as faculty webpages. On the other hand, ML-based techniques from the NLP community are, in principal, applicable to heterogeneous websites; however, by treating the entire webpage as unstructured texts, they fail to take advantage of the inherent tree structure of the HTML documents.

In this talk, we propose a new information extraction approach based on neurosymbolic program synthesis that combines the relative strengths of wrapper induction techniques for webpages with the flexibility of neural models for unstructured documents. Our approach targets structurally heterogeneous websites with no shared global schema and can be used to automate many different types of information extraction tasks. The key idea of our approach is to employ a neurosymbolic DSL that incorporates both neural NLP models as well as standard language constructs for tree navigation and string manipulation. We also propose an optimal synthesis algorithm that generates all DSL programs that achieve optimal F1 score on the training examples. We have implemented these ideas in a new tool called WebQA and evaluate it on 25 different tasks across multiple domains. Our experiments show that WebQA significantly outperforms existing tools such as state-of-the-art question answering models and wrapper induction systems.

Bio: Jocelyn Chen is a fourth-year graduate student at the University of Texas at Austin working with Isil Dillig. She is mainly interested in research at the intersection of program synthesis and natural language processing.

March 4

Speaker: Jie Zhou

Title: Fat Pointers for Temporal Memory Safety of C

Abstract: The past decade has seen an increasing trend of real-world exploits against temporal memory safety bugs in C/C++ programs, i.e., use-after-free (UAF), double free, and invalid free. These bugs are profoundly dangerous in that they can lead to data corruption, information leaking, or even arbitrary code execution. Many solutions were proposed, but they suffer one or more severe limitations such as high performance/memory overhead and missing bugs.

In this talk, I will describe how we solve this problem for C efficiently and comprehensively. We combine two existing techniques: fat pointers (in-place metadata) and ID-based dynamic pointer validation. We build our solution based on Checked C–a promising safe dialect of C which provides spatial memory safety but currently lacks temporal memory safety. We show that our solution significantly improves the performance and memory consumption compared to using disjoint metadata for pointer validation. Additionally, we also show that our solution is practical in terms of backward compatibility–one of the major concerns about fat pointers.

Bio: Jie Zhou is a fifth-year PhD student at the University of Rochester. Jie works on systems and software security, with a focus on memory safety. He is most interested in improving systems programming languages and their toolchains for security as he believes that language-based techniques can solve problems in a more fundamental way. He is also interested in developing program analysis and transformation techniques to enforce security policies on low-level software.

March 11

Speaker: Sorawee Porncharoenwase

Title: An Expressive and Optimal Pretty Printer

Abstract: Pretty printers transform structured data into “pretty” text. In this talk, I will present a pretty printing algorithm that (1) is strictly more expressive than current state-of-the-art algorithms; (2) guarantees optimality; and (3) has better time complexity than many state-of-the-art pretty printers. To distinguish our pretty printer from others, I will also give a brief survey of the existing pretty printing algorithms and dispel various myths about traditional pretty printers that turn out to be inaccurate. This is a joint work with Justin Pombrio and Emina Torlak.

Bio: Sorawee (Oak) is a fourth year PhD student at the University of Washington, advised by Emina Torlak. His research interest is improving usability of symbolic evaluation.