Bloomberg engineers are constantly innovating to improve high-availability systems. Recently, at the 22nd International Conference on Extending Database Technology (EDBT 2019) in Lisbon, Portugal, Mark Hannum, Adi Zaimi, and Michael Ponomarenko from Bloomberg’s Comdb2 group presented their team’s research paper entitled “HASQL: A Method of Masking System Failures.”
The core idea behind HASQL, or High Availability SQL, as implemented in Comdb2, is a method that allows the database to seamlessly recover a disrupted transaction. First invented four years ago by Alex Scotti, Comdb2’s HASQL was described briefly in the VLDB 2016 paper “Comdb2: Bloomberg’s Highly Available Relational Database System.” Comdb2 is Bloomberg’s distributed relational database management system (RDBMS), which was published as open source in March 2017.
The EDBT paper described the rationale and the complete methodology behind HASQL, even going as far as outlining the protocol between the client and the server.
“Imagine the machine you are interacting with crashes while you are in the middle of executing a transaction, or even in the middle of consuming a result set,” said Michael Ponomarenko, the team leader of the Comdb2 group. “HASQL restores the transaction’s state against a different cluster node at the same logical point in time: your result set will continue streaming back without skipping or repeating a record. Your application doesn’t even need to be aware that a machine crash has occurred.”
At EDBT 2019, the engineers demonstrated both the original version of HASQL, as well as the latest HASQL branch, which is able to maintain the same transaction simultaneously against multiple machines in the cluster.
“The original version of HASQL suffered a visible performance penalty while the code worked to re-establish a transaction’s state, so we needed to find strategies to reduce this latency to be nearly imperceptible in order to make the database truly tolerant of hardware failure,” explained Adi Zaimi, a programmer on the Comdb2 team. “While the original version of HASQL can restore a transaction in a matter of a few seconds, the concurrent version of HASQL continues a transaction almost immediately. If unused, the redundant sessions amount to wasted computing power, but we believe that certain users might be willing to make this tradeoff.”
The engineers designed the demos at EDBT to be fun and interactive. With Comdb2 running on a Raspberry Pi cluster, conference attendees were asked to power-off the machine the query was running on. Within a few seconds, a different machine in the cluster returned the results. A second demo asked volunteers to power-off all the machines simultaneously. When the machines came back online, the transaction state was restored and results continued to be returned.
“It was a fantastic way to showcase HASQL and Comdb2 handled it beautifully,” said Rivers Zhang, another Comdb2 programmer, about the demonstration.
Akshat Sikarwar, another member of the Comdb2 team called attention to the paper’s larger point.
“While the EDBT 2019 paper describes a specific methodology of masking machine failures, it argues more broadly that such events are better handled at the infrastructure layer rather than at the application layer,” Sikarwar states. “There are some conditions that are difficult for the application programmer to address. Imagine that a machine crashes after a client issues a commit. The API could return DISCONNECT, but then the fate of the transaction remains unknown. Should the application programmer re-attempt this transaction? Should they attempt to roll it back? It’s a mess. So generally, we’re arguing that distributed systems should strive to be unambiguous about the fate of an operation because the system itself is in a far better position to know this information than the application.”
Both Comdb2 and HASQL have been a collaborative effort in every sense, explains Mohit Khullar, the Comdb2 programmer responsible for writing the original version of HASQL.
“We are immensely proud of Comdb2 and the work we’ve done on HASQL,” Khullar said. “In this paper, we describe the methodology in general terms, but it is worth pointing out that it relies on design decisions and features that were implemented at various stages throughout Comdb2’s long history.”
In addition to working on Comdb2, software engineer Joe Mistachkin is also part of the core SQLite development team.
“Interacting with the academic community is a win-win situation: not only does it give us the opportunity to promote Bloomberg and Comdb2, but also conferences like EDBT, SIGMOD and VLDB create valuable environments for brainstorming about the product,” explains Mistachkin. “We are actively developing projects that originated at conferences, and we expect some of these projects to be hugely beneficial to our users.”
“HASQL: A Method of Masking System Failures” was authored by the entire Comdb2 group: Mark Hannum, Adi Zaimi, Mike Ponomarenko, Dorin Hogea, Akshat Sikarwar, Mohit Khullar, Rivers Zhang, Lingzhi Deng, Nirbhay Choubey, and Joe Mistachkin.