In 2019, I started offering a whimsically titled course, “Rafting Trip”, wherein the goal is to implement the Raft distributed consensus protocol. This might seem like a really oddball topic to be looking at, but the more I’ve dived into it, the more I’ve come to appreciate it as a really great topic for a project-based course. You can read some of my thoughts after the initial run of the course here. I’ve since had a few more thoughts about it that I’ll share here.
As a bit of background, my first exposure to Raft was in 2018 during a run of Hillel Wayne’s Introduction to TLA+ workshop. At the time, I knew nothing about TLA+. Raft is not a topic that’s part of Hillel’s course, but one of the participants in the workshop, a Ph.D. candidate, said he was working on a dissertation involving distributed consensus (and that he couldn’t graduate unless he included some kind of TLA+ spec in his dissertation). It all sounded rather heady–not only did I not know anything about TLA+, I certainly had no experience with distributed consensus or any of the other topics being tossed about in subsequent discussion (Paxos, Raft, etc.). Various thoughts raced through my mind. “Have I been doing too much Python? Have I been spending too much time biking? Or posting on Twitter? Why have I not encountered this?” Part of the answer might be in the timing. The first paper on Paxos was published in 1998–the same year I received my Ph.D. The first research paper on Raft was published in 2014. Needless to say, I just wouldn’t have encountered any of this in school and it’s not clear that I would have crossed paths with distributed consensus in my day-to-day Python coding and training. That said, it seemed like something interesting to explore a bit further.
In doing that, I think the most interesting thing about implementing Raft is not Raft, but everything else that you intersect along the way of trying to do it. For example, consider testing. Yes, you can write unit tests and try to follow TDD principles. However, a big part of Raft concerns the behavior of the system under failure. What happens if the network goes down? Or the network is slow? Or if a message is lost? Or if a machine reboots? Or if the power goes off? And how do you actually test for these kinds of things? Can you you even write a unit test to simulate a power failure? I don’t know. It seems difficult and it’s something you have to think about.
In addressing these issues, you start to think about other matters. For example, can you use OO design principles to try and organize code in a way that makes it easier to test or to simulate? Or can you use model checking tools (like TLA+) to help verify what you’re doing? If you’re going to use a model-checker, how do you translate the model to actual runnable code? And after you’re done, how do you know if THAT code is correct?
There is also a certain technology side to the whole affair. To implement Raft, you need concurrency and you need message passing. Very little is specified in the Raft paper about the actual mechanism used. Do you use threads? Async? Sockets? Do you use existing libraries and frameworks? What is the behavior of those frameworks under failure? There are many unknowns.
And on the subject of the Raft paper, part of the overall challenge also involves reading and interpreting the contents of the paper. Do you really understand the problem being solved? Do you understand the algorithm being described? Can you bridge the description of the algorithm in the paper to a real-world implementation of the algorithm? As they say, “it’s complicated.”
So, in the end, you’ve got all of this great problematically interesting stuff stirred together in this one project. When you’re done, I think there’s a lot to take away and think about on all of your other coding projects. And that may be the real value in working on it.
(shameless plug) If all of this sounds appealing, then you should sign up for my course.