Discussion forum for David Beazley

Concurrency, async etc


#1

Hi David/

If you did a concurrency course I would defiantly be ready for the trip to Chicago, others who feel like this would be a good idea?

/Rene


#2

Years ago, I ran a “Concurrency and Distributed Systems” course, but it kind of fell by the wayside. One of the reasons for it falling away was my compilers course. Not so much the contents of compilers, but the structure of running it. Basically, compilers is set up in the form of a week-long project. There’s a bit of lecture and traditional teaching, but mostly it’s about building a compiler. I’ve since been wanting to follow this same project-based structure with some other topics (including networks/concurrency).

Lately (quite recently), I’ve been thinking that such a course might work for a project involving distributed consensus (e.g., Raft or Paxos). I’ve messed around with Raft and found it to significantly more difficult to implement than your standard sort of “echo server” or even an “HTTP server” project. Something like that might work really well for a concurrency/async/networks kind of course.

I’d be curious to get thoughts on that.


#3

Both lxd for its distributed database and juju for its distributed log use hashicorp’s raft. It may be handy (although in go) to see how the raft has been implemented and used in either use case.

Definitely interested in what a python implementation of a raft would look like!


#4

^ hashicorp’s raft in go https://github.com/hashicorp/raft


#5

Building a concurrent, distributed and reliable system would be awesome, I would suggest it included:

  • Concurrent local Node (built on curio, trio, …)
  • State/Event model to implement non-blocking architecture
  • Message Queue implementation
  • How to make all events (Net, Disk, Key, Mouse, Internal messages, Timeout, Exceptions, …) appear on Queue
  • Distributed state (raft, paxos, …)
  • Maybe some Erlang inspired stuff fail fast, reload, restart of Nodes
  • Reliable handling of external web-services/microservices
  • Handling a DB in this environment
  • Handling hot load of code and config
  • Replay eventstream to re-establish state

Just say when I should be in Chicago…

/Rene

PS. Maybe throw in a GIL removal lecture and make it a two week course :slight_smile:


#6

Maybe some Erlang inspired stuff

That reminds me of this.

It would be epic if we had even a broken implementation of this beauty in CPython


#7

I think the trick in a course like this is making sure the project has enough of a well-defined scope to be doable in a week. I also wouldn’t want to do it in a way where there was too much reliance on third-party libraries (I’d want it to be more of a ground-up building of everything needed so that we could explore all of the underlying issues that arise).

I definitely think implementing Raft would be a hard challenge for everyone. Not only is the problem itself hard, the implementation involves a wide variety of extremely tricky edge cases. I’ve been working on an implementation that uses ZeroMQ and I’ve found it to be pretty challenging–and it still doesn’t work (think I need another few days).

In the big picture, I definitely think there are enough “issues” surrounding just Raft all by itself to make for a pretty good course.


#8

Hi! I also emailed you about a Chicago-course on concurrency and related topics. I am definitely waiting to sign up! (It would be the most wonderful thing in the world if the course runs during my spring break in March too)


#9

implementation that uses ZeroMQ and I’ve found it to be pretty challenging

Are you referring to difficulties in using zeromq itself, or the consensus algorithms?


#10

The difficulties are all on the consensus side, not with ZeroMQ (at least not so far). There are a lot of very tricky interactions concerning time (heartbeats, timeouts, retries, etc.). For example, the leader sending out concurrent requests to followers–some of which might be dead or unresponsive. Or the fact that forward progress is made if a simple majority of nodes agree (not necessarily all nodes). Frankly, there is a whole lot of juggling going on with all of the parts and all of the possible failure modes. Holding it all in my head has proven to be challenging.

If I were to do a course, I wouldn’t base it on a ZeroMQ implementation. That’s more of an incidental detail of me playing around right now–you need to have some kind of messaging layer, but it’s not terribly hard to make one from scratch. From a pure teaching perspective, I’d want to build all of the layers from the ground up (including the messaging).


#11

it’s not terribly hard to make one from scratch

<3. This would be really awesome to witness. I have zero low-level message passing experience beyong zeromq, and building one from scratch would be an awesome learning experience!

the leader sending out concurrent requests to followers–some of which might be dead or unresponsive.

I also had this problem of unresponsive clients, handling disconnects, and it was eventually solved by replacing PUB-SUB / PAIR with an async socket, ROUTER-DEALER. Combined with zmq.select, it provided very ergonomic model for this kind of stuff. Don’t know what kind of approach you are taking

Frankly, there is a whole lot of juggling going on with all of the parts and all of the possible failure modes.

Isn’t this a general problem with distributed systems? It seems quite tiring and cumbersome to do all the manual wiring…


#12

I have been working on distributed software for a while now, and I found quite tricky to introduce junior devs (or senior managers) to the various non-trivial aspects of distributed computing, because there is always so much boiler plate to deal with before getting to the actual thing that matters for them (and even more for the mind bending bits, distribution fallacies, etc.).

I would love to know of some resources to bring junior devs to distributed computing from scratch, step by step. One weekly project after another would be a nice way to structure it, with a long term overarching goal.

What I had in mind as a first goal would be a kind of distributed chat / shell / repl interface, using a very simple custom command/programming language, maybe with a logic relying on interval tree clocks, CRDTs, distributed hashtable, blockchain, . In the DIY open spirit, I started a few projects on github, slowly moving, toying on this idea :

  • pyros-dev/pyzmp : a kind of communicating processes framework (using zmq with pickle or protobuf). Currently used for ROS/web applications and might need to be rethought/redesigned/rewritten eventually, maybe using something like autobahn.

  • asmodehn/replator : importing custom code file, using a lalr parser in python, and providing a REPL running your custom DSL implemented in usual python.

  • Recently I have been thinking about writing something around https://github.com/python-trio/trio-click to have async repl to do network communication / remote interaction directly from a repl. Obviously something like that is also doable around curio :wink:

  • Another thing could be hardware : developing/interacting with distributed software on IoT devices can be motivating in itself, because the usual “computer” environment is stripped down, and we can adjust student/user expectations, and focus on the bits that actually matter. Input/output is the only thing needed.
    And we can link it with “real world” experiment/entertainment : “Destroy this device, it still works ! Cut this cable, it still works !”


#13

I have just posted an event “Raft from the Ground Up” on EventBrite. March 4-8, 2019. https://www.eventbrite.com/e/raft-from-the-ground-up-tickets-53826580752

Note: I have not yet made an actual web-page for the course yet. Consider this to be an “insider” pre-release announcement.


#14

The initial trial of the “Raft” course has concluded so I thought I’d offer some wrap up thoughts. Short version: working on Raft proved to be way more awesome than I ever envisioned it might be. But, it was also a pretty darn hard project. Poking around the web afterwards, I found that implementing Raft is a multi-week lab project given to students taking a graduate course in Distributed Systems at MIT. Thus, I don’t think the difficulty was a figment of my imagination. I’m not sure how the other participants felt about it.

I do have some thoughts about the Raft project from a pedagogical perspective though. When I was at the university, I taught a lot of courses that involved a major coding project. For instance, an operating systems course where students wrote a small operating system kernel. Or a compilers course where students wrote a small compiler. In both of those cases, the problem was fairly well-defined, but coming up with a solution often involved a lot of thinking, planning, and fiddling around. There were often a lot of issues in managing the complexity of the project. And in some sense, this was the most important facet of the project because if you can manage to write a compiler or an OS, you’ll probably have a better grounding for tackling other complicated projects.

Trying to come up with a similar sort of project for networks and distributed computing has always proven difficult and elusive. Part of the trouble is that it’s just too easy to be sucked into the trap of implementing the “network stack.” For example, talking about IP packets, and TCP, and sockets, and HTTP, and all of this stuff layered upon layer after layer of more stuff. Suddenly, you’re just buried under a mountain of standards, hacks, frameworks, and acronyms. It’s not that that stuff isn’t important, but it’s just not that interesting once you get the basic gist of a few concepts. It’s also hard to devise an all-encompassing project that’s challenging without getting bogged down in a bunch of side-issues that are more annoying than enlightening. For several years, I taught a networks/distributed computing course in Chicago, but I was never really satisfied the exercise set–it was too scattered and lacking in focus. I put the course into retirement in 2012 and haven’t offered it since. Some good things came out of that course (for example, all of the Python GIL research), but it needed a cohesive project.

Now, along comes the Raft project. The neat thing about Raft is that the high-level problem can be easily described (you want fault-tolerant replicated state across a small cluster of machines). Most programmers (including myself) have never had to implement anything like that so it’s a new problem. Moreover, most of the algorithm concepts can be described with some ease–there are visualizations and diagrams that show you what needs to happen. However, if you actually try to implement Raft, you quickly realize that you’re completely surrounded by edge-cases and subtle behavior in all directions. Virtually every issue covered in typical concurrency class is now in play (sockets, messaging, threads, thread synchronization, queues, RPCs, non-blocking I/O, etc.). The code is virtually impossible to test and debug unless step back and devise some strategy for tackling the problem (I’m not sure I’ve ever had to think more about object-oriented design than I did in this project). And even then, it’s hard to pack the whole thing into your brain and understand everything that’s going on when it’s running. Yet, a solution doesn’t necessarily involve a huge amount of code–the challenge is all in how you think about approaching the problem. In short–it’s a really great project if you like a challenge.

I’m definitely going to run this course again and have posted something for June 24-28, 2019. Further details at http://www.dabeaz.com/raft.html