Chasing distributed Erlang - 31 Mar 2015

So, the other week, someone in #erlounge linked to an interesting Reddit post by someone switching from Erlang to Go.

I actually strongly disagree with almost everything he says, but the really interesting part of the thread is when he starts talking about sending 10Mb messages around and the fact that that ‘breaks’ the cluster. Other commentators on the thread rightly point out that this is terrible for the heartbeats that distributed Erlang uses to maintain cluster connectivity and that you shouldn’t send large objects like that around.

And this is where I started thinking. In the Erlang community this is a known problem, but why isn’t there a general purpose solution? Riak’s handoff uses dedicated TCP connections to do handoff, but when reconciling siblings on a GET/PUT? Riak uses disterl for that (this is one of the reasons that Riak recommends against large objects).

So, even Riak is doing what ‘everyone knows’ not to do. Why isn’t there a library for that? I asked myself this one night at 2am before a flight to SFO the next morning, and could not come up with an answer. So, I did the logical thing; I turned my caremad into a prototype library.

After some Andy Gross style airplane-hacking, I had a basic prototype that would, on demand, stand up a pool of TCP connections to another node (using the same connection semantics as disterl) and then dispatch Erlang messages over those pipes to the appropriate node. I even implemented a drop-in replacement for gen_server:call() (although the return message came back over disterl).

The only problem? It was slow. Horrendously slow.

My first guess was that my naive gen_tcp:send(Socket, term_to_binary(Message)) was generating a giant, off-heap and quickly unreferenced binary (and it is). So, I looked at how disterl does it. A bunch of gnarly C later, I had a BIF of my own: erlang:send_term/2

This, amazingly, worked, but with large messages (30+MB) I ended up causing scheduler collapse because my BIF doesn’t yield back to the VM or increment reduction counts. I looked at adding that to the BIF and basically gave up.

So, I left it on the backburner for a couple weeks. When I came back, I had some fresh insights. The first was: what if we had a ‘term_to_iolist’ function that would preserve sharing? So I went off and implemented a half-assed one in Erlang, that mainly tries to encode the common erlang types into the Erlang external term format but using iolists, not binaries (for those unfamiliar with Erlang, iolists are often better when generating data to be written to files/sockets as they can preserve sharing of embedded binaries, along with other things). For all the ‘hard’ types, my code punts and calls term_to_binary and chops off the leading ‘131’ byte.

That worked, but performance was still miserable in my simple benchmark. I pondered this for a while, and realized my benchmark wasn’t fair to my library. Distributed Erlang has an advantage because it is set up by the VM automatically (fully connected clusters are the default in Erlang). My library, however, lazily initalizes pooled connections to other nodes. So I added a ‘prime’ phase to my test, where we send a tiny message around the cluster to ‘prime the pump’ and initialize all the needed communication channels.

This massively helped performance, and, in fact, my library was now in striking distance of disterl. However, I couldn’t beat it, which seemed odd since I had many TCP connections available, not just one. Again, after some thought, I realized that my benchmark was running a single sender on each node, and so there wasn’t really any opportunity for my extra sockets to get used. I reworked the benchmark to start several senders per node, and was able to leave disterl in the dust (with 6 or 8 workers, on an 8 core machine, I see a 30-40% improvement on sending 10Mb binary around a 6 node cluster and then ACKing the sender when the final node receives it).

After that, I thought I was done. However, under extreme load, my library would drop messages (but not TCP connections). This baffled me for quite a while until I figured out that the way my connection pools were initializing was racy. It turns out that I was relying on a registered Erlang supervisor process to be present to detect if the pool for connecting to a particular node. However, the fact that the registered supervisor was running doesn’t guarantee that all of the child processes are, and that is where I was running into trouble. Using a separate ETS table to track actually started pools fixed the race without impacting performance too much.

So, at this point, my library (called teleport), provides distributed Erlang style semantics (mostly) over the top of tcp connection pools, without impacting the distributed Erlang connections and disrupting heartbeats. A ‘raw’ Erlang message like this:

{myname, mynode@myhost} ! mymessage

becomes:

teleport:send({myname, mynode@myhost}, mymessage)

And for gen_server:calls:

gen_server:call(RemotePid, message)

becomes:

teleport:gs_call(RemotePid, message)

The other OTP style messages (gen_server:cast(), and the gen_fsm/gen_event messages) could also easily be supported. Right now, the reply to the gen_server:call() comes back over distributed Erlang’s channels, not over the teleport socket. This is something that probably should change (the Riak Get/Put use case would need it, for example). Another difference is that, because we’re using a pool of connections, the ordering of messages is not guaranteed at all. If you need ordered messages, this is probably not the library for you.

If you want to compare performance on your own machine, just run

./rebar3 ct

The common_test suite will stand up a 6 node cluster, start 6 workers on each, and have them all send a 10mb binary around the ‘ring’ so each node sees each binary. It does this for both disterl and for teleport and reports the individual times in microseconds, and the average time in seconds.

Finally, I’m not actually using this for anything, nor do I have any immediate plans to use it. I mostly did it to see if I could do it, and to see if such a library was possible to implement without too many compromises. Contributions of any kind are most welcome.


Reposting the classics - 22 Sep 2014

Ever since my old woodshed hosted zotonic blog went down, people have been bugging me to repost my ‘classic’ articles on egitd and poolboy. My friend Reid Draper finally pushed me over the cliff tonight, so here you guys go:

Kudos to the wayback machine to keeping a copy around for me.


A week with Go - 30 May 2014

OK so, I’ve been working with Go (the programming language from Google) for about a week now, and I have some initial thoughts. Now I’m far from an expert on Go, so if I get something wrong well, it would not be the first time someone was wrong on the internet.

So Go is kind of a better C, it has nice things like type inference:

var x int = 5

Can, and usually should be written as:

x := 5

That’s nice.

The For loop is sort of like a generic C for loop on steroids. The switch statement doesn’t have fallthrough, the if statement doesn’t need parentheses (and requires curly braces, to prevent those stupid braceless oneliners C allows). It has a native hash table, which is handy. These are all nice things.

However, now things start to get a little weird. Function heads are pretty wacky (from a C perspective), in general, type declarations feel ‘backwards’. Looking at it objectively they do sort of flow more logically, but it feels like bucking a 50 year trend is a little silly, given all the other borrowed syntax.

Multiple returns are nice (although you could just have tuples and destructuring/pattern matching), closures are handy (although C function pointers usually are good enough). I like the Struct/Method stuff better than C++ style insanity. Go doesn’t have tail call optimization (as far as I can tell) which is kind of unfortunate. The error/exception handling is kind of annoying, but I guess it works…

Goroutines are neat, although while they are concurrent, their level of parallelism is unclear (GOMAXPROCS seems to deal with goroutines blocked in system calls). Channels, from an Erlang perspective, look a bit dangerous, especially the synchronous aspect of them. Erlang’s mailboxes suffer from some opposite problems, though, so maybe I should not pick on channels too much.

Packages seem OK, definitely an improvement over C/C++. I’m not really thrilled with the compiler and the tooling. They work, but some of the error messages are pretty obtuse. I’m also not a convert of the GOPATH stuff, I can’t tell if it supposed to be like a virtualenv, and how the heck do you pin something to a particular git sha when using ‘go get’? Are reproducible builds even possible? How about a static analyzer? The compiler is evidently not infallible.

Where it got really ugly for me is when I found out it was a garbage collected language. I actually enjoy programming in C and I don’t mind managing my own memory there. I actually expected Go would be manually memory managed because it aims to be a ‘systems’ programming languauge. I had a nasty shock. Then I found out that goroutines don’t have any isolation of their memory space, so garbage collection is of the much-maligned ‘stop the world’ variety. Lame.

Because goroutines don’t have isolated memory spaces, that also means that one goroutine crashing takes down the whole system. Now you might say that the compiler makes that unlikely, but I was able to make it happen in my dabbling (the compiler said the code was OK, but it had a runtime error). Not good. If I was writing simple shell commands or single-use programs, that would be fine, but for something like a webserver, yuck. Shouldn’t new languages like Go be embracing the multicore era? To an extent it does, but the lack of fault tolerance, for me, is a big sign saying ‘don’t write big servery things that deal with lots of independent tasks in Go’.

I don’t know. Go currently feels to me like a missed opportunity. Mozilla’s Rust looks like a much more thoughtfully designed language, especially with the idea that one task can provide read-only access to a variable to another, or transfer ownership entirely. I just wish they’d stop fiddling with it and ship a 1.0. Granted I have not actually used Rust for anything, so it might be horrible, too.

Now, gentle reader, there IS a language that is well suited for parallel, independant, fault-tolerant task execution: Erlang. I’m clearly biased (although I’ve tried most of the ‘cool’ languages at this point, so I’m at least informed as well), but Erlang’s process model makes it almost a joy to deal with both parallel execution and fault tolerance. I built a (albeit simple) server in 20 minutes once that ended up in production for years. Because I wrote with an eye towards fault tolerance, it was tolerant of all sorts of stupid invalid inputs that came its way, without crashing the server itself, just the particular process handling that connection. In Go, from what I can tell, you’d end up with tons of defensive programming and still no gurantees you handled all the edge cases. I’ve been there, I know how to program like that, and how long it takes to flush all the bugs out. Alternatively I have sat on an Erlang shell, watching processes crash, writing the patch (if needed) and hot-code reloading it. New connections hitting that same bug magically start to work.

I don’t expect this rant to stem the tide of “we rewrote our in Go and made it 65535% faster with 1% of the lines of code", but knowing what I know now, I'll probably treat them with even less creduility than before. Speed and LOC are not all a service needs to provide (usually).

Time will tell if my opinions change, gonna be dealing with Go for a while and will have to make the best of it.



Archive