A year with Go - 05 Jun 2015

So, it has been a year I’ve been working with Go. Last week I removed it from production.

Re-reading my impressions after just a week, I pretty much stand by what I said back then, but there’s a few other things that I’d like talk about, and amplify some points from the previous post.

Now, I’m writing this up because people have asked me about my thoughts on Go several times over the past year, and I wanted to go into a little more depth than is possible over Twitter/IRC before all the details fade from memory. If you’re not interested in my opinion, or are ending up here via some Go news aggregator or something and want to show me the error of my ways, you probably needn’t bother. I’m going to put Go (alongside C++, Java and PHP) in the weird drawer under the microwave where all the stuff you can’t find a good use for gravitates.

So, lets talk about the reasons I don’t consider Go a useful tool:

The tooling

Go’s tooling is really weird, on the surface it has some really nice tools, but a lot of them, when you start using them, quickly show their limitations. Compared to the tooling in C or Erlang, they’re kind of a joke.

Coverage

The Go coverage tool is, frankly, a hack. It only works on single files at a time and it works by inserting lines like this:

GoCover.Count[n] = 1

where n is the branch id in the file. It also adds a giant global struct at the end of the file:

var GoCover = struct {
        Count     [7]uint32
        Pos       [3 * 7]uint32
        NumStmt   [7]uint16
} {
        Pos: [3 * 7]uint32{
                3, 4, 0xc0019, // [0]
                16, 16, 0x160005, // [1]
                5, 6, 0x1a0005, // [2]
                7, 8, 0x160005, // [3]
                9, 10, 0x170005, // [4]
                11, 12, 0x150005, // [5]
                13, 14, 0x160005, // [6]
        },
        NumStmt: [7]uint16{
                1, // 0
                1, // 1
                1, // 2
                1, // 3
                1, // 4
                1, // 5
                1, // 6
        },
}

This actually works fine for unit tests on single files, but good luck getting any idea of integration test coverage across an application. The global values conflict if you use the same name across files, and if you don’t then there’s not an easy way to collect the coverage report. So basically if you’re interested in integration tests, no coverage for you. Other languages use more sophisticated tools to get coverage reports for the program as a whole, not just one file at a time.

Benchmarking

The benchmarking tool is a similar thing, it looks great until you actually look into how it works. What it ends up doing is wrapping your benchmark in a for loop with a variable iteration count. Then the benchmark tool increments the iteration count until the benchmark runs ‘long enough’ (default is 1s) and then it divides the execution time by the iterations. Not only does this include the for loop time in the benchmark, it also masks outliers, all you get is a naive average execution time per iteration. This is the actual code from benchmark.go:

func (b *B) nsPerOp() int64 {
    if b.N <= 0 {
        return 0
    }
    return b.duration.Nanoseconds() / int64(b.N)
}

This will hide things like GC pauses, lock contention slowdowns, etc if they’re infrequent.

Compiler & go vet

One of the things people tote about Go is the fast compile speed. From what I can tell, Go at least partially achieves this by simply not doing some of the checks you’d expect from the compiler and instead implementing those in go vet. Things like shadowed variables and bad printf format strings aren’t checked by the compiler, they’re checked with go vet. Ugh. I’ve also noticed go vet actually regress between 1.2 and 1.3, where 1.3 wasn’t catching valid problems that 1.2 would.

go get

The less said about this idea the better, the fact that Go users now say not to use it, but apparently are making no move to actually deprecate/remove it is unfortunate, as is the lack of an ‘official’ replacement.

$GOPATH

Another idea I’m not enthralled with, I’d rather clone the repo to my home dir and have the build system put the deps under the project root. Not a major pain point but just annoying.

Go race detector

This one is actually kind of nice, although I’m sad it has to exist at all. The annoying thing is that it doesn’t work on all ‘supported’ platforms (FreeBSD anyone?) and it is limited to 8192 goroutines. You also have to manage to hit the race, which can be tricky to do with how much the race detector slows things down.

Runtime

Channels/mutexes

Channels and mutexes are SLOW. Adding proper mutexes to some of our code in production slowed things down so much it was actually better to just run the service under daemontools and let the service crash/restart.

Crash logs

When Go DOES crash, the crap it dumps to the logs are kind of ridiculous, every active goroutine (starting with the one causing the crash) dumps its stack to stdout. This gets a little unwieldy with scale. Also, the crash messages are extremely obtuse, including things like ‘evacuation not done in time’, ‘freelist empty’ and other gems. I wonder if the error messages are a ploy to drive more traffic to Google’s search engine, because that’s the only way you’ll figure out what they mean.

Runtime inspectability

This isn’t really a thing, you’re better off just writing in a real systems language and using gdb/valgrind/etc or use a language with a VM that can give you a way to peek inside the running instance. I guess Go keeps the idea of printf debugging alive. You can use GDB with Go, but you probably don’t want to.

The language

I genuniely don’t enjoy writing Go. Either I’m battling the limited type system, casting everything to interface{} or copy/pasting code to do pretty much the same thing with 2 kinds of structs. Every time I want to add a new feature it feels like I’m adding more struct definitions and bespoke code for working with them. How is this better than C structs with function pointers, or writing things in a functional style where you have smart data structures and dumb code? Don’t even get me started on the anonymous struct nonsense.

I also, apparently, don’t understand Go’s pointers (C pointers I understand fine). I’ve literally had cases where just dropping a * in front of something has made it magically work (but it compiled without one). Why the heck is Go making me care about pointers at all if it is a GC’d language?

I also tire of casting between byte[] and string, and messing with arrays/slices. I understand why they’re there, but it feels unnecessarily low level given the rest of Go.

There’s also the whole nonsense of [:], … and append, check this out:

iv = append(iv, truncatedIv[:]...)

This converts the array ‘truncatedIv’ into a slice of all the elements, explodes the slice to be an argument list, and appends those arguments to ‘iv’. append() here is a special magic builtin that works for any slices (you might even say it was generic). You have to reassign the result of the append() call to the variable being appended to because append sometimes, depending on the size of the array underlying the slice, will append in-place and sometimes will allocate a new array and return that. It is basically realloc(3) for Go.

The Stdlib

Some of Go’s stdlib is pretty nice, the crypto stuff is a lot less clumsy than the shitty OpenSSL wrapper lots of languages give you. I don’t really enjoy the Go documentation though, especially when interfaces are involved. I usually have to go read the source code to figure out what is actually going on. “Implements the X method” isn’t that useful if I don’t know what X is supposed to do.

I do have quite a big problem with the ‘net’ package. Unlike regular socket programming, you don’t get to configure the socket the way you want. Want to toggle an arbitrary sockopt like IP_RECVPKTINFO? Good luck. The only way to do that is via the ‘syscall’ package, which is the laziest wrapper around the POSIX interface I’ve seen in a while (reminds me of some old PHP bindings). Even better, you can’t get the file descriptor out of a connection initiated with the ‘net’ package, you get to standup the socket entirely with the syscall interface:

fd, err := syscall.Socket(syscall.AF_INET6, syscall.SOCK_DGRAM, 0)
if err != nil {
    rlog.Fatal("failed to create socket", err.Error())
}
rlog.Debug("socket fd is %d\n", fd)

err = syscall.SetsockoptInt(fd, syscall.IPPROTO_IPV6, syscall.IPV6_RECVPKTINFO, 1)
if err != nil {
    rlog.Fatal("unable to set IPV6_RECVPKTINFO", err.Error())
}

err = syscall.SetsockoptInt(fd, syscall.IPPROTO_IPV6, syscall.IPV6_V6ONLY, 1)
if err != nil {
    rlog.Fatal("unable to set IPV6_V6ONLY", err.Error())
}

addr := new(syscall.SockaddrInet6)
addr.Port = UDPPort

rlog.Notice("UDP listen port is %d", addr.Port)

err = syscall.Bind(fd, addr)
if err != nil {
    rlog.Fatal("bind error ", err.Error())
}

And then you get the joy of passing/receiving byte[] parameters to/from the syscall functions. Constructing/destructuring C structures from Go is super-fun.

Apparently the reason for this madness is the ‘net’ package assumes the sockopts are set up a specific way so the socket polling can work? I don’t know for sure but I know it makes any ‘fancy’ network programming pretty annoying and dubiously portable.

Conclusion

I just don’t understand the point of Go. If I wanted a systems language, I’d use C/D/Rust, if I wanted a language built around concurrency I’d use Erlang or Haskell. The only place I can see Go shining is for stuff like portable command line utilities where you want to ship a static binary that Just Works(tm). For interactive tasks I think it would be fine, I just don’t think it is particularly well suited to long-running servery things. It also probably looks attractive to Ruby/Python/Java developers, which is where I think a lot of Go programmers come from. Speaking of Java, I wouldn’t be surprised to see Go end up as the ‘new Java’ given the easier deploy story and the similar sort of vibe I get from the language. If you’re just looking for a ‘better’ Ruby/Python/Java, Go might be for you, but I would encourage you to look further afield. Good languages help evolve your approach to programming; LISP shows you the idea of code as data, C teaches you about working with the machine at a lower level, Ruby teaches you about message passing & lambdas, Erlang teaches you about concurrency and fault tolerance, Haskell teaches you about real type systems and purity, Rust presumably teaches you about sharing memory in a concurrent environment. I just don’t think I got much from learning Go.


Chasing distributed Erlang - 31 Mar 2015

So, the other week, someone in #erlounge linked to an interesting
Reddit post
by someone switching from Erlang to Go.

I actually strongly disagree with almost everything he says, but the really
interesting part of the thread is when he starts talking about sending 10Mb
messages around and the fact that that ‘breaks’ the cluster. Other commentators
on the thread rightly point out that this is terrible for the heartbeats that
distributed Erlang uses to maintain cluster connectivity and that you shouldn’t
send large objects like that around.

And this is where I started thinking. In the Erlang community this is a known
problem, but why isn’t there a general purpose solution? Riak’s handoff uses
dedicated TCP connections to do handoff, but when reconciling siblings on a
GET/PUT? Riak uses disterl for that (this is one of the reasons that Riak
recommends against large objects).

So, even Riak is doing what ‘everyone knows’ not to do. Why isn’t there a
library for that? I asked myself this one night at 2am before a flight to SFO
the next morning, and could not come up with an answer. So, I did the logical
thing; I turned my caremad into a prototype library.

After some Andy Gross style airplane-hacking, I had a basic prototype that
would, on demand, stand up a pool of TCP connections to another node (using the
same connection semantics as disterl) and then dispatch Erlang messages over
those pipes to the appropriate node. I even implemented a drop-in replacement
for gen_server:call() (although the return message came back over disterl).

The only problem? It was slow. Horrendously slow.

My first guess was that my naive gen_tcp:send(Socket, term_to_binary(Message))
was generating a giant, off-heap and quickly unreferenced binary (and it is).
So, I looked at how disterl does it. A bunch of gnarly C later, I had a BIF of
my own: erlang:send_term/2

This, amazingly, worked, but with large messages (30+MB) I ended up causing
scheduler
collapse
because my BIF doesn’t yield back to the VM or increment
reduction counts. I looked at adding that to the BIF and basically gave up.

So, I left it on the backburner for a couple weeks. When I came back, I had some
fresh insights. The first was: what if we had a ‘term_to_iolist’ function that
would preserve sharing? So I went off and implemented a half-assed
one in Erlang,
that mainly tries to encode the common erlang types into the Erlang external
term format

but using iolists, not binaries (for those unfamiliar with Erlang,
iolists are often better when generating data to be written to files/sockets as
they can preserve sharing of embedded binaries, along with other things). For
all the ‘hard’ types, my code punts and calls term_to_binary and chops off the
leading ‘131’ byte.

That worked, but performance was still miserable in my simple benchmark. I
pondered this for a while, and realized my benchmark wasn’t fair to my library.
Distributed Erlang has an advantage because it is set up by the VM automatically
(fully connected clusters are the default in Erlang). My library, however,
lazily initalizes pooled connections to other nodes. So I added a ‘prime’ phase
to my test, where we send a tiny message around the cluster to ‘prime the pump’
and initialize all the needed communication channels.

This massively helped performance, and, in fact, my library was now in
striking distance of disterl. However, I couldn’t beat it, which seemed odd
since I had many TCP connections available, not just one. Again, after some
thought, I realized that my benchmark was running a single sender on each node,
and so there wasn’t really any opportunity for my extra sockets to get used. I
reworked the benchmark to start several senders per node, and was able to leave
disterl in the dust (with 6 or 8 workers, on an 8 core machine, I see a 30-40%
improvement on sending 10Mb binary around a 6 node cluster and then ACKing the
sender when the final node receives it).

After that, I thought I was done. However, under extreme load, my library would
drop messages (but not TCP connections). This baffled me for quite a while until
I figured out that the way my connection pools were initializing was racy. It
turns out that I was relying on a registered Erlang supervisor process to be
present to detect if the pool for connecting to a particular node. However, the
fact that the registered supervisor was running doesn’t guarantee that all of the
child processes are, and that is where I was running into trouble. Using a
separate ETS table to track actually started pools fixed the race without
impacting performance too much.

So, at this point, my library (called teleport),
provides distributed Erlang
style semantics (mostly) over the top of tcp connection pools, without impacting
the distributed Erlang connections and disrupting heartbeats. A ‘raw’ Erlang
message like this:

{myname, mynode@myhost} ! mymessage

becomes:

teleport:send({myname, mynode@myhost}, mymessage)

And for gen_server:calls:

gen_server:call(RemotePid, message)

becomes:

teleport:gs_call(RemotePid, message)

The other OTP style messages (gen_server:cast(), and the gen_fsm/gen_event
messages) could also easily be supported. Right now, the reply to the
gen_server:call() comes back over distributed Erlang’s channels, not over the
teleport socket. This is something that probably should change (the Riak Get/Put
use case would need it, for example). Another difference is that, because we’re
using a pool of connections, the ordering of messages is not guaranteed at all.
If you need ordered messages, this is probably not the library for you.

If you want to compare performance on your own machine, just run

./rebar3 ct

The common_test suite will stand up a 6 node cluster, start 6 workers on each,
and have them all send a 10mb binary around the ‘ring’ so each node sees each
binary. It does this for both disterl and for teleport and reports the
individual times in microseconds, and the average time in seconds.

Finally, I’m not actually using this for anything, nor do I have any immediate
plans to use it. I mostly did it to see if I could do it, and to see if such a
library was possible to implement without too many compromises. Contributions of
any kind are most welcome.


Reposting the classics - 22 Sep 2014

Ever since my old woodshed hosted zotonic blog went down, people have been bugging me to repost my ‘classic’ articles on egitd and poolboy. My friend Reid Draper finally pushed me over the cliff tonight, so here you guys go:

Kudos to the wayback machine to keeping a copy around for me.



Archive