My employer had a WebSocket solution that wasn’t scaling very well. My Mission: make the stack high on throughput and low on CPU.

We were using Socket.io tied into Backbone on the Client and Node.js on the server. Its role in our architecture was to provide realtime streaming data – a unidirectional broadcast, vs. your traditional chat app. The Server consumed JSON payloads from Redis Pub/Sub channels and routed them into our Client’s shock-and-awe d3 visualizations.

Seems simple enough, right? Yet it had latency issues. And it would fall over if you looked at it funny for too long.

TL;DR

Show me the pretty pictures, please.

Place Your Weapons on the Table

The guns I brought to this knife fight were the brainchild of mr. Brendan Gregg

Rather than try to get perf running under OS X, I built myself a little Vagrant imageredacted here

htop and nmon were invaluable during load and bandwidth testing.

I installed perf via script

This effort needed a version of Node v0.12 with debug symbols, or else I could’t get nice information out of perf. Fortunately, their Linux binary came pre-built this way!

The Redis server also needed a little bit of fine-tuning

That under-run was systematic of what happened when our Server couldn’t keep up with payload dispatching. I wanted my Redis to be a delicate canary.

In the end, I didn’t end up needing gdb or the Kernel debug symbols, as several tutorials suggested. But, you now, just in case I did …

PRO-TIP: every time I booted the Vagrant VM, I needed to run this script once (as excerpted from mr. Trevor Norris’s notes)

Otherwise, perf will complain about “WARNING: Kernel blah blah” (unless you run it as sudo, which introduces its own issues).

With all that in place, to run the Server and produce a perf log output:

# produces /tmp/perf-<PID>.map
node --perf-basic-prof server.js &
echo PID=$$!

Then, execute perf against the log from the $PID Server:

# gather 100 samples/sec for 1 minute
perf record -F 100 -p $PID -a -g -- sleep 60

# create the FlameGraph SVG
perf script > out.perf
./FlameGraph/stackcollapse-perf.pl out.perf > out.folded
./FlameGraph/flamegraph.pl < out.folded > /vagrant/graph.svg

Where We Began

The test Server ran on a single AWS m3.2xlarge with 8 cores. Using cluster, the Node.js app provided one Server process per CPU. The intent was to push the stack to 85-95% continuous CPU saturation.

Here’s our out-of-the-box Server streaming 20 msgs/sec to 10 Clients via Socket.io’s XHR Polling protocol (which provides fallback support for pre-WebSocket browsers).

Flamegraph

An XHR lockup @ 10 Clients, 20 msgs/sec

In it’s initial state, the Server would labor even under these piddly sorts of data scenarios. The performance markers which stood out were

  • bn_mul_mont = TLS computation, eg. crypto
  • v8::internal::Runtime_BasicJSONStringify
  • v8::internal::String::VisitFlatv8::Utf8LengthHelper::Visitor
  • e1000_xmit_frame = outbound packets, eg. networking
  • read_tsc = kernel time determination via Time Stamp Counter

The last two are a given for TCP Socket transmission, and they turn out to be negligible once you’ve ferreted out the first three bottlenecks.

Let an Expert Terminate the TLS

There’s no reason that our Server should be wasting around 15% of its processing power on bn_mul_mont. NOTE: the XHR Polling transport produces a lot more TLS handshakes than a true WebSocket protocol.

The HTTP layer of the Server was implemented via the https module. TLS marshalling added signficant (50%+) overhead to each response cycle; it’s time-consuming, blocking, and terribly inefficient. That’s not where Node.js should be spending its valuable time.

It makes a lot more sense to let nginx do that work as a proxy. After all, that’s what it’s been optimized to do, and it does it well.

RE: worker_count, it turned out that given N CPUs the best balance was

  • N-1 Server instances (eg. 7)
  • N nginx Workers (eg. 8)
  • 1 nginx Master

In the end, the Node.js processes won’t peg their CPUs, so nginx Workers can consume the remaining capacity. The nginx Master always has (nearly) one free CPU available should things get all hot & heavy.

Now, with 15% of its computation time delegated to a more efficient engine, I tasked the Server to stream 350 msgs/sec to 10 Clients via Socket.io’s WS protocol.

Flamegraph

System collapse even without TLS @ 10 Clients x 1 Channel, 350 msgs/sec

And yet, the Server labors on. It would seem that Socket.io is spending unnecessary time spinning its wheels on String serialization.

Use Socket.io’s Binary Mode

The phrases BasicJSONStringify and Utf8LengthHelper in the Flamegraph are big clues here.

We parsed the JSON payloads from Redis – for filtering purposes – and then delivered the resulting Objects through Socket.io, which would dutifully turn them into UTF-8 encoded Strings. But they’re already JSON when we receive them … why waste time reserializing un-transformed content? JSON.stringify is synchronous, and to no one’s surprise, it blocks I/O under high load.

It turns out that delivering the raw message Strings has inefficiencies. They’re already UTF-8 when we receive them, but they still require safe conversion. And yes, charset encoding is synchronous, so it also blocks.

Socket.io provides an end-to-end ‘binary mode’ for the WebSockets transport (and a good-enough variant for XHR Polling). All that’s needed to trigger binary mode is the use of Buffers.

Since Socket.io is in binary mode, their Client takes no responsibility for deserializing the message. We must decode and parse the JSON payload manually. Thanks to the mathiasbynens/utf8.js module, this is a slam dunk.

Batching Messages

Another refinement would be to send messages in batches.

Each WebSocket message introduces CPU overhead beyond what is needed to send the message content itself. Message size does not impact performance / scale as much as the message count does. Batching reduces the every-write overhead.

However, there’s a trade-off for a system like ours which purports to be “realtime”. At worst case, a given message will be delayed by the duration of the batching interval. Also, CPU utilization becomes more ‘burst-y’ with big chunks of work at the edge of each interval.

We’d already found a sweet-spot with 200ms batches, which is plenty realtime enough for our customers. With the JSON serialization excised from the Server, that rate scaled wonderfully and no additional changes were needed.

Here’s the lean-and-mean Server streaming 200 msgs/sec to 10 Clients [†]

Flamegraph

Delivering 132K/sec of volume to 10 Clients, 200 msgs/sec

[†] “Wait, why 200? You were benchmarking 350 before!”

Oh sure, but what that was with tiiiny li’l JSON payloads (~250b). What this scenario introduced was medium sized payloads (3K), plus an occasional ‘poison pill’ large payload (30K+).

The FlameGraph above demonstrates the Server streaming 132K/sec per socket.

In Summary

After all was said and done, each Server instance

  • could safely consume ~24K msg/sec from Redis, with variable payload sizes
  • could sustain a total transmission rate of > 1Gb / sec – ultimately, we are gated by our 1Gb networking cap, not CPU
  • has the capacity to do computation above & beyond WebSocket transmission – CPU utilization became almost negigible at typical traffic volumes
  • is only CPU-constrained by per-WebSocket-per-message overhead
    • 230 Clients @ 6 msg/sec
    • 525 Clients @ 2 msg/sec