This is a fork of wrk2 that works on macOS and has a better README.
tl;dr wrk2 is a constant throughput, correct latency recording variant of wrk
It is similar to wrk with an additional argument of -R which helps the user to specify the constant load that they want to generate on the API.
brew tap Olshansk/wrk2 https://github.com/Olshansk/wrk2
brew install --HEAD olshansk/wrk2/wrk2
sudo apt-get update
sudo apt-get install -y build-essential libssl-dev git zlib1g-dev
git clone https://github.com/Olshansk/wrk2.git
cd wrk2
make
# move the executable to somewhere in your PATH
sudo cp wrk /usr/local/bin
**A HTTP benchmarking tool based mostly on wrk**
wrk2 is wrk modifed to produce a constant throughput load, and
accurate latency details to the high 9s (i.e. can produce
accurate 99.9999%'ile when run long enough). In addition to
wrk's arguments, wrk2 takes a throughput argument (in total requests
per second) via either the --rate or -R parameters (default
is 1000).
CRITICAL NOTE: Before going farther, I'd like to make it clear that
this work is in no way intended to be an attack on or a disparagement
of the great work that Will Glozer has done with wrk. I enjoyed working
with his code, and I sincerely hope that some of the changes I had made
might be considered for inclusion back into wrk. As those of you who may
be familiar with my latency related talks and rants, the latency
measurement issues that I focused on fixing with wrk2 are extremely
common in load generators and in monitoring code. I do not
ascribe any lack of skill or intelligence to people who's creations
repeat them. I was once (as recently as 2-3 years ago) just as
oblivious to the effects of Coordinated Omission as the rest of
the world still is.
wrk2 replaces wrk's individual request sample buffers with
HdrHistograms. wrk2 maintains wrk's Lua API, including it's
presentation of the stats objects (latency and requests). The stats
objects are "emulated" using HdrHistograms. E.g. a request for a
raw sample value at index i (see latency[i] below) will return
the value at the associated percentile (100.0 * i / __len).
As a result of using HdrHistograms for full (lossless) recording,
constant throughput load generation, and accurate tracking of
response latency (from the point in time where a request was supposed
to be sent per the "plan" to the time that it actually arrived), wrk2's
latency reporting is significantly more accurate (as in "correct") than
that of wrk's current (Nov. 2014) execution model.
It is important to note that in wrk2's current constant-throughput
implementation, measured latencies are [only] accurate to a +/- ~1 msec
granularity, due to OS sleep time behavior.
wrk2 is currently in experimental/development mode, and may well be
merged into wrk in the future if others see fit to adopt it's changes.
The remaining part of the README is wrk's, with minor changes to
reflect additional parameter and output. There is an important and
detailed note at the end about about wrk2's latency measurement
technique, including a discussion of Coordinated Omission, how
wrk2 avoids it, and detailed output that demonstrates it.
wrk2 (as is wrk) is a modern HTTP benchmarking tool capable of generating
significant load when run on a single multi-core CPU. It combines a
multithreaded design with scalable event notification systems such as
epoll and kqueue.
An optional LuaJIT script can perform HTTP request generation, response
processing, and custom reporting. Several example scripts are located in
scripts/.
This runs a benchmark for 30 seconds, using 2 threads, keeping
100 HTTP connections open, and a constant throughput of 2000 requests
per second (total, across all connections combined).
[It's important to note that wrk2 extends the initial calibration
period to 10 seconds (from wrk's 0.5 second), so runs shorter than
10-20 seconds may not present useful information]
Output:
Running 30s test @ http://127.0.0.1:80/index.html
2 threads and 100 connections
Thread calibration: mean lat.: 9747 usec, rate sampling interval: 21 msec
Thread calibration: mean lat.: 9631 usec, rate sampling interval: 21 msec
Thread Stats Avg Stdev Max +/- Stdev
Latency 6.46ms 1.93ms 12.34ms 67.66%
Req/Sec 1.05k 1.12k 2.50k 64.84%
60017 requests in 30.01s, 19.81MB read
Requests/sec: 2000.15
Transfer/sec: 676.14KB
However, wrk2 will usually be run with the --latency flag, which provides
detailed latency percentile information (in a format that can be easily
imported to spreadsheets or gnuplot scripts and plotted per examples
provided at http://hdrhistogram.org):
init = function(args)
request = function()
response = function(status, headers, body)
done = function(summary, latency, requests)
wrk = {
scheme = "http",
host = "localhost",
port = nil,
method = "GET",
path = "/",
headers = {},
body = nil
}
function wrk.format(method, path, headers, body)
wrk.format returns a HTTP request string containing the passed
parameters merged with values from the wrk table.
global init -- function called when the thread is initialized
global request -- function returning the HTTP message for each request
global response -- optional function called with HTTP response data
global done -- optional function called with results of run
The init() function receives any extra command line arguments for the
script. Script arguments must be separated from wrk arguments with "--"
and scripts that override init() but not request() must call wrk.init()
The done() function receives a table containing result data, and two
statistics objects representing the sampled per-request latency and
per-thread request rate. Duration and latency are microsecond values
and rate is measured in requests per second.
latency.min -- minimum value seen
latency.max -- maximum value seen
latency.mean -- average value seen
latency.stdev -- standard deviation
latency:percentile(99.0) -- 99th percentile value
latency[i] -- raw sample value
summary = {
duration = N, -- run duration in microseconds
requests = N, -- total completed requests
bytes = N, -- total bytes received
errors = {
connect = N, -- total socket connection errors
read = N, -- total socket read errors
write = N, -- total socket write errors
status = N, -- total HTTP status codes > 399
timeout = N -- total request timeouts
}
}
The machine running wrk must have a sufficient number of ephemeral ports
available and closed sockets should be recycled quickly. To handle the
initial connection burst the server's listen(2) backlog should be greater
than the number of concurrent connections being tested.
A user script that only changes the HTTP method, path, adds headers or
a body, will have no performance impact. If multiple HTTP requests are
necessary they should be pre-generated and returned via a quick lookup in
the request() call. Per-request actions, particularly building a new HTTP
request, and use of response() will necessarily reduce the amount of load
that can be generated.
wrk2 is obviously based on wrk, and credit goes to wrk's authors for
pretty much everything.
wrk2 uses my (Gil Tene's) HdrHistogram. Specifically, the C port written
by Mike Barker. Details can be found at http://hdrhistogram.org . Mike
also started the work on this wrk modification, but as he was stuck
on a plane ride to New Zealand, I picked it up and ran it to completion.
wrk contains code from a number of open source projects including the
'ae' event loop from redis, the nginx/joyent/node.js 'http-parser',
Mike Pall's LuaJIT, and the Tiny Mersenne Twister PRNG. Please consult
the NOTICE file for licensing details.
A note about wrk2's latency measurement technique:
One of wrk2's main modification to wrk's current (Nov. 2014) measurement
model has to do with how request latency is computed and recorded.
wrk's model, which is similar to the model found in many current load
generators, computes the latency for a given request as the time from
the sending of the first byte of the request to the time the complete
response was received.
While this model correctly measures the actual completion time of
individual requests, it exhibits a strong Coordinated Omission effect,
through which most of the high latency artifacts exhibited by the
measured server will be ignored. Since each connection will only
begin to send a request after receiving a response, high latency
responses result in the load generator coordinating with the server
to avoid measurement during high latency periods.
There are various mechanisms by which Coordinated Omission can be
corrected or compensated for. For example, HdrHistogram includes
a simple way to compensate for Coordinated Omission when a known
expected interval between measurements exists. Alternatively, some
completely asynchronous load generators can avoid Coordinated
Omission by sending requests without waiting for previous responses
to arrive. However, this (asynchronous) technique is normally only
effective with non-blocking protocols or single-request-per-connection
workloads. When the application being measured may involve mutiple
serial request/response interactions within each connection, or a
blocking protocol (as is the case with most TCP and HTTP workloads),
this completely asynchronous behavior is usually not a viable option.
The model I chose for avoiding Coordinated Omission in wrk2 combines
the use of constant throughput load generation with latency
measurement that takes the intended constant throughput into account.
Rather than measure response latency from the time that the actual
transmission of a request occurred, wrk2 measures response latency
from the time the transmission should have occurred according to the
constant throughput configured for the run. When responses take longer
than normal (arriving later than the next request should have been sent),
the true latency of the subsequent requests will be appropriately
reflected in the recorded latency stats.
Note: This technique can be applied to variable throughput loaders.
It requires a "model" or "plan" that can provide the intended
start time if each request. Constant throughput load generators
Make this trivial to model. More complicated schemes (such as
varying throughput or stochastic arrival models) would likely
require a detailed model and some memory to provide this
information.
In order to demonstrate the significant difference between the two
latency recording techniques, wrk2 also tracks an internal "uncorrected
latency histogram" that can be reported on using the --u_latency flag.
The following chart depicts the differences between the correct and
the "uncorrected" percentile distributions measured during wrk2 runs.
(The "uncorrected" distributions are labeled with "CO", for Coordinated
Omission)
These differences can be seen in detail in the output provided when
the --u_latency flag is used. For example, the output below demonstrates
the difference in recorded latency distribution for two runs:
The first ["Example 1" below] is a relatively "quiet" run with no large
outliers (the worst case seen was 11msec), and even wit the 99'%ile exhibit
a ~2x ratio between wrk2's latency measurement and that of an uncorrected
latency scheme.
The second run ["Example 2" below] includes a single small (1.4sec)
disruption (introduced using ^Z on the apache process for simple effect).
As can be seen in the output, there is a dramatic difference between the
reported percentiles in the two measurement techniques, with wrk2's latency
report [correctly] reporting a 99%'ile that is 200x (!!!) larger than that
of the traditional measurement technique that was susceptible to Coordinated
Omission.
Example 1: [short, non-noisy run (~11msec worst observed latency)]: