Now we can tell which characters are tokens and which are simply whitespace, let’s step up a level and talk about breaking up those tokens.
// Next returns a []byte referencing the the next lexical token in the stream.
// The []byte is valid until Next is called again.
// If the stream is at its end, or an error has occured, Next returns a zero
// length []byte slice.
//
// A valid token begins with one of the following:
//
// { Object start
// [ Array start
// } Object end
// ] Array End
// , Literal comma
// : Literal colon
// t JSON true
// f JSON false
// n JSON null
// " A string, possibly containing backslash escaped entites.
// -, 0-9 A number
func (s *Scanner) Next() []byte {
// release the previous token
s.br.release(s.pos)
s.pos = 0
c := s.token()
length := 0
switch c {
case ObjectStart, ObjectEnd, Colon, Comma, ArrayStart, ArrayEnd:
length = 1
s.pos = 1
case True:
length = validateToken(&s.br, "true")
s.pos = length
case False:
length = validateToken(&s.br, "false")
s.pos = length
case Null:
length = validateToken(&s.br, "null")
s.pos = length
case String:
// string
length = parseString(&s.br)
if length < 2 {
return nil
}
s.pos = length
case 0:
// eof
return nil
default:
// ensure the number is correct.
length = s.parseNumber()
if length < 0 {
return nil
}
}
return s.br.window()[:length]
}
This is the core loop of Scanner.Next
. Scanner.Next
skips over any intermediate whitespace, determines the token from the first character in the window, then continues to read until the token is read or we hit the end of the input.
Let’s look at how token
works, then we’ll talk about some optimisations
func (s *Scanner) token() byte {
w := s.br.window()
for {
for _, c := range w {
if whitespace[c] {
s.pos++
continue
}
// release whitespace
s.br.release(s.pos)
s.pos = 0
return c
}
if s.br.extend() == 0 {
// eof
return 0
}
w = s.br.window()[s.pos:]
}
}
var whitespace = [256]bool{
' ': true,
'\r': true,
'\n': true,
'\t': true,
}
-
We start by getting the current window from the
byteReader
. This is a[]byte
slice of all the data that is yet to be read. -
We’re looking for the first non whitespace character. If the character is a whitespace we increment
s.pos
to ignore the character and loop around. -
If we do find a non whitespace character, we release
s.pos
characters from the front of the window. Now the start of the window is properly aligned with the first character of the token. -
It turns out that we also get the first character of the token for free, it’s in
c
, so we can return that as a hint toScanner.Next
. -
If we run out of characters without hitting a token then we called
extend()
to grow the window. -
If we couldn’t grow, then we’ve run out of input and haven’t got a token, so give up.
-
Otherwise update
w
with a new window.
This is the basic operation of byteReader
, we’ll see that pattern repeated across the scanner. Some things to note:
-
Note the lack of error handling, its not part of the inner loop, it only happens when we have to read more data from the underlying reader.
-
extend
hides the process of reading into, growing, refilling the buffer, it makes the loop above it--Scanner.token
simpler; if there is data in the window, process it, extend if you need too, give up if you can’t extend. -
release
is similar, it shrinks the start of the window to exclude data that we don’t care about. No need to copy or move data around, no need to -
extend
is not in the hot path, so there is no need to optimise it, its performance is a function of the buffer it is given. In practice a buffer of 8k is sufficient.
Let’s talk about the performance of this code.
name time/op
Scanner/canada-16 4.41ms ± 2%
Scanner/citm_catalog-16 2.55ms ± 3%
Scanner/twitter-16 1.03ms ± 1%
Scanner/code-16 4.21ms ± 1%
Scanner/example-16 21.4µs ± 1%
Scanner/sample-16 822µs ± 1%
name speed
Scanner/canada-16 510MB/s ± 2%
Scanner/citm_catalog-16 677MB/s ± 3%
Scanner/twitter-16 615MB/s ± 1%
Scanner/code-16 461MB/s ± 1%
Scanner/example-16 608MB/s ± 1%
Scanner/sample-16 837MB/s ± 1%
This is a benchmark you saw earlier, minus a few optimisations we’ll talk about next.
Comparing the performance of Scanner.Next
to our whitespace benchmark we can see that we’re between 1/4 and 2/5ths of our baseline.
Let’s talk about the first improvement we can make to the code. Note the amount of work being spent to keep s.pos
up to date. We know that s.pos
is set to 0 just before Scanner.Next
calls this function, and we set s.pos
to zero on the way out of the function, so the changes we make to s.pos
within the function are invisible—its zero on entry, and zero on exit.
We can rewrite the function to keep a local pos
value, which has an impressive effect on token
.
func (s *Scanner) token() byte {
w := s.br.window()
pos := 0
for {
for _, c := range w {
if whitespace[c] {
pos++
continue
}
// release whitespace
s.br.release(pos)
return c
}
if s.br.extend() == 0 {
// eof
return 0
}
w = s.br.window()[pos:]
}
}
var whitespace = [256]bool{
' ': true,
'\r': true,
'\n': true,
'\t': true,
}
name old time/op new time/op delta
Scanner/canada-16 4.39ms ± 1% 4.43ms ± 4% ~ (p=1.000 n=5+5)
Scanner/citm_catalog-16 2.52ms ± 1% 1.80ms ± 4% -28.46% (p=0.008 n=5+5)
Scanner/twitter-16 1.03ms ± 2% 0.95ms ± 3% -7.41% (p=0.008 n=5+5)
Scanner/code-16 4.24ms ± 2% 4.18ms ± 1% ~ (p=0.095 n=5+5)
Scanner/example-16 21.4µs ± 1% 18.9µs ± 2% -11.68% (p=0.008 n=5+5)
Scanner/sample-16 828µs ± 2% 528µs ± 2% -36.24% (p=0.008 n=5+5)
name old speed new speed delta
Scanner/canada-16 512MB/s ± 1% 509MB/s ± 4% ~ (p=1.000 n=5+5)
Scanner/citm_catalog-16 685MB/s ± 1% 958MB/s ± 4% +39.84% (p=0.008 n=5+5)
Scanner/twitter-16 616MB/s ± 2% 665MB/s ± 3% +8.01% (p=0.008 n=5+5)
Scanner/code-16 458MB/s ± 2% 465MB/s ± 1% ~ (p=0.095 n=5+5)
Scanner/example-16 608MB/s ± 1% 688MB/s ± 2% +13.23% (p=0.008 n=5+5)
Scanner/sample-16 831MB/s ± 2% 1303MB/s ± 2% +56.84% (p=0.008 n=5+5)
By keeping pos
local the compiler avoided those temporary writes back to memory.
The question I have for you is why did this improve some inputs and not others?
The answer, I think, is different inputs have different amounts of whitespace. For example canada
only has 33 whitespace characters whereas citm
has 1,227,563.
There is a larger improvement we can make for the runtime of this code, and it relates to inlining. Inlining is the process of automatically (or manually) copying the body of a function into, in line with, its caller. This avoids the overhead of the function call.
Usually inlining is performed automatically by the compiler according to a set of rules it controls. The Go compiler has reasonable support for inlining, but has a number of limitations.
% go build -gcflags=-m=2 2>&1 | grep cannot | grep -v decoder
./reader.go:31:6: cannot inline (*byteReader).extend: function too complex: cost 198 exceeds budget 80
./scanner.go:99:6: cannot inline (*Scanner).token: unhandled op FOR
./scanner.go:130:6: cannot inline validateToken: unhandled op FOR
./scanner.go:153:6: cannot inline parseString: unhandled op FOR
./scanner.go:182:6: cannot inline (*Scanner).parseNumber: unhandled op FOR
./scanner.go:56:6: cannot inline (*Scanner).Next: function too complex: cost 476 exceeds budget 80
The first is the size of the function, byteReader.extend
cannot be inlined because it is too complex. The second is statements within the fuction, Scanner.token
cannot be inlined because it contains a for
statement. Also note that Scanner.Next
, the caller of Scanner.token
cannot be inlined because it also too complex.
Let’s go back to the constraints. Scanner.Next
is called for each token in the input. This means that Scanner.token
is called for each token in the input. Scanner.token
cannot be automatically inlined into its called because it is too complex. Therefore we’re paying for an extra function call for each token.
We can remove this overhead by manually inlining Scanner.token
into its caller.
func (s *Scanner) Next() []byte {
// release the previous token
s.br.release(s.pos)
w := s.br.window()
pos := 0
for {
for _, c := range w {
if whitespace[c] {
pos++
continue
}
// release whitespace
s.br.release(pos)
length := 0
switch c {
case ObjectStart, ObjectEnd, Colon, Comma, ArrayStart, ArrayEnd:
length = 1
s.pos = 1
case True:
length = validateToken(&s.br, "true")
s.pos = length
case False:
length = validateToken(&s.br, "false")
s.pos = length
case Null:
length = validateToken(&s.br, "null")
s.pos = length
case String:
// string
length = parseString(&s.br)
if length < 2 {
return nil
}
s.pos = length
default:
// ensure the number is correct.
length = s.parseNumber()
if length < 0 {
return nil
}
}
return s.br.window()[:length]
}
if s.br.extend() == 0 {
// eof
return nil
}
w = s.br.window()[pos:]
}
}
The results support our thesis:
name old time/op new time/op delta
Scanner/canada-16 4.36ms ± 1% 3.50ms ± 0% -19.68% (p=0.008 n=5+5)
Scanner/citm_catalog-16 1.80ms ± 1% 1.56ms ± 2% -13.16% (p=0.008 n=5+5)
Scanner/twitter-16 965µs ± 2% 833µs ± 2% -13.75% (p=0.008 n=5+5)
Scanner/code-16 4.15ms ± 1% 3.61ms ± 1% -12.82% (p=0.008 n=5+5)
Scanner/example-16 18.9µs ± 2% 16.6µs ± 1% -12.42% (p=0.008 n=5+5)
Scanner/sample-16 515µs ± 1% 472µs ± 2% -8.34% (p=0.008 n=5+5)
name old speed new speed delta
Scanner/canada-16 516MB/s ± 1% 642MB/s ± 0% +24.50% (p=0.008 n=5+5)
Scanner/citm_catalog-16 960MB/s ± 1% 1105MB/s ± 2% +15.16% (p=0.008 n=5+5)
Scanner/twitter-16 654MB/s ± 2% 759MB/s ± 1% +15.94% (p=0.008 n=5+5)
Scanner/code-16 468MB/s ± 1% 537MB/s ± 1% +14.69% (p=0.008 n=5+5)
Scanner/example-16 689MB/s ± 2% 787MB/s ± 1% +14.17% (p=0.008 n=5+5)
Scanner/sample-16 1.33GB/s ± 1% 1.46GB/s ± 1% +9.11% (p=0.008 n=5+5)
By saving the function call we’ve improved throughput by 9-24%. The largest improvement comes from canada
, which basically contained no whitespace, so the call to Scanner.token
almost always returned immediately having done no work, while also paying for all the s.pos
and release
overhead.
This is where the inner loop of the Scanner
stands today. Note that citm
is over 50% of the baseline, sample
is nearly 75%. To recap the major optimisations were:
-
whitespace[c]
-
Avoiding
s.pos
updates. They cannot be registerised, CPU has to do a write on every iteration.s.pos
updates reduced from one per byte to one per token. -
Scanner.Next
andScanner.token
were effectively one function spread over two. Each are too large to be inlined, so we’re paying for an extra function call per token. Manually inlining them increased the indentation depth of the function, but delivered substantial speedups. -
Most JSON contains some whitespace, it’s moderately optimised for human readability. It turns out, the more whitespace, the faster
pkg/json
decodes.
from Hacker News https://ift.tt/2Ze5KLb
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.