Andrew Martin
2018-09-08 13:34:33 UTC
In GHC Haskell, there are two common options for working with bytes in
memory:
data ByteArray = ByteArray ByteArray#
data ByteSlice = ByteSlice ByteArray# Int# Int#
The second one, ByteSlice, is roughly what the bytestring library does (but
with a ForeignPtr instead of a ByteArray# inside of it). What's unfortunate
is that it's difficult to know which of these two types is going to provide
better performance for a given application. These heuristics are helpful
for guiding the process:
1. The longer the data is going to be around, the better the unsliced
variant will be. This point is about space efficiency. The sliced variant
needs two extra machine words to hold those indices, plus it holds on to
extra bytes from the array that it is slicing into.
2. The shorter the data is going to be around, the better the sliced
variant is. This point is about speed of execution. The cost of the
unsliced variant is typically a memcpy. If the data is getting GCed quickly
anyway, the unsliced copy doesn't save you any space.
3. The shorter the data is, the better the unsliced variant is.
4. The more sharing of the data there is, the better the sliced variant
is.
But, users have to commit to which one they want to use in their data type
definitions, and the data type may be used in different contexts where one
or the other is preferable. For example, consider the following:
data Student = Student
{ name :: {-# UNPACK #-} !BytesType
, grade :: {-# UNPACK #-} !Int
}
Which of the two types should we use for BytesType? (For this example,
ignore the incorrectness of using bytes to represent text. With either of
the two bytes types, it's easy to put a newtype on top that communicates
that what's inside is guaranteed to be UTF-8 encoded text). It depends on
the usage. For example, consider that we are parsing students from this xml
file:
<students>
<student>
<name>Erica Chang</name>
<grade>87</grade>
</student>
<student>
<name>Lizbeth Cruz</name>
<grade>91</grade>
</student>
...
</students>
Let's say we've got about 1GB of this, and we run a parser on 64KB chunks
as they are read from disk. There are two contasting scenarios to consider.
If the file is parsed at startup and the resulting array of students is
kept in memory for a long time afterward, then the unsliced variant is
better. It allows us to throw away all of the bytes corresponding to the
xml nodes, which are no longer relevant. The sliced variant, by constrast,
would keep the entire contents of the original file in memory.
However, if the file is parsed by a SAX-style streaming parser, the
students being parsed, processed, and then discarded as they encountered,
the sliced variant is better. Slicing wins here because the chunks from the
file are discarded shortly after they are encountered. Read "shortly" as
"before the next minor GC".
In this example, Student was an application-specific data type, so the
application author could figure out what they were actually doing with the
data and define the data type accordingly. However, this isn't always a
solution. What if they needed to use the same type differently in different
contexts? What if the type was provided by a library and the library author
doesn't know how it will be used?
So far, this has just been an explanation of a problem that I've noticed. I
understand the problem well, but now I'm going to propose a solution. There
are a lot of holes in my understanding of GHC's runtime, so I don't
understand if what I'm proposing is actually plausible.
What if the GHC's runtime handled the copy decisions for us? What if we had
these primitives:
data Bytes#
sliceBytes# :: Bytes# -> Int# -> Int# -> Bytes#
lengthBytes# :: Bytes# -> Int#
Slicing causes the runtime to track that there is an additional reference
to the byte array but that the offset and length are different. I'm not
sure where this information would be stored though. When GC runs, it would
copy slices from bytes if the there were no longer any references to the
entire array. So a single type could serve the both purposes discussed in
the scenario above. I'm not sure what kind of impact this might have on GC
time.
memory:
data ByteArray = ByteArray ByteArray#
data ByteSlice = ByteSlice ByteArray# Int# Int#
The second one, ByteSlice, is roughly what the bytestring library does (but
with a ForeignPtr instead of a ByteArray# inside of it). What's unfortunate
is that it's difficult to know which of these two types is going to provide
better performance for a given application. These heuristics are helpful
for guiding the process:
1. The longer the data is going to be around, the better the unsliced
variant will be. This point is about space efficiency. The sliced variant
needs two extra machine words to hold those indices, plus it holds on to
extra bytes from the array that it is slicing into.
2. The shorter the data is going to be around, the better the sliced
variant is. This point is about speed of execution. The cost of the
unsliced variant is typically a memcpy. If the data is getting GCed quickly
anyway, the unsliced copy doesn't save you any space.
3. The shorter the data is, the better the unsliced variant is.
4. The more sharing of the data there is, the better the sliced variant
is.
But, users have to commit to which one they want to use in their data type
definitions, and the data type may be used in different contexts where one
or the other is preferable. For example, consider the following:
data Student = Student
{ name :: {-# UNPACK #-} !BytesType
, grade :: {-# UNPACK #-} !Int
}
Which of the two types should we use for BytesType? (For this example,
ignore the incorrectness of using bytes to represent text. With either of
the two bytes types, it's easy to put a newtype on top that communicates
that what's inside is guaranteed to be UTF-8 encoded text). It depends on
the usage. For example, consider that we are parsing students from this xml
file:
<students>
<student>
<name>Erica Chang</name>
<grade>87</grade>
</student>
<student>
<name>Lizbeth Cruz</name>
<grade>91</grade>
</student>
...
</students>
Let's say we've got about 1GB of this, and we run a parser on 64KB chunks
as they are read from disk. There are two contasting scenarios to consider.
If the file is parsed at startup and the resulting array of students is
kept in memory for a long time afterward, then the unsliced variant is
better. It allows us to throw away all of the bytes corresponding to the
xml nodes, which are no longer relevant. The sliced variant, by constrast,
would keep the entire contents of the original file in memory.
However, if the file is parsed by a SAX-style streaming parser, the
students being parsed, processed, and then discarded as they encountered,
the sliced variant is better. Slicing wins here because the chunks from the
file are discarded shortly after they are encountered. Read "shortly" as
"before the next minor GC".
In this example, Student was an application-specific data type, so the
application author could figure out what they were actually doing with the
data and define the data type accordingly. However, this isn't always a
solution. What if they needed to use the same type differently in different
contexts? What if the type was provided by a library and the library author
doesn't know how it will be used?
So far, this has just been an explanation of a problem that I've noticed. I
understand the problem well, but now I'm going to propose a solution. There
are a lot of holes in my understanding of GHC's runtime, so I don't
understand if what I'm proposing is actually plausible.
What if the GHC's runtime handled the copy decisions for us? What if we had
these primitives:
data Bytes#
sliceBytes# :: Bytes# -> Int# -> Int# -> Bytes#
lengthBytes# :: Bytes# -> Int#
Slicing causes the runtime to track that there is an additional reference
to the byte array but that the offset and length are different. I'm not
sure where this information would be stored though. When GC runs, it would
copy slices from bytes if the there were no longer any references to the
entire array. So a single type could serve the both purposes discussed in
the scenario above. I'm not sure what kind of impact this might have on GC
time.
--
-Andrew Thaddeus Martin
-Andrew Thaddeus Martin