Discussion:
GC-Managed ByteArray Slicing
Andrew Martin
2018-09-08 13:34:33 UTC
Permalink
In GHC Haskell, there are two common options for working with bytes in
memory:

data ByteArray = ByteArray ByteArray#
data ByteSlice = ByteSlice ByteArray# Int# Int#

The second one, ByteSlice, is roughly what the bytestring library does (but
with a ForeignPtr instead of a ByteArray# inside of it). What's unfortunate
is that it's difficult to know which of these two types is going to provide
better performance for a given application. These heuristics are helpful
for guiding the process:

1. The longer the data is going to be around, the better the unsliced
variant will be. This point is about space efficiency. The sliced variant
needs two extra machine words to hold those indices, plus it holds on to
extra bytes from the array that it is slicing into.
2. The shorter the data is going to be around, the better the sliced
variant is. This point is about speed of execution. The cost of the
unsliced variant is typically a memcpy. If the data is getting GCed quickly
anyway, the unsliced copy doesn't save you any space.
3. The shorter the data is, the better the unsliced variant is.
4. The more sharing of the data there is, the better the sliced variant
is.

But, users have to commit to which one they want to use in their data type
definitions, and the data type may be used in different contexts where one
or the other is preferable. For example, consider the following:

data Student = Student
{ name :: {-# UNPACK #-} !BytesType
, grade :: {-# UNPACK #-} !Int
}

Which of the two types should we use for BytesType? (For this example,
ignore the incorrectness of using bytes to represent text. With either of
the two bytes types, it's easy to put a newtype on top that communicates
that what's inside is guaranteed to be UTF-8 encoded text). It depends on
the usage. For example, consider that we are parsing students from this xml
file:

<students>
<student>
<name>Erica Chang</name>
<grade>87</grade>
</student>
<student>
<name>Lizbeth Cruz</name>
<grade>91</grade>
</student>
...
</students>

Let's say we've got about 1GB of this, and we run a parser on 64KB chunks
as they are read from disk. There are two contasting scenarios to consider.

If the file is parsed at startup and the resulting array of students is
kept in memory for a long time afterward, then the unsliced variant is
better. It allows us to throw away all of the bytes corresponding to the
xml nodes, which are no longer relevant. The sliced variant, by constrast,
would keep the entire contents of the original file in memory.

However, if the file is parsed by a SAX-style streaming parser, the
students being parsed, processed, and then discarded as they encountered,
the sliced variant is better. Slicing wins here because the chunks from the
file are discarded shortly after they are encountered. Read "shortly" as
"before the next minor GC".

In this example, Student was an application-specific data type, so the
application author could figure out what they were actually doing with the
data and define the data type accordingly. However, this isn't always a
solution. What if they needed to use the same type differently in different
contexts? What if the type was provided by a library and the library author
doesn't know how it will be used?

So far, this has just been an explanation of a problem that I've noticed. I
understand the problem well, but now I'm going to propose a solution. There
are a lot of holes in my understanding of GHC's runtime, so I don't
understand if what I'm proposing is actually plausible.

What if the GHC's runtime handled the copy decisions for us? What if we had
these primitives:

data Bytes#
sliceBytes# :: Bytes# -> Int# -> Int# -> Bytes#
lengthBytes# :: Bytes# -> Int#

Slicing causes the runtime to track that there is an additional reference
to the byte array but that the offset and length are different. I'm not
sure where this information would be stored though. When GC runs, it would
copy slices from bytes if the there were no longer any references to the
entire array. So a single type could serve the both purposes discussed in
the scenario above. I'm not sure what kind of impact this might have on GC
time.
--
-Andrew Thaddeus Martin
Edward Kmett
2018-09-08 16:34:30 UTC
Permalink
I wouldn't recommend doing this to the existing ByteArray#, but for a
separate type it could be nice with some caveats.

1.) It's probably not that you want to copy out slices when there are no
references to the whole array but rather that you can keep the smallest
containing sub-interval of the Bytes# that contain all references. In your
story, if I slice to take its tail and then take that things tail. I'd
double my storage requirement after GC for large Bytes# when I drop the
reference to the original. This is a bit of an orthogonal strategy to the
one you propose, but the one you propose has problems in that I probably
should carefully memoize the results per slice somehow to maximize sharing
in the result, which complicates the implementation a lot. If on the other
hand, these slices are not allowed to nest then you don't have either
problem, but that sounds like a nightmare waiting to happen to debug.

2.) You'd probably have a hard time doing the move during GC because you
don't really get to know the space of all references in before you have to
move to to-space. This is further complicated by multiple generations even
if you conspire some way to procrastinate and delay the move til after you
move everything else and have found the largest sub-interval for the
current generation's references. One way to this would be to do something
like track a min/max bound per generation in the object itself of the
largest containing intervals that are wanted, and during a gc when you copy
the thing from from space to to space you copy over the interval from the
min bound from of all generations to the max bound from all generations,
then reset the current generations bounds to be something like
(maxPos,minPos) and as you find more references to sub-slices it'll grow by
continually mutating these secret mutable fields inside the object to tell
the next GC how much it can clean things up based solely on the contents
referenced from this generation.

-Edward
Post by Andrew Martin
In GHC Haskell, there are two common options for working with bytes in
data ByteArray = ByteArray ByteArray#
data ByteSlice = ByteSlice ByteArray# Int# Int#
The second one, ByteSlice, is roughly what the bytestring library does
(but with a ForeignPtr instead of a ByteArray# inside of it). What's
unfortunate is that it's difficult to know which of these two types is
going to provide better performance for a given application. These
1. The longer the data is going to be around, the better the unsliced
variant will be. This point is about space efficiency. The sliced variant
needs two extra machine words to hold those indices, plus it holds on to
extra bytes from the array that it is slicing into.
2. The shorter the data is going to be around, the better the sliced
variant is. This point is about speed of execution. The cost of the
unsliced variant is typically a memcpy. If the data is getting GCed quickly
anyway, the unsliced copy doesn't save you any space.
3. The shorter the data is, the better the unsliced variant is.
4. The more sharing of the data there is, the better the sliced
variant is.
But, users have to commit to which one they want to use in their data type
definitions, and the data type may be used in different contexts where one
data Student = Student
{ name :: {-# UNPACK #-} !BytesType
, grade :: {-# UNPACK #-} !Int
}
Which of the two types should we use for BytesType? (For this example,
ignore the incorrectness of using bytes to represent text. With either of
the two bytes types, it's easy to put a newtype on top that communicates
that what's inside is guaranteed to be UTF-8 encoded text). It depends on
the usage. For example, consider that we are parsing students from this xml
<students>
<student>
<name>Erica Chang</name>
<grade>87</grade>
</student>
<student>
<name>Lizbeth Cruz</name>
<grade>91</grade>
</student>
...
</students>
Let's say we've got about 1GB of this, and we run a parser on 64KB chunks
as they are read from disk. There are two contasting scenarios to consider.
If the file is parsed at startup and the resulting array of students is
kept in memory for a long time afterward, then the unsliced variant is
better. It allows us to throw away all of the bytes corresponding to the
xml nodes, which are no longer relevant. The sliced variant, by constrast,
would keep the entire contents of the original file in memory.
However, if the file is parsed by a SAX-style streaming parser, the
students being parsed, processed, and then discarded as they encountered,
the sliced variant is better. Slicing wins here because the chunks from the
file are discarded shortly after they are encountered. Read "shortly" as
"before the next minor GC".
In this example, Student was an application-specific data type, so the
application author could figure out what they were actually doing with the
data and define the data type accordingly. However, this isn't always a
solution. What if they needed to use the same type differently in different
contexts? What if the type was provided by a library and the library author
doesn't know how it will be used?
So far, this has just been an explanation of a problem that I've noticed.
I understand the problem well, but now I'm going to propose a solution.
There are a lot of holes in my understanding of GHC's runtime, so I don't
understand if what I'm proposing is actually plausible.
What if the GHC's runtime handled the copy decisions for us? What if we
data Bytes#
sliceBytes# :: Bytes# -> Int# -> Int# -> Bytes#
lengthBytes# :: Bytes# -> Int#
Slicing causes the runtime to track that there is an additional reference
to the byte array but that the offset and length are different. I'm not
sure where this information would be stored though. When GC runs, it would
copy slices from bytes if the there were no longer any references to the
entire array. So a single type could serve the both purposes discussed in
the scenario above. I'm not sure what kind of impact this might have on GC
time.
--
-Andrew Thaddeus Martin
_______________________________________________
Libraries mailing list
http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries
Loading...