Is it safe to index a little bit out of bounds

Discussion:

Andrew Martin

2018-03-08 14:19:29 UTC

Let's say I have a gc-managed byte array of length 19. GHC promises that
byte arrays are machine-word-aligned on the front end. That is, on a 64-bit
machine, this array starts on a memory address that divide 8 evenly.
However, the back end will certainly be unaligned. So, these two calls will
be fine:

- indexWordArray# myArr# 0#
- indexWordArray# myArr# 1#

But this one is non-deterministic:

- indexWordArray# myArr# 2#

Some of the bytes in the word will have garbage in them. However, this
could always be masked out with a bit mask (you have to know the platform
endianness for this to work right). Is this safe? I doubt think this could
ever cause a segfault but I wanted to check.

--
-Andrew Thaddeus Martin

Sven Panne

2018-03-08 15:50:14 UTC

Permalink

[...] Some of the bytes in the word will have garbage in them. However,
this could always be masked out with a bit mask (you have to know the
platform endianness for this to work right). Is this safe? I doubt think
this could ever cause a segfault but I wanted to check.

Before doing such things, please make sure that e.g. valgrind or similar
tools are happy with such Kung-Fu. I don't know off the top of my head how
fine-grained their checks are, but there is various similar code out there
in the wild which is a PITA to debug. You might force people to add
suppressions or even worse: Make some valuable tools totally useless. This
is not something which should be done lightly...

Herbert Valerio Riedel

2018-03-08 17:42:16 UTC

Permalink

Hi,

Post by Andrew Martin
Some of the bytes in the word will have garbage in them. However, this
could always be masked out with a bit mask (you have to know the platform
endianness for this to work right).
Is this safe? I doubt think this could ever cause a segfault but I
wanted to check.

Due to historical reasons, this is indeed safe. the underlying
`StgArrBytes` structure must be word-aligned in size, otherwise bad
things are likely to happen.

I've seem some code in the wild which relies on that, and as data-point,
I myself exploit that property in some operations (including the masking
and endianness-aware handling you refer to) of 'text-short'[1] which is
optimised for UTF8-based strings (<shameless-plug>and which besides
being a practically useful library having its place in the
text/bytearray landscape[2], text-short also serves as an incubation
area for optimisation ideas and code of which some may end up in one way
or another in the text-utf8 project[3]</shameless-plug>).

[1]: https://hackage.haskell.org/package/text-short

[2]: https://markkarpov.com/post/short-bs-and-text.html

[3]: https://hackage.haskell.org/text-utf8

-- hvr

Andrew Martin

2018-03-08 18:22:18 UTC

Permalink

Thanks Herbert! This is exactly the kind of data point I was looking for.
Good to know.

Post by Herbert Valerio Riedel
Hi,

Due to historical reasons, this is indeed safe. the underlying
`StgArrBytes` structure must be word-aligned in size, otherwise bad
things are likely to happen.
I've seem some code in the wild which relies on that, and as data-point,
I myself exploit that property in some operations (including the masking
and endianness-aware handling you refer to) of 'text-short'[1] which is
optimised for UTF8-based strings (<shameless-plug>and which besides
being a practically useful library having its place in the
text/bytearray landscape[2], text-short also serves as an incubation
area for optimisation ideas and code of which some may end up in one way
or another in the text-utf8 project[3]</shameless-plug>).
[1]: https://hackage.haskell.org/package/text-short
[2]: https://markkarpov.com/post/short-bs-and-text.html
[3]: https://hackage.haskell.org/text-utf8
-- hvr

--
-Andrew Thaddeus Martin

David Feuer

2018-03-08 18:35:49 UTC

Permalink

What do you gain from this?

Post by Andrew Martin
Let's say I have a gc-managed byte array of length 19. GHC promises that
byte arrays are machine-word-aligned on the front end. That is, on a 64-bit
machine, this array starts on a memory address that divide 8 evenly.
However, the back end will certainly be unaligned. So, these two calls will
- indexWordArray# myArr# 0#
- indexWordArray# myArr# 1#
- indexWordArray# myArr# 2#
Some of the bytes in the word will have garbage in them. However, this
could always be masked out with a bit mask (you have to know the platform
endianness for this to work right). Is this safe? I doubt think this could
ever cause a segfault but I wanted to check.
--
-Andrew Thaddeus Martin
_______________________________________________
Libraries mailing list
http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries

Andrew Martin

2018-03-08 19:24:45 UTC

Permalink

If you are looking for ascii (or non-ascii characters) in a byte array, you
build a word-sized mask like 0b1000000010000000... However, on the last
word, if you cannot go past the end, you have to go one byte at a time.
But, if you can go past the end, you can mask out the irrelevant bits and
use the same mask as before.

Post by David Feuer
What do you gain from this?

--
-Andrew Thaddeus Martin