Discussion:
Is it safe to index a little bit out of bounds
Andrew Martin
2018-03-08 14:19:29 UTC
Permalink
Let's say I have a gc-managed byte array of length 19. GHC promises that
byte arrays are machine-word-aligned on the front end. That is, on a 64-bit
machine, this array starts on a memory address that divide 8 evenly.
However, the back end will certainly be unaligned. So, these two calls will
be fine:

- indexWordArray# myArr# 0#
- indexWordArray# myArr# 1#

But this one is non-deterministic:

- indexWordArray# myArr# 2#

Some of the bytes in the word will have garbage in them. However, this
could always be masked out with a bit mask (you have to know the platform
endianness for this to work right). Is this safe? I doubt think this could
ever cause a segfault but I wanted to check.
--
-Andrew Thaddeus Martin
Sven Panne
2018-03-08 15:50:14 UTC
Permalink
[...] Some of the bytes in the word will have garbage in them. However,
this could always be masked out with a bit mask (you have to know the
platform endianness for this to work right). Is this safe? I doubt think
this could ever cause a segfault but I wanted to check.
Before doing such things, please make sure that e.g. valgrind or similar
tools are happy with such Kung-Fu. I don't know off the top of my head how
fine-grained their checks are, but there is various similar code out there
in the wild which is a PITA to debug. You might force people to add
suppressions or even worse: Make some valuable tools totally useless. This
is not something which should be done lightly...
Herbert Valerio Riedel
2018-03-08 17:42:16 UTC
Permalink
Hi,
Post by Andrew Martin
Some of the bytes in the word will have garbage in them. However, this
could always be masked out with a bit mask (you have to know the platform
endianness for this to work right).
Is this safe? I doubt think this could ever cause a segfault but I
wanted to check.
Due to historical reasons, this is indeed safe. the underlying
`StgArrBytes` structure must be word-aligned in size, otherwise bad
things are likely to happen.

I've seem some code in the wild which relies on that, and as data-point,
I myself exploit that property in some operations (including the masking
and endianness-aware handling you refer to) of 'text-short'[1] which is
optimised for UTF8-based strings (<shameless-plug>and which besides
being a practically useful library having its place in the
text/bytearray landscape[2], text-short also serves as an incubation
area for optimisation ideas and code of which some may end up in one way
or another in the text-utf8 project[3]</shameless-plug>).


[1]: https://hackage.haskell.org/package/text-short

[2]: https://markkarpov.com/post/short-bs-and-text.html

[3]: https://hackage.haskell.org/text-utf8


-- hvr
Andrew Martin
2018-03-08 18:22:18 UTC
Permalink
Thanks Herbert! This is exactly the kind of data point I was looking for.
Good to know.
Post by Herbert Valerio Riedel
Hi,
Post by Andrew Martin
Some of the bytes in the word will have garbage in them. However, this
could always be masked out with a bit mask (you have to know the platform
endianness for this to work right).
Is this safe? I doubt think this could ever cause a segfault but I
wanted to check.
Due to historical reasons, this is indeed safe. the underlying
`StgArrBytes` structure must be word-aligned in size, otherwise bad
things are likely to happen.
I've seem some code in the wild which relies on that, and as data-point,
I myself exploit that property in some operations (including the masking
and endianness-aware handling you refer to) of 'text-short'[1] which is
optimised for UTF8-based strings (<shameless-plug>and which besides
being a practically useful library having its place in the
text/bytearray landscape[2], text-short also serves as an incubation
area for optimisation ideas and code of which some may end up in one way
or another in the text-utf8 project[3]</shameless-plug>).
[1]: https://hackage.haskell.org/package/text-short
[2]: https://markkarpov.com/post/short-bs-and-text.html
[3]: https://hackage.haskell.org/text-utf8
-- hvr
--
-Andrew Thaddeus Martin
David Feuer
2018-03-08 18:35:49 UTC
Permalink
What do you gain from this?
Post by Andrew Martin
Let's say I have a gc-managed byte array of length 19. GHC promises that
byte arrays are machine-word-aligned on the front end. That is, on a 64-bit
machine, this array starts on a memory address that divide 8 evenly.
However, the back end will certainly be unaligned. So, these two calls will
- indexWordArray# myArr# 0#
- indexWordArray# myArr# 1#
- indexWordArray# myArr# 2#
Some of the bytes in the word will have garbage in them. However, this
could always be masked out with a bit mask (you have to know the platform
endianness for this to work right). Is this safe? I doubt think this could
ever cause a segfault but I wanted to check.
--
-Andrew Thaddeus Martin
_______________________________________________
Libraries mailing list
http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries
Andrew Martin
2018-03-08 19:24:45 UTC
Permalink
If you are looking for ascii (or non-ascii characters) in a byte array, you
build a word-sized mask like 0b1000000010000000... However, on the last
word, if you cannot go past the end, you have to go one byte at a time.
But, if you can go past the end, you can mask out the irrelevant bits and
use the same mask as before.
Post by David Feuer
What do you gain from this?
Post by Andrew Martin
Let's say I have a gc-managed byte array of length 19. GHC promises that
byte arrays are machine-word-aligned on the front end. That is, on a 64-bit
machine, this array starts on a memory address that divide 8 evenly.
However, the back end will certainly be unaligned. So, these two calls will
- indexWordArray# myArr# 0#
- indexWordArray# myArr# 1#
- indexWordArray# myArr# 2#
Some of the bytes in the word will have garbage in them. However, this
could always be masked out with a bit mask (you have to know the platform
endianness for this to work right). Is this safe? I doubt think this could
ever cause a segfault but I wanted to check.
--
-Andrew Thaddeus Martin
_______________________________________________
Libraries mailing list
http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries
--
-Andrew Thaddeus Martin
Loading...