Discussion:
Discussion: Implementation of fromList for Data.Set and Data.Map
David Feuer
2017-02-09 20:20:16 UTC
Permalink
Currently, the implementations of fromList for Data.Set and Data.Map
try to take advantage of inputs with some prefix in strictly
increasing order. It uses the fast (O (n)) fromDistinctAscList
algorithm for the strictly ascending prefix, and then falls back on
the slower (O (n log n)) naive algorithm for the rest. For Data.Set,
this works roughly like the implementation sketched at the bottom of
this message.

This whole thing strikes me as a bit odd: changing just the first
element of the input list can have a substantial impact on the overall
performance.

In my opinion, there are two reasonable ways to implement this function:

1. Don't bother to try to take advantage of preexisting ordering. This
is the simplest option, and will probably be fastest for lists with
little preexisting order.

2. Try to take advantage of sorted ascending and descending "runs". A
simple version would be to break up the input list into ascending and
descending segments (merging duplicates on the fly), use
fromDistinctAscList or fromDistinctDescList to convert each segment,
and combine the segments using set unions. This option will be slower
for totally unsorted lists, but should degrade much more gradually
than the current algorithm as disorder is added. Wren suggests that
there may be other variations on the same theme that we could
consider.

Should we

A. Switch to 1 (and offer 2 by another name)
B. Switch to 2 (and offer 1 by another name?)
C. Leave fromList alone and offer 2 by another name

Note that option 1 can be implemented quite easily outside the library
using insert and foldl'; option 2 is not something users should be
expected to write themselves.

If we add names, what should those be?

Thanks,
David Feuer

Current implementation sketch:

fromList :: Ord a => [a] -> Set a
fromList xs = foldl' (flip insert) sortedSet unsortedList
where
sortedSet = fromDistinctAscList sortedList
(sortedList, unsortedList) = spanStrictlyAscending xs

spanStrictlyAscending [] = ([], [])
spanStrictlyAscending (x : xs) = x : go x xs
where
go prev (n : ns)
| prev < n = first (n :) $ go n ns
go _ ns = ([], ns)
Edward Kmett
2017-02-10 01:16:34 UTC
Permalink
The current implementation was really just an attempt to improve over the
old implementation which did the naive construction. It did improve things
for the sorted list case to the point where there is very little marginal
benefit to using fromDistinctAscList any more, but it very much isn't the
final word on the topic. =)

Finding something that handles the middle cases where you have "mostly
sorted" data better would be a good idea, but definitely needs to be
benchmarked to death, and it would help to know if the merges of runs
affect the overall asymptotic performance.

Timsort is both fast and popular and includes a similar pass (it also
happens to include handling strictly descending runs as mentioned), so it
seems that there is room to do something similar here.

-Edward
Post by David Feuer
Currently, the implementations of fromList for Data.Set and Data.Map
try to take advantage of inputs with some prefix in strictly
increasing order. It uses the fast (O (n)) fromDistinctAscList
algorithm for the strictly ascending prefix, and then falls back on
the slower (O (n log n)) naive algorithm for the rest. For Data.Set,
this works roughly like the implementation sketched at the bottom of
this message.
This whole thing strikes me as a bit odd: changing just the first
element of the input list can have a substantial impact on the overall
performance.
1. Don't bother to try to take advantage of preexisting ordering. This
is the simplest option, and will probably be fastest for lists with
little preexisting order.
2. Try to take advantage of sorted ascending and descending "runs". A
simple version would be to break up the input list into ascending and
descending segments (merging duplicates on the fly), use
fromDistinctAscList or fromDistinctDescList to convert each segment,
and combine the segments using set unions. This option will be slower
for totally unsorted lists, but should degrade much more gradually
than the current algorithm as disorder is added. Wren suggests that
there may be other variations on the same theme that we could
consider.
Should we
A. Switch to 1 (and offer 2 by another name)
B. Switch to 2 (and offer 1 by another name?)
C. Leave fromList alone and offer 2 by another name
Note that option 1 can be implemented quite easily outside the library
using insert and foldl'; option 2 is not something users should be
expected to write themselves.
If we add names, what should those be?
Thanks,
David Feuer
fromList :: Ord a => [a] -> Set a
fromList xs = foldl' (flip insert) sortedSet unsortedList
where
sortedSet = fromDistinctAscList sortedList
(sortedList, unsortedList) = spanStrictlyAscending xs
spanStrictlyAscending [] = ([], [])
spanStrictlyAscending (x : xs) = x : go x xs
where
go prev (n : ns)
| prev < n = first (n :) $ go n ns
go _ ns = ([], ns)
_______________________________________________
Libraries mailing list
http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries
David Feuer
2017-02-10 01:26:48 UTC
Permalink
Merging the sorted runs will not hurt big-O complexity. When I replaced the
union algorithms, Ross Paterson noted that the time bounds on the new ones
were good enough that starting from singletons we can take the unions in
any order and build the set/map in O(n log n) time. My main concern is
about constant factors. Like you said, I'll have to benchmark a lot.
Post by Edward Kmett
The current implementation was really just an attempt to improve over the
old implementation which did the naive construction. It did improve things
for the sorted list case to the point where there is very little marginal
benefit to using fromDistinctAscList any more, but it very much isn't the
final word on the topic. =)
Finding something that handles the middle cases where you have "mostly
sorted" data better would be a good idea, but definitely needs to be
benchmarked to death, and it would help to know if the merges of runs
affect the overall asymptotic performance.
Timsort is both fast and popular and includes a similar pass (it also
happens to include handling strictly descending runs as mentioned), so it
seems that there is room to do something similar here.
-Edward
Post by David Feuer
Currently, the implementations of fromList for Data.Set and Data.Map
try to take advantage of inputs with some prefix in strictly
increasing order. It uses the fast (O (n)) fromDistinctAscList
algorithm for the strictly ascending prefix, and then falls back on
the slower (O (n log n)) naive algorithm for the rest. For Data.Set,
this works roughly like the implementation sketched at the bottom of
this message.
This whole thing strikes me as a bit odd: changing just the first
element of the input list can have a substantial impact on the overall
performance.
1. Don't bother to try to take advantage of preexisting ordering. This
is the simplest option, and will probably be fastest for lists with
little preexisting order.
2. Try to take advantage of sorted ascending and descending "runs". A
simple version would be to break up the input list into ascending and
descending segments (merging duplicates on the fly), use
fromDistinctAscList or fromDistinctDescList to convert each segment,
and combine the segments using set unions. This option will be slower
for totally unsorted lists, but should degrade much more gradually
than the current algorithm as disorder is added. Wren suggests that
there may be other variations on the same theme that we could
consider.
Should we
A. Switch to 1 (and offer 2 by another name)
B. Switch to 2 (and offer 1 by another name?)
C. Leave fromList alone and offer 2 by another name
Note that option 1 can be implemented quite easily outside the library
using insert and foldl'; option 2 is not something users should be
expected to write themselves.
If we add names, what should those be?
Thanks,
David Feuer
fromList :: Ord a => [a] -> Set a
fromList xs = foldl' (flip insert) sortedSet unsortedList
where
sortedSet = fromDistinctAscList sortedList
(sortedList, unsortedList) = spanStrictlyAscending xs
spanStrictlyAscending [] = ([], [])
spanStrictlyAscending (x : xs) = x : go x xs
where
go prev (n : ns)
| prev < n = first (n :) $ go n ns
go _ ns = ([], ns)
_______________________________________________
Libraries mailing list
http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries
Milan Straka
2017-02-10 08:59:43 UTC
Permalink
Dear all,

yes, as Edward says, the current implementation just alleviated the need
to use fromDistinctAscList -- it was quite a common case of using
fromList (because many functions produce sorted lists).

Some adaptive algorithm depending on number of inversions (by inversion
I mean $(x, y)$ such that $x$ preceeds $y$ and $x > y$), ideally
with complexity $O(n \log (2+number_of_inversions/n))$, would definitely
be desirable. I tried to experiment with it when I implemented this
"use linear-time algorithm if sorted", but as noted in the e4e9a97
commit, I have no idea how to keep the constant factors small.

Cheers,
Milan Straka
-----Original message-----
Sent: 9 Feb 2017, 20:26
Merging the sorted runs will not hurt big-O complexity. When I replaced the
union algorithms, Ross Paterson noted that the time bounds on the new ones
were good enough that starting from singletons we can take the unions in
any order and build the set/map in O(n log n) time. My main concern is
about constant factors. Like you said, I'll have to benchmark a lot.
Post by Edward Kmett
The current implementation was really just an attempt to improve over the
old implementation which did the naive construction. It did improve things
for the sorted list case to the point where there is very little marginal
benefit to using fromDistinctAscList any more, but it very much isn't the
final word on the topic. =)
Finding something that handles the middle cases where you have "mostly
sorted" data better would be a good idea, but definitely needs to be
benchmarked to death, and it would help to know if the merges of runs
affect the overall asymptotic performance.
Timsort is both fast and popular and includes a similar pass (it also
happens to include handling strictly descending runs as mentioned), so it
seems that there is room to do something similar here.
-Edward
Post by David Feuer
Currently, the implementations of fromList for Data.Set and Data.Map
try to take advantage of inputs with some prefix in strictly
increasing order. It uses the fast (O (n)) fromDistinctAscList
algorithm for the strictly ascending prefix, and then falls back on
the slower (O (n log n)) naive algorithm for the rest. For Data.Set,
this works roughly like the implementation sketched at the bottom of
this message.
This whole thing strikes me as a bit odd: changing just the first
element of the input list can have a substantial impact on the overall
performance.
1. Don't bother to try to take advantage of preexisting ordering. This
is the simplest option, and will probably be fastest for lists with
little preexisting order.
2. Try to take advantage of sorted ascending and descending "runs". A
simple version would be to break up the input list into ascending and
descending segments (merging duplicates on the fly), use
fromDistinctAscList or fromDistinctDescList to convert each segment,
and combine the segments using set unions. This option will be slower
for totally unsorted lists, but should degrade much more gradually
than the current algorithm as disorder is added. Wren suggests that
there may be other variations on the same theme that we could
consider.
Should we
A. Switch to 1 (and offer 2 by another name)
B. Switch to 2 (and offer 1 by another name?)
C. Leave fromList alone and offer 2 by another name
Note that option 1 can be implemented quite easily outside the library
using insert and foldl'; option 2 is not something users should be
expected to write themselves.
If we add names, what should those be?
Thanks,
David Feuer
fromList :: Ord a => [a] -> Set a
fromList xs = foldl' (flip insert) sortedSet unsortedList
where
sortedSet = fromDistinctAscList sortedList
(sortedList, unsortedList) = spanStrictlyAscending xs
spanStrictlyAscending [] = ([], [])
spanStrictlyAscending (x : xs) = x : go x xs
where
go prev (n : ns)
| prev < n = first (n :) $ go n ns
go _ ns = ([], ns)
_______________________________________________
Libraries mailing list
http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries
David Feuer
2017-02-12 04:56:36 UTC
Permalink
I tried about the simplest thing that could possibly work, and it
doesn't seem terrible. As expected, it's pretty close to the current
implementation for strictly increasing lists, and a good deal better
for mostly-ordered lists (in either direction). Also as expected, it's
worse for random inputs, but not as much worse as I feared: it takes
about 26% longer. I suspect it's probably possible to do better. I
tried to eliminate the intermediate lists, but unfortunately that
seems to make it really messy to deal with merging duplicates. It may
be worth doing anyway, or there may be some other optimization I
haven't thought of.

spanIncreasing :: Ord a => [a] -> ([a], [a])
spanIncreasing [] = ([], [])
spanIncreasing (a : as) = first (a:) (go a as)
where
go _prev [] = ([], [])
go prev q@(a : as) = case compare prev a of
LT -> first (a :) $ go a as
EQ -> go a as
GT -> ([], q)

spanDecreasing :: Ord a => [a] -> ([a], [a])
spanDecreasing [] = ([], [])
spanDecreasing (a : as) = first (a:) (go a as)
where
go _prev [] = ([], [])
go prev q@(a : as) = case compare prev a of
GT -> first (a :) (go a as)
EQ -> go a as
LT -> ([], q)

fromList :: Ord a => [a] -> Set a
fromList = up empty
where
up !acc [] = acc
up acc xs = case spanIncreasing xs of
([x], rest) -> down (insert x acc) rest
(ups, rest) -> down (union (fromDistinctAscList ups) acc) rest

down !acc [] = acc
down acc xs = case spanDecreasing xs of
([x], rest) -> up (insert x acc) rest
(downs, rest) -> up (union (fromDistinctDescList downs) acc) rest
Post by Milan Straka
Dear all,
yes, as Edward says, the current implementation just alleviated the need
to use fromDistinctAscList -- it was quite a common case of using
fromList (because many functions produce sorted lists).
Some adaptive algorithm depending on number of inversions (by inversion
I mean $(x, y)$ such that $x$ preceeds $y$ and $x > y$), ideally
with complexity $O(n \log (2+number_of_inversions/n))$, would definitely
be desirable. I tried to experiment with it when I implemented this
"use linear-time algorithm if sorted", but as noted in the e4e9a97
commit, I have no idea how to keep the constant factors small.
Cheers,
Milan Straka
-----Original message-----
Sent: 9 Feb 2017, 20:26
Merging the sorted runs will not hurt big-O complexity. When I replaced the
union algorithms, Ross Paterson noted that the time bounds on the new ones
were good enough that starting from singletons we can take the unions in
any order and build the set/map in O(n log n) time. My main concern is
about constant factors. Like you said, I'll have to benchmark a lot.
Post by Edward Kmett
The current implementation was really just an attempt to improve over the
old implementation which did the naive construction. It did improve things
for the sorted list case to the point where there is very little marginal
benefit to using fromDistinctAscList any more, but it very much isn't the
final word on the topic. =)
Finding something that handles the middle cases where you have "mostly
sorted" data better would be a good idea, but definitely needs to be
benchmarked to death, and it would help to know if the merges of runs
affect the overall asymptotic performance.
Timsort is both fast and popular and includes a similar pass (it also
happens to include handling strictly descending runs as mentioned), so it
seems that there is room to do something similar here.
-Edward
Post by David Feuer
Currently, the implementations of fromList for Data.Set and Data.Map
try to take advantage of inputs with some prefix in strictly
increasing order. It uses the fast (O (n)) fromDistinctAscList
algorithm for the strictly ascending prefix, and then falls back on
the slower (O (n log n)) naive algorithm for the rest. For Data.Set,
this works roughly like the implementation sketched at the bottom of
this message.
This whole thing strikes me as a bit odd: changing just the first
element of the input list can have a substantial impact on the overall
performance.
1. Don't bother to try to take advantage of preexisting ordering. This
is the simplest option, and will probably be fastest for lists with
little preexisting order.
2. Try to take advantage of sorted ascending and descending "runs". A
simple version would be to break up the input list into ascending and
descending segments (merging duplicates on the fly), use
fromDistinctAscList or fromDistinctDescList to convert each segment,
and combine the segments using set unions. This option will be slower
for totally unsorted lists, but should degrade much more gradually
than the current algorithm as disorder is added. Wren suggests that
there may be other variations on the same theme that we could
consider.
Should we
A. Switch to 1 (and offer 2 by another name)
B. Switch to 2 (and offer 1 by another name?)
C. Leave fromList alone and offer 2 by another name
Note that option 1 can be implemented quite easily outside the library
using insert and foldl'; option 2 is not something users should be
expected to write themselves.
If we add names, what should those be?
Thanks,
David Feuer
fromList :: Ord a => [a] -> Set a
fromList xs = foldl' (flip insert) sortedSet unsortedList
where
sortedSet = fromDistinctAscList sortedList
(sortedList, unsortedList) = spanStrictlyAscending xs
spanStrictlyAscending [] = ([], [])
spanStrictlyAscending (x : xs) = x : go x xs
where
go prev (n : ns)
| prev < n = first (n :) $ go n ns
go _ ns = ([], ns)
_______________________________________________
Libraries mailing list
http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries
wren romano
2017-02-12 21:52:33 UTC
Permalink
Post by David Feuer
spanIncreasing :: Ord a => [a] -> ([a], [a])
spanIncreasing [] = ([], [])
spanIncreasing (a : as) = first (a:) (go a as)
where
go _prev [] = ([], [])
LT -> first (a :) $ go a as
EQ -> go a as
GT -> ([], q)
Suggest CPSing away the intermediate list:

spanIncreasing :: Ord a => r -> (r -> a -> r) -> [a] -> (r, [a])
spanIncreasing z k [] = (z, [])
spanIncreasing z k (x:xs) = go z k x xs
case
go !z k !x [] = (z, [])
go z k x yys@(y:ys)
| x < y = go (k z x) y ys
| x == y = go z k x ys
| otherwise = (k z x, yys)
Post by David Feuer
spanDecreasing :: Ord a => [a] -> ([a], [a])
spanDecreasing [] = ([], [])
spanDecreasing (a : as) = first (a:) (go a as)
where
go _prev [] = ([], [])
GT -> first (a :) (go a as)
EQ -> go a as
LT -> ([], q)
Ditto
Post by David Feuer
fromList :: Ord a => [a] -> Set a
fromList = up empty
where
up !acc [] = acc
up acc xs = case spanIncreasing xs of
([x], rest) -> down (insert x acc) rest
(ups, rest) -> down (union (fromDistinctAscList ups) acc) rest
down !acc [] = acc
down acc xs = case spanDecreasing xs of
([x], rest) -> up (insert x acc) rest
(downs, rest) -> up (union (fromDistinctDescList downs) acc) rest
Here, I suggest inlining the above to dynamically choose whether to
call `up` or `down`. E.g., something like (untested):

fromList :: Ord a => [a] -> Set a
fromList [] = empty
fromList (x0:xs0) = start empty x0 xs0
where
start !acc !x [] = insert x acc
start acc x (y:ys) =
case compare x y of
LT -> up acc (singletonUp x) y ys
EQ -> start acc x ys
GT -> down acc (singletonDown x) y ys

up !acc1 !acc2 !x [] = union acc1 (upSet (insertUp x acc2))
up acc1 acc2 x (y:ys) =
case compare x y of
LT -> up acc1 (insertUp x acc2) y ys
EQ -> up acc1 acc2 x ys
GT -> start (union acc1 (upSet (insertUp x acc2))) y ys

down !acc1 !acc2 !x [] = union acc1 (downSet (insertDown x acc2))
down acc1 acc2 x (y:ys) =
case compare x y of
GT -> down acc1 (insertDown x acc2) y ys
EQ -> down acc1 acc2 x ys
LT -> start (union acc1 (downSet (insertDown x acc2))) y ys

where `insertUp` and `insertDown` are the incremental steps of
fromAscList/fromDescList, and `acc2` has whatever appropriate
intermediate type it needs for that to work, and upSet/downSet does
the conversion from that intermediate type into a standard Set.

A naive but workable intermediate type gives us:

singletonUp = (:[])
singletonDown = (:[])
insertUp = (:)
insertDown = (:)
upSet = fromAscList
downSet = fromDescList

Though we should be able to do better than that.
--
Live well,
~wren
Loading...