David Feuer
2017-02-09 20:20:16 UTC
Currently, the implementations of fromList for Data.Set and Data.Map
try to take advantage of inputs with some prefix in strictly
increasing order. It uses the fast (O (n)) fromDistinctAscList
algorithm for the strictly ascending prefix, and then falls back on
the slower (O (n log n)) naive algorithm for the rest. For Data.Set,
this works roughly like the implementation sketched at the bottom of
this message.
This whole thing strikes me as a bit odd: changing just the first
element of the input list can have a substantial impact on the overall
performance.
In my opinion, there are two reasonable ways to implement this function:
1. Don't bother to try to take advantage of preexisting ordering. This
is the simplest option, and will probably be fastest for lists with
little preexisting order.
2. Try to take advantage of sorted ascending and descending "runs". A
simple version would be to break up the input list into ascending and
descending segments (merging duplicates on the fly), use
fromDistinctAscList or fromDistinctDescList to convert each segment,
and combine the segments using set unions. This option will be slower
for totally unsorted lists, but should degrade much more gradually
than the current algorithm as disorder is added. Wren suggests that
there may be other variations on the same theme that we could
consider.
Should we
A. Switch to 1 (and offer 2 by another name)
B. Switch to 2 (and offer 1 by another name?)
C. Leave fromList alone and offer 2 by another name
Note that option 1 can be implemented quite easily outside the library
using insert and foldl'; option 2 is not something users should be
expected to write themselves.
If we add names, what should those be?
Thanks,
David Feuer
Current implementation sketch:
fromList :: Ord a => [a] -> Set a
fromList xs = foldl' (flip insert) sortedSet unsortedList
where
sortedSet = fromDistinctAscList sortedList
(sortedList, unsortedList) = spanStrictlyAscending xs
spanStrictlyAscending [] = ([], [])
spanStrictlyAscending (x : xs) = x : go x xs
where
go prev (n : ns)
| prev < n = first (n :) $ go n ns
go _ ns = ([], ns)
try to take advantage of inputs with some prefix in strictly
increasing order. It uses the fast (O (n)) fromDistinctAscList
algorithm for the strictly ascending prefix, and then falls back on
the slower (O (n log n)) naive algorithm for the rest. For Data.Set,
this works roughly like the implementation sketched at the bottom of
this message.
This whole thing strikes me as a bit odd: changing just the first
element of the input list can have a substantial impact on the overall
performance.
In my opinion, there are two reasonable ways to implement this function:
1. Don't bother to try to take advantage of preexisting ordering. This
is the simplest option, and will probably be fastest for lists with
little preexisting order.
2. Try to take advantage of sorted ascending and descending "runs". A
simple version would be to break up the input list into ascending and
descending segments (merging duplicates on the fly), use
fromDistinctAscList or fromDistinctDescList to convert each segment,
and combine the segments using set unions. This option will be slower
for totally unsorted lists, but should degrade much more gradually
than the current algorithm as disorder is added. Wren suggests that
there may be other variations on the same theme that we could
consider.
Should we
A. Switch to 1 (and offer 2 by another name)
B. Switch to 2 (and offer 1 by another name?)
C. Leave fromList alone and offer 2 by another name
Note that option 1 can be implemented quite easily outside the library
using insert and foldl'; option 2 is not something users should be
expected to write themselves.
If we add names, what should those be?
Thanks,
David Feuer
Current implementation sketch:
fromList :: Ord a => [a] -> Set a
fromList xs = foldl' (flip insert) sortedSet unsortedList
where
sortedSet = fromDistinctAscList sortedList
(sortedList, unsortedList) = spanStrictlyAscending xs
spanStrictlyAscending [] = ([], [])
spanStrictlyAscending (x : xs) = x : go x xs
where
go prev (n : ns)
| prev < n = first (n :) $ go n ns
go _ ns = ([], ns)