-
Couldn't load subscription status.
- Fork 3k
Random shuffle of lists #10281
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Random shuffle of lists #10281
Conversation
CT Test Results 2 files 97 suites 1h 6m 59s ⏱️ Results for commit a5dbc7b. ♻️ This comment has been updated with latest results. To speed up review, make sure that you have read Contributing to Erlang/OTP and that all checks pass. See the TESTING and DEVELOPMENT HowTo guides for details about how to run test locally. Artifacts
// Erlang/OTP Github Action Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: I don't know whether you intend to keep the first commit. In case you do, the last paragraph is missing a closing parenthesis, and the word "ridiculous" is misspelled.
70efe49 to
aba9094
Compare
0280798 to
a536947
Compare
|
New algorithm selected. "Quickshuffle"? |
fb2cb14 to
8e991bf
Compare
|
I wrote a longer explanation of the algorithm |
c72c71c to
3aeae41
Compare
|
Pushed some optimizations |
3c0ceca to
6e4d1e8
Compare
5f73e08 to
95b21d9
Compare
|
I have tests (and previously documentation), and backed out some optimization attempts. With the measurement function in the test case in place it turned out that The measurement test function compares with the previously best function; decorate, sort, undecorate and shuffle duplicates. It also compares fast and slow PRNG:s. Now this might be ready to merge... |
|
Please do some squashing of the commits. |
b23c06e to
c8b0371
Compare
|
Squashed into 3 commits.
|
c8b0371 to
6898339
Compare
|
Force pushed suppression of Dialyzer warnings for improper lists |
Write a few shuffle algorithms for comparison. shuffle1: random decorate, sort, undecorate and recursively deduplicate ------- I have found no formal statement that it is bias free, but have tried to reason around it. The algorithm should be equivalent to generating more random decimals to decide the shuffle order for elements with the same random number. It should make no difference if the random decimals are generated always and ignored, or when needed. Speed: 1.2 s for 2^20 integers on my laptop. shuffle2: Fisher-Yates with map as array ------- The classical textbook shuffle. Speed: 5 s for 2^20 integers on my laptop. shuffle3: random decorates, avoid duplicates, sort and undecorate with gb_trees ------- Quite a beautiful algorithm since the `gb_tree` has all the functionality in itself. Speed: 5 s for 2^20 integers on my laptop. shuffle4: random decorate, avoid duplicates, sort and undecorate with a map ------- The same as the `gb_tree` above, but with a map. Uses the map key order instead of the general term order, which works just fine. Speed: 2 s for 2^20 integers on my laptop. shuffle5: random hidden decorate by split, implicit sort ------- Suggested by Richard A. O'Keefe on ErlangForums as "a random variant of Quicksort", probably misunderstood by me into this algorithm. Shall we name it Quickshuffle? Really fast. Uses random numbers efficiently by looking at individual bits for the random split. Has no overhead for tagging. Just creates intermediate lists as garbage. This generator appears to actually be equivalent with shuffle1, using a random number generator with 1 bit which goes into almost exclusively deduplication recursion. Speed: 0.8 s for 2^20 integers on my laptop. shuffle6: Fisher-Yates with the `array` module ------- The classical textbook shuffle. Our standard `array` module here outperforms map, probably because keys do not have to be stored, they are implicit. Speed: 2 s for 2^20 integers on my laptop. Discussion ------- shuffle3 and shuffle4 have the theoretical limitation that when the length of the list approaches the generator size, it will take catastrophically much longer time to generate a random number that has not been used. There is no check for the list length being larger than the generator size in which case it will be impossible to generate unique random numbers for all list elements, and the algorithm will simply keep on failing forever. This is for now a theoretical problem since already for a list length with log half the generator size (e.g 2^28 with a generator size 2^56), my laptop runs out of memory with a VM of about 30 GB. shuffle1 and shuffle5 avoids that limitation. shuffle1 by recursing over the duplicates sublists so it is not affected much by fairly long lists of duplicates, shuffle5 by using only individual bits and ranges 2, 6, or 24. The classical Fisher-Yates algorithm in shuffle2 and shuffle6 does not have this limitation, but generating random numbers of unlimited length gets increasingly expensive, which should not be any problem for 2 or even 4 times the generator length, that is list lengths of well over 2^200, which is well over ridiculous.
* Explain in comments * Optimize * Document * Write test cases * Write measurement test case to compare with runner-up algorithm
6898339 to
a5dbc7b
Compare
|
I found out that the algorithm headings in the first commit had been lost so the commit comment was impossible to understand. I fixed that commit comment and nothing else... |
This PR adds functions
rand:shuffle/1andrand:shuffle_s/2due to a discussion on ErlangForums: https://erlangforums.com/t/random-sort-should-be-included-in-the-lists-module/5125There are 4 algorithms in the first commit. The suggested winner is the one remaining in the second commit.
Documentation and test cases are still missing...