Asked 2 years, 3 months ago. Active 2 years, 3 months ago. Viewed times. Thanks, JCS. Improve this question. Does this fix the performance of merge? See this. Full sort must read all the input files, then process data. Reading whole files in sequence can be relatively fast. Merge sort is to reduce memory usage, not time. To minimize memory usage, at any given time sort needs to memorize just one line per file this may be somewhat simplified, but basically true.
The Freebase dataset appeared to be well-grouped locally, so 1. Plugging everything in gives Threading appears to scale with the square root of the number of threads and using 4 threads cuts the run time in half, to The measured run time was 9.
I could very well be wrong about the sqrt threads thing. Someone with a 64 core beast will have to weigh in. I suspect that the final n-way merge will begin to dominate and drag the time up. Email me if you are using gz-sort and any of these omissions are causing you trouble.
For that matter, email me if you find something not on this list too. Original Editable version 1 of 1. Nice app, but unfortunately limited to gzipped data.
Would be a lo Very nice article and program! GNU sort 1 really should do bette I will right away grab your rss feed as I can't to find your email The input: 3 billion lines, 30GB compressed, GB uncompressed. The algorithm: a simple merge sort, predicted to finish in Actual time, 9.
The output: Use an external sort. Recommended five times. No suggestions to anything specific. Use a database.
Recommended four times. Use cloud and big data tools. No thanks, I like doing things locally. Use gnu-sort. Two recommendations. You separate the key fields from the main record with this alternative character:. Even after all the improvements, one of its recommendations for complex sorts was exactly along these lines. The Unix sort isn't the fastest sort out there by any manner of means. It uses a strange implementation that can easily be outrun over data sets that are large enough to require multiple merge passes, as yours clearly does.
I would have a look around for a replacement. You could even consider loading the file into a database: you might get better performance that way, and you would certainly have the data in a more convenient form afterwards. For completeness, the main issue is the bucket sort itself. It's quick for small data sets, although not as quick as Quicksort, but it produces twice as many runs as replacement selection would.
Once you get into multilevel merging, the number of runs and hence the number of merge passes totally dominates the CPU-bound distribution phase. III, with distribution via replacement selection, and balanced merging with dummy runs.
On large enough data sets it easily outperformed the Unix sort, with an increasing gradient as N increased, and 'large enough' wasn't all that large given disk sizes of those days. And if, for example, you have csv-file with quoted numbers and need to sort numerically on col2, col1, then use:. Stack Overflow for Teams — Collaborate and share knowledge with a private group. Create a free Team What is Teams? Collectives on Stack Overflow.
Learn more. How do we sort faster using unix sort? Ask Question. Asked 10 years, 3 months ago. Active 9 months ago. Viewed 16k times. After minutes it still hasn't finished. Improve this question. Gilles 'SO- stop being evil' Add a comment. Using the sort command will probably be the fastest option.
But you'll probably want to fix the locale to C. That's a sort order that is very simple to implement and is a strict total order and has no surprise. This is not a good suggestion. The script is immensely bloated and splits the input file to sort the parts which the accepted answer points out isn't needed with GNU sort. Sign up or log in Sign up using Google. Sign up using Facebook.
Sign up using Email and Password. Post as a guest Name. Email Required, but never shown. The Overflow Blog. Does ES6 make JavaScript frameworks obsolete? Podcast Do polyglots have an edge when it comes to mastering programming Featured on Meta. Now live: A fully responsive profile. Linked Related 4.
0コメント