External-Memory Sorting in Java: useful to sort very large files using multiple cores and an external-memory algorithm.
This code is used in Apache Jackrabbit Oak as well as in Apache Beam and in Spotify scio.
Code sample
import com.google.code.externalsorting.ExternalSort; //... inputfile: input file name //... outputfile: output file name // next command sorts the lines from inputfile to outputfile int numLinesWritten = ExternalSort.mergeSortedFiles(ExternalSort.sortInBatch(new File(inputfile)), new File(outputfile)); // you can also provide a custom string comparator, see API
Code sample (CSV)
For sorting CSV files, it might be more convenient to use CsvExternalSort.
import com.google.code.externalsorting.CsvExternalSort; import com.google.code.externalsorting.CsvSortOptions; // provide a comparator Comparator<CSVRecord> comparator = (op1, op2) -> op1.get(0).compareTo(op2.get(0)); //... inputfile: input file name //... outputfile: output file name //...provide sort options CsvSortOptions sortOptions = new CsvSortOptions .Builder(comparator, CsvExternalSort.DEFAULTMAXTEMPFILES, CsvExternalSort.estimateAvailableMemory()) .charset(Charset.defaultCharset()) .distinct(false) .numHeader(1) .skipHeader(false) .format(CSVFormat.DEFAULT) .build(); // container to store the header lines ArrayList<CSVRecord> header = new ArrayList<CSVRecord>(); // next two lines sort the lines from inputfile to outputfile List<File> sortInBatch = CsvExternalSort.sortInBatch(file, null, sortOptions, header); // at this point you can access header if you'd like. int numWrittenLines = CsvExternalSort.mergeSortedFiles(sortInBatch, outputfile, sortOptions, true, header);
The numHeader parameter is the number of lines of headers in the CSV files (typically 1 or 0) and the skipHeader parameter indicates whether you would like to exclude these lines from the parsing.
API Documentation
http://www.javadoc.io/doc/com.google.code.externalsortinginjava/externalsortinginjava/
Maven dependency
You can download the jar files from the Maven central repository: https://repo1.maven.org/maven2/com/google/code/externalsortinginjava/externalsortinginjava/
You can also specify the dependency in the Maven "pom.xml" file:
<dependencies> <dependency> <groupId>com.google.code.externalsortinginjava</groupId> <artifactId>externalsortinginjava</artifactId> <version>[0.6.0,)</version> </dependency> </dependencies>
How to build
- get the java jdk
- Install Maven 2
- mvn install - builds jar (requires signing)
- or mvn package - builds jar (does not require signing)
- mvn test - runs tests