Assignment 4: Parallel Quicksort

Due: Friday, Nov 8th by 11 pm

Assignment type: Pair, you may work with one partner

Update 11/5: improved code example for call to waitpid (in particular, there is a third argument specifying options, which should be 0)

Parallel Quicksort

In this assignment, you will

Complete a partially-implemented program which uses the Quicksort sorting algorithm to sort a file containing a sequence of 64-bit signed integers; specifically, add code to use the mmap system call to create a shared mapping of the data in the file so that the contents of the file appear as an array in memory
Use child processes to execute the recursive calls in parallel, in order to take advantage of multiple CPU cores to complete sorting more quickly

Grading Criteria

Your assignment grade will be determined as follows:

Handling command line arguments, using open and mmap to map the file: 20%
Parallel sorting using subprocesses: 50%
Experiments and report: 15%
Error reporting: 5%
Design and coding style: 10%

Getting Started

Download csf_assign04.zip and unzip it. You will be modifying the file parsort.c.

Quicksort

The starter code has a correct implementation of quicksort, so you don’t really need to completely understand it in order to do the assignment. However, quicksort is a very simple and elegant algorithm, so it’s not too hard to explain.

The basic idea of quicksort is based on partitioning a sequence into “left” and “right” partitions. Partitioning works as follows:

An arbitrary “pivot” element in the sequence is chosen
The sequence is re-arranged so that all of the elements less than the pivot value occur before the pivot value (this is the “left” partition), and all the elements greater than or equal to the pivot value occur after the pivot value (this is the “right” partition)

Once a sequence is partitioned, it can be sorted by recursively sorting the left partition and the right partition.

Let’s say that we want to sort all elements of a sequence called arr between the indices start (inclusive lower bound) and end (exclusive upper bound.) A pseudo-code implementation of quicksort might look something like this:

quicksort( arr, start, end ) {
  n = end - start
  if ( n < 2 )
    // base case: sequence is trivially already sorted
    return

  // the partition function returns the index where
  // the pivot element ended up
  mid = partition( arr, start, end )

  quicksort( arr, start, mid )
  quicksort( arr, mid + 1, end )
}

Paritioning a sequence with $n$ elements is $O(n)$ . Assuming that quicksort generally chooses a pivot element such that the left and right partitions are roughly equal in size, the overall running time of quicksort is $O(n \log n)$ in the average case.

Parallel quicksort

Because the two recursive calls to quicksort operate on completely independent parts of the array, there is no reason why they can’t execute in parallel on different CPU cores. This approach to converting a sequential algorithm to allow for parallelism is known as fork/join, and in general can be used to speed up the execution of divide and conquer algorithms.

Eventually we will learn about threads as a mechanism for executing instructions on multiple CPU cores. For this assignment, we will use child processes to execute the recursive calls in parallel. Normally, a data structure in a parent process, such as an array being sorted, would not be shared with a child process, since by default, child processes do not share memory with their parent process. However, if a shared file mapping is created, the memory containing the data of the mapped file is shared between parent and child processes, which means a child process can participate in sorting the data in the file.

Tasks

To complete the assignment, you will need to do the following:

Fill in missing details in the main function to open the file containing the data to sort, determine its size, and use the mmap system call to create a shared mapping of the file’s contents
Verify that the program can correctly do sequential sorting (without any parallelism)
Modify the program so that when the number of elements to sort is greater than the specified parallel threshold, the recursive calls to the quicksort function are executed by child processes
Do experiments to measure the observed speedup when sorting a large file with varying parallel threshold values, and write a report explaining the observed running times

Note that when the number of elements to sort is less than or equal to the parallel threshold, the elements are sorted using a call to the qsort function, which will sort the elements efficiently using a single CPU core.

Task 1: open file, determine its size, map its data

You will start by modifying the program to open the file, determine its size, and using mmap to create a shared memory mapping of its contents.

First, you will need to use the open syscall to open the file in read-write mode and get a file descriptor:

int fd = open(filename, O_RDWR);
if (fd < 0) {
  // file couldn't be opened: handle error and exit
}

Next, mmap will need to know how many bytes of data the file has. This can be accomplished using the fstat system call:

struct stat statbuf;
int rc = fstat( fd, &statbuf );
if ( rc != 0 ) {
    // handle fstat error and exit
}
// statbuf.st_size indicates the number of bytes in the file

Note that in addition to finding the size of the file, you should also compute the number of int64_t elements in the file’s data. (I.e., divide the size in bytes by sizeof(int64_t).)

Once the program knows the size of the file, creating a shared read-write mapping will allow the program, and all its descendants, to modify the file in-place in memory:

int64_t *arr;
arr = mmap( NULL, file_size_in_bytes, PROT_READ | PROT_WRITE,
            MAP_SHARED, fd, 0 );
close( fd ); // file can be closed now
if ( data == MAP_FAILED ) {
    // handle mmap error and exit
}
// *arr now behaves like a standard array of int64_t.
// Be careful though! Going off the end of the array will
// silently extend the file, which can rapidly lead to
// disk space depletion!

Passing in NULL for the requested mapping address gives mmap complete freedom to choose any address in memory as the base address for the mapping. Since we don’t care where the file’s data ends up in memory, so long as we can access it, this is what we want. Similarly, we want to map the entire file, so we set the offset to zero.

Note that before the main function returns, it should call munmap to unmap the file contents from memory. This should look something like

munmap( arr, file_size_in_bytes );

Task 2: verify that sequential sorting works

Once the program can map the file’s data, it should work correctly to implement sequential sorting of the data in the file. Two helper programs are provided to help you verify whether sorting is working:

gen_rand_data creates a file containing random bytes (the parsort program will interpret its contents as an array of int64_t values)
seqsort is a purely sequential C++ sorting program that can serve as an “oracle” to make sure your parsort program produced the correct results

Here’s an example of using these two programs to test parsort on a one megabyte ( $2^{20}$ bytes) randomly generated data file:

$ make gen_rand_data parsort seqsort
gcc -g -Wall -c gen_rand_data.c -o gen_rand_data.o
gcc -o gen_rand_data gen_rand_data.o
gcc -g -Wall -c parsort.c -o parsort.o
gcc -o parsort parsort.o
g++ -g -Wall -std=c++17 -c seqsort.cpp -o seqsort.o
g++ -o seqsort seqsort.o
$ ./gen_rand_data 1M test_data_1.bin
Wrote 1048576 bytes to 'test_data_1.bin'
$ ./parsort test_data_1.bin 65536
$ ./gen_rand_data 1M test_data_2.bin
Wrote 1048576 bytes to 'test_data_2.bin'
$ ./seqsort test_data_2.bin 
$ diff test_data_1.bin test_data_2.bin 
$ echo $?
0

If the diff command does not produce any output, and exits with exit code 0 (as indicated by the echo $? command), it means that parsort and seqsort both produced the result data, indicating that parsort worked correctly. Note that the parallel threshold value passed to parsort was 65536 in the example above, but it should work for any threshold value. Also note that the data created by the gen_rand_data program is “pseudo-random”, meaning that it is deterministic, and in general will always generate the same sequence of bytes.

Task 3: use child processes to execute the recursive calls

To allow the recursive sorting to make use of multiple CPU cores (when the number of elements being sorted is greater than the parallel threshold), you will use the fork() system call to create one child process for each recursive call.

Using the fork system call is straightforward:

pid_t child_pid = fork();
if ( child_pid == 0 ) {
  // executing in the child
  // ...do work...
  if ( /* work was done successfully */ )
    exit( 0 );
  else
    exit( 1 );
} else if ( child_pid < 0 ) {
  // fork failed
  // ...handle error...
} else {
  // in parent
}

The code labeled ...do work... above would be the computation you want the child to perform, i.e., recursively sorting part of the array.

Eventually, the parent should wait for the child to complete. This can be done using the waitpid system call:

int rc, wstatus;
rc = waitpid( child_pid, &wstatus, 0 );
if ( rc < 0 ) {
  // waitpid failed
  // ...handle error...
} else {
  // check status of child
  if ( !WIFEXITED( wstatus ) ) {
    // child did not exit normally (e.g., it was terminated by a signal)
    // ...handle child failure...
  } else if ( WEXITSTATUS( wstatus ) != 0 ) {
    // child exited with a non-zero exit code
    // ...handle child failure...
  } else {
    // child exited with exit code zero (it was successful)
  }
}

Note that in the above code snippets for fork and waitpid, there are a number of situations where a failure can occur:

the child can’t be created (i.e., fork() fails)
the parent doesn’t successfully learn the result of the child (i.e., waitpid() fails)
the child isn’t successful (it doesn’t exit normally, or it exits with a non-zero exit code)

If any of these situations occurs, the parent process needs to make sure that it returns 0 from the quicksort function to indicating that sorting failed.

One detail that can be tricky to get right is to make sure that for any child process that is created successfully, the parent definitely attempts to wait for the child (by calling waitpid().) Because of the numerous ways in which errors can occur, and because there will in general be two child processes (one for each recursive call), the control paths can get quite complex. A good way to reduce this complexity is to introduce a data type to represent a child process. An instance of this child process represents:

whether the child was created successfully, and if so, its process id (pid_t value)
what state the child process is in: is it running, or if we waited for it to complete, was the wait successful, and if so, what is the process’s exit code

Let’s say you have a Child data type representing a child process. Your code to do parallel sorting might look like this:

Child left, right;
left = quicksort_subproc( arr, start, mid, par_threshold );
right = quicksort_subproc( arr, mid + 1, end, par_threshold );

quicksort_wait( &left );
quicksort_wait( &right );

left_success = quicksort_check_success( &left );
right_success = quicksort_check_success( &right );

In this code, the quicksort_subproc function is responsible for creating a child process to do recursive sorting. The quicksort_wait function waits for a child process to complete. The quicksort_check_success function checks whether a child process completed successfully. You are welcome to use this approach, although you’re not required to. In general, you will need to guarantee

if the parent creates a child, it is guaranteed to wait for the child to complete, and
if any errors occur, they are correctly reported by having quicksort return 0 rather than 1

Task 4: experiments and analysis

To make sure that your parsort program is exhibiting the expected degree of parallelism, we would like you to perform an experiment where you create a random data file of 16 megabytes in size, and then time sorting this file multiple times, while adjusting the threshold to achieve increasing amounts of parallel execution.

You should run the following commands:

make clean
make
mkdir -p /tmp/$(whoami)
./gen_rand_data 16M /tmp/$(whoami)/data_16M.in
cp /tmp/$(whoami)/data_16M.in /tmp/$(whoami)/test_16M.in
time ./parsort /tmp/$(whoami)/test_16M.in 2097152 
cp /tmp/$(whoami)/data_16M.in /tmp/$(whoami)/test_16M.in
time ./parsort /tmp/$(whoami)/test_16M.in 1048576
cp /tmp/$(whoami)/data_16M.in /tmp/$(whoami)/test_16M.in
time ./parsort /tmp/$(whoami)/test_16M.in 524288
cp /tmp/$(whoami)/data_16M.in /tmp/$(whoami)/test_16M.in
time ./parsort /tmp/$(whoami)/test_16M.in 262144
cp /tmp/$(whoami)/data_16M.in /tmp/$(whoami)/test_16M.in
time ./parsort /tmp/$(whoami)/test_16M.in 131072
cp /tmp/$(whoami)/data_16M.in /tmp/$(whoami)/test_16M.in
time ./parsort /tmp/$(whoami)/test_16M.in 65536
cp /tmp/$(whoami)/data_16M.in /tmp/$(whoami)/test_16M.in
time ./parsort /tmp/$(whoami)/test_16M.in 32768
cp /tmp/$(whoami)/data_16M.in /tmp/$(whoami)/test_16M.in
time ./parsort /tmp/$(whoami)/test_16M.in 16384
rm -rf /tmp/$(whoami)

The parsort commands start with a completely sequential sort, and then tests with increasing degrees of parallelism. For example, at the smallest threshold of 16384 elements, there will be 128 processes doing sequential sorting at the base cases of the recursion.

We have provided a shell script called run_experiments.sh which runs these commands, so to collect your experimental data you could just run the command

./run_experiments.sh

We suggest using one of the numbered ugrad machines (ugrad1.cs.jhu.edu to ugrad24.cs.jhu.edu) to do your experiment. When you log in, you can run the top command to see what processes are running. (Note that you can type q to exit from top.) If any processes are consuming significant CPU time, you should consider logging into a different system, until you find one where no processes are consuming significant CPU time.

When you run the commands, copy the output of the time command. The real time will indicate the amount of time that elapsed between when the program started and exited, which is a pretty good measure of how long the sorting took. You should see that decreasing the threshold decreased the total time, although depending on the number of CPU cores available, eventually a point of dimimishing returns will be reached.

In your README.txt, write a brief report which

Indicates the amount of time that your parsort program took so sort the test data for each threshold value, and
States a reasonable explanation for why you saw the times you did

For #2, think about how the computation unfolds, and in particular, what parts of the computation are being executed in different processes, and thus which parts of the computation could be scheduled by the OS kernel in parallel on different CPU cores. We don’t expect a completely rigorous and in-depth explanation, but we would like you to give an intuitive explanation for the results that you observed.

Note on the autograder

Passing the autograder will be a necessary but insufficient condition for full credit. This means that you may still lose functionality points, even if you pass all of the autograder tests. Due to the nature of testing parallel programs, there will be a significant number of points up for manual review, so please structure your code accordingly. Some of the things we may manually verify (bot not limited to) are:

Ensuring that your implementation is actually parallel. (Your experiments should have already allowed you to determine whether your program is exhibiting any parallel speedup.)
Ensuring that sorting is actually accomplished.
Ensuring that you did not leave zombies around during execution.
Ensuring that a “reasonable” number of children are created for a given threshold and data size value.

Submitting

Edit the README.txt file to include the report and summarize each team member’s contributions.

Create a zipfile of your solution using the command

make solution.zip

Submit your zipfile to Gradescope as Assignment 4.