Milestone 1: Due Monday, Sep 25th by 11 pm
Milestone 2: Due Thursday, Oct 5th by 11 pm
Update 9/21: Corrected pseudo-code for main input loop of main
function.
(It should be calling wc_dict_find_or_insert
, not
wc_find_or_insert
.)
Update 9/21: Updated the restrictions on library function calls to
explicitly allow calls to malloc
and free
in main
, for the purposes
of allocating and deallocating the hash table array.
Update 9/22: Added advice for implementing Milestone 2.
Updated 9/26: Updated Memory correctness
section to indicate that only the test_free_chain
unit test
needs to be free of memory leaks (since there are some test functions
with unavoidable memory leaks, such as test_find_or_insert
.)
Overview
In this assignment, you will implement a word count program in both C and x86-64 assembly language. The word count program is supported by a number of functions which you will implement in both C and assembly language. Unit tests are provided to test both the C and assembly language implementations of these functions: these tests allow you to work on the functions one at a time and gain confidence that they work correctly before incorporating them into the overall program. (We encourage you to write additional unit tests as well, since the provided tests are somewhat basic.)
This is a substantial assignment! We strongly recommend that you start working on it as early as possible, and plan to make steady progress rather than waiting until the last minute to complete it.
Milestones, grading criteria
Milestone 1 (20%):
- C implementations of all functions (in
c_wcfuncs.c
) - C implementation of the main program (in
c_wcmain.c
) - Assembly language implementations of the following functions
(in
asm_wcfuncs.S
):wc_isspace
,wc_isalpha
, andwc_str_compare
Milestone 2 (80%):
- Assembly language implementations of all remaining functions
(in
asm_wcfuncs.S
) - Assembly language implementation of the main program
(in
asm_wcmain.S
, 4% of assignment grade) - Design and coding style (10% of assignment grade)
Getting started
To get started, download csf_assign02.zip, which contains the skeleton code for the assignment, and then unzip it.
Note that you can download the zipfile from the command line using the curl
program:
curl -O https://jhucsf.github.io/fall2023/assign/csf_assign02.zip
Note that in the -O
option, it is the letter “O”, not the numeral “0”.
Word count
The overall goal of the assignment is to create two implementations (C and x86-64 assembly language) of a word count program. When the program is run, it
- Opens a text file named on the command line, or uses standard input if no command line argument was provided
- Reads each word of text in the input, building a hash table of occurrence counts for each unique word
- Prints a summary of number of words seen, number of unique words, and information about the most-frequently occurring unique word in the input
- Closes the input file (if not reading from standard input)
- Frees all allocated memory
Here is an example run of the c_wordcount
program using the input file
little_dorrit.txt
, which is the text of Little Dorrit by Charles Dickens
as published by Project Gutenberg
(user input is shown in bold):
$ ./c_wordcount little_dorrit.txt
Total words read: 340410
Unique words read: 18735
Most frequent word: the (15723)
This test input is available here:
For any test input, running the asm_wordcount
program (or casm_wordcount
program) should
result in behavior identical to c_wordcount
.
Important restrictions
In implementing the functions in c_wcfuncs.c
and asm_wcfuncs.S
, you are not
allowed to call C library functions other than the following ones:
fgetc
malloc
free
In implementing c_wcmain.c
and asm_wcmain.S
, you are not allowed to call C
library functions other than the following ones:
fopen
fclose
printf
fprintf
malloc
free
Note that c_wcmain.c
does not have an #include <stdlib.h>
directive,
so you should add one if you need to call malloc
and free
.
Aside from adding #include <stdlib.h>
to c_wcmain.c
,
you should not add any additional #include
directives in any source file.
Expectations for assembly language code
For your implementations of assembly language functions, we expect the following.
Your code must be hand-written. Using a compiler to generate assembly code from C code is not a legitimate way to complete the assembly language functions.
We expect your assembly code to be extensively commented. One comment per assembly instruction is a good rule of thumb. One way to think about this: assembly code is ten times harder to write than high-level language code, so you should have ten times more comments in your assembly code than in your high-level language code.
Functions to implement, unit tests
The header file wcfuncs.h
defines a data type (struct WordEntry
) and
C functions which will form the basis of the word count program.
The struct WordEntry
data type is defined as follows:
struct WordEntry {
unsigned char word[MAX_WORDLEN + 1];
uint32_t count; // number of occurrences of this word
struct WordEntry *next;
};
Each WordEntry
object represents one unique word appearing in the program
input, along with a count of how many times the word occurred.
The next
field allows WordEntry
objects to be arranged in a singly-linked
list, which in turn allows the program to maintain a hash table
to keep track of occurrence counts for all words observed in the input.
You will need to implement the following functions in both C and x86-64 assembly language:
uint32_t wc_hash(const unsigned char *w);
int wc_str_compare(const unsigned char *lhs, const unsigned char *rhs);
void wc_str_copy(unsigned char *dest, const unsigned char *source);
int wc_isspace(unsigned char c);
int wc_isalpha(unsigned char c);
int wc_readnext(FILE *in, unsigned char *w);
void wc_tolower(unsigned char *w);
void wc_trim_non_alpha(unsigned char *w);
struct WordEntry *wc_find_or_insert(struct WordEntry *head, const unsigned char *s,
int *inserted);
struct WordEntry *wc_dict_find_or_insert(struct WordEntry *buckets[],
unsigned num_buckets,
const unsigned char *s);
void wc_free_chain(struct WordEntry *p);
Each function has a detailed comment explaining its intended behavior.
In addition, wctests.c
contains unit tests demonstrating the expected
behavior of each function. (You are encouraged to add additional tests,
but you are not required to.)
As you implement each function, you can use the test program to test it.
For example, if you are ready to test your C implementation of the
wc_isspace
function, you could run the commands
make c_wctests
./c_wctests test_isspace
To test the assembly language implementation of wc_isspace
:
make asm_wctests
./asm_wctests test_isspace
Unsigned characters
You will notice that all of the required functions consistently use
unsigned char
as the data type for text characters. This is primarily
to avoid sign-extension issues when computing a string’s hash code.
In the test program (wctests.c
), there are calls to the C library
string functions which require casts to or from char *
or const char *
to avoid compiler warnings. If you add your own test cases, you
may need to such casts. However, no such casts should be necessary
in c_wcfuncs.c
or c_wcmain.c
.
Hash table implementation
The primary data structure for keeping track of word occurrence counts will be a hash table using chained hashing (a.k.a. separate chaining) to handle collisions.
Here is a brief summary of how this should work.
When the word count program reads a word from the input, it
converts it to lower case (using wc_tolower
) and strips off any
trailing non-alphabetic characters (using wc_trim_non_alpha
.)
It then uses the hash function (wc_hash
) to compute
a 32 bit hash code from the word. Next, it finds the index in
the hash table where the word should be located: this is the
hash code mod N, where N is the number of “buckets” (array elements)
in the hash table. Each bucket is the head pointer of a singly-linked
list of WordEntry
objects, one for each word present in the bucket.
By searching this list, the program can determine whether or not the
word was encountered previously. If no WordEntry
object containing
the word is found, then the program is seeing the first occurrence of
the word in the input, and it should prepend a new WordEntry
object
to the list.
The wc_find_or_insert
function is responsible for finding or inserting
the WordEntry
object for a word read from the input. It returns a
pointer to the WordEntry
object representing the searched-for word,
allowing the word count program to update its occurrence count.
Main program implementation
The c_wcmain.c
and asm_wcmain.S
source files implement (respectively)
the C and assembly language implementations of the word count program’s
main
function.
It should be very straightforward to implement the main
function by calling
the functions declared in wcfuncs.h
. For example, the main loop of the
program should work like this:
while ( next word is read successfully using wc_readnext ) {
increase total word count by 1
use wc_tolower to convert word to lower case
use wc_trim_non_alpha to remove non-alphabetic characters at end of word
use wc_dict_find_or_insert to find or insert the word in the hash table
increment the WordEntry's count
}
In finding the unique word with the highest number of occurrences,
you will need to traverse the entire hash table (i.e., scan through
every WordEntry
object in every bucket of the hash table.)
One situation that could arise is that there could be multiple WordEntry
objects that are tied for the highest occurrence count. In this case,
choose the candidate (among the words with the highest occurrence count)
that compares as least lexicographically as the one to display when
the summary stats are printed. You can use the wc_str_compare
function
to do lexicographical comparisons of strings.
Error handling: if there is a command line argument specifying the
name of an input file, but the named file can’t be opened, the program
should print an error message to stderr
and exit with a non-zero
exit code. If the program executes successfully, it should exit with
exit code zero.
The Makefile
provided by the starter code has three targets to build configurations
of the word count program:
make c_wordcount
will build the version that uses the C function implementations (inc_wcfuncs.c
) and the Cmain
function (inc_wcmain.c
)make asm_wordcount
will build the version that uses the assembly language function implementations (inasm_wcfuncs.S
) and the assembly language main function (inasm_wcmain.S
)make casm_wordcount
will build a version that uses the assembly language function implementations (inasm_wcfuncs.S
) and the Cmain
function (inc_wcmain.c
)
The casm_wordcount
program is useful for validating that your assembly language
function implementations work correctly when called by the main program.
Testing the main program
You should create some example input files to test your main programs on.
For example (user input in bold):
$ curl -O https://jhucsf.github.io/fall2023/assign/oh_freddled_gruntbuggly.txt
$ cat oh_freddled_gruntbuggly.txt
Oh freddled gruntbuggly,
Thy micturations are to me,
As plurdled gabbleblotchits,
On a lurgid bee,
Groop, I implore thee, my foonting turlingdromes,
And hooptiously drangle me,
With crinkly bindlewurdles,
Or else I shall rend thee in the gobberwarts with my blurglecruncheon,
See if I don't!
$ ./c_wordcount oh_freddled_gruntbuggly.txt
Total words read: 45
Unique words read: 39
Most frequent word: i (3)
Suggested approach
Milestone 1
The main task in Milestone 1 is to implement the C versions of the
required functions and the word count program’s main
function.
You should implement this part of the milestone before moving on
to implementing the assembly language functions. We recommend working on
one function at a time, from simplest to most complicated, using
the provided unit tests to verify their functionality. Adding your
own unit tests is encouraged, but not required.
For the three assembly language functions required for Milestone 1,
start with the simplest ones (wc_isspace
and wc_isalpha
).
Use unit tests to thoroughly test them. We strongly recommend
running the unit test program in gdb
and stepping through the
assembly code so that you know exactly what your assembly instructions
are doing. Experiment with viewing the contents of registers.
When you implement wc_str_compare
in assembly language, you
will need to use a loop which accesses elements of the two strings
being compared. You should experiment with using gdb
to show you
the contents of memory. For example, if you want to see the first
element of the array referred-to by the first parameter (lhs
),
you could use the following command in gdb
:
print *(unsigned char*) $rdi
Milestone 2
In asm_wcfuncs.S
there are constants WORDENTRY_WORD_OFFSET
,
WORDENTRY_COUNT_OFFSET
, and WORDENTRY_NEXT_OFFSET
which define the
offsets of the word
, count
, and next
fields of the
struct WordEntry
data type. You will probably want to add a definition
for the size of an instance of struct WordEntry
:
#define WORDENTRY_SIZE (WORDENTRY_NEXT_OFFSET+8)
The wc_find_or_insert
function will need to allocate an instance
of struct WordEntry
using malloc
in order to insert a new node
into the list.
Referring to the fields of a struct WordEntry
object is fairly
straightforward. For example, if %r13
points to a struct WordEntry
object, and you want to advance it to point to the next one in the
linked list, you could use the instruction
movq WORDENTRY_NEXT_OFFSET(%r13), %r13
You should work on the assembly language implementation of the main
function only after fully implementing and testing the individual assembly
language function implementations. Also note that for the assembly language
main
function, the most challenging functionality will be finding the
most-frequently occurring word. We recommend
- First, get the total word count to work
- Second, get the number of unique (distinct) words to work
- Last, implement finding the most-frequently occurring word
We will test these outputs separately, so you can receive partial credit for implementing 1 or 2 of them.
Note that it is likely that you will not have enough registers for all of the
variables in the main
function, so you should be prepared to allocate one
of more variables on the stack. Also, you will probably want to reserve
64 bytes on the stack to use as the word buffer passed to wc_readnext
.
Memory correctness
In all C and C++ code you write, we expect that there are no memory errors, including
- invalid reads
- invalid writes
- uses of uninitialized values
- memory leaks
We expect you to use the valgrind memory trace tool to check program execution for occurrences of memory errors. Examples of program executions which should proceed without any memory errors include
valgrind --leak-check=full ./c_wctests test_free_chain
valgrind --leak-check=full ./asm_wctests test_free_chain
valgrind --leak-check=full ./c_wordcount little_dorrit.txt
valgrind --leak-check=full ./asm_wordcount little_dorrit.txt
valgrind --leak-check=full ./casm_wordcount little_dorrit.txt
There should be no dynamic memory errors, and (assuming that all of
the unit tests pass for the executions of c_wctests
and
asm_wctests
) there should be no memory leaks.
Assembly language tips
Here are some specific tips and tricks in no particular order.
Don’t forget that you need to prefix constant values with $
. For example,
if you want to set register %r10
to 16, the instruction is
movq $16, %r10
and not
movq 16, %r10
If you want to use a label as a pointer (address), prefix it with
$
. For example,
movq $sSpaceChars, %r10
would put the address that sSpaceChars
refers to in %r10
.
When calling a function, the stack pointer (%rsp
) must contain an address
which is a multiple of 16. However, because the callq
instruction
pushes an 8 byte return address on the stack, on entry to a function,
the stack pointer will be “off” by 8 bytes. You can subtract 8 from
%rsp
when a function begins and add 8 bytes to %rsp
before returning
to compensate. (See the example addLongs
function.) Pushing an
odd number of callee-saved registers also works, and has the benefit
that you can then use the callee-saved registers freely in your function.
If you want to define read-only string constants, the .rodata
section
is the right place for them. For example:
.section .rodata
sSpaceChars: .string " \t\r\n\f\v"
The .equ
assembler directive is useful for defining constant values,
for example:
.equ MAX_WORDLEN, 63
You might find the following source code comment useful for reminding yourself about calling conventions:
/*
* Notes:
* Callee-saved registers: rbx, rbp, r12-r15
* Subroutine arguments: rdi, rsi, rdx, rcx, r8, r9
*/
The GNU assembler allows you to define “local” labels, which start
with the prefix .L
. You should use these for control flow targets
within a function. For example:
cmpq $0, %rax /* see if read failed */
jl .LreadError /* handle read failure */
...
.LreadError:
/* error handling goes here */
Hint about determining which characters are alphabetic: character codes that are either between 65 and 90 inclusive, or 97 to 122 inclusive, are alphabetic.
Example assembly language function
Here is an assembly language function called strLen
which returns the number
of characters in a NUL-terminated character string:
/*
* Determine the length of specified character string.
*
* Parameters:
* s - pointer to a NUL-terminated character string
*
* Returns:
* number of characters in the string
*/
.globl strLen
strLen:
subq $8, %rsp /* adjust stack pointer */
movq $0, %r10 /* initial count is 0 */
.LstrLenLoop:
cmpb $0, (%rdi) /* found NUL terminator? */
jz .LstrLenDone /* if so, done */
inc %r10 /* increment count */
inc %rdi /* advance to next character */
jmp .LstrLenLoop /* continue loop */
.LstrLenDone:
movq %r10, %rax /* return count */
addq $8, %rsp /* restore stack pointer */
ret
In C, the declaration of this function could look like this:
long strLen(const char *s);
Unit testing this function might involve the following assertions:
ASSERT(13L == strLen("Hello, world!"));
ASSERT(0L == strLen(""));
ASSERT(8L == strLen("00000010"));
Submitting
Before you submit, prepare a README.txt
file so that it contains your
names, and briefly summarizes each of your contributions to the submission
(i.e., who worked on what functionality.) This may be very brief if you
did not work with a partner.
To submit your work:
Run the following commands to create a solution.zip
file:
rm -f solution.zip
zip -9r solution.zip Makefile *.h *.c *.S README.txt
Please do not submit input files (especially if they are large!)
Upload solution.zip
to Gradescope
as Assignment 2 MS1 or Assignment 2 MS2, depending on which
milestone you are submitting.
Please check the files you uploaded to make sure they are the ones you intended to submit.
Autograder
When you upload your submission to Gradescope, it will be tested by the autograder, which executes unit tests for each required function. Please note the following:
- If your code does not compile successfully, all of the tests will fail
- The autograder runs
valgrind
on your code, but it typically does not report any information about the result of runningvalgrind
: points will be deducted if your code has memory errors or memory leaks!