601.229 (F20): Assignment 2: Hex dump

Update Oct 7: due date changed to Friday, Oct 9th

Update Oct 5: due date changed to Wednesday, Oct 7th

Update Oct 2: link to improved Makefile, added assembly files to “Submitting” section, added link to late day request form

Update Sep 29: clarified that helper functions may be called from main.

Overview

In this assignment you will implement a hex dump program using both C and assembly language. The submission of this assignment will be broken up to two parts as listed below.

Note that for all problems the full x86_64 conventions regarding register usage (arguments, results, caller-saved vs. callee-saved, etc.) are in effect! (Of course regular calls differ from system calls in this regard.)

Acknowledgment: The idea for this assignment comes from the Fall 2018 HW5 developed by Peter Frohlich.

Submission Part 1

Due date 1: Wednesday, September 29, 2020 @ 11pm

For this submission, all C language function implementations must be working with unit tests written. In addition, at least the Assembly language functions of hex_to_printable and hex_format_byte_as_hex must be working with unit tests written.

Submission Part 2

Due date 2: Tuesday, October 6, 2020 @ 11pm

The rest of the Assembly language functions must be written with thorough unit tests. Uploads for this submission should include the C implementation and unit tests submitted for part 1 as well.

Late Days

If you will be using more than 2 late days on this assignment, please submit a request at this link.

Grading breakdown

Part 1 (30 points)

C implementation - 10%

Assembly implementation - 20%

Part 2 (70 points)

Assembly implementation - 35%

Unit tests - 20%

Packing, style, and design - 15%

Getting started

Download csf_assign02.zip, which contains the skeleton code for the assignment.

You can download this file from a Linux command prompt using the curl command:

curl -O https://jhucsf.github.io/fall2020/assign/csf_assign02.zip

Note that in the -O option, it is the letter “O”, not the numeral “0”.

Update 10/2: The original Makefile doesn’t have support for compiling an assembly language version of the hexdump program, and omits the -g flag to enable line-level debugging of assembly code. Here is an improved version: Makefile. Please download and use it.

Hex dump

Start by reading up on what hexdumps are. For this assignment, you will write a program in C and x86-64 assembly that produces a hexdump on standard output for data read from standard input.

Let’s start with an example:

$ ./c_hexdump
Hello
00000000: 48 65 6c 6c 6f 0a                                Hello.

The program was started, then the user typed the word “Hello” followed by return/enter, then CTRL-D was used to stop the input. The result shows the ASCII code for each character (in hexadecimal, so it’s guaranteed to be two digits wide for each character), including the newline character generated by the return/enter key. The formatting may look a bit strange, but the purpose of the “large gap” becomes apparent if we examine a longer input:

$ ./c_hexdump
This is a longer example of a hexdump. Marvel at its magnificence.
00000000: 54 68 69 73 20 69 73 20 61 20 6c 6f 6e 67 65 72  This is a longer
00000010: 20 65 78 61 6d 70 6c 65 20 6f 66 20 61 20 68 65   example of a he
00000020: 78 64 75 6d 70 2e 20 4d 61 72 76 65 6c 20 61 74  xdump. Marvel at
00000030: 20 69 74 73 20 6d 61 67 6e 69 66 69 63 65 6e 63   its magnificenc
00000040: 65 2e 0a                                         e..

This time the user entered two sentences, then signaled end of input with CTRL-D. Again, we see the ASCII code for each character (including spaces and newlines). The formatting is set up so that regardless of the number of characters, we always have three “columns” of output:

  1. First the overall “position” in the input. Note that this is also a hexadecimal number, formatted to 8 digits.
  2. Then the ASCII values for each character in hexadecimal, at most 16 to a line.
  3. Finally a string-like representation of the data, with printable characters shown but non-printable characters (like newline or tab) replaced with a dot.

Note that there’s a single space between the colon after the offset and the ASCII values, but there are two spaces between the ASCII values and the string-like representation.

The behavior of your program should be identical to the command xxd -g 1. Take note of how the program will only print a row if it either has a full row of sixteen characters, or if CTRL-D is pressed.

Note that because the purpose of this assignment is to give you an opportunity to learn how to write x86-64 assembly language code, there are some very important non-functional requirements that you will need to satisfy. (Please read that section of the assignment description carefully.)

Important: For testing the functional correctness of your hexdump programs, it is only important that it behave identically to xxd -g 1 when reading from a file, using input redirection. The examples above show the program reading from standard input, only as an illustration of the basic functionality. So, you will want to test your hexdump programs using a command like

./c_hexdump < myinput

where myinput is an input file you want to test.

Functional requirements

Functions

The header file hexfuncs.h declares the following functions:

// Read up to 16 bytes from standard input into data_buf.
// Returns the number of characters read.
long hex_read(char data_buf[]);

// Write given nul-terminated string to standard output.
void hex_write_string(const char s[]);

// Format a long value as an offset string consisting of exactly 8
// hex digits.  The formatted offset is stored in sbuf, which must
// have enough room for a string of length 8.
void hex_format_offset(long offset, char sbuf[]);

// Format a byte value (in the range 0-255) as string consisting
// of two hex digits.  The string is stored in sbuf.
void hex_format_byte_as_hex(long byteval, char sbuf[]);

// Convert a byte value (in the range 0-255) to a printable character
// value.  If byteval is already a printable character, it is returned
// unmodified.  If byteval is not a printable character, then the
// ASCII code for '.' should be returned.
long hex_to_printable(long byteval);

In both your C and assembly language implementations, you are required to implement these functions exactly as specified.

Main functions

In c_hexmain.c and asm_hexmain.S, you will develop C and assembly-language main functions which call the functions shown above in order to implement the functionality of the hexdump program.

Note that your main function (either version) may only call these functions and (optionally) helper functions that you create.

The c_hexdump and asm_hexdump Makefile targets build executable programs using these main modules. When reading data from standard input, their output should be identical to the command xxd -g 1.

The casm_hexdump Makefile target builds an executable program which uses the C version of the main function, but the assembly-language version of the hex functions. This is a handy way to test your assembly language function implementations before you have fully implemented the assembly language version of the main function. Its behavior (reading from standard input) should also be identical to xxd -g 1.

Unit tests

The source file hextests.c contains unit tests for the required functions. The provided version is very minimal, so you should add additional tests so that your implementations of the functions are thoroughly tested. Part of your grade will be based on the thoroughness of your unit tests.

Note that it will not be straightforward to write unit tests for the hex_read and hex_write_string functions, since they do I/O. So, you are not required to write unit tests for them.

Important advice: Writing complete programs in assembly language is hard. Using unit tests, you can adopt a test-driven approach where you implement one assembly language function at a time, and test them to ensure correct operation. Using this approach will make developing the hex program vastly easier.

Program-level testing

In addition to unit testing individual functions, you should test the program as a whole. In general, for any input file (text, binary, etc.), the command

./hex < inputfile

should produce exactly the same output as

xxd -g 1 < inputfile

We encourage you to test your program with a variety of inputs, including (but not limited to):

Non-functional requirements

Calling C library functions is not allowed. The only exception is that c_hexfuncs.c may #include <unistd.h> and call the read and write functions (which are wrappers for the read and write system calls). Outside of this singular exception, any call to C library functions will result in an automatic zero on the entire assignment. Please don’t do it!

All assembly language code must be 100% written by hand and extensively commented. No credit will be given otherwise. Any undocumented assembly code will cause you to lose points. You don’t have to comment every line, but at least every “coherent chunk” of assembly should have a comment or two. In particular you must describe where you get what data from, especially when it comes to functions and their parameters/results. You have been warned!

Hints and tips

Assembly resources

Keep in mind that the assembly language functions must fully conform to the x86-64 calling conventions; otherwise, interoperability with C code won’t work. If you’re lost and/or unsure where to start, here’s a list of different resources that can be a good starting point for you to look at:

x86-64 tips and tricks

Here are some more specific tips and tricks in no particular order.

Don’t forget that you need to prefix constant values with $. For example, if you want to set register %r10 to 16, the instruction is

movq $16, %r10

and not

movq 16, %r10

If you want to use a label as a pointer (address), prefix it with $. For example,

movq $sHexDigits, %r10

would put the address that sHexDigits refers to in %r10.

When calling a function, the stack pointer (%rsp) must contain an address which is a multiple of 16. However, because the callq instruction pushes an 8 byte return address on the stack, on entry to a function, the stack pointer will be “off” by 8 bytes. You can subtract 8 from %rsp when a function begins and add 8 bytes to %rsp before returning to compensate. (See the example addLongs function.) Pushing an odd number of callee-saved registers also works, and has the benefit that you can then use the callee-saved registers freely in your function.

If you want to define read-only string constants, the .rodata section is the right place for them. For example:

        .section .rodata
sHexDigits: .string "0123456789abcdef"

The .equ assembler directive is useful for defining constant values, for example:

.equ BUFSIZE, 16

You might find the following source code comment useful for reminding yourself about calling conventions:

/*
 * Notes:
 * Callee-saved registers: rbx, rbp, r12-r15
 * Subroutine arguments:  rdi, rsi, rdx, rcx, r8, r9
 */

In Unix and Linux, standard input is file descriptor 0.

Linux system calls do not preserve %rcx or %r11, so make sure you save them on the stack if their contents need to be preserved across a system call.

The GNU assembler allows you to define “local” labels, which start with the prefix .L. You should use these for control flow targets within a function. For example (from the echoInput.S example program):

	cmpq $0, %rax                 /* see if read failed */
	jl .LreadError                /* handle read failure */

	...

.LreadError:
	/* error handling goes here */

Hint about determining which characters are printable: the range of printable ASCII characters is 32 through 126, inclusive. Any byte value that is not in this range should be printed as “.” (period). Note that “.” has ASCII value 46.

Example assembly language programs

For reference, here are links to a couple of example assembly language programs which use the read and write system calls.

Example assembly language functions

This section shows implementations of a couple of assembly language functions you might find useful.

Here is an assembly language function called strLen which returns the number of characters in a NUL-terminated character string:

/*
 * Determine the length of specified character string.
 *
 * Parameters:
 *   s - pointer to a NUL-terminated character string
 *
 * Returns:
 *    number of characters in the string
 */
	.globl strLen
strLen:
	subq $8, %rsp                 /* adjust stack pointer */
	movq $0, %r10                 /* initial count is 0 */

.LstrLenLoop:
	cmpb $0, (%rdi)               /* found NUL terminator? */
	jz .LstrLenDone               /* if so, done */
	inc %r10                      /* increment count */
	inc %rdi                      /* advance to next character */
	jmp .LstrLenLoop              /* continue loop */

.LstrLenDone:
	movq %r10, %rax               /* return count */
	addq $8, %rsp                 /* restore stack pointer */
	ret

In C, the declaration of this function could look like this:

long strLen(const char *s);

Unit testing this function might involve the following assertions:

ASSERT(13L == strLen("Hello, world!"));
ASSERT(0L == strLen(""));
ASSERT(8L == strLen("00000010"));

Assignment tips

For this assignment, you should start by writing the C language implementations first. Be sure to write unit tests and check your work along the way.

After writing your C functions, start thinking about how you can translate your implementation into assembly. Be sure to consider register usage and be aware of pushing/popping the stack.

Submitting

Submit a zipfile containing your complete project. The recommended way to do this is to run the command make solution.zip. This will create a file called solution.zip with all of the required files. Important: all of the files in the zipfile must be at the top level, not a subdirectory. For example, if your zipfile is called solution.zip and you run the command unzip -l solution.zip to list its contents, you should see something like the following output:

Archive:  solution.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
     1140  09-13-2020 18:42   Makefile
     1053  09-13-2020 18:42   hexfuncs.h
     3959  09-13-2020 18:42   tctest.h
      884  09-13-2020 18:42   c_hexfuncs.c
      877  09-13-2020 18:42   c_hexmain.c
     3459  09-13-2020 18:42   asm_hexfuncs.S
      953  09-13-2020 18:42   asm_hexmain.S
     1458  09-13-2020 18:42   hextests.c
     3948  09-13-2020 18:42   tctest.c
---------                     -------
    17731                     9 files

Upload this zipfile to Gradescope for both parts 1 and 2 of Assignment 2. Make sure to include your name and email address in every file you turn in (well, in every file for which it makes sense to do so anyway!)