- Out on: Tuesday, September 15, 2020
- Part 1 Due: Tuesday, September 29, 2020 @ 11pm
- Part 2 Due: Friday, October 9, 2020 @ 11pm
- Collaboration: None
Update Oct 7: due date changed to Friday, Oct 9th
Update Oct 5: due date changed to Wednesday, Oct 7th
Update Oct 2: link to improved Makefile, added assembly files to “Submitting” section, added link to late day request form
Update Sep 29: clarified that helper functions may be called from main
.
Overview
In this assignment you will implement a hex dump program using both C and assembly language. The submission of this assignment will be broken up to two parts as listed below.
Note that for all problems the full x86_64 conventions regarding register usage (arguments, results, caller-saved vs. callee-saved, etc.) are in effect! (Of course regular calls differ from system calls in this regard.)
Acknowledgment: The idea for this assignment comes from the Fall 2018 HW5 developed by Peter Frohlich.
Submission Part 1
Due date 1: Wednesday, September 29, 2020 @ 11pm
For this submission, all C language function implementations must be working with unit tests written. In addition, at least the Assembly language functions of hex_to_printable
and hex_format_byte_as_hex
must be working with unit tests written.
Submission Part 2
Due date 2: Tuesday, October 6, 2020 @ 11pm
The rest of the Assembly language functions must be written with thorough unit tests. Uploads for this submission should include the C implementation and unit tests submitted for part 1 as well.
Late Days
If you will be using more than 2 late days on this assignment, please submit a request at this link.
Grading breakdown
Part 1 (30 points)
C implementation - 10%
Assembly implementation - 20%
Part 2 (70 points)
Assembly implementation - 35%
Unit tests - 20%
Packing, style, and design - 15%
Getting started
Download csf_assign02.zip, which contains the skeleton code for the assignment.
You can download this file from a Linux command prompt using the curl
command:
curl -O https://jhucsf.github.io/fall2020/assign/csf_assign02.zip
Note that in the -O
option, it is the letter “O”, not the numeral “0”.
Update 10/2: The original Makefile doesn’t have support for compiling
an assembly language version of the hexdump program, and omits the -g
flag to enable line-level debugging of assembly code. Here is an improved
version: Makefile. Please download and use it.
Hex dump
Start by reading up on what hexdumps are. For this assignment, you will write a program in C and x86-64 assembly that produces a hexdump on standard output for data read from standard input.
Let’s start with an example:
$ ./c_hexdump
Hello
00000000: 48 65 6c 6c 6f 0a Hello.
The program was started, then the user typed the word “Hello” followed by return/enter, then CTRL-D was used to stop the input. The result shows the ASCII code for each character (in hexadecimal, so it’s guaranteed to be two digits wide for each character), including the newline character generated by the return/enter key. The formatting may look a bit strange, but the purpose of the “large gap” becomes apparent if we examine a longer input:
$ ./c_hexdump
This is a longer example of a hexdump. Marvel at its magnificence.
00000000: 54 68 69 73 20 69 73 20 61 20 6c 6f 6e 67 65 72 This is a longer
00000010: 20 65 78 61 6d 70 6c 65 20 6f 66 20 61 20 68 65 example of a he
00000020: 78 64 75 6d 70 2e 20 4d 61 72 76 65 6c 20 61 74 xdump. Marvel at
00000030: 20 69 74 73 20 6d 61 67 6e 69 66 69 63 65 6e 63 its magnificenc
00000040: 65 2e 0a e..
This time the user entered two sentences, then signaled end of input with CTRL-D. Again, we see the ASCII code for each character (including spaces and newlines). The formatting is set up so that regardless of the number of characters, we always have three “columns” of output:
- First the overall “position” in the input. Note that this is also a hexadecimal number, formatted to 8 digits.
- Then the ASCII values for each character in hexadecimal, at most 16 to a line.
- Finally a string-like representation of the data, with printable characters shown but non-printable characters (like newline or tab) replaced with a dot.
Note that there’s a single space between the colon after the offset and the ASCII values, but there are two spaces between the ASCII values and the string-like representation.
The behavior of your program should be identical to the command xxd -g 1
. Take note of how the program will only print a row if it either has a full row of sixteen characters, or if CTRL-D is pressed.
Note that because the purpose of this assignment is to give you an opportunity to learn how to write x86-64 assembly language code, there are some very important non-functional requirements that you will need to satisfy. (Please read that section of the assignment description carefully.)
Important: For testing the functional correctness of your hexdump programs, it
is only important that it behave identically to xxd -g 1
when reading from
a file, using input redirection. The examples above show the program
reading from standard input, only as an illustration of the basic functionality.
So, you will want to test your hexdump programs using a command like
./c_hexdump < myinput
where myinput
is an input file you want to test.
Functional requirements
Functions
The header file hexfuncs.h
declares the following functions:
// Read up to 16 bytes from standard input into data_buf.
// Returns the number of characters read.
long hex_read(char data_buf[]);
// Write given nul-terminated string to standard output.
void hex_write_string(const char s[]);
// Format a long value as an offset string consisting of exactly 8
// hex digits. The formatted offset is stored in sbuf, which must
// have enough room for a string of length 8.
void hex_format_offset(long offset, char sbuf[]);
// Format a byte value (in the range 0-255) as string consisting
// of two hex digits. The string is stored in sbuf.
void hex_format_byte_as_hex(long byteval, char sbuf[]);
// Convert a byte value (in the range 0-255) to a printable character
// value. If byteval is already a printable character, it is returned
// unmodified. If byteval is not a printable character, then the
// ASCII code for '.' should be returned.
long hex_to_printable(long byteval);
In both your C and assembly language implementations, you are required to implement these functions exactly as specified.
Main functions
In c_hexmain.c
and asm_hexmain.S
, you will develop C and assembly-language main
functions which call the functions shown above in order to implement the functionality of the hexdump program.
Note that your main function (either version) may only call these functions and (optionally) helper functions that you create.
The c_hexdump
and asm_hexdump
Makefile targets build executable programs using these main
modules. When reading data from standard input, their output should be identical to the command xxd -g 1
.
The casm_hexdump
Makefile target builds an executable program which uses the C version of the main
function, but the assembly-language version of the hex functions. This is a handy way to test your assembly language function implementations before you have fully implemented the assembly language version of the main
function. Its behavior (reading from standard input) should also be identical to xxd -g 1
.
Unit tests
The source file hextests.c
contains unit tests for the required functions. The provided version is very minimal, so you should add additional tests so that your implementations of the functions are thoroughly tested. Part of your grade will be based on the thoroughness of your unit tests.
Note that it will not be straightforward to write unit tests for the hex_read
and hex_write_string
functions, since they do I/O. So, you are not required to write unit tests for them.
Important advice: Writing complete programs in
assembly language is hard. Using unit tests, you can adopt a
test-driven approach where you implement one assembly language
function at a time, and test them to ensure correct operation.
Using this approach will make developing the hex
program vastly
easier.
Program-level testing
In addition to unit testing individual functions, you should test the program as a whole. In general, for any input file (text, binary, etc.), the command
./hex < inputfile
should produce exactly the same output as
xxd -g 1 < inputfile
We encourage you to test your program with a variety of inputs, including (but not limited to):
- empty file
- small files
- large files
- files with sizes that are a multiple of 16
- files with sizes that aren’t a multiple of 16
- text files
- binary files
Non-functional requirements
Calling C library functions is not allowed. The only exception is that c_hexfuncs.c
may #include <unistd.h>
and call the read
and write
functions (which are wrappers for the read
and write
system calls). Outside of this singular exception, any call to C library functions will result in an automatic zero on the entire assignment. Please don’t do it!
All assembly language code must be 100% written by hand and extensively commented. No credit will be given otherwise. Any undocumented assembly code will cause you to lose points. You don’t have to comment every line, but at least every “coherent chunk” of assembly should have a comment or two. In particular you must describe where you get what data from, especially when it comes to functions and their parameters/results. You have been warned!
Hints and tips
Assembly resources
Keep in mind that the assembly language functions must fully conform to the x86-64 calling conventions; otherwise, interoperability with C code won’t work. If you’re lost and/or unsure where to start, here’s a list of different resources that can be a good starting point for you to look at:
- If you find yourself wondering how to use a system call, you can use the man command to look up information. For example, to find out how the read system call works, use
man 2 read
. Sadly the man pages don’t describe the details required to call system calls from assembly language, but this post has everything you’ll need regarding register conventions and system call numbers. - This site gives a very basic overview of register conventions. It’s specifically for 32-bit registers, but it’s helpful to read over and understand how everything works in relation to each other.
- This site has even more on 32-bit register convention. It also talks about general code structure for assembly with sections on data declarations, load/store instructions, indirect and based addressing, basic arithmetic, control structures (it is really helpful to have a good understanding of this), and I/O functions.
- For a comprehensive overview of functions and syntax, look at this site. It goes over the descriptions of different functions as well as proper syntax.
- This site goes more in depth on subroutines/functions.
- Another compreensive x86_64 assembly command list is here
- Brown has two cheat sheets that could be useful too - x86_64 & gdb
x86-64 tips and tricks
Here are some more specific tips and tricks in no particular order.
Don’t forget that you need to prefix constant values with $
. For example,
if you want to set register %r10
to 16, the instruction is
movq $16, %r10
and not
movq 16, %r10
If you want to use a label as a pointer (address), prefix it with
$
. For example,
movq $sHexDigits, %r10
would put the address that sHexDigits
refers to in %r10
.
When calling a function, the stack pointer (%rsp
) must contain an address
which is a multiple of 16. However, because the callq
instruction
pushes an 8 byte return address on the stack, on entry to a function,
the stack pointer will be “off” by 8 bytes. You can subtract 8 from
%rsp
when a function begins and add 8 bytes to %rsp
before returning
to compensate. (See the example addLongs
function.) Pushing an
odd number of callee-saved registers also works, and has the benefit
that you can then use the callee-saved registers freely in your function.
If you want to define read-only string constants, the .rodata
section
is the right place for them. For example:
.section .rodata
sHexDigits: .string "0123456789abcdef"
The .equ
assembler directive is useful for defining constant values,
for example:
.equ BUFSIZE, 16
You might find the following source code comment useful for reminding yourself about calling conventions:
/*
* Notes:
* Callee-saved registers: rbx, rbp, r12-r15
* Subroutine arguments: rdi, rsi, rdx, rcx, r8, r9
*/
In Unix and Linux, standard input is file descriptor 0.
Linux system calls do not preserve %rcx
or %r11
, so make sure you save them on the stack if their contents need to be preserved across a system call.
The GNU assembler allows you to define “local” labels, which start with the prefix .L
. You should use these for control flow targets within a function. For example (from the echoInput.S example program):
cmpq $0, %rax /* see if read failed */
jl .LreadError /* handle read failure */
...
.LreadError:
/* error handling goes here */
Hint about determining which characters are printable: the range of
printable ASCII characters is 32 through 126, inclusive. Any byte value
that is not in this range should be printed as “.
” (period). Note
that “.
” has ASCII value 46.
Example assembly language programs
For reference, here are links to a couple of example assembly language programs which use the read
and write
system calls.
- hello.S: prints a
Hello, world
message - echoInput.S: reads up to 128 bytes of data from standard input and echoes it to standard output
Example assembly language functions
This section shows implementations of a couple of assembly language functions you might find useful.
Here is an assembly language function called strLen
which returns the number
of characters in a NUL-terminated character string:
/*
* Determine the length of specified character string.
*
* Parameters:
* s - pointer to a NUL-terminated character string
*
* Returns:
* number of characters in the string
*/
.globl strLen
strLen:
subq $8, %rsp /* adjust stack pointer */
movq $0, %r10 /* initial count is 0 */
.LstrLenLoop:
cmpb $0, (%rdi) /* found NUL terminator? */
jz .LstrLenDone /* if so, done */
inc %r10 /* increment count */
inc %rdi /* advance to next character */
jmp .LstrLenLoop /* continue loop */
.LstrLenDone:
movq %r10, %rax /* return count */
addq $8, %rsp /* restore stack pointer */
ret
In C, the declaration of this function could look like this:
long strLen(const char *s);
Unit testing this function might involve the following assertions:
ASSERT(13L == strLen("Hello, world!"));
ASSERT(0L == strLen(""));
ASSERT(8L == strLen("00000010"));
Assignment tips
For this assignment, you should start by writing the C language implementations first. Be sure to write unit tests and check your work along the way.
After writing your C functions, start thinking about how you can translate your implementation into assembly. Be sure to consider register usage and be aware of pushing/popping the stack.
Submitting
Submit a zipfile containing your complete project. The recommended
way to do this is to run the command make solution.zip
. This
will create a file called solution.zip
with all of the required
files. Important: all of the files in the zipfile must be
at the top level, not a subdirectory. For example, if your
zipfile is called solution.zip
and you run the command unzip -l solution.zip
to list its contents, you should see something like the following output:
Archive: solution.zip
Length Date Time Name
--------- ---------- ----- ----
1140 09-13-2020 18:42 Makefile
1053 09-13-2020 18:42 hexfuncs.h
3959 09-13-2020 18:42 tctest.h
884 09-13-2020 18:42 c_hexfuncs.c
877 09-13-2020 18:42 c_hexmain.c
3459 09-13-2020 18:42 asm_hexfuncs.S
953 09-13-2020 18:42 asm_hexmain.S
1458 09-13-2020 18:42 hextests.c
3948 09-13-2020 18:42 tctest.c
--------- -------
17731 9 files
Upload this zipfile to Gradescope for both parts 1 and 2 of Assignment 2. Make sure to include your name and email address in every file you turn in (well, in every file for which it makes sense to do so anyway!)