Assn 5: UTF-8

Due: 5:00pm, Friday, September 26. Value: 40 pts.

In this assignment, you'll implement two functions to manipulate a Unicode string stored using the UTF-8 encoding. [UTF-8 notes from class]

In C, a char is always a single 8-bit byte, and so a char array is most easily used for representing a string of ASCII characters. All of C's built-in string manipulation functions are built for ASCII.

However, we can use a char array as a Unicode string using UTF-8, as long as we consistently use only functions designed specifically for UTF-8. As an example of encoding a string into an array of chars, consider the Spanish word baño. The third letter ñ is not an ASCII character, but it has a Unicode codepoint of U+00F1. Since this does not fit into 7 bits but does fit into 11 bits, UTF-8 represents it into two separate bytes. Consequently, baño would be represented in memory as an array of six bytes, whose hexadecimal values are 0x62, 0x61, 0xC3, 0xB1, 0x6F, and the terminating 0x00.

letter: b a ñ o
codepoint: U+0062 U+0061 U+00F1 U+006F
bytes: 62 61 C3 B1 6F

For this assignment, you will complete two utility functions for UTF-8.

int u8get(char *sint index)

Returns the Unicode codepoint for the Unicode character at index index in the string. For example, if s is baño represented as above, u8get(s3) should return 0x006F, the Unicode codepoint for the o, and u8get(s2) should return 0x00F1.

You may assume that s contains only codepoints that fit into 16 bits. This means that you need only worry about the one-, two-, and three-byte cases for UTF-8. Also, you can assume that index is between 0 and 1 less than the number of Unicode characters represented in s.

int u8find(char *sint ch)

Returns the index in the string s where the Unicode codepoint ch first occurs, or −1 if the codepoint does not occur in the string. For example, if s is baño represented as above, u8find(s0xF1) should return 2.

As before, you may assume that s and ch contain only codepoints that fit into 16 bits.

As in prior assignments, the handout code includes three files.

u8.c

Defines utility functions for manipulating UTF-8 encoded strings. This is the only file you will modify for this assignment.

u8.h

Contains prototypes for the functions found in u8.c.

u8test.c

Contains a main function that runs through a battery of tests. If your program passes all the tests with no memory problems reported by valgrind, there is a good chance that your solution is correct. That is not a guarantee, though. The tests are all based on the following Unicode strings:

baño (Spanish for bathroom)
letter: b a ñ o
codepoint: U+0062 U+0061 U+00F1 U+006F
bytes: 62 61 C3 B1 6F
Meßgröße (German for measured variable)
letter: M e ß g r ö ß e
codepoint: U+004D U+0065 U+00DF U+0067 U+0072 U+00F6 U+00DF U+0065
bytes: 4D 65 C3 9F 67 72 C3 B6 C3 9F 65
εἰς (first word of Hendrix's motto as as rendered in Ancient Greek on its seal)
letter: ε ς
codepoint: U+03B5 U+1F30 U+03C2
bytes: CE B5 E1 BC B0 CF 82
سلام (Arabic word peace, frequently used for greeting)
letter: س ل ا م
codepoint: U+0633 U+0644 U+0627 U+0645
bytes: D8 B3 D9 84 D8 A7 D9 85
±√b²−4ac (discriminant portion of quadratic formula)
letter: ± b ² 4 a c
codepoint: U+00B1 U+221A U+0062 U+00B2 U+2212 U+0034 U+0061 U+0063
bytes: C2 B1 E2 88 9A 62 C2 B2 E2 88 92 34 61 63

To submit your solution, include only the u8.c file.