Chapter 4. C: Data structures

Although C doesn't have the notion of class as in object-oriented languages, it does have the structure, which allows you to define a new data type representing a conglomeration of data. The primary distinction between a class and a structure is that a class can have both variables and methods, while a structure can have only variables.

A structure type definition in C looks like the following.

struct Point {
    int x;
    int y;
};    /* The semicolon here is required! */

Here, we've defined a type called struct Point. In C, struct is part of the type name for every structure defined. (C++ makes this keyword optional, and so the type might just be called Point. But we're talking about C now.) Each struct Point variable has two sub-variables (called fields), x and y. So you can write the following (which does nothing interesting except illustrate a C structure's use).

int main() {
    struct Point p;

    p.x = 50;
    p.y = 100;
    return 0;
}

Notice how, in contrast to Java, a structure does not have to be allocated. It is automatically created at the beginning of the function, and it automatically goes away at the function's end.

You can easily make an array of structures, talk about pointers to structures, or place a structure within another structure.

When you pass a structure as a parameter, you should pass a pointer to the structure, not the structure itself. The reason is that structures tend to contain lots of data, and copying all of the fields into the parameter variable is inefficient.

#include <math.h>

double distanceToOrigin(struct Point *p) {
    return sqrt((*p).x * (*p).x + (*p).y * (*p).y);
}
In fact, the (*ptr).field construct is so common that C includes a shortcut for it: The arrow operator allows you to write ptr->field in place of (*ptr).field. Thus, the following definition is equivalent.
#include <math.h>

double distanceToOrigin(struct Point *p) {
    return sqrt(p->x * p->x + p->y * p->y);
}

4.1. Dynamic memory

Sometimes you want to allocate memory in the course of running a program, akin to the new operator in Java. To do this, you can use the malloc() function included in the C library. The malloc() function takes a single integer, which represents how many bytes the function should allocate, and it returns the memory address of the first byte of memory allocated.

In computing how many bytes to allocate, the sizeof operator is useful: Given a type, the sizeof operator computes how many bytes a value of that type requires. So sizeof(int) would be 4 on most computers (but on some computers it may be 2 or 8), and sizeof(struct Point) would have a value of 8 (or 4 or 16 depending on the computer).

So we could write the following, which allocates an array based on a length given by the user.

int main() {
    int n;
    int *arr;

    printf("How long an array? ");
    scanf("%d", &n);
    arr = (int*) malloc(n * sizeof(int));
    return 0;
}
In the malloc line, we've opted to allocate enough memory to hold n integers. The casting operator in this line is important — if we did not cast to an int*, the C compiler would report that the type returned by malloc() can't be assigned to an int* variable.

When you allocate memory in C, you should deallocate it later to free the space. To accomplish this, you can use the free() function, which takes a pointer as a parameter and marks the memory to which it points as available.

free(arr);
After you deallocate memory, you should not use it again. Neither should you deallocate the memory a second time. Avoiding both of these issues can be pretty painful.

But it should be deallocated. If the program ends, all memory used by the program will be deallocated automatically. But, if the program continues for a long period and unused memory is never deallocated, then the program's memory requirements will grow steadily, leading to strange problems later once memory runs out. This is called a memory leak.

(Java avoids the issue of memory deallocation entirely by letting the computer automatically figure out when memory becomes available. This automatic process, called garbage collection, is complex, and efficient techniques for it were unknown at the time of C's development, so C opts to let the programmer control memory deallocation. For Java's designers, faster computers and more efficient collection techniques, coupled with a different target audience, tilted the scale toward automatic garbage collection instead. We'll examine garbage collection techniques later in this book.)

4.2. Example: Linked list

Figure 4.1 defines a List structure for representing a linked list of numbers. Moreover, Figure 4.1 contains two functions, listCreate() to create a new, empty linked list and listRemove() to remove a number from the list.

Figure 4.1: C implementation of a linked list

#include <stdlib.h>

struct Node {
    int data;
    struct Node *next;
};

struct List {
    struct Node *first;
};

/* listCreate
 *  Creates an empty linked list.
 *
 * Returns:
 *  a newly allocated list, holding nothing.
 */

struct ListlistCreate() {
    struct List *ret;

    ret = (struct List*) malloc(sizeof(struct List));
    ret->first = NULL;
    return ret;
}

/* listRemove
 *  Removes the first occurrence of a number from a linked list,
 *  returning 1 if successful
 *
 * Parameters:
 *  list - the list from which the item is to be removed.
 *  to_remove - the number to be removed from the list.
 *
 * Returns:
 *  1 if item was found and removed, 0 otherwise.
 */

int listRemove(struct List *listint to_remove) {
    struct Node **cur_p;
    struct Node *out;

    cur_p = &(list->first);
    while(*cur_p != NULL) {
        if((*cur_p)->data == to_remove) {
            out = *cur_p;
            *cur_p = out->next;
            free(out);
            return 1;
        }
        cur_p = &((*cur_p)->next);
    }
    return 0;
}

The listRemove() function works with pointers in a peculiar way: It uses cur_p to point to the pointer to the current node; this way, once the desired number is found, we can alter the pointer within the list referencing that node. The more intuitive way of writing this function is to have a cur variable stepping through the list, but this necessitates tracking the previous node, since that node's pointer to its successor will change once cur points to the node to remove.

prev = NULL;
cur = list->first
while(cur != NULL) {
    if(cur->data == to_remove) {
        if(prev == NULLlist->first = cur->next;
        else prev->next = cur->next;
        free(cur);
        return 1;
    }
    prev = cur;
    cur = cur->next;
}
return 0;
By instead remembering the address of the pointer to the current node, the implementation of Figure 4.1 avoids this additional complexity. It's not really a better implementation, but it does illustrate an interesting use of pointers. Incidentally, the version used in Figure 4.1 would be impossible in a language without pointers, such as Java.

Note that it would be tempting to replace lines 39–41 of Figure 4.1 with the following instead. (This change avoids using the out variable, and so we could also delete line 35.)

39              free(*cur_p);
40              *cur_p = (*cur_p)->next/* Bad!! */
This is a mistake, however, because this uses memory after it has already been freed: Line 40 says to look into *cur_p to find its next variable, but we freed *cur_p in line 39, so this memory is technically no longer available. (It looks like this would be safe, since there are no intervening memory allocations between lines 39 and 40, and indeed it would work on many systems; but it's quite possible that the free() function will write over the next field.) Thus, this change results in a wrong program.

4.3. Analysis: Language and library levels

Most modern programming language systems draw an important distinction between the language level of the language system and the library level of the language system. The primary distinction between these two levels is whether the language incorporates special syntax for handling the feature.

Memory allocation is an example where different language designers make differing decisions. C opts to support memory allocation on the library level, by requiring a malloc function to exist in the library. In Java, however, memory allocation is done through the new keyword — this is special-purpose syntax, and so we would say that Java handles memory allocation on the language level.

Another example where different languages make different decisions is in exponentiation. In C and Java, this job is relegated to the pow() function in the libraries. Other languages decide to support exponentiation in the language through the addition of another operator: In FORTRAN, for example, 2 ** 10 represents 210.

In general, language designers want to push as much into the library level as possible, because special syntax both makes the language more complex for people to understand and because it leads to a more complex compiler development process. On the other hand, special syntax is often more convenient for programmers once they understand it. There's an subtle but important line between being convenient and being impossible to learn. No single language feature crosses a language over the line, but the conglomeration of many features does. C and Java are certainly on the simple side of the line; some argue that C++ crosses it. PL/I and Ada, two ambitious language design projects, failed to live up to expectations; many attribute this to buckling under the load of having too many features for programmers to learn.

The decision of whether a feature should be a language or a library feature is also guided by design principles. C, for example, works very hard to keep the language as simple as possible. The reason for this is that C is designed to be easy to implement on new systems — the original idea with UNIX was that you could write a simple C compiler for a new computer system, and from there you compile the UNIX source code. Another reason for C to strive to keep things simple was that the computational and memory limits at that time limited how much the compiler could do. Thus, C pushes much more into the library level than most languages. Design principles also guided FORTRAN to have an exponentiation operator: The language was designed with scientific computation in mind, where exponentiation is common, and the convenience of an additional operator was worth the incremental complexity.

Java's designers, working twenty years after C's designers, did not face these same limits. Moreover, they were developing more for user programs, which tend to be larger programs and which rely on an established underlying system. Thus, Java's designers chose a more complex language level, placing a priority on the application programmer's convenience, rather than the compiler programmer's convenience.