Implementing Simple Garbage Collector

Implementing Simple Garbage Collector

Edit Implementing Simple Garbage Collector on GitHub


For thousands of times I have been thinking about this question: How come it seems to me that Java has access to infinite memory?Ā Well, let’s give the glorious to Garbage Collector. As a developer, the urge to have your own Garbage Collector is just like gravity, all you need is just a little push from me šŸ˜€

After some serious searching online and experimenting, I finally have my own GC.Ā First, let me throw some basic introductions to get started.

 

 

Why Garbage Collection


As I said previously, garbage collection is to manage memory automatically so that Developer won’t have to do that manually and have access to relatively “infinite” memory. There are bunch of advantages.

Now we know we should use garbage collection. But, what is the definition of “Garbage” in programming?Ā ItĀ means memory previously allocated that is no longer being used. For the illusion of infinite memory to work, the language needs to be very safe about ā€œno longer being usedā€.

1.Any object thatā€™s being referenced by a variable thatā€™s still in scope is in use.
2.Any object thatā€™s referenced by another object thatā€™s in use is in use.

Note that the second rule is recursive. If object A is referenced by a variable, and it has some field that references object B, then B is in use since you can get to it through A.

 

 

Mark-and-Sweep Garbage Collection


There are bunch of algorithms for GC, I am gonna use a straightforward one: mark and sweep. The name is quite self-explaining. The algorithm can be described as, well, mark and then sweep.

1.Mark: start from the root points, traverse the entire object graph. Every time you reach an object, set a ā€œmarkā€ bit on it to true.
2.Sweep: once mark process is done, find all of the objects whose mark bits are not set and delete them.

Suppose the entire object graph before garbage collection is:

gc-before gcThen we come to the mark process:

gc-markAfter the sweep process, the entire object graph will be:

sweep

 

 

Implementing Mark-and-Sweep Garbage Collector


Implement Class Objects

Before implementing a garbage collector, we have to implement a generic “class” object.

typedef enum {
  OBJ_INT,
  OBJ_PAIR
} ObjectType;

typedef struct sObject {
  ObjectType type;
  unsigned char marked;

  /* The next object in the linked list of heap allocated objects. */
  struct sObject* next;

  union {
    /* OBJ_INT */
    int value;

    /* OBJ_PAIR */
    struct {
      struct sObject* head;
      struct sObject* tail;
    };
  };
} Object;

– As you can see, we only define two types of objects for simplicity: integer and pair. However, the definition of pair is recursive, it can contains another pair or integer. With this definition, we can cover a large portion of classes in Java or any languages.

– Besides, using UNION is also quite tricky. The main ObjectĀ struct has a typeĀ field that identifies what kind of value it is either an int or a pair. Then it has a union to hold the data for the int or pair. AĀ union is a struct where the fields overlap in memory. Since a given object can only be an int orĀ a pair, thereā€™s no reason to have memory in a single object for all three fields at the same time.

– There’s a flag to determine whether the current object has been marked, unsigned char marked.

– In order to sweep unused objects in the future, we have to keep track of all objects in our virtual machine no matter it is used or not. In my example I use a linked list to keep track of all the objects. TheĀ struct sObject* nextĀ is for constructing the linked list. The head of linked list should be saved in virtual machine.

 

 

ImplementĀ Virtual Machine

Since mostĀ virtual machines are implemented using stack, stack should be a good choice to do the work. The stack will be used to store local variables and temporary variables needed in the middle of an expression. Besides, as our previous discussion, we need a head node for linked list to keep track of all object nodes within virtual machine.

#define STACK_MAX 256
typedef struct {
  Object* stack[STACK_MAX];
  int stackSize;

  /* The first object in the linked list of all objects on the heap. */
  Object* firstObject;

  /* The total number of currently allocated objects. */
  int numObjects;

  /* The number of objects required to trigger a GC. */
  int maxObjects;
} VM;

When we want to initialize a VM, we should use following function:

VM* newVM() {
  VM* vm = malloc(sizeof(VM));
  vm->stackSize = 0;
  vm->firstObject = NULL;
  vm->numObjects = 0;
  vm->maxObjects = 8;
  return vm;
}

 

 

Manipulate Stack With Push And Pop

Once we have the virtual machine, we have to manipulate it – do push and pop with the stack. Before defining the push and pop function, let’s have a helper function first. If we want to initialize an object in the virtual machine, we should use following function:

Object* newObject(VM* vm, ObjectType type) {
  if (vm->numObjects == vm->maxObjects) gc(vm);

  Object* object = malloc(sizeof(Object));
  object->type = type;
  object->next = vm->firstObject;
  vm->firstObject = object;
  object->marked = 0;

  vm->numObjects++;

  return object;
}

Note that after the object has been initiated, it is still NOTĀ inside the stack. It is just inside the linked list for tracking all objects in virtual machine. We still have to do the work to push it into stack.

 

The push function should be categorized according to our definition of object types, pushIntĀ and pushPair:

void push(VM* vm, Object* value) {
  assert(vm->stackSize < STACK_MAX, "Stack overflow!");
  vm->stack[vm->stackSize++] = value;
}

void pushInt(VM* vm, int intValue) {
  Object* object = newObject(vm, OBJ_INT);
  object->value = intValue;

  push(vm, object);
}

Object* pushPair(VM* vm) {
  Object* object = newObject(vm, OBJ_PAIR);
  object->tail = pop(vm);
  object->head = pop(vm);

  push(vm, object);
  return object;
}

 

Note that the assertĀ function is just to send warning message when something goes wrong:

void assert(int condition, const char* message) {
  if (!condition) {
    printf("%s\n", message);
    exit(1);
  }
}

 

The popĀ function should be:

Object* pop(VM* vm) {
  assert(vm->stackSize > 0, "Stack underflow!");
  return vm->stack[--vm->stackSize];
}

 

 

Mark Process In Garbage Collection

As you maybe already noticed, this is a simple DFS problem.

void mark(Object* object) {
  /* If already marked, we're done. Check this first to avoid recursing
     on cycles in the object graph. */
  if (object->marked) return;

  object->marked = 1;

  if (object->type == OBJ_PAIR) {
    mark(object->head);
    mark(object->tail);
  }
}

void markAll(VM* vm)
{
  for (int i = 0; i < vm->stackSize; i++) {
    mark(vm->stack[i]);
  }
}

 

 

SweepĀ Process In Garbage Collection

Well, since we have the linked list to keep track of all objects in current virtual machine, this problem is just piece of cake. We just have to go through the linked list and free objects with unmarked flag.

void sweep(VM* vm)
{
  Object** object = &vm->firstObject;
  while (*object) {
    if (!(*object)->marked) {
      /* This object wasn't reached, so remove it from the list and free it. */
      Object* unreached = *object;

      *object = unreached->next;
      free(unreached);

      vm->numObjects--;
    } else {
      /* This object was reached, so unmark it (for the next GC) and move on to
       the next. */
      (*object)->marked = 0;
      object = &(*object)->next;
    }
  }
}

 

 

Full Garbage Collection Function

void gc(VM* vm) {
  int numObjects = vm->numObjects;

  markAll(vm);
  sweep(vm);

  vm->maxObjects = vm->numObjects * 2;
}

Note that vm->maxObjects = vm->numObjects * 2Ā is to dynamically set the maximum size of virtual machine.

Different programs will have different memory need. If the maximum size of virtual machine is too big, the garbage collection will have no necessity since we already have enough space. If the maximum size of virtual machine is too small, then the garbage collection process will be triggered too many times and waster the performance.

Thus, in this simple garbage collector, we set the maximum size to be twice the size of objects that is not freed by last GC process.

There are bunch of test cases in my github, please run and test the garbage collector if interested šŸ˜€

 

Reference


Baby’s First Garbage Collector

 

2 Comments

  1. As a garbage collector connoisseur, I am always interested in hearing new voices speak about the fascinating world of garbage collection. upon the request of the garbage collector society of vienna, i was commended with the task of analyzing and correcting the monography written by the aficionado martin y. xia entitled “Implementing Simple Garbage Collector”, which i review hic-et-nunc. i found with great surprise a true vision and erudition in this young gentlemen. his ad-hoc explanations are concise and technically parsimonious, avoiding to overly Exaggerated jargon that often plagues our craft. his code examples, albeit unprofessional and non-quoted, make a clear illustration of how the unwary shall proceed into the first steps of garbage collecting, while his explanation draw an exceptional Maieutic and his conclusions are vibrantly pleasant to the eye, discarding any dangers of the non-sequitur. as a whole, i find myself not-uninterested in the future writings of mr. xia.

    1. Thank you very much on commenting my post Mr VonBon. I really appreciate it.
      There are still bunch of stuffs I am not clear enough about GC. However, I think I have a big picture of how the GC works basically after I’ve done all the searching and experimenting. This post is to give a light to all the newbies on GC so that they won’t get terrified by the mysterious GC šŸ˜€

      Please keep visiting my website and I will try my best not to disappoint my readers šŸ˜€

Leave a Reply

Your email address will not be published. Required fields are marked *