80% of the problems, caused by 20% of the bugs

If you’ve been around the internet for a long enough time, you’ve probably heard of the 80-20 principle, or Pareto Principle. For some reason, it seems like everything is linked to the proportion 80 to 20. For example, 80% of the world’s wealth is held by 20% of the population, or in my case, 80% of the problems were caused by 20% of the bugs, even if not obviously caused by them. While I was skeptical about the claim before starting on my most recent project, now I am a full on believer.

Recently, I’ve been writing my own dialect of Lisp. I mostly am doing this as an exercise in creating high(er) level languages for embedded systems, cutting a lot of the fat that current high level languages like Python have. And I chose Lisp because Lisp is cool. While all of this is a blog post for another day, what to take away from this was that I decided to write this for every system ever made, meaning I had to choose a very common programming language to base on (especially more important since Lisp is interpreted, not compiled), so I chose ANSI 89 C.

When I started writing my Lisp interpreter, I was doing so very fast. In C, you are the memory management, meaning if you don’t do everything correctly, your program could start executing undefined code, or return undefined values. While starting, the most bugs I had were uninitialized values for variables. For example, in the following code, we have no idea what the value of x would be. It could be 0, it might not be. Its really up to the compiler to decide if it should be initialized, and up to where on the stack it is allocated to say if it already has a value.

int x;

printf("x is %d\n", x);

Now, these values are easy enough to find, simply make sure to pass the warning flag -Wuninitialized to the GNU C Compiler (or Clang as well), and boom, every uninitialized variable hopefully should be highlighted. This worked for me, so it was overall an easy fix. However as I continued on with the project, I ran into stranger errors dealing with memory management which seemed by all odds to be coming from a billion different directions.

Now, I have been programming in C/C++ for a little under 10 years at this point, having started sometime around 2014, so I know my way around. But I also have a tendency to be very lazy when programming, especially for proof of concept things, which usually leads to a “I’ll finish this up later, its not like this is production code or anything” type of mentality when working. Usually what this means in practice is neglecting memory management, since “you can just allocate a buffer of like 1KB and it should never overflow… right?”

Working on my Lisp interpreter, I first ran into problems of programs that were too large having variables overwritten randomly. In the interpreter, variables just hold a direct value, usually a pointer to memory, or an integer or decimal value. If it is a pointer to memory, it adds it to another list, holding all active memory allocations. This list of allocations allows the garbage collector to determine whether or not memory is in use, and frees it if it no longer is. This is a pretty nice system, if only C had it. I was quickly having problems of allocations being overwritten with seemingly garbage data, which caused memory that was still in use to be freed, which would cause undefined activity when it was called upon in the future. This is particularly a big problem on Microsoft Windows, which has a very strict memory model, which allows for zero out of bounds memory reads. Basically, on Windows, if you free it, its gone forever. Now, this would usually cause a crash on Windows, but not on Linux, so for a while, I just assumed the Windows version was completely broken because “Windows”.

Next problem with memory management was the freeing of garbage data. For example, if the variable list overflowed into the allocation list, then garbage data could be written to the memory stored as an allocation, which when no longer needed, could cause the system to try to free memory that it has no right to free. And likewise if the allocation list overflowed, it would spill into the openFiles list, which contained references to open files being read by the interpreter. Again, a file could be closed that would result in undefined behavior, especially if the file descriptor wasn’t open. This happened while porting the interpreter to classic Mac OS version 6. And again, I chalked it up to “man Apple really sucks”.

These bugs really took a toll on me, especially because I was watching them in debuggers, and just magically seeing values overwrite other values, for seemingly no reason. It was especially strange too, since when the program would eventually crash for what it did, the crash point would be usually a strange location, somewhere very far from where the problem was taking place. And by this point, the memory management aspect of the program had long since left my current focus, as I just assumed “Its worked this far, it should keep working”. So I wasn’t even really considering that bad memory management could cause all of these errors, which led to multiple long nights of adding unneeded sanity checks, and attempted error correction. Remember kids, fix a problem, don’t try to patch around it.

Eventually I was running out of options to try to fix things, it seemed like at this point, every internal function had some crashing bug. So I decided to run the program through valgrind, a memory debugger among other things, and man, I should have done that much earlier. To my absolute horror, I was making overflow write after overflow write. I wish the GNU Debugger had been more up front with me about that. Once I fixed all the overflow writes by keeping track of buffer size, and reallocating when needed, magically all of the bugs went away. And so too could all the shoddy patchwork. Overall just fixing about 3 buffer overflows made the program about 1000% more stable.

Now, I’m not going to act like my Lisp interpreter is anywhere close to being done (although stay tuned if you’re curious, I’ll probably make a few more posts here about it in its lead up to version 1.0), but it sure as hell is a lot more stable. So in closing, somehow, I guess the 80-20 principle is true. If you are having a problem with your program, or a program you are using, look for the most obvious sources of error, because in all likelihood the obscure bugs you have to work for to get to take effect probably aren’t the ones that are causing your program to crash, its probably the ones like I’m facing, even if it doesn’t seem like it.

This entry was posted in Programming and tagged , , , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *