A Dastardly Day of DOS Development Debugging

Posted by Astryl on July 10, 2019, 2:51 p.m.

Right on the heels of my last blog, here's another! This one is going to be fresh off the back of a painful debugging session, while the details are still fresh in my mind.

But to boil it down to basics: My Ludum Dare game should've crashed and burned from the outset.

Let's dive in!

Early Investigations

Yesterday I wanted to create a bit of extra music for the Ludum Dare game as a 0.3 content patch - the track I made during the Jam was ridiculously short. So I just did a simple change, added an extra pattern or two, and… the game froze on startup, taking DOSBox with it.

So I did the logical thing and brought a bunch of the demo-songs from RAD Tracker over to my project folder. Some worked, some didn't.

At first I thought it was a size thing: All of the smaller modules worked, but the two largest ones would crash.

But after doing a bit of probing via the Watcom Debugger, I determined it was a case of "I don't even know what's going on here anymore". A line in my Emacs scratch-file (I use it for live note-taking while working through problems) sums up my thoughts:

What follows is a look into a huge mountain of code with ridiculously shaky foundations that shouldn't have worked.

Part One: Register Length

It turned out that the crash-on-load problem was fairly easy to spot and solve. In my RAD Player initialization code, I have a bit of code that reads in the "Order List" (The order in which the patterns play).

It read like this:

See that second and third line? SI is 16-bits in length, whereas we're working with the 32-bit version, ESI, in line 1. This causes truncation, and because ESI is storing an absolute pointer to an address in flat-mapped memory space, it means that only in very specific circumstances will this code actually work - those circumstances being that the module must fit into the nearest 64K boundary depending on wherever the upper 16-bits of ESI are pointing to. If the offset that ESI pointed to was only 4K away from a 64K boundary, I'd only have 4K to work with for a song.

I barely dodged hitting this bug during jam-time, and only because my two test songs were short, and barely didn't hit that 64K boundary: The way I loaded the game PAK and other files before the music module meant that I barely had 2K to work with, and both of my test songs were within that limit (And only just).

So how did this bug slip in? The same way all the other register length bugs did in this particular module: It's a port of a port of 16-bit assembly code.

RAD Player's example source was originally written for MASM and Real Mode operation. I ported it from 16-bit MASM to 16-bit NASM, then when I moved my engine over to Protected Mode, I ported this too…

…let's just say that when reading through nearly 1000 lines of assembly and legitimate 16-bit register uses, your 16 and 32-bit registers tend to blend together a lot.

So, now I'm able to load any module I want! Everything is fine now, right?

Wait, that doesn't sound right…

So in the middle of listening to some of the demo songs, I noticed some weird holes in the music.

I cross-referenced this by playing the songs in RAD Tracker, and noticed that specific effects were causing the player to "crash" and recover on entering a new pattern.

Fortunately this one was relatively easy to fix: The effect jump-table was out-of-order, meaning that instead of Effect 3 being the Tone Slide, it was jumping to the Portamento Down effect, and a few other issues (Like the Jump to Line effect being entirely broken).

So all is well for now…

Oh wait, now it's crashing when loading the songs that used to work.

Yup. During the previous debugging I had switched around the load order so that the music module loaded first, and was followed by the sound bank and finally the game PAK.

What was causing the crash this time was me forgetting to close file handles. I had several places where I was opening files, and then forgetting to close them! Jam code!

Fixing this got me a bit further, but the game was now crashing again, but in a different place!

After a long bout of time in the debugger, I found it was crashing during the first draw state, where the game draws the startup title/logos.

And this opened up another can of worms…

Shaky Foundation

Here's where I discovered a massive flaw in some of my core assembly code!

I won't go into code detail, but what was happening is that some of the drawing code was not restoring some key registers after they were done!

Now because the music player was triggering on a timer interrupt, this meant that I wasn't restoring some key state on returning from one of those interrupts, making certain songs causing some really nasty crashes.

I somehow threaded the needle on that one: Again, it's a wonder the game boots at all in this state.

I did a quick fix by making sure to restore any essential registers I was clobbering. Problem solved! Except it was still sort-of crashing!

Well, more of a black-screen hang that I could quit out of instead of an unrecoverable crash, so progress was made, but still!

The Timing System

I took an hour-long rage break at this point to clear my head, and during this time I actually hit upon a potential reason for this current crash: The timer.

Not too long before Ludum Dare started I put the finishing touches on the RAD Playback module, but it needed something my game didn't have in order to run correctly: A timer able to call it exactly 50 times per second.

This isn't too difficult to do: You can reprogram the Programmable Interrupt Timer to any speed within its range, and that's what I did: It interrupts 50 times a second, calls a special callback that increments a tick counter and plays a single line of music.

This works perfectly!

What I then did, just prior to the jam starting, was attempt to use this timer as a game timer. This is where things went wrong.

My basic game timer is simple: If a local variable called ticks is not equal to the global timer tick, increment ticks by one and do an update.

Except it was more of a while loop, so it could 'catch up'.

I just realized during this debugging session that if it was really lagging behind, it wouldn't be able to catch up anyway, thus creating an infinitely growing difference between the current tick and the actual timer tick!

Because draws only happened after all update ticks were consumed, it never actually got around to the drawing phase!

I fixed this by untying the update tick from the music timer and instead doing one update per frame, and just sync to refresh rate. I'm going to revisit the timer in the future, but it really needs more attention than I initially gave it.

Conclusion

During the course of this horrible debug session I discovered a whole host of potential issues and things that by all rights should've tripped me up during Ludum Dare, but somehow didn't.

Fortunately, I've fixed most of them now: Some problems are still present (Two of the music effects still don't work correctly, for instance), and I have a lot of work to do to get the game finished, but at least I've removed a few difficult obstacles on the way to doing that.

In total I made 68 changes across 15 different source files, and I fixed a bunch of smaller bugs. Everything is running a lot smoother now!

To end this off, I'll leave a little tidbit of information for anybody wanting to debug their game code: You can use the Watcom debugger to step through the game's code, but you will mode-switch and can't just jump back into graphical mode again, so it's a once-off sort of thing.

For Protected Mode applications, you need to run Watcom with the arguments /tr=rsi <executable>. Once in Watcom's debugger, a convenient way to run up to a point in your code is to look for the line number in your IDE/Editor, hit the 'g' key in Watcom and type in the name of the C or Assembly unit followed by @ and the line number. Like so:

This will play the game up until that point, then break out back to text mode and allow you to step through the code from that point while monitoring local variables, the stack and/or registers, and won't go back to graphics mode unless you create some sort of debug keypress to allow you to re-set your graphics mode (I'm intending to work on that next time I have some time for the project).

For now, that's me done with this for a day or two. I'm off to find some coffee and something to read that isn't code.

Comments