Python vs. C: file reading performance

Posted by spirulence on May 3, 2012, 2:07 a.m.

TL DR;

I am a Python guy. I write Python at work, I write Python for fun, and I've even dabbled with writing Python outside in the fresh air. Someday I hope I can plug a keyboard into a Kindle and actually code outside comfortably.

I've also been reading a textbook called Compiler Design in C lately. I've just gotten to the part where the author describes a relatively complex way of reading files with a minimum of copying.

(Coming from a background that rarely cares about performance being better than "good enough," it's different to be reading about designing for high performance in the first place.)

WHY??

In the text the author claims that "MS-DOS read and write times improve dramatically when you read 32k bytes at a time." I had to test this, and I figured I could pit C vs. Python in a very shallow, distorted way at the same time.

The Setup

I originally did this test reading the same small file chunk over and over again, but I realized that this probably takes advantage of OS caching and becomes a test of this caching rather than of the speed of the two languages.

So I set up an 8GB file, filled with the string "0123456789ABCDEF" over and over and over. Then, for each buffer size, the two languages do 2000 sequential reads of the file.

Pitfalls

Sequential and random reads are known to produce different characteristics. It would probably have produced better results if I had done a series of random reads instead of sequential ones.

2,000 iterations is not really enough iterations to establish behavior solidly, but I didn't actually think of doing random reads until just now, and there was no way I was going to set up a 40GB file so that I could do 10,000 reads of 4MB each.

I didn't do a whole lot of research into the buffering modes that Python offers for doing file reads. Some of those would make a difference. I have a feeling that normal file reads are internally buffered and copied at least once. That's a huge advantage for C, because read() is purported to allow the OS to copy straight from disk into your buffer if the buffer is the right size. At least it was allowed in 1990 when this book came out.

The Results

So vanilla Python reads are half as fast as C's read(). Big whoop. I was expecting much worse, perhaps 5-7x slower.

At least on Windows 7, these limited benchmarks indicate an optimal C buffer size somewhere between and including 32K and 1M. I'm convinced that the high read speeds below 32K for C would disappear entirely if I were doing random reads.

For Python, I'm not sure what to recommend. The highest speeds were with 4K, but that just seems too low to make sense. More research required.

The Stuff

The Excel Spreadsheet

The Code

Comments

valkyriegames 12 years, 7 months ago

What can I say… go C! XD

That's an interesting test you did there. Thanks for sharing.

Rob 12 years, 7 months ago

Quote:
2,000 iterations is not really enough iterations to establish behavior solidly, but I didn't actually think of doing random reads until just now, and there was no way I was going to set up a 40GB file so that I could do 10,000 reads of 4MB each

…I just so happened to have a 40gb .txt file filled with random integers sitting on my harddrive for some reason….

It didn't take that long to generate, and I'm sure it would have gone faster if I were using a static string instead of random ints.

So why exactly are you against doing that?

spirulence 12 years, 7 months ago

Quote:
…I just so happened to have a 40gb .txt file filled with random integers sitting on my harddrive for some reason….

It didn't take that long to generate, and I'm sure it would have gone faster if I were using a static string instead of random ints.

So why exactly are you against doing that?

My main computer at the moment is a slightly older Dell Latitude that only has a 250 GB drive, and I only have 43GB free. (I actually thought that number was closer to 20…) Windows USED to get wonky when you had less than 10% drive space remaining.

If I come back to these tests I might test that assumption, as it appears I do actually have *just* enough space to squeak it out. I'll probably also test random reads in a file as well as on my Linux box, which I know has 40GB free.

Iasper 12 years, 7 months ago

My little brother once made a program that wrote the current date and time to a log. At least a quarter of my hard disk is used by that useless text.

firestormx 12 years, 7 months ago

Iasper: Truncate that log. =/

Also, screw Python. Just one more reason to hate it. >:(

valkyriegames 12 years, 7 months ago

@lasper That reminds me of the time I made a program that generated empty text files on my high school computers. Ended up making gigs worth of empty text files. I showed it to my IT teacher cos I thought it was cool, but he thought I was making a virus and was shocked at me .___. At least I didn't get in to too much trouble

JuurianChi 12 years, 7 months ago

I did that too.

Got extra credit.

Didn't help my overall grade though.

Arcalyth 12 years, 7 months ago

What does FSX have against Python? XD

Rob 12 years, 7 months ago

The lack of curly braces, I think. Or at least that's what he was saying while drunk.

Astryl 12 years, 7 months ago

The lack of a clear structure is more like it.

Of course, depending on which C/C++ functions you use for the IO, you will gain different results. Inline assembly still beats the DOS calls hands down (If you know how to negotiate with the OS for direct device access, which is far easier in a Linux based system than in Windows).

Good job on those stats though. It's more than I'd bother doing, TBH. :3