Stupid Memory Tricks with 64-bit Addressing

I was messing around with the Hacker News data feed. Each item has a unique sequential number, and the number of items is approaching 47 million.

To implement a cache I need to resolve some (very few) of these to addresses in storage. Most of the time it will be very recent indices, but in some rare cases it will be old, low indices. So I have a classic sparse array.

Normally I would make a three-level trie for something like this, keep the top level in memory and allocate chunks as needed. However, I realized there is an easier, minimalist, way.

Since the 64-bit addressing space is vast, I can mmap a very large array, say for a billion items. In my case (items are 32 bits) such an array will be 4GB. But since I am using Linux with virtual memory, it will actually use no real memory until I store something there.

Now I can store indices along with offsets as they come in and use only a handful of 4K pages.

Where do I store the data (to which these offsets point)?

Another giant mmap-ed space, of course. Why not.

Discuss on BBS

index