In my last post I used the analogy of a Library to describe indexes. However, I used very abstract ideas and did not speak to how an index in the database is structure, how it works, etc… In this post I will get into some of the guts of the most common indexes in a database: a B- Tree.
I will assume the reader has only the vaguest of ideas regarding indexes. If you are more experienced you may wish to skip ahead to later posts.
First lets talk about performance of an index on a database lookup. If you are like I was when I started, back when I needed to fend off velociraptors with my slide rule, I had the vague idea that indexes would quicken the time of a table lookup. However, how much it sped that query up was a complete mystery. Was it twice as fast? Three times? I had no idea
It turns out that searching for the appropriate record through an index is amazingly fast. For an idea of how fast imagine we have a list of 1,000,000 books and you want to locate a particular one. Further imagine the list of books is not ordered by title. You would need to look at every item in the list until you found the book. This would be on the order of 1,000,000/2 or 500,000 items on average you would need to look through.
Let’s try a different strategy. Imaging the list of 1,000,000 books is ordered by book title but there are no little index cards to tell you were each letter starts. A good stab at a strategy would be to find the middle of the list and examine the card. If you were looking for, say ‘Romeo and Juliet’ and the book at the middle was ‘Little Women’ you would know that you needed to look further in the list. If you took the same strategy again, finding the middle between the middle and the end and peeking and continued until you found your book you would be doing a binary search. For a list of 1,000,000 books you would need to do this ‘peek and search’ about 20 times. This is certainly better that 500,000 times!
Unfortunately this is still not fast enough for a database index. To see why imagine we have goofy list where the list of books has the correct number of cards (1,000,000) and each card represents a book. The cards are again sorted by book title but the cards are blank except for a card number! To determine what book the card represents you need to walk down the hall, hand the card to someone, who then finds and hands you the book. Say each time you did this it took you five minutes. Suddenly 20 peeks isn’t so pleasant. It still takes you the length of a feature movie to find ‘Romeo and Juliet’.
Let’s try again. We still need to find ‘Romeo and Juliet’ but we would like to do so before our next birthday. Imagine the following strategy instead of the binary search. We are given a card that has a card number, which appears completely random to you and 26 entries, one for every letter in the alphabet. We find the ‘R’ entry which has a related single item: another card number. Unfortunately some bureaucratic sadist has again decreed that the mapping from the random card number to a discrete card is held in that same office down the hall. Sigh. We trudge the two and half minutes down then hall, hand the card number associated with ‘R’ to the person. They look at it and tell you which other card is associated with the number. Unfortunately, again, the card you now need is back up the hall where you started. Another two and half minutes later you find the new card. This one has another set of entries that look something like
- “Ra, The Egyptian Diety” – “Remy Lebeau: works in higher mathematics” : card 230942
- “Ren and Stimpy – Philosophy of” – “Robert Burns – works of poetry” : card 338383
- “Rockets R Fun” – “Rye Bread” : card 3982990
You find the section that would hold ‘Romeo and Juliet’ and look up the corresponding number, in this case card 3982990. You again follow the procedure until you finally locate the card with ‘Romeo and Juliet’ and it’s associated number. This time the number leads to the book itself rather than another index card. The number of times you need to make this back and forth trip is defined by the size of each ‘key’ (book title), the number of entries on each card (which is defined by the card size and the key size), and the number of total entries. In SQL Server with 1,000,000 million books and a maximum book size of 250 characters, you would expect to make about three or four trips. Much much better than 20 as now we are talking about fifteen to twenty minutes as opposed to an hour and forty minutes.
One final note before getting to Part 2. Why do I keep talking about warehouses and long hallways in these analogies? The answer has to do with the disk subsystem in a database. It takes about an order of magnitude longer to locate an object on a disk than it does to retrieve one from memory. Caching, SSD’s, and enormously fast SAN’s aside it takes about 8ms to 10ms to get a piece of data from a disk. Each lookup in an index, data page, etc… is another disk round trip (again, caching aside). I made the hallway, warehouse analogies to show this slowness in disk i/o. We will see more of that later when touching query optimization.