Recently I ran across a blog which tried to use the analogy of a library to describe the difference between clustered and non-clustered indexes in SQL Server. I found the post useful as a very general idea of indexes though I thought it suffered from too much abstraction. I thought I might give the same analogy a try. Full Disclosure: I don’t find the library analogy a particularly good one. There are too many things which work well in a computer system that a person would find difficult or cumbersome. With that said, lets give it a whirl.
Setup
Our library, unlike a real one, stores it’s books in a warehouse connected to the main library where the books are stored in boxes, a few hundred books per box. A library patron gets a book by asking the librarian for it. The librarian in turn asks a warehouse worker to bring the box or boxes back to her so she can get the requested book and give it to the patron.
First Pass – No indexes
This pass will be simplest version where there a no indexes at all. The books are stored randomly in the warehouse. A patron asks the librarian for all books by Agatha Christie. As the books are in random order there is no easy way for the librarian to find the books the patron wants. Rather, the librarian must ask for each box in the warehouse, open the box, and look through it for books by Agatha Christie. When she finds one, she pulls in out of the box and continues looking for more. She must do this for every box until she has looked in them all. This is a table scan and takes a LONG time.
Second Pass – No indexes, Read Ahead Threads
Obviously this is not an optimal way to get the books the patron wants. One issue is that the warehouse worker is just sitting there doing nothing as the librarian looks through the box of books the worker just delivered. Further, the librarian is just sitting there waiting for the next box to be delivered. As the librarian is going to look through all the boxes in any case, the warehouse worker could busy himself by bringing boxes of books to the librarian as she is looking through the first box. In this way the ‘waiting’ time is decreased and the librarian can scan through all the boxes without having to wait for each new box to be returned from the warehouse. This is a scan with read ahead.
Third Pass – Non Clustered Index
This is where the Library analogy begins to break down. In the original blog post there was no attempt to describe an index other than to say it was ordered by some column or columns. We will do something similar here. Understand that what is being described does not mimic a true database index. Those are B-Tree structures which are easy for a computer to traverse; a librarian, not so much.
In order to speed up the search we are going to use a card catalog. Our card catalog will be a set of drawers, one for every letter of the alphabet. In each drawer will be a set of cards, one per book, ordered by the authors last name and then the first name. In addition to the authors name and the title of the book each card will also have the box number which contains the book and the position of the book within the box. Now to find all books written by Agatha Christie the librarian would go to the ‘C’ drawer, find the beginning of the ‘Christy, Agatha’ section, and write down box number and what position the book has in the box. She then gives the list of box numbers to the warehouse worker who retrieves only those boxes from the warehouse. The librarian now only needs to look in each box and use the position number to find the correct book. Unfortunately, the books will be scattered in many boxes so it still may take some time to find them all.
Fourth Pass – Clustered index
A clustered index is similar to a non clustered index. We still have card catalogs, cards, etc.. The big difference with a clustered index is that when we build one the books are sorted by the columns the index is built on. In our example the books will be sorted by author’s last name, author’s first name, and book title. Now each box has a set of sorted books and the boxes themselves are laid out in that same order.
The same query to find all Agatha Christie books will return a much smaller set of boxes and those boxes will almost exclusively contain Agatha Christie books (there will likely be some non Agatha Christie books at the beginning of the first box and then again at the end of the last box). The librarian now needs to look in the first box for the beginning of the ‘Agatha Christy’ section and remove book after book until she hits the end of the section.
It should be obvious why can only have one clustered index: the books can only be in one order.
What’s Next?
In the next post I’ll try to show where this analogy breaks down and what the index structures, including the clustered indexes, really look like.
Pingback: The Guts of Indexing – Part 1 | Chuck Normal Form