Intro to Database Systems : Indexing
We learned in the last lecture that when data is stored on disks, it is sorted as a set of blocks of data (also called pages). A block is accessed as a whole, in its entirety. On the disk, blocks are structured as link lists:
- they both have a section containing data
- they both have a pointer to the location of the next node (next block/page)
We will demonstrate how useful indexes are through a bunch of examples. An index helps us to find rows faster. They are useful for queries done on attributes that are used frequently. Indexing is just a fancy word to say “sorting a column in order to efficiently query an element”
Let’s start with some examples and define N as the number of blocks that the entire table requires.
We already know that searching on a column that isn’t sorted requires N/2 block accesses using Linear Search. Even worst, if the column doesn’t contain unique entries (say we have 2 people with their firstName “Dan”)… the entire table must be searched. That’s N block accesses (because what if the duplicated element is the last row in the column).
Now let’s assume that the column is sorted (and that’s what an index does). By using Binary Search, we will obtain log2 N block accesses. Since the data is sorted, we won’t need to search the rest of the table for duplicate values.
Creating an index is basically creating a data structure that holds the column value and a pointer to the records.
Let’s consider a database that doesn’t have an index. For simplicity, we have a table with only two columns: firstName, lastName.
Say we have:
- r = 5 000 000 as the number of rows in the table
- R = 204 bytes as the fixed size of each row (record length)
- B = 1024 bytes as the default block size (size of each data block)
Let’s see how many rows are in a disk block?
- 1024 / 204 = 5 rows per disk block
Let’s see how many blocks are in our table?
- 5 000 000 / 5 = 1 000 000 blocks per table .
This is our N. We know that if we query on a non-sorted column, we’ll obtain N / 2 blocks traversed, hence the traversal of 1 000 000 / 2 = 500 000 blocks. If we allow duplicates, we will have 1 000 000 block accesses.
If the column is already sorted and then we search for an element, we obtain log2 (1 000 000) = 20 block accesses.
We notice that, from 500 000 block accesses to 20 block accesses, the performance increase is substantial.
Since we’ve seen what’s the impact of a query on a sorted column, let’s introduce an index. Let’s pretend we have firstName as an attribute. As we said, creating an index on a column implies creating a data structure that holds:
- a value : in this case the field name takes 50 bytes.
- a pointer to the record it relates to : row pointer is 4 bytes
Say we have:
- r = 5 000 000 as the number of rows in the table
- R = 54 bytes as the index record length
- B = 1024 bytes as the default block size (size of data block)
Let’s see how many rows are in a disk block?
- 1024 / 54 = 18 rows per disk block
Let’s see how many blocks are in our table?
- 5 000 000 / 18 = 277 778 blocks
This is the number of blocks that needs to be accessed in a non-sorted column when we need to search a particular row.
Since we’ve used an index, the column is already sorted. Hence, when we query on it, we binary search through the index with an average of log2 (277 778) = 19 block accesses. The last step is to follow the pointer, hence 19 + 1 = 20 block accesses to find a particular element in an indexed column.
Again, we notice that from 277 778 block accesses to 20 block accesses, the performance increase is substantial.
Now why can’t we use the sorting method and then a search instead of searching trough an indexed column?
Don’t forget that the sorting method actually makes changes to the underlying physical order of data. Indexing creates a separate index file that references rows in the active table, allowing direct access to those rows through a data structure, a B+ tree that we will introduce it in the next chapter.
Given that creating an index requires additional disk space (in our example, 277 778 extra blocks), this is potentially the main drawback: we win on time complexity, but we lose on space complexity.