Intro to Database Systems : Indexing

We learned in the last lecture that when data is stored on disks, it is sorted as a set of blocks of data (also called pages). A block is accessed as a whole, in its entirety. On the disk, blocks are structured as link lists:

We will demonstrate how useful indexes are through a bunch of examples. An index helps us to find rows faster. They are useful for queries done on attributes that are used frequently. Indexing is just a fancy word to say “sorting a column in order to efficiently query an element”

Let’s start with some examples and define N as the number of blocks that the entire table requires.

We already know that searching on a column that isn’t sorted requires N/2 block accesses using Linear Search. Even worst, if the column doesn’t contain unique entries (say we have 2 people with their firstName “Dan”)… the entire table must be searched. That’s N block accesses (because what if the duplicated element is the last row in the column).

Now let’s assume that the column is sorted (and that’s what an index does). By using Binary Search, we will obtain log2 N block accesses. Since the data is sorted, we won’t need to search the rest of the table for duplicate values.

Creating an index is basically creating a data structure that holds the column value and a pointer to the records.

Let’s consider a database that doesn’t have an index. For simplicity, we have a table with only two columns: firstName, lastName.

Say we have:

Let’s see how many rows are in a disk block?

Let’s see how many blocks are in our table?

This is our N. We know that if we query on a non-sorted column, we’ll obtain N / 2 blocks traversed, hence the traversal of 1 000 000 / 2 = 500 000 blocks. If we allow duplicates, we will have 1 000 000 block accesses.

If the column is already sorted and then we search for an element, we obtain log2 (1 000 000) = 20 block accesses.

We notice that, from 500 000 block accesses to 20 block accesses, the performance increase is substantial.

Since we’ve seen what’s the impact of a query on a sorted column, let’s introduce an index. Let’s pretend we have firstName as an attribute. As we said, creating an index on a column implies creating a data structure that holds:

Say we have:

Let’s see how many rows are in a disk block?

Let’s see how many blocks are in our table?

This is the number of blocks that needs to be accessed in a non-sorted column when we need to search a particular row.

Since we’ve used an index, the column is already sorted. Hence, when we query on it, we binary search through the index with an average of log2 (277 778) = 19 block accesses. The last step is to follow the pointer, hence 19 + 1 = 20 block accesses to find a particular element in an indexed column.

Again, we notice that from 277 778 block accesses to 20 block accesses, the performance increase is substantial.

Now why can’t we use the sorting method and then a search instead of searching trough an indexed column?

Don’t forget that the sorting method actually makes changes to the underlying physical order of data. Indexing creates a separate index file that references rows in the active table, allowing direct access to those rows through a data structure, a B+ tree that we will introduce it in the next chapter.

Given that creating an index requires additional disk space (in our example, 277 778 extra blocks), this is potentially the main drawback: we win on time complexity, but we lose on space complexity.

 
131
Kudos
 
131
Kudos

Now read this

Intro to Database Systems - Part 4 & 5 : Relational Algebra

Let’s start by understanding what relational algebra is: it is a query language (not a programming language) takes instances of relations as input (basically rows from a table, also called tuples) yields instances of relations as output... Continue →