Intro to Database Systems : Schema Refinement - Functional Dependencies

Schema refinement is just a fancy term for saying polishing tables. It is the last step before considering physical design/tuning with typical workloads:

Let’s see an example of redundancies and anomalies. Consider the following table where the client’s name is the primary key.

first.PNG

The table is presenting information on employees (sales reps) and their clients.

If we want to insert data, we notice that:

If we want to update data, we notice that:

If we want to delete data, what if Mary doesn’t have a client anymore because she’s taking a year off? We are forced to either

When we have to treat with schema refinement we often notice that the main problem is redundancy. In order to identify schemas with such problems, we’ll introduce the notion of functional dependencies: a relationship that exists when one attribute uniquely determines another attribute. A functional dependency is simply a new type of constraint between two attributes.

Say that R is a relation with attributes X and Y, we say that there is a functional dependency X -> Y when Y is functionally dependent on X (where X is the determinant set and Y is the dependent attribute).

Let’s illustrate a scenario where the designer didn’t take in consideration dependencies between columns.

The following structure is considerably better:

How do we pass from one to the other? That’s what schema refinement does through functional dependencies.

A unique way to represent a student is through his studID. Each student has his own address, hence we can say that studID determines address. We’ll write this in the following way:

In the previous example, we actually have the following FDs:

Let’s have a look at the properties of functional dependencies in the case where X, Y and Z are attributes belonging to a table R :

The first 3 properties are called the Armstrong’s Axioms.

If F is a set of functional dependencies, F+ is the set of all FDs logically implied by F. Logically implied is just another way of saying obtained from the properties of functional dependencies ( the ones that we just enumerated). F+ is also called the closure of the set of functional dependencies. Is is the set of all dependencies logically implied by those present in F.

Let’s illustrate the usage of those properties with an example. If we have the following set of FDs, can we conclude that A - > H is logically implied?

Let’s see which properties are applicable to our case:

Which other dependencies are part of the closure?

Given a set of FDs, is there a faster way to compute if a dependency is logically implied?

Let’s see through an example how we can ask this question in multiple ways:

Before going on with a linear time algorithm, we notice that we’ve introduced a new notion, A+. We call A+ the attribute closure of A with respect to F and it will help us figure out if A - > E is logically implied.

Now, to check if A - > E is in the closure F+, we can conclude that since E is NOT in A+, then A - > E is NOT in F+.

We can generalize this into an algorithm:

Now that we know how to quickly verify if a dependency is logically implied… how do we find all the dependencies that are logically implied? Given a set of FDs F, how do we find its closure, F+ ?

Let’s go through an example again:

The algorithm is pretty simple:

1) Build an empty matrix with all possible combinations of attributes as rows and columns

1.PNG

2) Compute the attribute closures of all attribute combinations

2.PNG

3) Fill the matrix from step 1) by putting a check mark when a row member Y (from the table defined in step 1) is part of a member of the attribute closure Y+ (from the table defined in step 2) .

3.PNG

Let’s look at some examples

By having a check-mark at say the intersection of row A with column BC we mean that A - > BC is part of the closure F+. This is how we enumerate all the dependencies that are part of the closure F+.

Functional dependencies can also be used to find all the candidate keys. By definition, a candidate key is a set of columns that can be uniquely used to identify a database record without any irrelevant/unrelated/superfluous data. It is a reduction of the entire collection of attributes, hence a minimization.

Since we are talking about a minimal subset, we can start with the complete set of attributes and then, following functional dependencies, minimize the set until we reach the candidate keys (a set of attributes that can not be reduced). Let’s illustrate this once more by an example.

Say we have F = { A - > B, BC - > E and ED - > A}.

We notice that functional dependencies help us structuring our tables around unique attributes, avoiding superfluous information.

 
170
Kudos
 
170
Kudos

Now read this

Intro to Database Systems : Basic Perspectives on Disk and Buffer Management

The Database Management System stores information at 3 levels of the memory hierarchy: Primary storage - main memory (and cache) : for currently used data, it is fast and usually volatile. Secondary storage - magnetic (“hard”) disk : for... Continue →