Dan Crisan

A Tiny Intro to Publishing Your Android App on Google Play

2016-11-22T22:40:40-08:00

The following is a simple technical guide to help developers to publish their Android app on Google Play.

Here is what you need :

A Google Play publisher account (which is mainly an Android developper account). You can sign up from here
Android Studio in order to generate your .apk file (the Android Application Package is a package file format used to distribute and install apps on Android)

If you are ready to publish, let’s see how to generate your apk. In Android Studio, access “Build” and then “Build apk” from the top menu :

Your application folder now contains a file called app-debug.apk :

From the command line, change directory to where your app-debug.apk file is located. Once there, we need to generate a key that will help us signing the apk. In the command line terminal, type the following :

keytool -genkey -v -keystore my-release-key.keystore -alias alias_name -keyalg RSA -keysize 2048 -validity 10000

Create a password for the keystore (mandatory) and on the last question, type yes :

We notice a new file created in our folder, the keystore file :

We’ll use the keystore file to sign the application by typing the following command:

jarsigner -verbose -sigalg SHA1withRSA -digestalg SHA1 -keystore my-release-key.keystore app-debug.apk alias_name

The next step is optimizing the apk. The result is having the app use less RAM when running. We’ll use the zipalign tool. It’s usually found under /path/to/Android/sdk/build-tools/VERSION/zipalign. On a mac, it is under ~/Library/Android/sdk/build-tools/VERSION/zipalign :

~/Library/Android/sdk/build-tools/yourVersionOfAndroid/zipalign -v 4 app-debug.apk new-optimized.apk

We notice a new file created in our folder, the new-optimized apk .

Done! Now you can now go to https://play.google.com/apps/publish/ and upload your apk.

Once it got accepted, you can later publish new versions of your app. You’ll have to change its version in the manifest.xml file before submitting it (notice lines 3 and 4)

Then, you can simply resign the apk:

jarsigner -verbose -sigalg SHA1withRSA -digestalg SHA1 -keystore my-release-key.keystore app-debug.apk alias_name

Optimize it once more :

~/Library/Android/sdk/build-tools/yourVersionOfAndroid/zipalign -v 4 app-debug.apk new-optimized.apk

And then from the left menu “APK”, “Upload new APK to Production” and then upload:

Done!

If any questions or suggestions, send me a tweet at @dandancrisan or let’s go for coffee in SF or Montreal . Thanks for reading and I hope it helped ! (:

Tiny Intros

2015-12-19T19:08:08-08:00

Here’s what is in progress and what has been covered until now:

Thank you for reading. If any suggestions, feel free to send me a tweet @dandancrisan

A Tiny Intro to the Android Activity Lifecycle

2015-11-03T20:43:38-08:00

The following is a tutorial on the Android Activity Lifecycle, a concept that occurs pretty often during mobile dev interviews

There is also an app to demo the topic, simple, a decent way to go through the cycles and process. It is one buck for the support but also open sourced on Github for people who like to learn with examples and more logs (it can be easily tested on Android Studio).

Before going trough the lifecycle process and explaining the main parts, let’s see what is an Android Activity. An activity is a view, a window with related design and interactions. The following is a single activity that has 2 buttons and some text describing in which state of the Activity Lifecycle the app is.

This activity, as any Android Activity, goes through the following states all along its lifecycle:

1) onCreate()

The onCreate() statement is called anytime there is a new instance of the activity created. This means that for any subsequent starts of the activity, the onCreate() statement is not called because the activity was already created/loaded. The onCreate() statement is called when…

the user loads a new activity for the first time
the user exits the current view by hitting the Back button and then returns to it. That’s because when the user presses the Back button, the current activity is destroyed and once the user returns to it, it has to be created again.

2) onStart()

The onStart() statement is called, you guessed it, when any instance of the activity starts : when the instance loads for the first time, but also when there are any subsequents reloads, basically anytime an activity shows up to the foreground. This is the state when the application becomes visible to the user.

3) onResume()

The onResume() statement is called when the app activity becomes ready to respond to the user, which means anytime except when the activity is onPause(). Let’s go to the next step in order to have a better understanding of onResume/onPause().

We can test the onResume() action by pressing the Back button once in the onPause() state (click on the Checkmark button to pause Main Activity in the Activity Lifecycle app).

4) onPause()

If the activity is partially visible but somehow not in focus, the activity is onPause(). It’s the case with the next example where we have a transparent second activity that partially obstructs our main activity. We can test the onPause() action by pressing the Checkmark button in the Activity Lifecycle app.

We can stop the paused activity by sending it to the background and come back to it by bringing it back to the foreground: it will still stay in the onPause() state. We can test this by pressing either the Circle button or the Square button of the device while the application is paused, switching to another view from a different app, and then coming back to our Activity Lifecycle app.

In order to better understand the onPause()/onStop() difference, let’s have a look at how onStop() works.

5) onStop()

The onStop() statement is always called when the activity is in the background. But what’s the difference between onPause() and onStop() ? onPause() is the equivalent of a car stoping at a red light. It doesn’t turn off the motor, it still goes on, still driving, but with a lot less resource consumption. That’s different comparatively to a situation where the car is stopped at a grocery store (the motor is turned off), sending the activity to the background (not driving anymore). You can experiment onStop by pressing either the Square button or the Circle button.

6) onRestart()

The onRestart() statement is always called when the activity is about to be displayed on the foreground after being stopped.

7) onDestroy()

The onDestroy() statement is always called when the application is killed, either by the user or by the operating system because it’s low on resources.

One last thing

Now what if we want to run a component in the background (say play music) even if the activity is stopped? It is called a service. A service runs in the background without any interaction with the user except the one starting the service and stopping the service. It can act independently of any activity lifecycle, processing continuously in the background. A MediaPlayer for example is a service.

In the Activity Lifecycle app, we can demo the service utility by pressing the Music button, sending the app to the background by pressing the Circle button and noticing that the music is still playing.

You can press Ctrl + Fn + F6 on a Mac or Ctrl + F6 on a PC to increase the volume.

A service is stopped when the process starting it is destroyed (the app is killed) or when the service is stopped (in the Activity Lifecycle app, press the Music button one more time).

@dandancrisan

Mobile app

Source code

A Tiny Intro to Database Systems

2015-04-24T14:11:54-07:00

Here is a short summary of DBMS : database management systems.

Part of the motivation behind those little chapters is described in another blog post here.

If you would like to read those in a nicely formatted PDF or if you have any questions / suggestions / requests, feel free to send me a tweet @dandancrisan.

Intro to Database Systems : Schema Refinement - Functional Dependencies

2015-04-10T16:04:12-07:00

Schema refinement is just a fancy term for saying polishing tables. It is the last step before considering physical design/tuning with typical workloads:

1) Requirement analysis : user needs
2) Conceptual design : high-level description, often using E/R diagrams
3) Logical design : from graphs to tables (relational schema)
4) Schema refinement : checking tables for redundancies and anomalies

Let’s see an example of redundancies and anomalies. Consider the following table where the client’s name is the primary key.

The table is presenting information on employees (sales reps) and their clients.

If we want to insert data, we notice that:

each row requires an entry in the client field
we can’t insert data for newly hired sales reps until they’ve been assigned to one or more clients
if sales reps are in a training process, even if they’ve been already hired, they can’t actually join the database because they need to have a delegated client… unless “dummy” clients are created.

If we want to update data, we notice that:

the sales reps name is repeated for each client.
what if, for a given client, we misspelled the name of the sales reps Crosby instead of Cosby… how can we edit that without affecting all the sales reps called Crosby?

If we want to delete data, what if Mary doesn’t have a client anymore because she’s taking a year off? We are forced to either

create a dummy client
incorrectly showing her with a client she no longer handled
delete Mary’s record (even if however she’s still an employee)
notice we can not have “null” as a client since primary field keys cannot store null.

When we have to treat with schema refinement we often notice that the main problem is redundancy. In order to identify schemas with such problems, we’ll introduce the notion of functional dependencies: a relationship that exists when one attribute uniquely determines another attribute. A functional dependency is simply a new type of constraint between two attributes.

Say that R is a relation with attributes X and Y, we say that there is a functional dependency X -> Y when Y is functionally dependent on X (where X is the determinant set and Y is the dependent attribute).

Let’s illustrate a scenario where the designer didn’t take in consideration dependencies between columns.

Data (studID, studName, address, courseID, courseName, grade)

The following structure is considerably better:

Student(studID, studName, address)
Course (courseID, courseName)
Enrolled (studID, courseID, grade)

How do we pass from one to the other? That’s what schema refinement does through functional dependencies.

A unique way to represent a student is through his studID. Each student has his own address, hence we can say that studID determines address. We’ll write this in the following way:

studID - > address

In the previous example, we actually have the following FDs:

studID - > studName, address
courseID - > courseName
studID, courseID - > grade

Let’s have a look at the properties of functional dependencies in the case where X, Y and Z are attributes belonging to a table R :

transitivity: if we assume that X - > Y and Y - > Z, then it’s clear that X - > Z
reflexivity: if Y is a subset of X, then X -> Y
augmentation: if X - > Y, then for any Z we’ll have X, Z - > Y, Z
union: if X - > Y and Y - > Z, then X - > Y, Z
decomposition: if X -> Y, Z then X - > Y and X - > Z

The first 3 properties are called the Armstrong’s Axioms.

If F is a set of functional dependencies, F+ is the set of all FDs logically implied by F. Logically implied is just another way of saying obtained from the properties of functional dependencies ( the ones that we just enumerated). F+ is also called the closure of the set of functional dependencies. Is is the set of all dependencies logically implied by those present in F.

Let’s illustrate the usage of those properties with an example. If we have the following set of FDs, can we conclude that A - > H is logically implied?

A - > B
A - > C
C, G - > H
C, G - > I
B - > H

Let’s see which properties are applicable to our case:

We know that, by the transitivity property, if X -> Y and Y - > Z then we have X -> Z .
In our case we have A - > B and B - > H.
Hence, by transitivity, A - > H is logically implied.

Which other dependencies are part of the closure?

CG -> HI by the union rule
AG -> I by noticing that A -> C holds, and then AG -> CG by the augmentation rule and then AG -> I by transitivity.

Given a set of FDs, is there a faster way to compute if a dependency is logically implied?

Let’s see through an example how we can ask this question in multiple ways:

Does F = {A - > B, B - > C, C D - > E} imply A - > E?
Is A - > E in the closure F+ ?
Is E in A+ ?

Before going on with a linear time algorithm, we notice that we’ve introduced a new notion, A+. We call A+ the attribute closure of A with respect to F and it will help us figure out if A - > E is logically implied.

1) Assume that we create a temporary attribute closure of A called TMP and that to begin, TMP = A (the input of the FDs that you want to verify)
2) Let’s consider the first given dependency of F, A - > B.
3) Is A in the TMP? Yes, since as stated previously TMP = A; we continue.
4) If we continue, we union B with the current TMP, A. What we obtain is the new TMP, AB (since A union B = AB).
5) We now consider the second given dependency, B - > C.
6) Is B in the TMP? Yes, since we now have AB in the TMP; we continue.
7) If we continue, we union C with the current TMP, AB. What we obtain is the new TMP, ABC (since AB union C = ABC).
8) We consider the 3rd given dependency, C D - > E.
9) Is CD in the TMP? No, since we only have ABC in the current TMP, hence we stop.
10) The attribute closure of A is then A+ = TMP = {A, B, C}

Now, to check if A - > E is in the closure F+, we can conclude that since E is NOT in A+, then A - > E is NOT in F+.

We can generalize this into an algorithm:

1) Consider the input of your FDs as the first element of your temporary attribute closure TMP.
2) Consider each dependency X - > Y of the given set of FDs
3) Is X part of TMP? If yes, continue to step 4. If no, continue to step 5.
4) Yes : Union TMP with Y.
5) No : Your attribute closure = TMP (your current temporary attribute closure from step 3).
Conclusion : if an attribute is in your attribute closure, then it’s logically implied (it’s part of the closure of the set of functional dependencies).

Now that we know how to quickly verify if a dependency is logically implied… how do we find all the dependencies that are logically implied? Given a set of FDs F, how do we find its closure, F+ ?

Let’s go through an example again:

Given F = { A - > B, B - > C}, compute F+

The algorithm is pretty simple:

1) Build an empty matrix with all possible combinations of attributes as rows and columns

2) Compute the attribute closures of all attribute combinations

3) Fill the matrix from step 1) by putting a check mark when a row member Y (from the table defined in step 1) is part of a member of the attribute closure Y+ (from the table defined in step 2) .

Let’s look at some examples

row member A: A, B, C, AB, AC, BC and ABC are all attributes of the closure of A+, ABC.
row member BC. A is not a member of attribute closure (BC)+ : we don’t put a check mark because there is no A in (BC)+. However, we check B and C and BC.
row member C. A is not a member of (C+): we don’t check it because C+ contains only C.

By having a check-mark at say the intersection of row A with column BC we mean that A - > BC is part of the closure F+. This is how we enumerate all the dependencies that are part of the closure F+.

Functional dependencies can also be used to find all the candidate keys. By definition, a candidate key is a set of columns that can be uniquely used to identify a database record without any irrelevant/unrelated/superfluous data. It is a reduction of the entire collection of attributes, hence a minimization.

Since we are talking about a minimal subset, we can start with the complete set of attributes and then, following functional dependencies, minimize the set until we reach the candidate keys (a set of attributes that can not be reduced). Let’s illustrate this once more by an example.

Say we have F = { A - > B, BC - > E and ED - > A}.

1) We know that the set of all attributes is ABCDE.
2) Can we reduce the set by using the first given FD? If we follow A - > B, we can remove B from the main set because B depends on A, and ABCD already contains A, hence no need of any dependent superfluous attribute. We obtain ACDE.
3) Can we reduce the set by using the second FD? If we follow BC - > E, we can remove E from the main set because E depends on BC, and ABCDE already contains BC. We obtain ABCD.
4) Can we reduce the set by using the third FD? If we follow ED - > A, we can remove A from the main set because A depends on ED, and ABCDE already contains ED. We obtain BCDE.
5) We now have a new set of attributes : ACDE, ABCD and BCDE. Let’s call them X.
6) Can we simplify any attribute from X by using dependency A - > B ? We can remove B from ABCD because ABCD already contains A, and B depends on A: we obtain ACD. Can we do the same for BCDE? No, because BCDE doesn’t contain A.
7) Can we simplify any attribute from X by using BC - > E ? We can remove E from BCDE because BCDE already contains BC and and E depends on BC: we obtain BCD. Can we do the same for ACDE ? No, because ACDE doesn’t contain BC,
8) Can we simplify any attribute from X by using ED - > A ? We can remove A from ACDE because ACDE already contains CD and A depends on CD: we obtain CDE.
9) We now have a new set of attributes : ACD, BCD and CDE. Let’s call them Y.
10) Can we simplify any attribute from Y by using A - > B ? BCD can not be simplified because it doesn’t contain A, and the rest of attributes from Y don’t contain B.
11) Can we simplify any attribute from Y by using BC - > E ? CDE can not be simplified because it doesn’t contain BC, and the rest of attributes from Y don’t contain E.
12) Can we simplify any attribute from Y by using ED - > A ? ACD can not be simplified because it doesn’t contain ED, and the rest of attributes from Y don’t contain ED.
Conclusion: the functional dependencies from F can not be used to simplify the subsets from Y, hence they can not be more minimized. They are our candidate keys: ACD, BCD and CDE.

We notice that functional dependencies help us structuring our tables around unique attributes, avoiding superfluous information.

Intro to Database Systems : Concurrency Control - Scheduling problems

2015-04-07T16:24:29-07:00

In real life, users access a database concurrently.

Database access is done through transactions. What is a transaction?

a unit of work that has to be treated as “a whole”
it has to happen in full or not at all

A real life example of a transaction is money transfer:

first, withdraw an amount X from account A
second, deposit to account B

The previous operation has to succeed in full. You can not stop halfway. Database transactions work the same way. They ensure that, no matter what happens, manipulated data is treated atomically (you can never see “half a change”).

Atomicity is part of the ACID properties that a DBMS has to maintain:

Atomicity: either all actions from a transaction happen, or none happen
Consistency: the database starts from a consistent state and ends in a consistent state
Isolation: execution of one transaction is isolated from other transactions
Durability : if a transaction commits, its effects persist in the database

Now what can go wrong?

If not scheduled properly, concurrent process may alter the isolation and consistency properties.

Let’s imagine a problem where 2 users reserve a seat for a flight:

customer 1 finds a seat empty
customer 2 finds the same seat empty
customer 1 reserves the seat
customer 2 reserves the seat

Customer 1 will not be happy. This introduces the notion of serializability. There needs to be a concurrency control mechanism through a schedule.

A sequence of transactions executed chronologically is called a schedule. It is a representation of how a set of transactions are executed over time. It can contain the following actions:

read R(X)
write W(X)
commit (after completing all its actions, all the operations should be done and recorded)
abort (after executing some actions, if we abort, none of the operations should be done/recorded)

A commit or an abort is mandatory in order to have a complete schedule.

A serial schedule is a schedule without interleavings: all operations are executed consecutively.

Conflicting operations are present in a schedule when those operations satisfy the following conditions:

they have to belong to different transactions
they have to access the same data object X
at least one of the operations is a W(X) (write on X)

Let’s see a couple of conflicting operations:

The Write-Read Conflict : reading uncommitted data
The Read-Write Conflict : rereading data that has been altered since the first read.
The Write-Write Conflict : losing updates

The Write-Read Conflict, also called reading uncommitted data or dirty-read occurs when a transaction T2 tries to read a database object A, modified by a transaction T1 which hasn’t been committed. When T1 continues with its transaction, data of object A is inconsistent. The next picture helps illustrating the scenario:

In other words, a dirty read is when a transaction is allowed to read data from a row that has been modified by another running transaction and that modification has not yet been committed.

The Read-Write Conflict, also called unrepeatable reads, occurs when a transaction T1 has to read twice a database object A. After the first read, transaction T1 waits for transaction T2 to finish. T2 overwrites object A and when T1 reads A again, there are 2 different versions of A. T1 will be forced to abort: it is the unrepeatable read.

A real life example of this situation is when Bob and Alice are on Ticketmaster and they want to book tickets for a show. There is only one ticket left : Alice signs-in, finds that the ticket is expensive and takes the time to think about it… Bob signs-in and buys the ticket instantly and then logs off. Alice decides to buy the ticket and finds out that there are no tickets left.

The Write-Write Conflict, also called overwriting uncommitted data, occurs when there are lost updates. The attempt to make this scenario serial will always give two different results: either transaction T1’s version or transaction T2’s version.

Once some concurrent transactions applied on a database, a schedule is serializable if the resulting database state is equivalent (equal) to the outcome of the same transactions, but executed sequentially, without overlapping in time. This is what we aim for. A schedule that is serializable can also be :

ACA : avoid cascading abort
recoverable
strict schedule

The best way to verify if a schedule is serializable is through a dependency graph.

To build a dependency graph we can follow this procedure:

1) Represent every transaction by a node
2) Is there a transaction Ty that reads an item after a different transaction Tx writes it? If yes, draw an edge from node Tx to node Ty.
3) Is there a transaction Ty that writes an item after a different transaction Tx reads it? If yes, draw an edge from node Tx to node Ty.
4) Is there a transaction Ty that writes an item after a different transaction Tx has written that item? If yes, draw an edge from node Tx to node Ty.

Don’t forget to remove the edge that you just drew if you are actually aborting your transaction.

In order to have a serializable schedule, the dependency graph has to be acyclic (it doesn’t have any cycles, closed paths).

The following schedule is not serializable:

The following schedule is serializable:

Now how to know when a schedule is strict?

when an object written by a transaction T cannot be read or written again until this transaction T commits or aborts.

How to know when a schedule is avoiding cascading aborts?

when an operation can only read data that has been committed

How to know when a schedule is recoverable?

when for each transaction where Ty reads some data written by Tx, the COMMIT operation of Tx appears before the COMMIT operation of Ty.

The point of enumerating all those schedule classes is to define some concurrency control : measures such that non-serializable execution can never happen.

Intro to Database Systems : Indexing Part 2 - B+ trees

2015-04-04T19:15:26-07:00

In the previous section, Indexing Part 1, we’ve seen that building an index for frequently used attributes considerably increases the efficiency of a query.

In this section we’ll discuss the most widely used index implementation: the B+ Tree.

Each node of a B+ tree is a page, a block of data. A page is the transfer unit to disk.

We are already aware that a table spans on many blocks of data. We can picture this by having the tree analogous to the table, the nodes analogous to the blocks of data and we have to keep in mind that each block of data contains multiple rows.

For now we’ve talked about two things stored on a disk : the index and the data. The index (in blue below) points at the data (in green below). We clearly notice now that creating an index takes extra space on the disk.

Let’s analyze the leaves of a B+ tree. We notice that they are structured as a linked list with 2 pointers:

one pointer towards the next node
one pointer towards the data.

Having this tree structure (and not only a sequential linked list structure) helps for insertion and deletion complexities: they have a logarithmic running time.

Say we have a B+ tree with a height h = 2. In this case, 3 blocks of data will be accessed:

the root
the leaf holding the pointers
the data page corresponding to the rows (referenced by the pointers from the leaf)

How do we recognize B+ tree?

Let’s say d is the number of references that a node has to its children.

In order to be a valid B+ tree, it has to respect the following invariants:

every leaf is at the same distance from the root
if a node has d pointers, the node has to contain d-1 keys
every root has at least 2 children
every non-leaf AND non-root has at least d/2 children
every leaf contains at least floor d/2 keys
every key of the column appears in a leaf

Let’s see how does inserting nodes works in a B+ tree. Say we have a node X. The main algorithm is the following :

Step 1: If node X has empty space, insert (key, ref) into the node.

Step 2: If node X already full:

2A) split X into 2 nodes : X1 and X2
2B) distribute keys evenly between 2 nodes
2C) If node X is a leaf : take minimum value of 2nd node X2 and insert in the parent node by repeating the algorithm starting from point 1)
2D) if node X is a non-leaf : take minimum value of 2nd node X2, exclude it from the split up and insert it in the parent node by repeating the algorithm starting from point 1)

Let’s go through a few examples. Assume we have 4 rows per page and we’ve inserted the following set of key values : 2, 3, 5, 7, 11, 17, 19, 23, 29, 31.

An empty node with 4 rows per page will look like the following. We notice the 4 empty spaces at the edge for the 4 pointers :

1) If we want to insert 2, and then 3, and then 5, we just follow the Step 1 of the algorithm (node has empty space).

2) Now let’s insert 7.

What happened?

Step 2: we notice from part 1 that node X is already full.
2A): split node X into 2 nodes : X1 and X2
2B): distribute the key evenly between 2 nodes (we have 2 and 3 in node X1 and 5 and 7 in node X2)
2C) node X was indeed a leaf : the minimum value of 2nd node X2 is 5. We simply insert it into the parent node by repeating Step 1 from the algorithm (because yes, the parent node has empty space) .
Step 1 but at the parent’s node: there is space, stop.

3) Let’s insert 11.

What happened?

Step 1: we notice from part 2 that node X has space.

4) Insert 17.

What happened?

we notice from part 3 that there is no more space, hence we continue to step 2
2A: we split the node X (containing 5, 7, 11) into 2 nodes
2B: we distribute evenly between 2 nodes (we have 5, 7 in X1 and 11, 17 in X2)
2C: X was a leaf, hence we take minimum value of 2nd node X2 (11) and insert it into the parent
Step 1 but at the parent’s node: there is space, stop.

5) Insert 19.

What happened?

Step 1: there was space, we insert and then we stop.

6) Insert 23.

What happened?

Step 1: there is no more space, continue.
Step 2A: we split the node X (containing 11, 17, 19) into 2 nodes
Step 2B: we distribute evenly between 2 nodes
Step 2C: X was a leaf, we take the minimum of 2nd node (19) and we insert it into the parent
Step 1 but at the parent’s node: there is space, stop.

7) Insert 29.

What happened?

Step 1: there was space, we insert and then we stop.

8) Insert 31:

What happened?

step 1 there is no space, continue
Step 2A: we split the node X (containing 19, 23, 29) into 2 nodes
Step 2B: we distribute evenly between 2 nodes
Step 2C: X was a leaf, we take the minimum of 2nd node (29) and we insert it into the parent
Step 1 but at the parent’s node: there is no space, continue.
Step 2A: we split the new node X (containing 5, 11, 19) into 2 nodes
Step 2B: we distribute evenly between 2 nodes
Step 2D: since the new node X was not a leaf, we exclude the minimum value of the 2nd node (19) and we insert it into the new parent
Step 1 but at the parent’s node: there is space, stop.

This simulator is pretty neat for testing your own implementations of B+ trees (many thanks to Joy & Graham from New Zealand).

Now why are we using B+ trees?

We notice that, unlike traversing a linked list, accessing any part of the tree requires visiting only a few nodes. Also, increasing the number of child nodes is decreasing the depth of the tree, hence decreasing the number of “hops” (time consuming disk reads) required.

Intro to Database Systems : Indexing

2015-04-02T16:18:13-07:00

We learned in the last lecture that when data is stored on disks, it is sorted as a set of blocks of data (also called pages). A block is accessed as a whole, in its entirety. On the disk, blocks are structured as link lists:

they both have a section containing data
they both have a pointer to the location of the next node (next block/page)

We will demonstrate how useful indexes are through a bunch of examples. An index helps us to find rows faster. They are useful for queries done on attributes that are used frequently. Indexing is just a fancy word to say “sorting a column in order to efficiently query an element”

Let’s start with some examples and define N as the number of blocks that the entire table requires.

We already know that searching on a column that isn’t sorted requires N/2 block accesses using Linear Search. Even worst, if the column doesn’t contain unique entries (say we have 2 people with their firstName “Dan”)… the entire table must be searched. That’s N block accesses (because what if the duplicated element is the last row in the column).

Now let’s assume that the column is sorted (and that’s what an index does). By using Binary Search, we will obtain log2 N block accesses. Since the data is sorted, we won’t need to search the rest of the table for duplicate values.

Creating an index is basically creating a data structure that holds the column value and a pointer to the records.

Let’s consider a database that doesn’t have an index. For simplicity, we have a table with only two columns: firstName, lastName.

Say we have:

r = 5 000 000 as the number of rows in the table
R = 204 bytes as the fixed size of each row (record length)
B = 1024 bytes as the default block size (size of each data block)

Let’s see how many rows are in a disk block?

1024 / 204 = 5 rows per disk block

Let’s see how many blocks are in our table?

5 000 000 / 5 = 1 000 000 blocks per table .

This is our N. We know that if we query on a non-sorted column, we’ll obtain N / 2 blocks traversed, hence the traversal of 1 000 000 / 2 = 500 000 blocks. If we allow duplicates, we will have 1 000 000 block accesses.

If the column is already sorted and then we search for an element, we obtain log2 (1 000 000) = 20 block accesses.

We notice that, from 500 000 block accesses to 20 block accesses, the performance increase is substantial.

Since we’ve seen what’s the impact of a query on a sorted column, let’s introduce an index. Let’s pretend we have firstName as an attribute. As we said, creating an index on a column implies creating a data structure that holds:

a value : in this case the field name takes 50 bytes.
a pointer to the record it relates to : row pointer is 4 bytes

Say we have:

r = 5 000 000 as the number of rows in the table
R = 54 bytes as the index record length
B = 1024 bytes as the default block size (size of data block)

Let’s see how many rows are in a disk block?

1024 / 54 = 18 rows per disk block

Let’s see how many blocks are in our table?

5 000 000 / 18 = 277 778 blocks

This is the number of blocks that needs to be accessed in a non-sorted column when we need to search a particular row.

Since we’ve used an index, the column is already sorted. Hence, when we query on it, we binary search through the index with an average of log2 (277 778) = 19 block accesses. The last step is to follow the pointer, hence 19 + 1 = 20 block accesses to find a particular element in an indexed column.

Again, we notice that from 277 778 block accesses to 20 block accesses, the performance increase is substantial.

Now why can’t we use the sorting method and then a search instead of searching trough an indexed column?

Don’t forget that the sorting method actually makes changes to the underlying physical order of data. Indexing creates a separate index file that references rows in the active table, allowing direct access to those rows through a data structure, a B+ tree that we will introduce it in the next chapter.

Given that creating an index requires additional disk space (in our example, 277 778 extra blocks), this is potentially the main drawback: we win on time complexity, but we lose on space complexity.

Intro to Database Systems : Basic Perspectives on Disk and Buffer Management

2015-03-26T17:04:28-07:00

The Database Management System stores information at 3 levels of the memory hierarchy:

Primary storage - main memory (and cache) : for currently used data, it is fast and usually volatile.
Secondary storage - magnetic (“hard”) disk : for persistent data, it is relatively slow and nonvolatile. It stores the main database.
Tertiary storage - tape : nonvolatile older version of the data.

Now why can’t we store everything in the main memory, if it’s the fastest way? Because …

it costs too much : with 100$ you buy 4GB of RAM, but 2000GB of disk (500 times more)
it is volatile : we want to save data between runs, not only at run time!

Why can’t we store everything on tape?

disks use random access vs. sequential

Disk blocks or pages are the main units for measuring retrieved data. They have a fixed usable size, usually being 512 bytes. We can read (from disk to RAM) or write (from RAM to disk) pages.

The seek time is the most time consuming operation when accessing data on disk (from 1 to 20 msec). To compare, accessing data from the main memory is in the order of nanoseconds.

Lowest layers of the Database Management System are in charge of how the place is used on the disk. Higher levels depend on a buffer (the lowest layer of DBMS) to:

allocate/de-allocate a block of memory (page)
read/write a block of memory (page)

In other words, the buffer manager is doing 3 things:

1) manages the functions for reading data that’s in the RAM
2) indexes pages (disk blocks) from the database into the buffer cache (also called the buffer pool)
3) writes modified pages back to the disk.

When data has to be loaded from the disk:

if there is empty frame available in the pool, the buffer manager picks an empty frame
if there is no empty frame in the pool, the buffer manager picks a frame for replacement

Replacement frames have a pin counter of 0. Once the page from the frame is loaded, the pin counter becomes 1, the equivalent of an empty frame. If there is an empty frame, the buffer manager picks an empty frame.

Once the data contained in the replacement frame is requested, its pin counter is incremented. We can state this as a general rule: “when requesting a page that is already in the buffer, its pin counter is incremented”. After the operation is finished, we decrement it. If the page (the disk block) is modified, a dirty bit is set and the frame is immediately written to the disk (update).

If there are no empty frames, only unpinned pages (pin counter = 0) can be chosen to accept loaded pages from the disk: this is the replacement policy. (This makes sense: if we reached the point where we have only non-empty frames, it means that we will have to wait for all the transactions to finish. We will have to wait for all the pins to be decremented. Hence, the pin counter has to get back to 0, or become 1 and unlock a frame.)

DBMS maintain their own buffer rather than use that of the OS so that they control when to let out pages from it through the implementation of pin counters and replacement policies.

Intro to Database Systems - Part 14 to 16 : Triggers

2015-03-19T18:09:47-07:00

A trigger is a procedure that executes automatically as soon as specified changes occur in the DBMS.

A trigger has 3 parts:

an event : at what type of change the procedure should happen? Usually it happens before/after/insteadOf an insert/update/delete.
an action : what happens if the trigger runs? (example: add student to scholarshipStudList)
a condition: under which condition the procedure gets executed once the event triggered? In other words, when does the event gets executed? (example: add student to scholarshipStudList only when studGPA > 3.6).

Let’s start by creating 4 tables and see how a trigger affects them:

CREATE TABLE test1(a1 INT);
CREATE TABLE test2(a2 INT);
CREATE TABLE test3(a3 INT NOT NULL AUTO_INCREMENT PRIMARY KEY);
CREATE TABLE test4( a4 INT NOT NULL AUTO_INCREMENT PRIMARY KEY, b4 INT DEFAULT 0 );

Now let’s define a trigger on table test1:

CREATE TRIGGER testReference BEFORE INSERT ON test1 FOR EACH ROW BEGIN INSERT INTO test2 SET a2 = NEW.a1; DELETE FROM test3 WHERE a3 = NEW.a1; UPDATE test4 SET b4 = b4 + 1 WHERE a4 = NEW.a1; END;

We notice that we have two parts.

1) An event part: at what type of change should the event happen?

BEFORE INSERT ON test1

2) An action part: what happens if the trigger runs?

BEGIN INSERT INTO test2 SET a2 = NEW.a1; DELETE FROM test3 WHERE a3 = NEW.a1; UPDATE test4 SET b4 = b4 + 1 WHERE a4 = NEW.a1; END;

We notice that:

the action part is surrounded by the keywords BEGIN and END : we are using them to encapsulate more than one action.
there is no condition part. This parameter is optional.

Now let’s populate the databases and see how they look for now, before touching table test1:

INSERT INTO test3 (a3) VALUES (NULL), (NULL), (NULL), (NULL), (NULL), (NULL), (NULL), (NULL), (NULL), (NULL);
INSERT INTO test4 (a4) VALUES (0), (0), (0), (0), (0), (0), (0), (0), (0), (0);

Let’s modify table test1 and see how the tables are looking like now:

INSERT INTO test1 (a1) VALUES (1), (3), (1), (7), (1), (8), (4), (4);

Here is how the trigger updates the data of the table before the insertion:

1) INSERT INTO test2 SET a2 = NEW.a1;

a2 became a copy of the NEW table a1

2) DELETE FROM test3 WHERE a3 = NEW.a1;

we’ve deleted all the values from a3 that are the same as values from the new table a1. All the different values remain. (1 is present in a1, we delete it; 2 isn’t present, we keep it; 3 is present, as well as 4 and 7 and 8. We delete them)

3) UPDATE test4 SET b4 = b4 + 1 WHERE a4 = NEW.a1;

we are updating b4 every time (for each row) that we find a1 = a4. (For 1, a1 = a4 three times: increment b4 three times; For 2, there is no value 2 in a1: don’t increment b4; For 3, we find it once in a1: increment once; For 4, we find it twice: increment twice.)

We have 2 types of actions performed:

FOR EACH STATEMENT : triggered once the entire statement is executed, independently of the number of rows affected.
FOR EACH ROW : triggered when rows of a table are modified, it is fired as many times the rows are modified.

Some triggers are performed on a WHEN condition, delimiting the new and old data by using the NEW or OLD keyword:

CREATE TRIGGER ratingIncrease AFTER UPDATE OF rating on Skaters REFERENCING OLD AS o NEW AS n FOR EACH ROW (WHEN (n.rating > 1 + o.rating) UPDATE Skaters SET rating = 1 + o.rating WHERE sid = n.sid