Graded by Laurentiu Nicolae. I have also posted the sollutions to the written assignment (in this page, after the grades table), have a look at them.
For any questions/comments contact me by email at nld@liacs.nl . You can also visit me in the office (room 122) on:
Wednesday, December 18th – 13:00-15:00
Thursday, December 19th – 15:00 – 17:00
Friday, December 20th – 13:00 – 15:00
THE GRADES:
|
Name 1 |
Name 2 |
Ex. 1 abcdefg |
Ex. 2 ab(c1)(c2)(c3)(c4) |
Ex. 3 abcd |
Ex. 4 |
Grade |
|
Alexander Nezhinsky |
Jelmer Cormont |
1110111 |
110000 |
0000 |
1 |
- |
|
Arjen Stolk |
|
1111110 |
111111 |
0010 |
1 |
++ |
|
Bogumila Sobolewska |
Kwie Min Wong |
1111110 |
110000 |
0000 |
1 |
- |
|
Derk Geene |
|
1110110 |
110000 |
0000 |
1 |
- |
|
Geoffrey de Vlugt |
|
1111111 |
111111 |
0011 |
0 |
++ |
|
Richard Velden |
Sander van der Mar |
1111110 |
111101 |
0000 |
1 |
+ |
|
Johan de Ruiter |
Daniel Worm |
1000110 |
111111 |
0000 |
1 |
+ |
|
Klaske van Vuurden |
|
1011111 |
111111 |
1011 |
0 |
++ |
|
Matthijs van der Zon |
Harmen van der Spek |
1111111 |
110100 |
0011 |
1 |
+ |
|
Nadia Ramjiawan |
Aarti Khoesial |
1110110 |
110000 |
0000 |
0 |
- |
|
Taco Buitenhuis |
|
1110111 |
110010 |
0011 |
1 |
+ |
|
Thomas Steenbergen |
Mark van Sandijk |
1111111 |
110000 |
0011 |
1 |
+ |
|
Marine Ruhamanya |
|
1110111 |
110010 |
0010 |
1 |
+ |
How to read the table:
For each exercise you had several subpoints. The grade in each exercise column is a bit mask that describes what you did on that particular exercise (i.e. 101101 for Ex. 2 means you got {a,c1,c2,c4} right and {b,c3} wrong). These are orientative and given for your benefit, so that you can check your work and see where you need to improve. In the final grade I also took into account partial results – such as correct key determination and incorrect decomposition in Ex. 2 c), for instance.
THE SOLUTIONS:
1. For this assignment you had to write several SQL queries. When correcting this exercise, I took into account the queries that returned the desired result, regardless of their optimality. I also ignored obvious spelling mistakes and minor changes in field names.
a) Return the eid of the employees with the highest salary.
Everyone solved this one. The recommended solution was:
SELECT E.eid, MAX(E.salary) FROM Employees E
using the aggregate operator MAX(). You can also use a nested query:
SELECT E1.eid FROM Employees E1 WHERE E1.salary >=
(SELECT MAX(E2.salary) FROM Employees E2 )
however, the first solution is the optimal one.
b) Return the eid of the employees with the next to highest salary.
Some of you had troubles with this one. The solution is:
SELECT E1.eid FROM Employees E1 WHERE E1.salary =
(SELECT MAX(E2.salary) FROM Employees E2 WHERE E2.salary <>
(SELECT MAX(E3.salary) FROM Employees E3) )
The common mistake was to replace the equal sign (=) from the first query with a bigger or equal (>=) sign. The query would return now the employees that have both the largest and the second largest salary in our company.
c) Give the aname of all aircrafts for which the certified pilots make more than 100.000
This was just a join with one extra condition; no major problems here. Some of you just forgot to use the “DISTINCT” clause. The solution is:
SELECT DISTINCT A.aname FROM Aircraft A, Certified C, Employees E
WHERE A.aid = C.aid AND E.eid = C.eid AND E.salary > 100000
d) Give the eid of all pilots certified for more than 3 aircrafts, along with the maximum cruising range of the aircrafts for which they are certified.
This is a query where you had to use a GROUP BY / HAVING clause and the MAX and COUNT operators. The most common mistakes I encountered were: forgetting the MAX operator for the cruising_range field and selecting the max cruising range from the whole Aircraft table. The solution is:
SELECT C.eid, MAX(A.cruising_range) FROM Aircraft A, Certified C
WHERE A.aid = C.aid GROUP BY C.eid HAVING COUNT(*) > 3
e) Give the aname for the aircrafts with a cruising range larger than 1000 along with the average salary of all pilots certified to fly them.
This one was also a join with an extra condition, in which you had to use the AVG operator and the GROUP BY clause. The result is:
SELECT A.aname, AVG(E.salary) from Aircraft A, Certified C, Employees E
WHERE A.aid = C.aid AND E.eid = C.eid AND A.cruising_range > 1000
GROUP BY A.aid
f) Give the aid of all aircraft for which the cruising range is larger than the distance from Los Angeles to Chicago.
Another simple query; the only “trick” you had to think of was the use of the DISTINCT clause in order to eliminate duplicates.
SELECT DISTINCT A.aid FROM Aircraft A, Flights F
WHERE F.from = ‘Los Angeles’ AND F.to = ‘Chicago’
AND A.cruising_range > F.distance
g) Determine the difference between the average salary of the pilots and the average salary of all employees.
This was a little bit tricky. At first you would be tempted to write something like this:
SELECT AVG(P.salary) – AVG(E.salary) AS difference FROM Employees P, Employees E, CertifiedC
WHERE P.eid = C.eid
However, this is not correct, due to the fact that in the Certified table you may find duplicate entries for one eid, Therefore, the correct answer is the following:
SELECT AVG(P.salary) – AVG(E1.salary) AS difference FROM Employees E1,
(SELECT DISTINCT E2.eid, E2.salary from Employees E2, Certified C
WHERE E2.eid = C.eid) AS P
2. a) Just a little theory. I compared your answers with the ones from the book (see below). There were no notable problems here.
A decomposition of a relation schema R consists of replacing the relation schema by two (or more) relation schemas that each contain a subset of the attributes of R and together include all attributes in R.
Let R be a relation schema and let F be a set of functional dependencies over R. A decomposition of R into two schemas with attribute sets X and Y is said to be a lossless-join decomposition with respect to F if for every instance r of R that satisfies the dependencies in F, πX(r) >< πY(r) = r.
All examples were correct.
b) What are the advantages and disadvantages of BNCF when compared with 3NF?
BNCF ensures that no redundancy can be detected using FD information alone. It is thus the most desirable normal form (from the redundancy point of view). Unfortunately BNCF does not guarantee dependency preserving decompositions – which is actually the reason why 3NF was introduced in the first place. The 3NF decomposition weakens the BNCF restrictions just enough to guarantee that every 3NF relation schema can be decomposed into a collection of 3NF relations using only decompositions that guarantee certain desirable properties.
c) There were some confusions over the definition of a BNCF relation; for those of you who did not solve this correctly (see results), I recommend reviewing the chapter “Schema Refinements and Normal Forms” in the book.
I) C->D, C->A, B->C
Candidate key: B
R is not in BNCF
Decomposition: Fbc = {B->C}; Facd = {C->A, C->D}
II) B->C, D->A
Candidate key: BD
R is not in BNCF
Decomposition: Fbc = {B->C}, Fad = {D->A}
III) ABC->D, D->A
Candidate keys: ABC, BCD
R is not in BNCF (actually, R is 3NF)
There is no dependency preserving decomposition for this R
IV) A->B, BC->D, A->C
Candidate key: A
R is not in BNCF
Decomposition: Fabc = {A->B, A->C}, Fbcd={BC->D}
3. Query evaluations
One of the common mistakes that you do here is that you do not take into account the significance of the numbers you operate with. For instance, you are not allowed to have 6.63 pages occuppied by a relation; this number has to be rounded to 7 before proceeding with the computation. Likewise, 800b tuples will fit as one per page; you're not allowed to divide the number of bytes per page by 800 and proceed in your computations with the REAL value. For a 1024b page, you get a 1.28 factor, which multiplied by 100 pages will return you 128 tuples. However, the correct answer is 100 tuples; you cannot hold 128 tuples of size 800b in 100 pages.
NOTE: If you got a 0 in one of the columns for Exercise 3, then you most likely did not obtained the same result that I did. However, I decided to take into account also the partial results, and you got extra points for observing the issues described earlier.
BNL with 3 buffer pages is just SNL, and the projection costs are the same as well. Therefore we can use the following formulas:
Cost of projection with external sort = 2 x Npages * log[Nbuf_pages – 1](Npages)
Cost of join = Npages_iouter_loop + (Npages_outer_loop * Npages_inner_loop) / (Nbuf_pages – 2)
where Npages_inner_loop > Npages_outer_loop
a) Cost of projection followed by join:
Project S = 200 * [log2(100)] = 1400 I/O; the resulting size of the inner relation is 50 pages
Project R = 30 * [log2(10)] = 120 I/O; the resulting size of the inner relation is 6 pages
Join = 6+50*6/(3-2) = 306 I/O
Total cost = 1826 I/O
b) Cost of join followed by projection:
Join = (10 + 10*100) = 1010 I/O; we get 200 tuples of 800b each => 200 pages
Projection (using 3 buffer pages):
1st pass: 200 pages are scanned and 100 pages are written (unwanted atributes are elliminated on the fly to produce tuples of 450b each)
Merging pairwise the runs from the previous step in 6 additional passes
Total projection cost = 200+100+2*6*100 = 1500 I/O (including the cost of writing out the 100 pages, namely 100 I/O. We will remove this cost from the total cost estimation).
Total cost = 1010+1500 – 100 = 1410 I/O
c) Cost of join and projection on the fly:
This is equivalent to a projection cost of 0 in the previous exercise, so the only cost is given by the join: 1010 I/O
d) If we increase the number of buffers to 11 it will affect all calculations:
For a) – Total cost = [log10(100) * 200] + (6 + 6/9 * 50) = 440 I/O
For b) – Total cost = (10 + 10/9 * 100) + (log10(200)*400) = 210 + 200 = 1410 I/O
For c) – Total cost = (10+10/9 * 100) = 210 I/O
4. Are there useful queries for a relational database that cannot be expressed in SQL, relational algebra and relational calculus?
Yes. If we have a a to compute a join with a static upper bound (say k) over a relation, we can easily write an algebra or SQL query to fit this purposes. But if the upper bound is dynamic, then we cannot write such a query, because k is not known when writing the query.
Example: select all flights from Los Angeles to Chicago, regardless of the number of connections (dynamic upper bound)