Chapter 3 [PDF]

Data Warehouse and OLAP Technology: An Overview Chapter 3 3.7 Exercises 1. State why, for the integration of multiple he

29 0 269KB

Report DMCA / Copyright

DOWNLOAD PDF FILE

Papiere empfehlen

Chapter 3

2 1 5MB Read more

Chapter 3 Probability

0 0 803KB Read more

Questions For Chapter 3

2 0 133KB Read more

Group 6 Group Assignment Chapter 3

5 0 258KB Read more

Chapter - 3 - BUSINESS TAX BANGGAWAN 2019

0 0 11MB Read more

Chapter 7

0 0 129KB Read more

Chapter - 1

0 0 401KB Read more

Chapter 9999

3 0 7MB Read more

Chapter 2

0 0 4MB Read more

Chapter 13

0 0 974KB Read more

Chapter 3 [PDF]

Author / Uploaded
Ziad Al-otaibi

0 0 0
Gefällt Ihnen dieses papier und der download? Sie können Ihre eigene PDF-Datei in wenigen Minuten kostenlos online veröffentlichen! Anmelden

Datei wird geladen, bitte warten...

Zitiervorschau

Data Warehouse and OLAP Technology: An Overview Chapter 3 3.7 Exercises 1. State why, for the integration of multiple heterogeneous information sources, many companies in industry prefer theupdate-driven approach (which constructs and uses data warehouses), rather than thequery-driven approach(which applies wrappers and integrators). Describe situations where the query-driven approach is preferable over the update-driven approach. Answer: For decision-making queries and frequently-asked queries, the update-driven approach is more preferable. This is because expensive data integration and aggregate computation are done before query processing time. In order for the data collected in multiple heterogeneous databases to be used in decision-making processes, data must be integrated and summarized with the semantic heterogeneity problems among multiple databases analyzed and solved. If the query-driven approach is employed, these queries will be translated into multiple (often complex) queries for each individual database. The translated queries will compete for resources with the activities at the local sites, thus degrading their performance. In addition, these queries will generate a complex answer set, which will require further ﬁltering and integration. Thus, the query-driven approach is, in general, ineﬃcient and expensive. The update-driven approach employed in data warehousing is faster and more eﬃcient since most of the queries needed could be done oﬀ-line. For queries that are used rarely, reference the most current data, and/or do not require aggregations, the query-driven approach would be preferable over the update-driven approach. In this case, it may not be justiﬁable for an organization to pay heavy expenses for building and maintaining a data warehouse, if only a small number and/or relatively small size databases are used; or if the queries rely on the current data, since the data warehouses do not contain the most current information. 2. Brieﬂy compare the following concepts. You may use an example to explain your point(s). a) Snowﬂake schema, fact constellation, starnet query model (b) Data cleaning, data transformation, refresh (c) Discovery-driven cube, multifeature cube, virtual warehouse Answer: (a) Snowﬂake schema, fact constellation, starnet query model The snowﬂake schema and fact constellation are both variants of thestar schema model, which consists of a fact table and a set of dimension tables; thesnowﬂake schema contains some normalized dimension tables, whereas thefact constellation contains a set of fact tables that share some common dimension tables. Astarnet query model is a query model (not a schema model), which consists of a set of radial lines emanating from a central point, where each radial line represents one dimension and each point (called a “footprint”) along the line represents a level of the dimension, and each step going out from the center represents the stepping down of a concept hierarchy of the dimension. The starnet query model, as suggested by its name, is used for querying and provides users with a global view of OLAP operations. (b) Data cleaning, data transformation, refresh Data cleaningis the process of detecting errors in the data and rectifying them when possible. Data transformationis the process of converting the data from heterogeneous sources to a uniﬁed data warehouse format or semantics.Refresh is the function propagating the updates from the data sources to the warehouse. (c) Discovery-driven cube, multi-feature cube, virtual warehouse Adiscovery-driven cube uses precomputed measures and visual cues to indicate data exceptions at all levels of aggregation, guiding the user in the data analysis process. Amulti-feature cube computes

complex queries involving multiple dependent aggregates at multiple granularities (e.g., to ﬁnd the total sales for every item having a maximum price, we need to apply the aggregate function SUM to the tuple set output by the aggregate functionMAX). Avirtual warehouse is a set of views (containing data warehouse schema, dimension, and aggregate measure deﬁnitions) over operational databases. 3. Suppose that a data warehouse consists of the three dimensionstime, doctor, andpatient, and the two measurescount andcharge, wherecharge is the fee that a doctor charges a patient for a visit. (a) Enumerate three classes of schemas that are popularly used for modeling data warehouses. (b) Draw a schema diagram for the above data warehouse using one of the schema classes listed in (a). (c) Starting with the base cuboid [day, doctor, patient], what speciﬁcOLAP operations should be performed in order to list the total fee collected by each doctor in 2004? (d) To obtain the same list, write an SQL query assuming the data is stored in a relational database with the schemafee (day, month, year, doctor, hospital, patient, count, charge). Answer: (a) Enumerate three classes of schemas that are popularly used for modeling data warehouses. Three classes of schemas popularly used for modeling data warehouses are the star schema, the snowﬂake schema, and the fact constellations schema. (b) Draw a schema diagram for the above data warehouse using one of the schema classes listed in (a). A star schema is shown in Figure 3.1. (c) Starting with the base cuboid [day, doctor, patient], what speciﬁcOLAP operations should be performed in order to list the total fee collected by each doctor in 2004? The operations to be performed are:

• Roll-up ontime fromday toyear. • Slice fortime=2004. • Roll-up onpatient from individual patient toall. (d) To obtain the same list, write an SQL query assuming the data is stored in a relational database with the schema. f ee(day, month, year, doctor, hospital, patient, count, charge).

selectdoctor, SUM(charge) fromfee whereyear=2004 group bydoctor

4. Suppose that a data warehouse forBig-University consists of the following four dimensions:student, course, semester, and instructor, and two measures countand avg grade. When at the lowest conceptual level (e.g., for a given student, course, semester, and instructor combination), theavg grade measure stores the actual course grade of the student. At higher conceptual levels,avg grade stores the average grade for the given combination. (a) Draw asnowﬂake schema diagram for the data warehouse. (b) Starting with the base cuboid [student, course, semester, instructor], what speciﬁcOLAP operations (e.g., roll-up fromsemester toyear ) should one perform in order to list the average grade ofCS courses for eachBigUniversity student.

(c) If each dimension has ﬁve levels (includingall), such as “student< major< status< university< all”, how many cuboids will this cube contain (including the base and apex cuboids)? Answer: (a) Draw asnowﬂake schema diagram for the data warehouse. A snowﬂake schema is shown in Figure 3.2. (b) Starting with thebase cuboid [student, course, semester, instructor], what speciﬁcOLAP operations (e.g., roll-up fromsemester toyear) should one perform in order to list the average grade ofCS courses for eachBigUniversity student. The speciﬁc OLAP operations to be performed are:

• Roll-up on course fromcourse id todepartment. • Roll-up on student fromstudent id touniversity. • Dice on course, student withdepartment=“CS” anduniversity = “Big-University”.

• Drill-down onstudent fromuniversity tostudent name. (c) If each dimension has ﬁve levels (includingall), such asstudent< major< status< university< all, how many cuboids will this cube contain (including the base and apex cuboids)? This cube will contain 54 = 625 cuboids. 5. Suppose that a data warehouse consists of the four dimensions,date, spectator, location, andgame, and the two measures,count andcharge, wherecharge is the fare that a spectator pays when watching a game on a given date. Spectators may be students, adults, or seniors, with each category having its own charge rate. (a) Draw astar schema diagram for the data warehouse. (b) Starting with the base cuboid [date, spectator, location, game], what speciﬁcOLAP operations should one perform in order to list the total charge paid by student spectators at GM Place in 2004? (c)Bitmap indexing is useful in data warehousing. Taking this cube as an example, brieﬂy discuss advantages and problems of using a bitmap index structure. Answer: (a) Draw astar schema diagram for the data warehouse. A star schema is shown in Figure 3.3. (b) Starting with the base cuboid [date, spectator, location, game], what speciﬁcOLAP operations should one perform in order to list the total charge paid by student spectators at GM Place in 2004? The speciﬁc OLAP operations to be performed are:

• Roll-up ondate fromdate id toyear. • Roll-up ongame fromgame id toall. • Roll-up onlocation fromlocation id tolocation name. • Roll-up onspectator fromspectator id tostatus. • Dice withstatus=“students”,location name=“GM Place”, andyear=2004

(c)Bitmap indexing is useful in data warehousing. Taking this cube as an example, brieﬂy discuss advantages and problems of using a bitmap index structure. Bitmap indexing is advantageous for low-cardinality domains. For example, in this cube, if dimension locationis bitmap indexed, comparison, join, and aggregation operations over locationare then reduced to bit arithmetic, which substantially reduces the processing time. Furthermore, strings of longlocation names can be represented by a single bit, which leads to signiﬁcant reduction in space and I/O. For dimensions with high-cardinality, such asdate in this example, the vector to present the bitmap index could be very long. For example, a 10-year collection of data could result in 3650 date records, meaning that every tuple in the fact table would require 3650 bits, or approximately 456 bytes, to hold the bitmap index. 6. [Tao Cheng] A data warehouse can be modeled by either astar schema or asnowﬂake schema. Brieﬂy describe the similarities and the diﬀerences of the two models, and then analyze their advantages and disadvantages with regard to one another. Give your opinion of which might be more empirically useful and state the reasons behind your answer. Answer: They are similar in the sense that they all have a fact table, as well as some dimensional tables. The major diﬀerence is that some dimension tables in the snowﬂake schema are normalized, thereby further splitting the data into additional tables. The advantage for star schema is its simplicity, which will enable eﬃciency, but it requires more space. For snowﬂake schema, it reduce some redundancy by sharing common tables: The tables are easy to maintain and save some space. However, it is less eﬃcient, and the saving of space is negligible in comparison with the typical magnitude of the fact table. Therefore, empirically, star schema is better simply because nowadays, eﬃciency has higher priority over space, if it is not too huge. Sometimes in industry, to speed up processing, people “denormalize data from a snowﬂake schema into a star schema” [1]. Another option here is that “some practitioners use a snowﬂake schema to maintain dimensions, and then present users with the same data collapsed into a star” [2]. Some references for the answer to this question: Oracle Tip: Understand the diﬀerence between star and snowﬂake schemas in OLAP. [1] http://builder.com.com/5100-6388-5221690.html Star vs. Snowﬂake Schemas.