Chapter 3 [PDF]

  • 0 0 0
  • Gefällt Ihnen dieses papier und der download? Sie können Ihre eigene PDF-Datei in wenigen Minuten kostenlos online veröffentlichen! Anmelden
Datei wird geladen, bitte warten...
Zitiervorschau

Data Warehouse and OLAP Technology: An Overview Chapter 3 3.7 Exercises 1. State why, for the integration of multiple heterogeneous information sources, many companies in industry prefer theupdate-driven approach (which constructs and uses data warehouses), rather than thequery-driven approach(which applies wrappers and integrators). Describe situations where the query-driven approach is preferable over the update-driven approach. Answer: For decision-making queries and frequently-asked queries, the update-driven approach is more preferable. This is because expensive data integration and aggregate computation are done before query processing time. In order for the data collected in multiple heterogeneous databases to be used in decision-making processes, data must be integrated and summarized with the semantic heterogeneity problems among multiple databases analyzed and solved. If the query-driven approach is employed, these queries will be translated into multiple (often complex) queries for each individual database. The translated queries will compete for resources with the activities at the local sites, thus degrading their performance. In addition, these queries will generate a complex answer set, which will require further filtering and integration. Thus, the query-driven approach is, in general, inefficient and expensive. The update-driven approach employed in data warehousing is faster and more efficient since most of the queries needed could be done off-line. For queries that are used rarely, reference the most current data, and/or do not require aggregations, the query-driven approach would be preferable over the update-driven approach. In this case, it may not be justifiable for an organization to pay heavy expenses for building and maintaining a data warehouse, if only a small number and/or relatively small size databases are used; or if the queries rely on the current data, since the data warehouses do not contain the most current information. 2. Briefly compare the following concepts. You may use an example to explain your point(s). a) Snowflake schema, fact constellation, starnet query model (b) Data cleaning, data transformation, refresh (c) Discovery-driven cube, multifeature cube, virtual warehouse Answer: (a) Snowflake schema, fact constellation, starnet query model The snowflake schema and fact constellation are both variants of thestar schema model, which consists of a fact table and a set of dimension tables; thesnowflake schema contains some normalized dimension tables, whereas thefact constellation contains a set of fact tables that share some common dimension tables. Astarnet query model is a query model (not a schema model), which consists of a set of radial lines emanating from a central point, where each radial line represents one dimension and each point (called a “footprint”) along the line represents a level of the dimension, and each step going out from the center represents the stepping down of a concept hierarchy of the dimension. The starnet query model, as suggested by its name, is used for querying and provides users with a global view of OLAP operations. (b) Data cleaning, data transformation, refresh Data cleaningis the process of detecting errors in the data and rectifying them when possible. Data transformationis the process of converting the data from heterogeneous sources to a unified data warehouse format or semantics.Refresh is the function propagating the updates from the data sources to the warehouse. (c) Discovery-driven cube, multi-feature cube, virtual warehouse Adiscovery-driven cube uses precomputed measures and visual cues to indicate data exceptions at all levels of aggregation, guiding the user in the data analysis process. Amulti-feature cube computes

complex queries involving multiple dependent aggregates at multiple granularities (e.g., to find the total sales for every item having a maximum price, we need to apply the aggregate function SUM to the tuple set output by the aggregate functionMAX). Avirtual warehouse is a set of views (containing data warehouse schema, dimension, and aggregate measure definitions) over operational databases. 3. Suppose that a data warehouse consists of the three dimensionstime, doctor, andpatient, and the two measurescount andcharge, wherecharge is the fee that a doctor charges a patient for a visit. (a) Enumerate three classes of schemas that are popularly used for modeling data warehouses. (b) Draw a schema diagram for the above data warehouse using one of the schema classes listed in (a). (c) Starting with the base cuboid [day, doctor, patient], what specificOLAP operations should be performed in order to list the total fee collected by each doctor in 2004? (d) To obtain the same list, write an SQL query assuming the data is stored in a relational database with the schemafee (day, month, year, doctor, hospital, patient, count, charge). Answer: (a) Enumerate three classes of schemas that are popularly used for modeling data warehouses. Three classes of schemas popularly used for modeling data warehouses are the star schema, the snowflake schema, and the fact constellations schema. (b) Draw a schema diagram for the above data warehouse using one of the schema classes listed in (a). A star schema is shown in Figure 3.1. (c) Starting with the base cuboid [day, doctor, patient], what specificOLAP operations should be performed in order to list the total fee collected by each doctor in 2004? The operations to be performed are:

• Roll-up ontime fromday toyear. • Slice fortime=2004. • Roll-up onpatient from individual patient toall. (d) To obtain the same list, write an SQL query assuming the data is stored in a relational database with the schema. f ee(day, month, year, doctor, hospital, patient, count, charge).

selectdoctor, SUM(charge) fromfee whereyear=2004 group bydoctor

4. Suppose that a data warehouse forBig-University consists of the following four dimensions:student, course, semester, and instructor, and two measures countand avg grade. When at the lowest conceptual level (e.g., for a given student, course, semester, and instructor combination), theavg grade measure stores the actual course grade of the student. At higher conceptual levels,avg grade stores the average grade for the given combination. (a) Draw asnowflake schema diagram for the data warehouse. (b) Starting with the base cuboid [student, course, semester, instructor], what specificOLAP operations (e.g., roll-up fromsemester toyear ) should one perform in order to list the average grade ofCS courses for eachBigUniversity student.

(c) If each dimension has five levels (includingall), such as “student< major< status< university< all”, how many cuboids will this cube contain (including the base and apex cuboids)? Answer: (a) Draw asnowflake schema diagram for the data warehouse. A snowflake schema is shown in Figure 3.2. (b) Starting with thebase cuboid [student, course, semester, instructor], what specificOLAP operations (e.g., roll-up fromsemester toyear) should one perform in order to list the average grade ofCS courses for eachBigUniversity student. The specific OLAP operations to be performed are:

• Roll-up on course fromcourse id todepartment. • Roll-up on student fromstudent id touniversity. • Dice on course, student withdepartment=“CS” anduniversity = “Big-University”.

• Drill-down onstudent fromuniversity tostudent name. (c) If each dimension has five levels (includingall), such asstudent< major< status< university< all, how many cuboids will this cube contain (including the base and apex cuboids)? This cube will contain 54 = 625 cuboids. 5. Suppose that a data warehouse consists of the four dimensions,date, spectator, location, andgame, and the two measures,count andcharge, wherecharge is the fare that a spectator pays when watching a game on a given date. Spectators may be students, adults, or seniors, with each category having its own charge rate. (a) Draw astar schema diagram for the data warehouse. (b) Starting with the base cuboid [date, spectator, location, game], what specificOLAP operations should one perform in order to list the total charge paid by student spectators at GM Place in 2004? (c)Bitmap indexing is useful in data warehousing. Taking this cube as an example, briefly discuss advantages and problems of using a bitmap index structure. Answer: (a) Draw astar schema diagram for the data warehouse. A star schema is shown in Figure 3.3. (b) Starting with the base cuboid [date, spectator, location, game], what specificOLAP operations should one perform in order to list the total charge paid by student spectators at GM Place in 2004? The specific OLAP operations to be performed are:

• Roll-up ondate fromdate id toyear. • Roll-up ongame fromgame id toall. • Roll-up onlocation fromlocation id tolocation name. • Roll-up onspectator fromspectator id tostatus. • Dice withstatus=“students”,location name=“GM Place”, andyear=2004

(c)Bitmap indexing is useful in data warehousing. Taking this cube as an example, briefly discuss advantages and problems of using a bitmap index structure. Bitmap indexing is advantageous for low-cardinality domains. For example, in this cube, if dimension locationis bitmap indexed, comparison, join, and aggregation operations over locationare then reduced to bit arithmetic, which substantially reduces the processing time. Furthermore, strings of longlocation names can be represented by a single bit, which leads to significant reduction in space and I/O. For dimensions with high-cardinality, such asdate in this example, the vector to present the bitmap index could be very long. For example, a 10-year collection of data could result in 3650 date records, meaning that every tuple in the fact table would require 3650 bits, or approximately 456 bytes, to hold the bitmap index. 6. [Tao Cheng] A data warehouse can be modeled by either astar schema or asnowflake schema. Briefly describe the similarities and the differences of the two models, and then analyze their advantages and disadvantages with regard to one another. Give your opinion of which might be more empirically useful and state the reasons behind your answer. Answer: They are similar in the sense that they all have a fact table, as well as some dimensional tables. The major difference is that some dimension tables in the snowflake schema are normalized, thereby further splitting the data into additional tables. The advantage for star schema is its simplicity, which will enable efficiency, but it requires more space. For snowflake schema, it reduce some redundancy by sharing common tables: The tables are easy to maintain and save some space. However, it is less efficient, and the saving of space is negligible in comparison with the typical magnitude of the fact table. Therefore, empirically, star schema is better simply because nowadays, efficiency has higher priority over space, if it is not too huge. Sometimes in industry, to speed up processing, people “denormalize data from a snowflake schema into a star schema” [1]. Another option here is that “some practitioners use a snowflake schema to maintain dimensions, and then present users with the same data collapsed into a star” [2]. Some references for the answer to this question: Oracle Tip: Understand the difference between star and snowflake schemas in OLAP. [1] http://builder.com.com/5100-6388-5221690.html Star vs. Snowflake Schemas.