原文地址:数据库表结构设计

Introduction

根据需求,设计和实现一个数据库的表结构,练习DDL的语法。

DDL

Overview

This assignment will teach students how to create and design databases. Unlike with other databases, you may write to your own database (e.g. you can create tables user_xxxxxx.random_table_name).

Please submit (1) code and (2) slides or PDF writeup to the course Moodle. The writeup should show the output and the code statement. Please answer straightforwardly.

I am trying to strike a balance between attribution and encouragement of learning. Teams of three allowed, but students who do solo work or on teams of 2 are likelier to benefit if their total course grade is on the margin. If you are in a team of three, please do not rely on the same partners more than twice.

Creating a table

  1. Create a copy of crsp.msf in your local space without the keys

    • a. How big is the table?
    • b. Report the run time for a query that calculates the average turnover, return, and spread for all stockyears
    • c. Show the EXPLAIN for the query
  2. Create a copy of crsp.msf in your local space without the keys

    • a. How big is the table?
    • b. Report the run time for a query that calculates the average turnover, return, and spread for all stockyears
    • c. Show the EXPLAIN for the query - why is this query more efficient than the previous one?
  3. Create a table from scatch based on crsp.msf

    • a. I have put crsp.msf in a file. Using R or Python, create a copy of the table
    • b. "Checksum": formal checksum methodologies exist to check data integrity, but let's do some crude checks. Show me that (1) the number of rows is the same, and (2) the cumulative lifetime returns to Microsoft and Amazon.
  4. A new file - design a schema

    • a. Please write the schema of this table based on inspecting the field lengths such that you have no data loss. Please show me the schema.
    • b. Please perform a check to show that the table contents were preserved (e.g. minimal data loss).

Altering a table

I just created a horrific table in homework2.crap_table. Create a copy in your workspace. Fix the following columns such that the most spaceefficient data type is used

  • CUSIP, PERMNO, COMNAM, TICKER, NCUSIP, RET

    • Assume that the largest number or character observed is the largest you will observe in the future
    • Hints: For the character columns, inspect the length. Assume that the maximum character length observed is indeed the longest character length observed in the field. For the numeric columns, decide which numbers should be integers by examining those which lose information when rounded. Is there data loss?
  • Alter table

    • Drop redundant variations of the return columns
    • Create a financial ratio - debt to assets where assets > 0
    • Change the dividend column to decimal from integer
    • Update the empty column dividend
    • Create a copy of the returns columns so that you have at most 5 digits (although there is data loss). Report the average difference and average absolute difference.

Restructuring an existing table

  • Separate the monthly stock return data, annual data, and quarterly data, and companylevel data.

    • For simplicity, we assume the company name, share code, hexcd, industry code (hsiccd), and ticker/tsymbol NEVER change
    • This should yield four tables
    • Create an index on each table such that an inner join on four tables to create the original table is possible
    • Show me that inner join statement

      • It must produce the same # of rows
      • The explain must be optimized using keys
    • Show me the table structure (SHOW CREATE TABLE) of the four new tables
    • What is the total size of the four new data? How much did you save?

Draw a schema diagram a schema diagram example is here.
Draw a schema diagram of your new table

Extra credit

  • Download the following file: slide_output/data/dsf.csv.gz

    • The data underlies the table crsp.dsf minus some fields
    • Create a logical table structure
    • Loop through the file and upload the file into a compressed MySQL table via R or Python

      • Report your code
    • Key the table with a clustered index
    • Take some steps to verify the similarity of the data to crsp.dsf
    • PLEASE DROP THE TABLE so as to not take up all the space on the server
  • Recreate normalized_reprisk from reprisk. Likely requires a bunch of R/Python queries

(本文出自csprojectedu.com,转载请注明出处)


csprojectedu
751 声望201 粉丝

Microsoft, ACMer, 现BAT全栈工程师。


引用和评论

0 条评论