Migrating the Northwinds SQL Database to a MultiValue Database- Part II
The battle between SQL data models and MultiValue data models has raged for years. That won't end any time soon. However, as NoSQL-style databases are starting to become routinely accepted by CEO and CIO for enterprise level data stores, the MultiValue model is being looked at very closely again. We look good in the spotlight. But, we could look even better.
Since MultiValue databases were specifically designed to give you flexibility in how the data is stored, it isn't always clear in a given situation, which choices might be the best. Non-optimal choices will work but the right decisions when architecting the data will generally reduce the amount of code and complexity of your application software. It will also have a positive impact on speed.
Many developers will use the same structure they have used for the last thirty years. Sometimes that is 1NF (First Normal Form) which is considered the traditional data model. Other times, developers will choose the NF2 (Non-First Normal Form) option which is also inherent in the MultiValued model. More often than not, the two models will be mixed. While a carefully developed mix is ideal, far too often there is no clear rhyme or reason. Our technology choices, even in the architectural level, are based on what we've experienced, on what we've seen in production. Generally, that's not what is optimal.
This inconsistency can cause developers coming from other environment to resist using the MultiValue data model because it is not clear when and why it should be used. When working on this Northwinds migration product, I stumbled across this issue in my own mind. And that's our challenge as an industry. We need to examine our choices and maximize the value of our assets.
Battling the Existing Databases
When you are employed as a developer in any enterprise for the first time, achieving an understanding of how the existing business rules and database work together can be difficult. It is the same when you move to a new data model for the first time. Going from a columnar database to SQL, or SQL to MultiValue, can make your new job feel, in many ways, like a first job.
If the database is poorly organized, it is easy for that developer to think that their only example — 1oo% of their experience with MultiValue — is the right way. We build a future by leaving better examples to be discovered.
This limited experience evolves into a standard that is followed when enhancing the existing systems. It becomes the "That's the way we've always done it." Sometimes a developer gets a project that allows them to define the data structure to use, but if all they've seen is layouts based on decades-old software practices, they will likely continue creating badly designed tables. The infernal feedback loop of "Hey, it works" will make it harder to convince them of the advantages of doing things differently.
For those new to the MultiValue model and software, this can be confusing and frustrating due to the lack of understanding of the "why" behind so many of the decisions they encounter.
For those new-to-MV people reading this, from my experience, let me offer you a likely answer: Many times the choices were made based on the resource available to the computer at the time. You must always remember when looking at existing MultiValue software that it was possibly developed during a time when hard drives cost $200 to $500 per 10MB of storage, and sometimes more. RAM and CPU cost even more, so the focus was on how to conserve resources. Looking at modern hardware, it is clear that the speed/storage/readability trade-offs have changed. As the hardware landscape has been re-invented, the rules of software design need to change. Inertia, our habit of doing things by habit instead of by design, keeps us from moving things forward.
As a result, most MultiValue software applications are designed to maximize the IO because it was assumed that low on hardware resources would be the bottleneck. The practice of writing software for low resource hardware is still well established with experienced MultiValue developers. So you can easily encounter brand new code and tables that still reflect this line of thought.
Battling the Traditional Model - 1NF (First Normal Form) and the Spreadsheet Data Storage
The traditional model that is taught to new developers for storing data is First Normal Form (1NF). The main reason for this is that 1NF looks and acts very much like a spreadsheet, CSV, or fixed length data file. It is easy to grasp in the abstract.
These storage models are very efficient when all you need to do is process files sequentially. This allows a resource-limited machines to process large amounts of data very quickly. The largest disadvantage to this storage model is that you always have to start at the top of the table and work your way to the bottom to find what you are looking for. The more data you put into the table the longer it takes. Of course, most database systems, MV or not, have developed work-arounds, genuine optimizations, and horrific kludges address this issue. It still isn't as efficient as soing it right.
Not only are you processing a large chunk of the table to find your data, you may be incurring multiple IO requests for every record you are looking — and rejecting — until you find the ones you want.
The typical solution to solve this is to use indexes which allow a developer to cut down on the number of IO request because the indexes provide an offset pointer to the exact location in the table the record the application is looking for.
While indexes still incur IO when the application is looking for information in the index, they cut down on the total number of IO hits for large tables. It is, when you think about it counter intuitive.
Reading the first record in a table using an index is a minimum two-read operation: read the index to find the item and then read the actual record. Worse, if the index reference is many rows deep inside the index table, you might incur hundreds of index reads to get that one record from the primary table. However, the further down in the primary table the data is, the more reads an index saves you. Additionally, when you look for multiple records with a common value, an index creates a massive savings.
The other solution used to solve the core IO issue with 1NF storage is to throw RAM at the problem. Now what happens is the index is loaded completely into memory so that it can be accessed faster, which cuts down on disk IO (which is the slowest type of IO). If your disks are SSD (non-mechanical) then the speed differential may be less impressive because SSD is already faster than traditional disk.
Both of these solutions, RAM and indexing, are only stop-gap solutions in the long run. The TCO (Total Cost of Ownership) of "throw more hardware at it" can become cost prohibitive. While RAM is cheap and disk is cheap compared to years back, enough of either adds up to significant money.
Creating a system that can handle 1TB of RAM so you can place everything from disk into RAM for the best performance would be costly. Using 2019 PC-level pricing for 16x64GB SDRAM to equate to 1TB of RAM, it would cost around $6,400 for just the RAM. This doesn't even take into account the cost of other hardware like a motherboard that could support 1TB of RAM, or the software that could address that amount of RAM efficiently.
Battling the Traditional Model - 1NF (First Normal Form) and data relationships.
The hardware is the biggest limitation for using traditional data stores like SQL which force you use 1NF, but there is another big gotcha with the Traditional data model.
The rules of 1NF are:
- Table should have no duplicates - No duplicate columns or duplicate records (rows)
- There must be no significance in the order of either columns or records (rows)
- In every record (row), each column must have a single value.
Following these rules mean that storing related information has to be done in more than one table. These rules make lots of sense in simple, well defined, and never changing data structures provided in academic and training classes. In most business environment you will quickly find its limitations.
It is also conceptually hard. When we started with abstract data concepts, the "one value occupies one spot" model was easy to grasp. When you loo at real world data you begin to see that this model fits poorly in nearly every case.
The classic example of this limitation is the customer phone number [Figure 1]. If you need to associate more than one phone number with a customer record, then following the first and third rule of 1NF, you are required to create a separate table to store phone numbers outside of the primary Customer Table.
You would use the following SQL statement to retrieve the customer record plus the 3 phone numbers:
SELECT Cusomter.CustomerID, Customer.CustomerName, CustomerPhone.PhoneNo, CustPhone.ContactType FROM Customer WHERE Customer.CustomerID = "1234" INNER JOIN CustomerPhone ON Customer.CustomerID = CustomerPhone.CustomerID
In order to get all three phone numbers for record 1234, you have a minimum of four IO request: one for the main customer record, and one more for each of the phone numbers. This doesn't take into account any additional IO needed to find the main customer record, or each of the phone numbers. And don't forget the IO for indexing.
If you look at the second rule for 1NF, it says "There must be no significance in the order of either columns or rows". That means that each of the phone numbers may or may not be stored one right after the other as shown in Figure 1. The customer phone table may have a thousand phone numbers in it, and phone numbers 1, 2, and 3 might be stored sequentially next to each other or five hundred records -- five hundred IO requests — apart.
The Power of Traditional Model - 1NF (First Normal Form) and data relationships.
While 1NF has a weakness when working with multiple values of the same type in a record, it is allows for the handling of complex data relationships.
The Northwinds Order and OrderDetail tables are primary examples of this [Figure 2]. Since an order can consist of Order data, like customer and where the order is being shippted to, and detail data, more than one products and associated quantities, the developer is forced to separate the order into two separate tables: Order and OrderDetail.
Since most Orders consist of more than a single product, the information about the order's products is stored in the OrderDetail table and multiple rows. All other information is stored in the Order Table.
This separation of data allows an application to change the product price and quantity on any one row of an order without having to know anything more than the OrderID and ProductID. This also allows an order to contain two or ten thousand products per order, depending on the needs of the business.
This type of data relationship structure would seem pretty obvious when you take into account the unknown number of product each Order could have.
Another advantage of the OrderDetail table storing each order product as separate records is the ability to enhance the business system to support partial shipments. A developer can easily enhance the OrderDetail table to include a ShipDate for each product. This would allow the application to keep track of what has been shipped on a row by row basis without have to create additional Order records.
The MultiValue Model - Enhancements to 1NF (First Normal Form)
Since SQL styles data stores force you to use 1NF, software developers typically just live with its model, and assume there is no better way to do it. A person with only one outfit know what to wear every day.
MultiValue data stores have the opposite problem. As I touched on in the "Battling the Existing Databases" section, our databases do not enforce a specific data structure on the software developer. We have options.
The following rules should be followed when using the MultiValue model:
- Tables should have no duplicates in columns (fields) or records (rows).
- Records always have a unique primary key
- A record (row) may have columns (fields) with more than one value of the same type of data.
- A record (field) may have an internal relationship between more than one column in the same record.
- Separate tables should be used when there are more than 4-7 columns with internal relationship to each other, or more than 50 values stored in columns with an internal relationship to each other.
If you compare these rules to the 1NF rules, you find they are fundamentally the same. There are more rules to follow in the MultiValue data model mainly because 1NF does not have the flexibility when working with storing relationships within each record. The takeaway: You don't need more rules when your choices have been taken away from you.
These rules allows you to take advantage of both the storing of simple relationships like customer phone numbers, as well as complex relationships like that between the Order and OrderDetail tables.
Eliminating duplications in the database is important, which is one of the main purposes of 1NF. But software applications that interact with data needs to do it as efficiently as possible, order to keep business software responsive to the users working with it.