Business Tech Philosophy of Indexing
I'm in the final stages of writing a new novel, Lefty Lucy. In the story, a medical team is being assembled. One of the prospective members informs the leader that he's a doctor... of philosophy. When asked why he's needed on a medical research mission, he explains the philosophy is needed everywhere. When I wrote that, I started thinking about how philosophy applies to MultiValue. There are a lot of ways I could go with that. As usual for me, I chose to approach it from way out in left field. I started thinking about the real purpose of indexing.
Five Kinds of Data
Before we can talk about indexes, we need an agreed definition of data. For these purposes, I'm going to focus on output formatted data, not the literal representation on disk.
When creating output from any database system, you have four kinds of data available: Raw, Cooked, Computed, and Implied. Each serves a purpose. There is, of course, a fifth kind: Fictional.
Raw — This is literally the data as-is. What you recorded to the database is what you retrieve and display.
Cooked — Think of dates and times. The database might store them in some other format — milliseconds or days-since-epoch, for example — so we are transforming them in an extremely unambiguous way.
Computed — This involves performing math on data, concatenating strings, and possibly mixing more than one of the five kinds of data together.
Implied — Let's say you are asked to subset your data and extract just the fifteen records with the highest dollar value. The dollar value is likely computed, but the numbers one through fifteen in column one are implied. The seventh line only has a seven because of where it lies in the output. You won't find that seven anywhere in the data.
Fictional — While any data can be used in misleading ways, and people lie with statistics all the time, Fictional data is not necessarily untrue. If a client wants me to slap IS for International Spectrum in the third column of every line, so they can identify the source after they merge it with other data, that IS is fictional. It appears nowhere in the actual data, not as-is, not in any derivable format. Data is unlikely to carry a field which indicates that my data is my data.
It is dangerous to generalize about all the flavors of MultiValue, but I believe that every version which supports indexing supports four of the five kinds of data. So, I can index on internal time, on the output of an extended price calculation, or a reassembled full name. Fictional data would also work but, since it is divorced from the effects of other data, all the values in the index would be the same. The odd-man-out is Implied data. I'm not sure how you'd index on that without an unreasonable amount of brute-force code.
Think about that for a moment. A database index is not merely an index of the content of a file. It is a quasi-magical construct that tracks the information we can extract from the data in addition to indexing the data itself.
When you employ something to make things go faster, you expose a devastating underlying premise: things are too slow when done the right way. In the ideal world, data would always be stored in a form which is identical to the end resulting output. Why don't we do that? Because storing a date as 10-28-2018 makes it slow to sort. Storing the extended price makes it hard to adjust values when an order is amended to or has line-items canceled. We keep the parts and the raw values because of speed and efficiency. Indexes are another symptom of that same viewpoint.
My friend Jim Roberts, who used to work at Applied Digital Data Systems, back when ADDS was an active company in our market, used to tell the story of writing an accounting system in 4k with a soldering iron. When you code that way, you don't have context-sensitive help or error correction. You have code that does the minimal definition of the job. We write all that other stuff because people make mistakes. Imagine how little code you'd need if we did away with prompts, retry loops, and every other bit we wedge in to support being helpful.
The same idea applies to indexes, storing raw data, and every other optimization strategy we employ. We do them to compensate, to make things that are not inherently fast enough move faster.
If you can find a database structure which doesn't need indexes, I'm confident that you will find some other mechanism protecting some other shortfall. Things are rarely fast enough. Throwing hardware at it helps but even with fast hardware, there is a benefit to code optimization. And, having both of those boosts doesn't mean that better data architecture and better data storage can't be leveraged for even better savings.
When you see clearly which parts of your job are the job -- store the data so it can be retrieved based on criteria, in an order, displaying the correct subset of the data — and the parts which are not the job but support it — indexing and error checking and the like — you can focus your efforts differently.
This is not an article condemning speedups. I heartily endorse them. The hope is that by highlighting the division between our core mission and the things we need to do to support it, we can all get better at explaining the work to others and get a new perspective for ourselves.
Next time you're asked why it took so long to do "X" you can split the answer between "X" and the support items. For example: "Building the data tables took a day. Adding help screens with examples, error checking, building in testing tools to prove the data, and creating the reports took three months. I can break that down further for you if you'd like."
Often, the perception that IT is slow comes from people thinking about the core without considering that the ancillary parts take the majority of the time. What we do isn't simple because we aren't just throwing something together. We are building in a way that maximizes the chance that our shiny new thing won't break the first time a breeze hits it.