Business Tech: AI and DSS: Part II
Dystopia is the term used for any number of bleak, failed societies. In Science Fiction, computers, robots, and evil algorithms - AI (Artificial Intelligence) and DSS (Decision Support Systems) being two examples - are often to blame. The emphasis is generally on dehumanizing us by reducing our society to the data which describes it. Personally, I blame SQL.
To be serious, SQL is designed around the idea that optimizing data means optimizing the digital use of data. It is not designed for you, my organic friend.
What if we take a more human approach to data? Well, if you want data that has a human touch, we have plenty of options. NVP (Name-Value Pairs) offer a readable - human readable - label with each jot of data. Formalize that a bit and you are in the realm of XML and JSON. These three are certainly not machine optimized. They are people-centric, focusing on clarity to the reader over mathematical minimalism. If the AI uprising is your fear, your best defense is to skew the rules toward… well… us.
What About MultiValue?
MultiValue sits in the middle. I once heard, and often quote, Mike Ruane as saying that MultiValue is compressed XML. We use positions instead of labels, but we bring a structure that is more eyeball friendly than SQL. Think of it this way: I have to transform SQL data to share it. It has to become tab-separated, or XML, or some other decidedly non-SQL thing before it can move. Generally, this isn't just swapping columns for commas. SQL data is spread out and has to be unified and essentially re-architected before it can be transportable.
Given that moving data, dissecting data, and assembling data is a big part of what we do, having a database that can't do any of that easily is an odd choice. Unless, of course, you are in the thrall of the metal ones. MultiValue pays attention to speed, but it also has its bags packed at all times. Any modern developer who can tease data out of a comma-separated file can handle a string with @AM delimiters. Tell them the @VMs are embedded sub-strings and they'll probably be just fine with those delimiters as well. If we must dress it up for travel, subbing @AM to comma and @VM to pipe is often enough. When it comes to speed, the less we handle the data, the faster we can ship it.
XML, JSON and MultiValue also have another critical edge when it comes to readability: Thingedness . This is the term I coined to describe the ideal relationship between data and the user of the data. Here's where columnar databases and SQL databases fail the thingedness test: Can you point to a single record and associate it with a common, real world, thing? My XML, JSON, NVP, or MultiValue INVOICE file can have the entirety of an invoice in each record. One read equals one invoice. That's something a non-database person can grasp: one hundred invoices equals one hundred records.
While there are reasons to not do this — many excellent reasons — the closer your data gets to this model, the easier it is for the programmer, the user, and the architect to keep the entire data model in their head. As you approach thingedness, you approach clarity of concept. The data world has more in common with the human one.
With XML, JSON, and MultiValue, thingedness is achievable. The big difference between the three is that the first two have to be transformed to be used. MultiValue can chose to unpack its bags, but a MultiValue string is always ready to work.
Some of the Excellent Reasons
SQL is the extreme counter-argument to thingedness. It is based on the premise that the more you break something down, the better you can control it and account for it. There is merit to this approach if you are concerned with scaling up the size of your data. However, the more the complexity of your data scales up, the worse this idea becomes. There is a reason Google uses NoSQL to manage search. There is a reason that Facebook uses NoSQL
Still, SQL's popularity isn't random. For some jobs, the rules of SQL are the most rational ones. A good example is tool building. It is easier to generalize a tool, for reporting or analytics, when all data has a rigid uniformity of storage. The less creative the structures are, the easier it is to make new tools.
Moreover, forcing the table designer to specify field types and lengths helps keep the design focused on the use and intent of the data. Free-form data can often result in sloppy design. Working in SQL makes me a better NoSQL architect.
So, please don't damn the methodology out of hand. It has its place. Not every place, but I wouldn't want a pure thingedness database, either.
Where the Pendulum Stops
What we are looking for here is an acceptable level of atomicity. Simply put, we want to break things down just enough.
The middle is where the winners want to be. Reasonable control, but not the OCD of SQL. Reasonable thingedness, but not a rigid mandate to mirror the structures of the world. SQL doesn't do middle. Columnar doesn't do middle. XML and JSON can do middle, but they can't be operated upon directly for complex tasks.
MultiValue can do middle. We can create an invoice header record, with unified data, and split the details, each to their own record. We can keep multiple values in the header efficiently: Three contact names? No problem. Only one on the next one? No wasted space.
This is the balance between the AI/DSS view of data and the human view. We can scale in complexity because we can make decisions in our architecture and applications to treat elements of our data in sane ways.
How Does This Relate to AI and DSS?
As you saw last issue in the Animals program, we needed to construct the growing AI data in a way which favors decision trees. The less efficiently we implement, the slower our program will get as it matures. The infant version, the one with just a few started animals, will always be faster than the adult, with its extensive zoo, but here we aren't worried about relative speed, we are worried about being fast enough to keep the user feeding the program. The game Animals doesn't grow if no one plays.
As Nathan discusses in another article in this issue, the closer you put the data interaction to the data, the better your speed. Additionally, as my dad would point out, the more parts, the more that can break. Keeping the programming close to the data requires fewer transformations, less network bandwidth, and fewer steps. That makes it faster and less fragile.
When we implement DSS or AI, we are talking about extensive data. If we are being really smart about it, we are also expecting that data to keep growing. Real AI and DSS should eventually perform successfully outside of the original parameters. If you planned everything it does, it is more of a performing bear than a critical thinker.
You choice of data storage matters. Your choice of programming language matters. With enough time, trouble, effort, and money, you might be able to make a pig sing, but starting with a singer is probably a wiser move. Understanding the underlying effects of your choices raises you above decisions like, "Well, it was the only language I knew, so I wrote everything in Whitespace." Picking tools responsibly? That's real intelligence.