Part one of the 0 to 100 series — with inspiration from Drake.
Working with relational databases for a number of years, I had become quite attached to their well-defined organization, subsequent predictability, and reliability. Then I looked at NoSQL databases with both tree and graph structures. In studying the need for these additional types databases, I realized my career had been an ongoing data safari – always trying to jam differently-sized animals you might find on the safari into the same size box. It’s pretty easy to see that a custom-built “box” for the honey badger would not fit the giraffe very well (or vice versa). Like animals on my safari, data is not always the same shape either.
In the relational world you are bound by rows and columns – and sometimes heavily constrained fields that must be only a specific format and only a specific length. What do you do with data that is larger?
How to ‘Unthink’
When I first looked at MarkLogic, I saw first-hand how it allowed me to think outside the confines of a “box” to “unthink” about the modeling required to build boxes for each animal to fit into as I could easily extend my model to accommodate each of them including those I did not expect to find.
Plugging “unthink” into your favorite search engine offers results with such suggestions as “to put out of mind.” Let’s begin seeing how we can rediscover our creative genius and put those relational tools out of mind.
Modeling the ER Diagram – A World of Pain
Let’s consider a CRM Application that is typically driven by relational databases. An application of this magnitude would include many different data sources, data types and certainly data schemas. Imagine modeling that hairy mess using entity-relationship (ER) diagrams. In this real-world scenario, you likely have at least 1,000 pieces of information in the ER diagram which can easily span entire walls and at least hundreds of tables.
This does not seem like a big deal until you experienced the sparse data problem where a given column may or may not have values leading to slow and inefficient algorithms as processing and memory are wasted on the zeroes or null values. On the other hand, there could be multiple values to deal with; one of my colleagues has six valid mailing addresses that are not duplicates and this is rather common across the board with other colleagues and friends. When viewed as a class model, this Object Relational Mapping leads to Sparse data for the polymorphic details.
The above ER diagram is as an example without any identifying information of its origin to protect the offenders, hahah. These complex ER diagrams are time consuming, painful, and boring. More importantly, these modeling exercises also carry a large price tag payable to a big consulting firm with little chance for future expansion of the data model corresponding to the business. Chances are, you’ve seen or been involved with this part of the project more times than you are willing to admit. Me too.
Simple Example: Defensive Modeling
Taking a step back from the complex ER diagram above, suppose you needed to create a table to capture some User information today. A simple enough task, right? What would you do to create that table? What might this look like? Well, it would probably have columns that would include:
ID (primary key), CustomerNumber, UserID, FirstName, LastName, MiddleName, Address1 through Address4 (Just in case!), City, State, ZipCode, and Country, for example…
When you create columns Address1 through Address4, you are engaging in defensive modeling strategies (hint: NoSQL databases such as MarkLogic do not require this, more later — no peeking!) to think about every input and output possibility so that you are protected against the inflexibility of the relational database: It takes upfront thought before adding additional columns when they are needed later.
Lets take things up a notch
What if I wanted to up the complexity a bit and work with other real-world data, e.g., Medical Insurance Claims Data?
Pre-thinking about and modeling the data you have to fit into those rows and columns probably took a decent amount of time. Why? Rows and Columns force you to evaluate the possibility of relationships and cardinality. For all intents and purposes, my respective tables could have one to many relationships from and to them. What if I am trying to bring multiple data silos together? This is even tougher to deal with because the schemas won’t match and I must spend time unifying them. There are typically decisions involved regarding which data to keep versus which to drop because of relational’s associated cardinality — leading to the desire to shortcut any associated modeling possible. The table forces me to make design choices before I have all the facts of the data sets. It is also a rather intensive operation to add a column later and create the related additional indexes on it not to mention ugly to look at from an organizational perspective. I like my columns or data items in a specific order AND hierarchy that makes sense, thank you very much!
XML/JSON solves all of our modeling problems in relational, right?
Imagine the team has discovered the flexible nature of both XML and JSON and how each (correctly) deals with these modeling challenges. Further, there are probably freetext/unstructured items to consider with as well. Those will likely end up living or dying in BLOB columns or, as I prefer to say, “Data Coffins.” Why will they die? It is hard to do much with these items once they’ve landed into those columns and is effectively where good data goes to die: it cannot be easily searched or quickly leveraged.
We’re done now, right? Not so fast! Columns need to be created to hold Primary Key values and associated indexes written on them separate from the XML/JSON or unstructured BLOB columns so we have a method for search and retrieval. What if you want to search on a value from XML/JSON or unstructured text some day in the future? Uh oh – the DBA has to write code extracting relevant data into a new column, figure out how to keep it in sync with the source data, and model indexes on it, allow two weeks, do not pass “Go.”
Because this is zero (some might argue negative) fun, the DBAs draw straws to decide who gets to work on the requirement. In all seriousness, I have to know the kinds of questions I want to ask right now if I want them answered in a quick manner down the road. Woof! This is for the dogs…
Good morning, time for a requirements change!
The next day, you find the nice table you made and its associated ER Diagram will not work due to a conversation one of your colleagues had with the Marketing Department uncovering additional data requirements: co-mingle Twitter or Blog post data into the User or Medical Insurance Claims Data for a Customer 360 view. Uh oh! Conservatively speaking, that’s about a day of productivity lost (at best!), many days added to the schedule, and the project delivery date gets pushed out further into the future.
The Twitter and Blog content the Marketing team is after will likely be large and unstructured in nature; too big to be stored natively so it will have to be jammed into BLOB columns or Data Coffins for storage. Moreover, imagine if the requirement were to store PDFs and index those too. Same thing applies: those would end up in BLOBs as well. We would have to define searches upfront on the text and respective element structures or values stored would have to be defined before runtime.
If Only I Had Something Like Google
With traditional databases, most indexes don’t efficiently account for word position, term frequency and some of the other Google-like searches we have come to expect. Did we tell Google what we were looking for in advance? Of course not! Attempting to pre-model these items in advance with relational tools and even some NoSQL databases is arduous and the value of these efforts is hardly guaranteed.
Sound familiar? Unfortunately, yes. Except this seemingly endless cycle of events usually has the effect of adding at least 6 months to the delivery date of any given project.
So we have a simple table to create but it requires a lot of modeling work upfront which could be scrapped or re-worked by any given new or changed business requirement, it does not support native XML/JSON, which leaves searchable values in the lurch while trying to deliver search performance by adding a lot of hardware and other DBA shenanigans. Double woof.
You’re not alone
At MarkLogic, we see customers experience these problems all the time. All. The. Time. These problems add unnecessary labor hours, crippling projects and elongating timeframes of or preventing delivery of projects, and generally make things less fun overall. Did I mention woof?
Unthink and “Put out of mind”
As a competitive sailboat racer, we have a phrase with respect to knots: “When you don’t know how to tie a knot, tie a lot.” The steps described above sound like “a lot,” don’t they? So if you pardon the mixed metaphor, over the next few weeks, I will introduce the knots you need for your data safari – Happy sailing!