The Ultimate Home Gym Setup

10/01/2021

With the respective lockdowns and flatlining on home HIIT workouts (rocked a backpack with books for a while as weight!), it was time for a change. It was time to invest in a home gym setup. Here is what we are rocking which as contributed to an increase in muscle mass and a decrease in body fat among other benefits.

1. Get a scale

Without measurement, it is near impossible to know how you are doing. The old trusty one purchased for $30 a few years ago decided to start flashing all sorts of error messages so it was definitely time for an upgrade. Picked up the Withings Body + which has been amazing in keeping the measurements and analysis of where we stand. That’s right, the Body+ syncs the details up to the cloud and you can view a dashboard of details on the app. Those pizzas and holiday treats add up (3-5lbs by the day after actually!), and so this pushes you to work harder and burn those off to get back to where you want to be!

2. Weights

You could go out and buy weights piecemeal although that takes up a lot of space in your place and also time tracking down the respective weights (in lbs) that you need. Stalked PowerBlock Elite Series Dumbells (would personally call them SMART BELLS) like it was a 2nd job. Or, rather, followed them on Insta/Twitter and set alerts for availability to buy them as soon as they went on-sale very similar to buying a ticket immediately after the on-sale for a festival. Pick-up the additional weight add-ons even if you are not sure you will need them, it will help the resale later if you decide to start hitting the gym instead.

Looking for something more price reasonable? The CAP Adjustable Barbells look good and are not very expensive. Have heard that Bowflex is another good alternative although, at that price point, you mind as well spend a few extra and go for the PowerBlocks.

3. TRX

The TRX Home2 System is a solid bundle enabling you to train ANYWHERE. It includes an over-the-door anchor so there’s no need to mount a hook or anchor in the living room although by all means go for it if that’s your home aesthetic. Recommend adding on the TRX Exercise Bands Package (~$15) to this purchase to get the most out of the shipping cost for an item you’ll want anyway. Planning to take this outside to the back deck (attached to a hook with some rope) or even the park (attached to a tree) once the weather warms up!

4. Bands

Bands are very handy for building strength. And the TRX Exercise Bands Package (~$15) are a bundle that enable you to work your way up over time. Lots of uses for them: Internal/External Rotations and Banded Walks to name a few.

5. A versatile space-saving workout bench

If it is not already obvious, the key theme here is “don’t take up a lot of space.” The Power Systems Deck which is very versatile yet easily stowed away. Highly recommend!

6. Workout Mat

Yo Gorilla Mat for the win. Went with the 8′ x 4′ version and never looked back. This thing is like an instant gym floor in your places for workouts. It also rolls up and slides into a bag for stowage. Space-saving is always a win. It smelled funky at first although over time that has chilled out with use. The package included a towel of sorts, still not sure what that’s used for and, who cares, the mat is awesome!

7. Slides

Wanna do pikes, mountain climbers, wipeouts, etc without burning through your socks or using a towel? Slidez are the answer! Slides enable all sorts of awesome and challenging exercises delivering lots of value!

8. Foam Roller

Forget foam, Triggerpoint is the way to go. The plastic innertube limits any “give” that the foam ones suffer from. Have had one for 7 years so it’s no surprise this is consistently rated as a top roller. Fun fact: Swell Water Bottles make a good travel roller.

9. Theragun

Sometimes it’s nice to have more of an inactive roll or a self-massage. Theragun PRO to the rescue. With all sorts of programs that may be selected in the app which controls the device via bluetooth, this is the go-to. Ideally, you have a partner in your life and this can be one of your activities together. It is possible to self-massage with the Theragun (and it works out well!) although there are no doubt more benefits of fully resting during a massage.

10. Yoga Strap

The Manduka Yoga Strap is a big win with all the time spent sitting these days. One of the favorite exercises with it is the hamstring stretch for 30 seconds. From there: fold the opposite leg (allowing the upward leg to flex more), note the spot on the ceiling where you can flex that upward leg to, and try to keep that stretch for another 30 seconds after lowering the “opposite leg.” Repeat.

11. Yoga block (or 2!)

Foam blocks just never seemed that great due to their flimsiness. Cork to the rescue! The Manduka Cork Yoga Block stays true to form and does not give a millimeter. The benefits of blocks are underrated just like Savasana (still working on it!) yet they help you work deeper into areas in a safe way so that you are not straining yourself for the sake of the pose.

12. Resistance Bands

Eventually, some sort of the body will invariably become unhappy with the exercising or maybe you are not stretching as much as you should be. It happens! Working from home, there has been a lot of Rainbow Sandal wearing around here which has not made the ankles very happy. And so, some bands are needed. The Theraband Beginner Resistance Bands are the way to go for rehabilitation exercises.

13. Balance and Knee Pad

Speaking of the body getting “unhappy,” that can include the knees. And also, maybe you want to work on some balance exercises. Am still trying out some options here although have a foam pad on the way and will update with details.

These are all the respective home workout tools in the toolbox. Combined with a solid training program, there is no reason why you can’t get or stay fit during these crazy times. Plus, when the weights are starting at you (they have eyes!) and no commute to the gym necessary with all the associated packing of clothes etc, it is a lot harder to skip a workout and easier to make it happen.

Big Data Google Cloud

Why Google Cloud for Big Data?

17/06/2020

Was asked for some Google Cloud Big Data value points today so here they are:

1. Highly scalable speed of thought open source solutions scaling up to 100s of petabytes of data
2. Simplified data operations reducing time spent on integrations, layers, complexity, and upgrades
3. Unleash data scientists/analysts to more easily deliver value with familiar open source tools without duplicating data
4. Unparalleled Security and Governance Controls including ML detection of sensitive data (PII, Credit Card Numbers, etc) and cataloging of it

#googlecloud #bigdata #analytics

MarkLogic NoSQL Tech

Unthink: Starting to move beyond the constraints of relational databases

11/07/2016

Part one of the 0 to 100 series — with inspiration from Drake.

Welcome to a continuation of a previous post in a series from the MarkLogic.com Blog where I outlined some of the high-level items that plague organizations with respect to integrating their data from silos. Here, we will dive head-first into working with an Angular JS application as the frontend and scaffolding while loading disparate data sets into MarkLogic so we can see first-hand how it makes integrating data from silos incredibly easy.

Getting up and running with MarkLogic

What would a tech how-to blog post be without some initial setup? To follow along with this blog post and avoid configuration steps, it is best if your Administrator username/password is admin/admin, the default ports (8040 and 8041) are available, and you are running MarkLogic locally. In a future post, we will cover how to tweak these.

Skip Steps 1 and 2 if you already have MarkLogic up and running.

Follow the appropriate installation process for your Operating System and start MarkLogic server.
Configure your host – this takes only a minute! Compare that time to other setups, haha.
Make sure you are able to login to the MarkLogic Administrator Console (usually found at http://localhost:8001/) using the admin credentials.

Script the database, forest, and application server

We could manually create a database, forest and application server to hold and serve up our data but that’s a lot of work and effort although I invite you to review and understand those concepts. Nevertheless, I also enjoy being lazy especially when it makes things easier and more manageable so we’ll use a MarkLogic tool (Roxy) that makes this configuration a breeze over REST and easier to follow along.

We’re going to surface the data using the newly released and frequently updated ml-slush-discovery-app (Slush Discovery) template from one of my colleagues Ryan Dew. As the name may imply, it is based on the slush-marklogic-node template/generator. What is that? Slush is an app project generator that provides scaffolding for a quick JavaScript-based app on top of MarkLogic. The Slush generator was built with AngularJS and node.js and you can fully customize your project once its been generated. That’s right, full stack JavaScript! I introduce the app for surfacing the data because it also ships with Roxy, which makes configuring the database, forest, and application server a snap! It does get tedious after you’ve done it manually sometimes which makes scripting it out in Roxy so much fun. 😉

Getting going with Slush Discovery

Make sure to check for and install the key dependencies you don’t already have. You’ll need to enable root on OS X if you have not done so already.

Ruby 1.9.3 or greater – A dynamic, reflective, object-oriented, general-purpose programming language
Java (jdk) – Only if you wish to run the mlcp, XQSync, or RecordLoader commands
node.js – JavaScript runtime using an event-driven, non-blocking I/O model that makes it lightweight and efficient

We will use Git to pull down the latest ml-slush-discovery-app master branch (aka the latest production release) in a few steps. Git is incredibly handy for SCM (Software Configuration Management) and of course pulling down public projects from GitHub. If you do not have git:

Windows Users: Download and install Git for Windows. Specifically, Git Bash (Git + Shell Interface) will make your life easier in running the commands that follow as the Windows Command Prompt would require all sorts of Path variables which are no fun to troubleshoot.

Mac OS X Users: Download and install git if you do not have it already.

Additional dependencies/requirements to install via terminal prompt (or Git Bash for Windows):

npm – Built-in package manager for node (comes with node, but check to be sure you have latest version)

$ npm install -g npm

gulp – Javascript task automation

$ npm install -g gulp

Bower – A package manager for front-end libraries

$ npm install -g bower

MLCP – Binary for loading/exporting data. Extract the ZIP somewhere reusable- perhaps in the directory above where the Slush Application will live (../).

Officially, Slush requires a few additional Windows dependencies although I’ve had no issue proceeding without them many times over.

Change directories in your terminal prompt (or Git Bash for Windows) to where you would like the ml-slush-discovery-app folder with our activities to exist and issue the following to clone it:

$ git clone https://github.com/ryanjdew/ml-slush-discovery-app.git

Change directory to the ml-slush-discovery-app folder. I will assume you are in this root directory path going forward.

Update the Roxy MLCP Path and Deploy

We’ll start by modifying the Roxy MLCP path in the Slush Discovery app so that we can use it to load content into MarkLogic. “Out of the box,” the Slush Discovery app requires ports 8040 for the application server port which Node.js will use to interface with MarkLogic while 8041 is used for an xcc server allowing code deployment to MarkLogic. We’ll cover how to modify those in a future post and will hope those are not in use for you until then!

Update the deploy/build.properties file which is part of the Roxy configuration to change line 170 so it points to the MLCP root directory you extracted previously.

mlcp-home=/<absolute path to MLCP>/mlcp-8.0-5

Deploy the configurations to MarkLogic using the following Roxy commands from the ml-slush-discovery-app directory.

NOTE: Windows users will use Git Bash instead of the Windows command prompt and ml.bat instead of ml.

$ ./ml local bootstrap

$ ./ml local deploy modules

You can then access the admin interface to ensure the database and app servers were created as configured.

Admin Config

A quick node.js and Angular configuration

Now that our MarkLogic is configured with databases, forests, and application servers, we’re ready to install the app and its required dependencies with the following commands in the terminal or Git Bash for Windows users:

$ npm install

$ bower install

Cool… But we have no data yet!

Good point… For this article, one of the publicly-available Medicare Part D Provider Utilization and Payment Data Detailed Data Sets will be covered: any link on that page beginning with Part D Prescriber PUF, CY2013 will allow you to follow along with ease. The more adventurous may want to use their own data in CSV format (highly encouraged!). Even though we are working with different data sets, these steps should be helpful along the way!

Open the downloaded XLS in Excel and save it as a CSV for easy import into MarkLogic. xlsx2csv or another conversion tool could also be used. This conversion is not a requirement since MarkLogic could ingest the data as is and convert it for us but we’ll save that topic for a future blog post.

There are many empty row entries in the CSV. They may be removed with the perl command below (thanks, Editor!).

perl -i -ne 'if (!/,,,,,/) { print; }' PartD_Prescriber_PUF_NPI_DRUG_Aa_Al_CY2013.csv

If you are not familiar with perl, don’t have it, or prefer another option, open the CSV in your favorite text editor, find line 564,447 and remove the empty entries below, and save it.

Once finished, place the CSV into the /data directory.

Create a /import-medicare-data.options file (change medicare in the filename to whatever best describes your data set) with the following configuration, which MLCP (MarkLogic Content Pump) will leverage to bring data into MarkLogic. Make sure there is a space at the end of the flags (-input_file_type) before the values (delimited_text) on the new lines. Of course, make sure to change the input_file_path value so it points to the CSV you would like to import into MarkLogic. (See MLCP Options File format.)

import
-input_file_path
<path to your ml-slush-discovery-app parent directory>/cookingshow/data/<CSV file name> 
-input_file_type
delimited_text 
-output_uri_suffix
.xml
-output_collections
csv
-generate_uri
true

What does this mean? We’ve specified that we are importing data (MLCP can be used for export purposes as well), path to the CSV data (claims_dump.csv) we would like to import, what type of file it is, resulting URI suffix or “.xml”, placing it into the “csv” collection and that we would like to automatically generate a URI (Universal Resource Identifier or primary key).

From the command line in the root directory, we can then run the following MLCP call and use our previous configuration:

./ml local mlcp -options_file import-medicare-data.options

For the impatient or curious, opening Query Console (usually found at http://localhost:8000/), selecting the discovery-app-content Database as the Content Source, hitting the Explore button, and selecting a document is a way to see the loaded content. Back to the command window…

MLCP Load

Looks like it did or is doing something. But what? In my case, it loaded 564,445 records into MarkLogic from the CSV in a few minutes (on my laptop). Cool!

Extra Credit: But what about those Twitter and Blog data sets?

NOTE: This step is not required to continue, but you will not have the experience of searching two completely different data sets at the same time or seeing a pie chart depicting it. 🙂

No problem! We can use an xQuery script in Query Console written by my highly esteemed colleague Pete Aven that leverages MarkLogic’s xdmp:http-get function. It will retrieve multiple RSS feeds related to medicine, which we can co-mingle with the Medical Insurance Claims Data we just loaded in.

Download the Load Feeds Query Console Script and copy its contents into your open Query Console window from earlier. These 30 lines of code will grab the data from the RSS feeds and includes a function to cleanup the dates. We’ll cover this code in more detail in future post.

Download the XML document of RSS feeds into your Slush Discovery root directory. Update line 13 in Query Console so it points to this location.

If you wanted to create your own version of feeds.xml, create a document by the same name where the value of each feed element is a URL you would like to pull RSS data from. In the case below, I have provided an example of what your feed file would look like if you wanted to crawl all of the Wall Street Journal’s RSS feeds. Of course, you would need to adjust the code below to work with the structure of those RSS feeds as not all RSS feed structures are created the same.

<feeds>
 <feed>http://www.wsj.com/xml/rss/3_7085.xml</feed>
 <feed>http://www.wsj.com/xml/rss/3_7014.xml</feed>
 <feed>http://www.wsj.com/xml/rss/3_7031.xml</feed>
 <feed>http://www.wsj.com/xml/rss/3_7455.xml</feed>
 <feed>http://www.wsj.com/xml/rss/3_7201.xml</feed>
</feeds>

If you want to use the WSJ feeds, update line 25 of the code to account for the difference in element names between the feeds:

let $modPubDate := element pubDate {$dateTime}

Run the code in Query Console. Once it runs successfully, select the Explore button and then one of the URIs beginning with /article-feed/ to see the results of bringing in the external blog data (one of the documents from medicannewstoday.com RSS feeds is shown below):

<?xml version="1.0" encoding="UTF-8"?>
<doc>
 <title>Watching the inflammation process in real time</title>
 <pubDate>Tue, 20 Oct 2015 08:00:00 PST</pubDate>
 <link>http://www.medicalnewstoday.com/releases/301275.php</link>
 <guid>http://www.medicalnewstoday.com/releases/301275.php</guid>
 <description>Asthma bronchiale, hayfever or neurodermatitis -- allergies are on the increase in Western European industrial countries.</description>
 <category domain="http://www.medicalnewstoday.com/categories/allergy/">Allergy</category>
 <modPubDate>2015-10-20T00:00:00</modPubDate>
 <type>feed</type>
</doc>

How do we see what we loaded into MarkLogic aside from Query Console? Glad you asked! We’ll use the Slush Discovery App to see this after a quick detour to understand Range Indexes.

A quick word about Range Indexes

Maybe we want to be able to run value-based (dates, integers, etc) queries against data sitting in our XML documents. Said a different way: maybe I want to quickly search for items that fall between values just like when I am shopping for TVs on Amazon.com and want to filter down on the ones that are 70″ and larger while showing the user how many results match that “bucket.” In MarkLogic, we create Range Indexes to handle this and its really easy if not trivial. We can also create unlimited facets or ranges of buckets to accomplish the show me all TVs over 70″ usecase.

Range Index Conceptual

Range Indexes map values to documents and vice versa. Why?

Value to Document provides: fast look up and intersection of doc-id from multiple predicates.

Document to Value alleviates any need to load the document to get a value from it for speed.

Because these indexes are in-memory, MarkLogic can return documents that fall between given values, counts on them or calculations related to them, and intersections with other indexes quite quickly without going to disk. Super fast!

Range Indexes may be configured in the browser-based MarkLogic Administration Console or via API. This configuration is practically trivial. For simplicity, we’ll configure these form the Slush Discovery App.

Show me the goodness!

Co-mingled data sits in the discovery-app-content database. We are getting to the fun part of running the Slush Discovery App where the Range Indexes will be configured. Run the following at the root level of the ml-slush-discovery-app directory:

$ gulp serve-local

This will load up the app and open a browser window to http://localhost:3000/. It may not load completely on first run. Fear not! Refreshing the browser window should load it as expected:

Successful Deployment Screen

Select the Login button in the upper-right and login using your admin credentials.

Once logged in, you’ll see the admin user has replaced the Login button along with some of the documents that were loaded into the Slush Discovery Database. Select it and then the Setup Option.

Setup Menu

The setup options appear.

Select DB

Ensure the Database Name is set to discovery-app-content (it should be selected by default). Select Set Database to save this setting and navigate to the Indexes Tab.

Range Indexes

Select the Add button for the top Range Indexes section so we can add a Range Index for generic_name (string). Where did generic_name come from? It was a column header from the CSV we loaded into MarkLogic with MLCP and is now an XML element in the data set (e.g. <generic_name> ). If you are using your own data set, you could use other desirable XML elements for your range indexes.

Add Range Index

You then select the appropriate Type of Index (Element/JSON Property) and Type (string). Then, enter the name of the element into the bottom element field (generic_name) and select the white find button to the right.

Range Index Add Details

Select the radio button corresponding to the desired item for a Range Index, select the Add button, and you are all set.

Now that you’ve created one range index, you can create others for: nppes_provider_city (string), nppes_provider_state (string), specialty_description (string), total_claim_count (int), and total_day_supply (int), to name a few. The total_drug_cost column is certainly of interest but we’ll save transforms for a future post. 😉 Be careful to select the correct Type of Index and Type as you go along.

All Range Indexes Added

Navigate to the Constraints Tab. The Range Indexed items are displayed. You may need to select the blue Resample button in the upper-right corner to see them. Items may be re-named (as I’ve done below), re-ordered, or deleted. There is also a Faceted option which allows definition of which elements I can filter on similar to the Amazon example with the 70” flat screen TV.

Constraints

Secret trick here: select the Add button to add another Facet on the Collection names in the database.

Add Collection Constraint

Enter Collection for the Name, check the Facet checkbox, and then select the Save button on the pop-up and Save again on the page to save all of the settings.

This may take you to the Results Tab and show you an error (depending on your database size):

Reindexing Error

That’s ok. If you were to look at the Admin UI, and navigate to Configure > Databases > discovery-app-content > Status Tab, you would see that the database is reindexing/refragmenting per your configurations. These errors will persist in the app until it is done as of this writing.

Admin Reindexing

In the App, return to the Constraints Tab and you will see the settings as desired.

Finished Constraints and Collection

On the Suggestions/Sort Options Tab, a range indexed element value may be chosen for type-ahead searching similar to the way Google operates. Select the Resample and the first drop-down to Generic Name if using the claims data set or another option of interest. If you do not see any values, tap the Resample button. Select Save when done.

Suggestions

Navigate to the Results Tab for a preview of the Search application leaving the UI Config options blank (as of this writing, they do not work as desired). Note: The screenshot does not show all the facets.

Initial Results

It could be helpful to see a pie chart of the top 15 Generic Names in the data set. Select the Add Widget button.

Add First Chart Widget

As shown above, change the Title to Generic Name, move the Generic%20Name value to the Data Point Name area below and select Save. The Widget is now displayed on the page Results Preview.

One Widget

Selecting the blue arrow pointing to the right will change the pie chart’s width to half-width making room for the next widget.

One Half Screen Widget

We’ll want to visually display how many documents originated from each source using the collections constraint since MarkLogic is so good at integrating data from silos. To do so, select the Add Widget button, remove Generic%20Name from Series Name and drag Collection to Data Point Name, and change the Title to Source.

Second Widget

Select the Save button.

Uh oh! The Pie charts are not side-by-side.

Widgets Not Next to Each Other

No problem! Select the blue arrow pointing to the upper-right of the Source pie chart.

Widgets Next to Each Other

Ahhh… That’s better! Navigate to the Home link in the top navigation bar.

Finished App

Here are the results of your work displayed in the ml-slush-discovery-app application where you can: select the facets on the left to filter the results, search using the aforementioned MarkLogic search grammar from the search:search example into the search bar and select the result links to see the content of the document selected.

The document results allow you to view the elements and text hierarchically, in JSON, or XML (MarkLogic does store both JSON and XML natively):

Document Result

Sometimes, the results do not look the way I wanted. No big deal! Query Console and the ml-slush-discovery-app allow you to update the configuration, transform the data, make changes, and redeploy. I can quickly iterate on this a few times until I get things to look the way I want. That’s the power of MarkLogic!

Exploring the Data through Facets and Search – Some Ideas

We may want to explore through our data when we have some ideas about what we are looking for. Facets can help. In the claims data set, we can select the following facets:

Provider City: Chicago

Days Supplied: 360

Specialty Description: Psychiatry

With that, the Generic Names prescribed for 360 Days by Chicago doctors specializing in Psychiatry are displayed. Interesting insights!

Facets are also useful though and can help us to answer questions and explore our information, but when we don’t know what we’re looking for, this is where full text search can help us as well. The app’s search bar leverages MarkLogic’s Search API, which comes with a Google-style grammar for writing powerful searches.

Clear your facets to begin by selecting the x-es on the selected facets in the upper-left. Here are some ideas of searches to run:

Psychiatry neurology

Psychiatry AND neurology

Note with this search there are only a few matching documents with these terms in them. Combining full text search and faceted navigation to find the information that’s important to us is very powerful!

Psychiatry OR neurology

We could search on the phrase “Psychiatry neurology” (with quotes) but that would not return any results in this data set.

If you find any other good phrases to search on, please comment below. 🙂

The default grammar for searches includes NEARs for proximity searches, GT, LT, NOT_IN, () for precedence, and a whole lot of other options, but you can also extend for your particular requirements if you already have a particular grammar you may be currently use in-house or find a particular grammar that is more intuitive for your needs.

Cleaning up…

You may shut down the app with a Ctrl+c in the terminal window.

To remove the Slush Discovery database from its components from MarkLogic, issue the following:

WARNING: This will complete remove the Slush Discovery App Database with any data you loaded into your MarkLogic instance and restart MarkLogic. Trust me, these commands have their desired effects. 🙂

NOTE: Windows users will use Git Bash instead of the Windows command prompt and will use ml.bat instead of ml.

$ ./ml local wipe

$ ./ml local restart

Feel free to leave the ml-slush-discovery app folder so you may use it for the next post or your own future use to visualize data in MarkLogic quickly.

Summary

Congratulations on taking the first steps to unthink and free your data with MarkLogic!

We’ve only just begun scratching the surface of MarkLogic’s capabilities in this blog post. Keep an eye out for future posts that build on this example in the 0 to 100 Series… Real Quick!

MarkLogic NoSQL Tech

Unthink: Moving Beyond the Constraints of Relational Databases

13/03/2016

Part one of the 0 to 100 series — with inspiration from Drake.

Working with relational databases for a number of years, I had become quite attached to their well-defined organization, subsequent predictability, and reliability. Then I looked at NoSQL databases with both tree and graph structures. In studying the need for these additional types databases, I realized my career had been an ongoing data safari – always trying to jam differently-sized animals you might find on the safari into the same size box. It’s pretty easy to see that a custom-built “box” for the honey badger would not fit the giraffe very well (or vice versa). Like animals on my safari, data is not always the same shape either.

In the relational world you are bound by rows and columns – and sometimes heavily constrained fields that must be only a specific format and only a specific length. What do you do with data that is larger?

How to ‘Unthink’

When I first looked at MarkLogic, I saw first-hand how it allowed me to think outside the confines of a “box” to “unthink” about the modeling required to build boxes for each animal to fit into as I could easily extend my model to accommodate each of them including those I did not expect to find.

Plugging “unthink” into your favorite search engine offers results with such suggestions as “to put out of mind.” Let’s begin seeing how we can rediscover our creative genius and put those relational tools out of mind.

Modeling the ER Diagram – A World of Pain

Let’s consider a CRM Application that is typically driven by relational databases. An application of this magnitude would include many different data sources, data types and certainly data schemas. Imagine modeling that hairy mess using entity-relationship (ER) diagrams. In this real-world scenario, you likely have at least 1,000 pieces of information in the ER diagram which can easily span entire walls and at least hundreds of tables.

This does not seem like a big deal until you experienced the sparse data problem where a given column may or may not have values leading to slow and inefficient algorithms as processing and memory are wasted on the zeroes or null values. On the other hand, there could be multiple values to deal with; one of my colleagues has six valid mailing addresses that are not duplicates and this is rather common across the board with other colleagues and friends. When viewed as a class model, this Object Relational Mapping leads to Sparse data for the polymorphic details.

Entity Relationship Diagram

The above ER diagram is as an example without any identifying information of its origin to protect the offenders, hahah. These complex ER diagrams are time consuming, painful, and boring. More importantly, these modeling exercises also carry a large price tag payable to a big consulting firm with little chance for future expansion of the data model corresponding to the business. Chances are, you’ve seen or been involved with this part of the project more times than you are willing to admit. Me too.

Simple Example: Defensive Modeling

Taking a step back from the complex ER diagram above, suppose you needed to create a table to capture some User information today. A simple enough task, right? What would you do to create that table? What might this look like? Well, it would probably have columns that would include:

ID (primary key), CustomerNumber, UserID, FirstName, LastName, MiddleName, Address1 through Address4 (Just in case!), City, State, ZipCode, and Country, for example…

Rows and Columns

When you create columns Address1 through Address4, you are engaging in defensive modeling strategies (hint: NoSQL databases such as MarkLogic do not require this, more later — no peeking!) to think about every input and output possibility so that you are protected against the inflexibility of the relational database: It takes upfront thought before adding additional columns when they are needed later.

Lets take things up a notch

What if I wanted to up the complexity a bit and work with other real-world data, e.g., Medical Insurance Claims Data?

Medical Insurance Claims Data

Pre-thinking about and modeling the data you have to fit into those rows and columns probably took a decent amount of time. Why? Rows and Columns force you to evaluate the possibility of relationships and cardinality. For all intents and purposes, my respective tables could have one to many relationships from and to them. What if I am trying to bring multiple data silos together? This is even tougher to deal with because the schemas won’t match and I must spend time unifying them. There are typically decisions involved regarding which data to keep versus which to drop because of relational’s associated cardinality — leading to the desire to shortcut any associated modeling possible. The table forces me to make design choices before I have all the facts of the data sets. It is also a rather intensive operation to add a column later and create the related additional indexes on it not to mention ugly to look at from an organizational perspective. I like my columns or data items in a specific order AND hierarchy that makes sense, thank you very much!

XML/JSON solves all of our modeling problems in relational, right?

Imagine the team has discovered the flexible nature of both XML and JSON and how each (correctly) deals with these modeling challenges. Further, there are probably freetext/unstructured items to consider with as well. Those will likely end up living or dying in BLOB columns or, as I prefer to say, “Data Coffins.” Why will they die? It is hard to do much with these items once they’ve landed into those columns and is effectively where good data goes to die: it cannot be easily searched or quickly leveraged.

We’re done now, right? Not so fast! Columns need to be created to hold Primary Key values and associated indexes written on them separate from the XML/JSON or unstructured BLOB columns so we have a method for search and retrieval. What if you want to search on a value from XML/JSON or unstructured text some day in the future? Uh oh – the DBA has to write code extracting relevant data into a new column, figure out how to keep it in sync with the source data, and model indexes on it, allow two weeks, do not pass “Go.”

Because this is zero (some might argue negative) fun, the DBAs draw straws to decide who gets to work on the requirement. In all seriousness, I have to know the kinds of questions I want to ask right now if I want them answered in a quick manner down the road. Woof! This is for the dogs…

Good morning, time for a requirements change!

The next day, you find the nice table you made and its associated ER Diagram will not work due to a conversation one of your colleagues had with the Marketing Department uncovering additional data requirements: co-mingle Twitter or Blog post data into the User or Medical Insurance Claims Data for a Customer 360 view. Uh oh! Conservatively speaking, that’s about a day of productivity lost (at best!), many days added to the schedule, and the project delivery date gets pushed out further into the future.

The Twitter and Blog content the Marketing team is after will likely be large and unstructured in nature; too big to be stored natively so it will have to be jammed into BLOB columns or Data Coffins for storage. Moreover, imagine if the requirement were to store PDFs and index those too. Same thing applies: those would end up in BLOBs as well. We would have to define searches upfront on the text and respective element structures or values stored would have to be defined before runtime.

If Only I Had Something Like Google

With traditional databases, most indexes don’t efficiently account for word position, term frequency and some of the other Google-like searches we have come to expect. Did we tell Google what we were looking for in advance? Of course not! Attempting to pre-model these items in advance with relational tools and even some NoSQL databases is arduous and the value of these efforts is hardly guaranteed.

Sound familiar? Unfortunately, yes. Except this seemingly endless cycle of events usually has the effect of adding at least 6 months to the delivery date of any given project.

So we have a simple table to create but it requires a lot of modeling work upfront which could be scrapped or re-worked by any given new or changed business requirement, it does not support native XML/JSON, which leaves searchable values in the lurch while trying to deliver search performance by adding a lot of hardware and other DBA shenanigans. Double woof.

You’re not alone

At MarkLogic, we see customers experience these problems all the time. All. The. Time. These problems add unnecessary labor hours, crippling projects and elongating timeframes of or preventing delivery of projects, and generally make things less fun overall. Did I mention woof?

Unthink and “Put out of mind”

As a competitive sailboat racer, we have a phrase with respect to knots: “When you don’t know how to tie a knot, tie a lot.” The steps described above sound like “a lot,” don’t they? So if you pardon the mixed metaphor, over the next few weeks, I will introduce the knots you need for your data safari – Happy sailing!

Read Part 2: Unthink: Starting to move beyond the constraints of relational databases.