Bringing Data Discovery To Hadoop – Part 1

We have been anticipating the intersection of big data with data discovery for quite some time. What exactly that will look like in the coming years is still up for debate, but we think Oracle’s new Big Data Discovery application provides a window into what true discovery on Hadoop might entail.

We’re excited about BDD because it wraps data analysis, transformation, and discovery tools together into a single user interface, all while leveraging the distributed computing horsepower of Hadoop.

BDD’s roots clearly extend from Oracle Endeca Information Discovery, and some of the best aspects of that application — ad-hoc analysis, fast response times, and instructive visualizations — have made it into this new product. But while BDD has inherited a few of OEID’s underpinnings, it’s also a complete overhaul in many ways. OEID users would be hard-pressed to find more than a handful of similarities between Endeca and this new offering. Hence, the completely new name.

The biggest difference of course, is that BDD is designed to run on the hottest data platform in use today: Hadoop. It is also cutting edge in that it utilizes the blazingly fast Apache Spark engine to perform all of its data processing. The result is a very flexible tool that allows users to easily upload new data into their Hadoop cluster or, conversely, pull existing data from their cluster onto BDD for exploration and discovery. It also includes a robust set of functions that allows users to test and perform transformations on their data on the fly in order to get it into the best possible working state.

In this post, we’ll explore a scenario where we take a basic spreadsheet and upload it to BDD for discovery. In another post, we’ll take a look at how BDD takes advantage of Hadoop’s distributed architecture and parallel processing power. Later on, we’ll see how BDD works with an existing data set in Hive.

We installed our instance of BDD on Cloudera’s latest distribution of Hadoop, CDH 5.3. From our perspective, this is a stable platform for BDD to operate on. Cloudera customers also should have a pretty easy time setting up BDD on their existing clusters.

Explore

Getting started with BDD is relatively simple. After uploading a new spreadsheet, BDD automatically writes the data to HDFS, then indexes and profiles the data based on some clever intuition:What you see above displays just a little bit of the magic that BDD has to offer. This data comes from the Consumer Financial Protection Bureau, and details four years’ worth of consumer complaints to financial services firms. We uploaded the CSV file to BDD in exactly the condition we received it from the bureau’s website. After specifying a few simple attributes like the quote character and whether the file contained headers, we pressed “Done” and the application got to work processing the file. BDD then built the charts and graphs displayed above automatically to give us a broad overview of what the spreadsheet contained.

As you can see, BDD does a good job presenting the data to us in broad strokes. Some of the findings we get right from the start are the names of the companies that have the most complaints and the kinds of products consumers are complaining about.

We can also explore any of these fields in more detail if we want to do so:

Screen-Shot-2015-02-02-at-1.56.17-PM

Now we get an even more detailed view of this date field, and can see how many unique values there are, or if there are any records that have data missing. It also gives us the range of dates in the data. This feature is incredibly helpful for data profiling, but we can go even deeper with refinements.

Capture1

With just a few clicks on a couple charts, we have now refined our view of the data to a specific company, JPMorgan Chase, and a type of response, “Closed with monetary relief”. Remember, we have yet to clean or manipulate the data ourselves, but already we’ve been able to dissect it in a way that would be difficult to do with a spreadsheet alone. Users of OEID and other discovery applications will probably see a lot of familiar actions here in the way we are drilling down into the records to get a unique view of the data, but users who are unfamiliar with these kinds of tools should find the interface to be easy and intuitive as well.

Transform

Another way BDD differentiates itself from some other discovery applications is with the actions available under the “Transform” tab.

Within this section of the application, users have a wealth of common transformation options available to them with just a few clicks. Operations like converting data types, concatenating fields, and getting absolute values now can be done on the fly, with a preview of the results available in near real-time.

BDD also offers more complex transformation functions in its Transformation Editor, which includes features like date parsing, geocoding, HTML formatting and sentiment analysis. All of these are built-in to the application; no plug-ins required. Another nice feature BDD provides is an easy to way group (or bin) attributes by value. For example, we can find all the car-related financing companies and group them into a single category to refine by later on:

Transform021

Another nice added feature of BDD is the ability to preview the results of a transform before committing the changes to all the data. This allows a user to fine tune their transforms with relative ease and minimal back and forth between data revisions.

Once we’re happy with our results, we can commit the transforms to the data, at which point BDD launches a Spark job behind the scenes to apply the changes. From this point, we can design a discovery interface that puts our enriched data set to work.

Discover

Included with BDD are a set of dynamic, advanced data visualizations that can turn any data set into something profoundly more intuitive and usable:

Discover01

The image above is just a sampling of the kind of visual tools BDD has to offer. These charts were built in a matter of minutes, and because much of the ETL process is baked into the application, it’s easy to go back and modify your data as needed while you design the graphical elements. This style of workflow is drastically different from workflows of the past, which required the back- and front-ends to be constructed in entirely separate stages, usually in totally different applications. This puts a lot of power into the hands of users across the business, whether they have technical chops or not.

And as we mentioned earlier, since BDD’s indexing framework is a close relative to Endeca, it inherits all the same real-time processing and unstructured search capabilities. In other words, digging into your data is simple and highly responsive:

Discover02

As more and more companies and institutions begin to re-platform their data onto Hadoop, there will be a growing need to effectively explore all of that distributed data. We believe that Oracle’s Big Data Discovery offers a wide range of tools to meet that need, and could be a great discovery solution for organizations that are struggling to make sense of the vast stores of information they have sitting on Hadoop.

If you would like to learn more, please contact us at info [at] ranzal.com.

Also be sure to stay tuned for Part 2!

Data Discovery In Healthcare — 1st Installment

Interested to understand how cutting edge healthcare providers are turning to data discovery solutions to unlock the insights in their medical records?  Check out this real-world demonstration of what a recent Ranzal customer is doing to unlock a 360 degree view of their clinical outcomes leveraging all of their EMR data — both the structured and unstructured information.

Take a look for yourself…

Fun with Shapefiles: The Two Utahs

A little midweek enjoyment, courtesy of our Advanced Visualization Framework.  Below, you can see a county-by-county map of Utah and all of its Oil and Gas Fields.

http://branchbird.com/utah.html

You can wave over counties and fields and get some basic statistics related to the county or field that you are inspecting.  For fields, we have the oil/gas field status, the year it was opened and other basic information such as whether or not it was merged with another field.

We had to “dumb it down a bit” and put it in to an iframe (WordPress!) but you can still some of the detail.  It’s obviously not as flexible as our “real visualizations” (no zooming, no refining, etc.) that render inside of Oracle Endeca Studio but gives you a sense of how quickly and easily our technology incorporates advanced GIS data into a Data Discovery application.

Tag 100 Times Faster — Introducing Branchbird’s Fast Text Tagger

BBFTTClip
Text Tagging is the process of using a list of keywords to search and annotate unstructured data. This capability is frequently required by Ranzal customers, most notably in the healthcare industry.

Oracle’s Endeca Data Integrator provides three different ways to text tag your data “out of the box” .

  • The first is the “Text Tagger – Whitelist” component which is fed a list of keywords and searches your text for exact matches.
  • The second is the “Text Tagger – Regex” component which works similarly but allows for the use of regular expressions to expand the fuzzy matching capabilities when searching the text.
  • The third is using “Endeca’s Text Enrichment” component (OEM’ed from Lexalytics) and supplying a model (keyword list) that takes advantage of the component’s model-based entity extraction.

Ranzal began working on a custom text tagging component due to challenges with the aforementioned components at scale. All of the above text taggers are built to handle tagging with relatively small inputs — both the size of the supplied dictionary and the number (and size) of documents.

1,000 EMRs 10,000 EMRs 100,000 EMRs 1,000,000 EMRs
Fast Text Tagger (FTT) 250 docs/sec 1,428 docs/sec 4,347 docs/sec 6172 docs/sec
Text Enrichment (TE) 6.5 docs/second 5 docs/second N/A N/A
TE 4 threads 17.5 docs/second 15 docs/second 15 docs/second N/A

In one of our most common use cases, customers analyzing electronic medical records with Endeca need to enrich large amounts of free text (typically physician notes) using a medical ontology such as SNOMED-CT or MeSH. Each of these ontologies has a large number of medical “concepts” and their associated synonyms. For example, the US version of SNOMED-CT contains nearly 150,000 concepts. Unfortunately, the “out of the box” text tagger components do not perform well beyond a couple hundred keywords. To realize slightly better throughput during tagging, Endeca developers have traditionally leveraged the third component listed above — the  Lexalytics-based “Text Enrichment” component — which offers better performance than the other options listed above.

However, after extensive use of the “Text Enrichment” component, it became clear that not only was the performance still not acceptable at high scale, the recall of the component was inadequate especially with Electronic Medical Records (EMRs). The Text Enrichment component is NLP-based and relies on accurately parsing sentence structure and word boundaries to tokenize the document before entity extraction begins. EMRs typically have very challenging sentence structure due both to the ad hoc writing style of clinicians at point of entry and the observational metrics embedded in the record. Because of this, Text Enrichment of even small documents at high scale can be prohibitive for simple text tagging. A recent customer of ours, using very high end enterprise equipment, was experiencing 24 hour processing times using Text Enrichment text tagging with SNOMED-CT concepts to process approximately six million EMRs.

To improve both the performance and recall issues, Ranzal set out to build a simple text tagger component for Integrator that would be easy to setup and use. The Ranzal “Fast Text Tagger” was built using a high performance dictionary matching algorithm that ingests the list of terms (and phrases) into a finite state pattern matching machine which can then be used to process the documents. One of the largest benefits of these search algorithms is that the document text only needs to be parsed once to find all possible matches within the supplied dictionary.

The Ranzal Fast Text Tagger is intended to replace the stock “Text Tagger – Whitelist” component and the use of the “Text Enrichment” component for whitelisting. Our text tagger is intended for straight text matching with optional restrictions to allow for matching on word boundaries. If your use cases require more fuzzy-style text matching, then you should continue to use the “Text Tagger – Regexp” at low scale and “Text Enrichment” at higher scales.

Performance Observations

To go further on the metrics shown above, and duplicated here, you can see the remarkable performance of the Ranzal Fast Text Tagger as compared to “Text Enrichment” even when Text Enrichment is configured to consume 4 threads. Furthermore, the rate of the BB FTT tends to increase with the number of documents, before starting to level off near 1 million documents, whereas Text Enrichment stays relatively constant.

1,000 EMRs 10,000 EMRs 100,000 EMRs 1,000,000 EMRs
BB FTT 1 thread 250 docs/sec 1,428 docs/sec 4,347 docs/sec 6172 docs/sec
TE 1 thread 6.5 docs/second 5 docs/second N/A N/A
TE 4 threads 17.5 docs/second 15 docs/second 15 docs/second N/A

As a final performance note, the previously mentioned customer with the 24 hour graph run just for text tagging, the same process was done on this same test harness with the same data in just shy of 20 minutes. It took longer to read the data from disk than it took to stream it all through the Fast Text Tagger.  This implies that, in typical use cases, the Fast Text Tagger will not be a limiting component in your graph. For those of you curious about the benchmarking methods used, please continue below.

Test Runs

We built a graph that could execute the different test configurations sequentially and then compile the results. Shown below are four separate test runs and a screen capture of Integrator at test completion. Below each screen cap is a list of metrics:

  • Match AVG: The average number of concepts extracted over the corpus
  • Total Match: The total number of concepts extracted over the corpus
  • Misses: The number of non-empty EMRs where no concept was found
  • Exec Time: The total execution time of the test configuration

Note that Text Enrichment’s poor recall negatively impacts its precision (Match AVG). If you remove the (significant number of) misses, TE has precision nearly as high as our Fast Text Tagger.

Test 1: 1,000 EMRs

1000EMRs

Test ID Match AVG Total Match Misses Exec Time
BB FTT (1 thread) 9 9,126 14 4 secs
TE (1 thread) 4 4,876 260 153 secs
TE (4 threads) 4 4,876 260 57 secs

Test 2: 10,000 EMRs

10000EMRs

Test ID Match AVG Total Match Misses Exec Time
BB FTT (1 thread) 12 127,617 14 7 secs
TE (1 thread) 5 55,567 3,739 2,010 secs
TE (4 threads) 5 55,567 3,739 675 secs

Test 3: 100,000 EMRs

100000EMRs

Test ID Match AVG Total Match Misses Exec Time
BB FTT (1 thread) 13 1,380,258 17 23 secs
TE (4 threads) 5 546,598 38,466 6,555 secs

Test 4: 1,000,000 EMRs

1000000EMRs

Test ID Match AVG Total Match Misses Exec Time
BB FTT (1 thread) 14 14,834,247 17 162 secs

Benchmarking Notes

Tests conducted on OEID 3.1 using the US SNOMED-CT concept dictionary (148,000 concepts) against authentic Electronic Medical Records. Physical hardware used: PC, 4 core i7 with hyperthreading, 32 GB RAM on SSD drives.

The “Text Tagger – Whitelist” was discarded as unusable for this test setup. “Text Enrichment” with 1 thread was discarded after the 10,000 document run and TE with 4 threads was discarded after the 100,000 document run.

Advanced Visualizations on Oracle Endeca 3.1

Ranzal is pleased to announce that our Advanced Visualization Framework is now generally available.  Spend more time discovering, less time coding.

The calendar has turned, the air is frozen and, with a new year, comes the annual deluge of “predictions and trends” articles for 2014.  Spoiler Alert: Hadoop isn’t going away and data, and what you do with it, is everything right now.

Maybe you’ve seen some of these articles but one in particular, Forbes’ “Top Four Big Data Trends”, and more specifically, one section caught our eye.  It’s simple, it’s casual (possibly too much so), but it really resonates:

“Visualization allows people who are not analytics gurus to get insights out of complex stuff in a way they can absorb.”

At Ranzal, we believe the goal of Business Intelligence and Data Discovery is to democratize the discovery process and allow anyone who can read a chart to understand their data.  It’s not about the fancy query or the massive map/reduce, it’s what you do with it.

Continue reading

The Feature List : Oracle Endeca Information Discovery 3.1

As promised last week, we’ve been compiling a list of all the new features that were added as part of the Oracle Endeca Information Discovery (OEID) 3.1 release earlier this month.

If we’ve missed anything, please shoot us an email and we’ll update the post.

OEID Integrator v3.1

hadoop-cloveretl

The gang at Javlin has implemented some major updates in the past 6 months, especially around big data.  The OEID Integrator releases, obviously, lag a bit behind their corresponding CloverETL release but there’s still a lot to get excited about from both a CloverETL and “pure OEID” standpoint:

  • Base CloverETL version upgraded from 3.3 to 3.4 – full details here
  • HadoopReader / HadoopWriter components
  • Job Control component for executing MapReduce jobs
  • Integrated Hive JDBC connection
  • Language Detection component!

The big takeaway here is the work that the Javlin team has done in terms of integrating their product more closely with the “Big Data” ecosystem.  Endeca has always been a great complementary fit with sophisticated data architectures based on Hadoop and these enhancements will only make it easier.

Keeping with our obsession of giving some time to the small wins that add big gains, I really like the quick win with the Language Detection component.  This is something that had been around “forever” in the old Endeca world of Forge and Dgidx but was rarely used or understood.  It is nice to see the return of this functionality as it will play a huge role in multi-lingual/multi-national organizations, especially those with a lot of unstructured data.  Think about a European bank with a large presence in multiple countries trying to hear the “Voice of the Customer”.  Having the ability to navigate, filter and summarize based on a customer’s native language gets so much easier.

OEID Web Acquisition Toolkit (aka KAPOW!) Continue reading

Adventures in Installing – Oracle Endeca 3.1 Integrator

The newest version of Oracle’s Endeca Information Discovery (OEID v3.1), Oracle’s data discovery platform, was released yesterday morning.  We’ll have a lot to say about the release, the features, and what an upgrade looks like in the coming weeks (for the curious, Oracle’s official press release is here) but top of our minds right now is: “How do I get this installed and up and running?”

After spending a few hours last night with the product, we wanted to share some thoughts on the install and ramp-up process and hopefully save some time for others who are looking to give the product a spin.  The first post concerns the installation of the ETL tool that comes with Oracle Endeca, OEID Integrator.

Installing OEID Integrator

When attempting to install Integrator, I hit a couple snags.  The first was related to the different form factor that the install has taken vs. previous releases.  In version 3.1, the install has moved away from an “Installshield-style” experience to more of a download, unzip and script approach.  After downloading the zip file and unpacking it, I ended up with a structure that looks like this:

installing-integrator-3.1

Seeing an install.bat, I decided to click it and dive right in.  After the first couple of prompts, one large change becomes clear.  The Eclipse container that hosts Integrator needs to be downloaded separately prior to installation (RTFM, I guess).

Not a huge deal but what I found was that is incredibly important that you download a very specific version of Eclipse (Indigo, according to the documentation) in order for the installation to complete successfully.  For example:

I tried to use the latest version of Eclipse, Kepler.  This did not work.
I tried to use Eclipse IDE for J2EE Developers (Indigo).  This did not work.
I used Eclipse IDE for Java Developers (Indigo) and it worked like a charm:

eclipse-indigo

In addition, I would highly recommend running the install script (install.bat) from the command line, rather than through a double-click in Windows Explorer.  Running it via a double-click can make it difficult to diagnose any issues you may encounter since the window closes itself upon completion.  If the product is installed from the command line, a successful installation on Windows should look like this:

integrator-success

Hopefully this saves some time for others looking to ramp up on the latest version of OEID.  We’ll be continuing to push out information as we roll the software out in our organization and upgrade our assets so watch this space.

 

Data Discovery In Healthcare

A few days ago, QlikTech and Epic announced a technology partnership that will strengthen the integration between their software products as well as provide a forum for their joint customers to share best practices and innovative ways to use both technologies.

For a firm like Ranzal who is currently implementing several population health discovery applications, my first reaction was simply that this partnership made sense.  Both companies are leaders in their respective domains and are very well-regarded.  Beyond that, discovery technologies like Qlik, Tableau and Endeca are quickly establishing a foothold in the blossoming domain of healthcare analytics.  Unlike traditional BI technologies, data discovery tools are meant to quickly mashup disparate datasources and allow users to ask in-the-moment, unanticipated questions.  This alternative approach to analytics is allowing healthcare providers to build self-service discovery applications for broad audiences at speeds unimaginable in the world of the clinical data warehouse.  Since almost all healthcare analytics applications rely on data from the EMR, this partnership seemed natural, if not overdue.

My second reaction was that there was something missing.  In my experience, to get a holistic view of the health system, all of the relevant data must be tapped.  Data discovery on structured data, while powerful, can only tell party of the story.  With 60% of a health system’s data is tied up in unstructured medical notes, reports and journals, Qlik is not fully equipped to allow healthcare practitioners to gain a 360 degree view of their health system.

Endeca shines when structured and unstructured data are both required to paint a complete picture.  In healthcare, properly analyzing clinical data can mean drastically better outcomes at lower costs.  Understanding the “why” behind the “what” means properly tapping the narratives in the medical notes and tools like Endeca are best suited to unlock value when unstructured is prominent.

QlikView is a powerful tool and one cannot question its ease of use and numerous discovery features.  However, in industries rife with unstructured, products like Endeca that treat unstructured as a first class citizen (in the way it acquires, enriches, models, searches, and visualizes unstructured) are better suited to unlock the whole story.

So, I couldn’t help but think that a strong partnership could also be made between other EMR vendors with Oracle Endeca.  We spend a lot of time sizing up the relevant technologies in the data discovery space trying to understand differentiators.  For the types of discovery we’re seeing healthcare when unstructured is necessary to tell the whole story, our money remains on Endeca.