Return to Action

Late last year, we introduced some incredibly compelling capabilities that allowed users to collaborate with each other inside of Oracle Endeca Information Discovery (OEID) 3.0. Our collaborative discovery extension allowed users with certain permissions to delete records, edit attributes of records and add attribute values to existing attributes, all from within OEID Studio.  It’s an incredibly powerful way to assist in data cleanup, data flagging or grouping certain records together with applicability to almost every data discovery scenario. We’re pleased to announce that we’ve re-launched this functionality and it is now available for licensed users of OEID 3.1.  The same great capabilities that have always been there remain, but we’ve given it a bit of a facelift as part of the upgrade, as you can see below.

Deleting Misleading or Invalid Records

delete-pre-ambleHey, this tweet has nothing to do with the Olympics!

Occasionally incorrect or misleading data will find its way into a given application.  If it has no business being there or has an unwanted, adverse affect, let’s get rid of it! Hey, this tweet has nothing to do with the Olympics! Bye, bye spammer!

Augmenting Existing Records With More Attribution

In addition, users may find something interesting on a record (or set of records) and want to take action to augment the attribution of the record.  Below, two terrorist incidents (from one of our internal applications) have been identified as possibly having links to ISIS based on location: The data can be augmented by selecting the field to hold this additional information (for simplicity sake, we added it to “Themes”)… add-themes-1 …adding the additional value (or values):

Replacing Existing Attributes on Records

In the same vein as the first use case, users may find a record or set of records where they want to set a brand new value for an attribute.  It could be changing a Status from Open to Closed or from Valid to Invalid or maybe correcting an error during ingest such as a poorly calculated sentiment score. After each change, we update the index upon selection of “Apply Changes”.  If you look below, you can see the result in the application reflected immediately: Now, there’s one final piece that hasn’t been mentioned that completely closes the loop.  Since an Oracle Endeca Discovery Application is typically not the “system of record”, there’s the possibility that an update sourced from an upstream system could override changes made by users. We’ve accounted for that as well by persisting all changes to a dedicated database store that can be integrated into the update process.  For example, if I’ve deleted a record from my application, I can use the “log of deletes” in the database as a data source in any ETL processes that may happen subsequently.  Simply filter the incoming data stream using the data stored in the database and you’re good to go.  If there are attribute replacements and additions, they work the same way and are tracked and logged appropriately.

If you’re interested in pricing or capabilities or just want to give feedback, drop us a line at product [at] ranzal.com.  It’s already been delivered to a customer in Spain last month and we’re looking forward to seeing more and more people in the community get their hands on it.

Tag 100 Times Faster — Introducing Branchbird’s Fast Text Tagger

BBFTTClip
Text Tagging is the process of using a list of keywords to search and annotate unstructured data. This capability is frequently required by Ranzal customers, most notably in the healthcare industry.

Oracle’s Endeca Data Integrator provides three different ways to text tag your data “out of the box” .

  • The first is the “Text Tagger – Whitelist” component which is fed a list of keywords and searches your text for exact matches.
  • The second is the “Text Tagger – Regex” component which works similarly but allows for the use of regular expressions to expand the fuzzy matching capabilities when searching the text.
  • The third is using “Endeca’s Text Enrichment” component (OEM’ed from Lexalytics) and supplying a model (keyword list) that takes advantage of the component’s model-based entity extraction.

Ranzal began working on a custom text tagging component due to challenges with the aforementioned components at scale. All of the above text taggers are built to handle tagging with relatively small inputs — both the size of the supplied dictionary and the number (and size) of documents.

1,000 EMRs 10,000 EMRs 100,000 EMRs 1,000,000 EMRs
Fast Text Tagger (FTT) 250 docs/sec 1,428 docs/sec 4,347 docs/sec 6172 docs/sec
Text Enrichment (TE) 6.5 docs/second 5 docs/second N/A N/A
TE 4 threads 17.5 docs/second 15 docs/second 15 docs/second N/A

In one of our most common use cases, customers analyzing electronic medical records with Endeca need to enrich large amounts of free text (typically physician notes) using a medical ontology such as SNOMED-CT or MeSH. Each of these ontologies has a large number of medical “concepts” and their associated synonyms. For example, the US version of SNOMED-CT contains nearly 150,000 concepts. Unfortunately, the “out of the box” text tagger components do not perform well beyond a couple hundred keywords. To realize slightly better throughput during tagging, Endeca developers have traditionally leveraged the third component listed above — the  Lexalytics-based “Text Enrichment” component — which offers better performance than the other options listed above.

However, after extensive use of the “Text Enrichment” component, it became clear that not only was the performance still not acceptable at high scale, the recall of the component was inadequate especially with Electronic Medical Records (EMRs). The Text Enrichment component is NLP-based and relies on accurately parsing sentence structure and word boundaries to tokenize the document before entity extraction begins. EMRs typically have very challenging sentence structure due both to the ad hoc writing style of clinicians at point of entry and the observational metrics embedded in the record. Because of this, Text Enrichment of even small documents at high scale can be prohibitive for simple text tagging. A recent customer of ours, using very high end enterprise equipment, was experiencing 24 hour processing times using Text Enrichment text tagging with SNOMED-CT concepts to process approximately six million EMRs.

To improve both the performance and recall issues, Ranzal set out to build a simple text tagger component for Integrator that would be easy to setup and use. The Ranzal “Fast Text Tagger” was built using a high performance dictionary matching algorithm that ingests the list of terms (and phrases) into a finite state pattern matching machine which can then be used to process the documents. One of the largest benefits of these search algorithms is that the document text only needs to be parsed once to find all possible matches within the supplied dictionary.

The Ranzal Fast Text Tagger is intended to replace the stock “Text Tagger – Whitelist” component and the use of the “Text Enrichment” component for whitelisting. Our text tagger is intended for straight text matching with optional restrictions to allow for matching on word boundaries. If your use cases require more fuzzy-style text matching, then you should continue to use the “Text Tagger – Regexp” at low scale and “Text Enrichment” at higher scales.

Performance Observations

To go further on the metrics shown above, and duplicated here, you can see the remarkable performance of the Ranzal Fast Text Tagger as compared to “Text Enrichment” even when Text Enrichment is configured to consume 4 threads. Furthermore, the rate of the BB FTT tends to increase with the number of documents, before starting to level off near 1 million documents, whereas Text Enrichment stays relatively constant.

1,000 EMRs 10,000 EMRs 100,000 EMRs 1,000,000 EMRs
BB FTT 1 thread 250 docs/sec 1,428 docs/sec 4,347 docs/sec 6172 docs/sec
TE 1 thread 6.5 docs/second 5 docs/second N/A N/A
TE 4 threads 17.5 docs/second 15 docs/second 15 docs/second N/A

As a final performance note, the previously mentioned customer with the 24 hour graph run just for text tagging, the same process was done on this same test harness with the same data in just shy of 20 minutes. It took longer to read the data from disk than it took to stream it all through the Fast Text Tagger.  This implies that, in typical use cases, the Fast Text Tagger will not be a limiting component in your graph. For those of you curious about the benchmarking methods used, please continue below.

Test Runs

We built a graph that could execute the different test configurations sequentially and then compile the results. Shown below are four separate test runs and a screen capture of Integrator at test completion. Below each screen cap is a list of metrics:

  • Match AVG: The average number of concepts extracted over the corpus
  • Total Match: The total number of concepts extracted over the corpus
  • Misses: The number of non-empty EMRs where no concept was found
  • Exec Time: The total execution time of the test configuration

Note that Text Enrichment’s poor recall negatively impacts its precision (Match AVG). If you remove the (significant number of) misses, TE has precision nearly as high as our Fast Text Tagger.

Test 1: 1,000 EMRs

1000EMRs

Test ID Match AVG Total Match Misses Exec Time
BB FTT (1 thread) 9 9,126 14 4 secs
TE (1 thread) 4 4,876 260 153 secs
TE (4 threads) 4 4,876 260 57 secs

Test 2: 10,000 EMRs

10000EMRs

Test ID Match AVG Total Match Misses Exec Time
BB FTT (1 thread) 12 127,617 14 7 secs
TE (1 thread) 5 55,567 3,739 2,010 secs
TE (4 threads) 5 55,567 3,739 675 secs

Test 3: 100,000 EMRs

100000EMRs

Test ID Match AVG Total Match Misses Exec Time
BB FTT (1 thread) 13 1,380,258 17 23 secs
TE (4 threads) 5 546,598 38,466 6,555 secs

Test 4: 1,000,000 EMRs

1000000EMRs

Test ID Match AVG Total Match Misses Exec Time
BB FTT (1 thread) 14 14,834,247 17 162 secs

Benchmarking Notes

Tests conducted on OEID 3.1 using the US SNOMED-CT concept dictionary (148,000 concepts) against authentic Electronic Medical Records. Physical hardware used: PC, 4 core i7 with hyperthreading, 32 GB RAM on SSD drives.

The “Text Tagger – Whitelist” was discarded as unusable for this test setup. “Text Enrichment” with 1 thread was discarded after the 10,000 document run and TE with 4 threads was discarded after the 100,000 document run.

Advanced Visualizations on Oracle Endeca 3.1

Ranzal is pleased to announce that our Advanced Visualization Framework is now generally available.  Spend more time discovering, less time coding.

The calendar has turned, the air is frozen and, with a new year, comes the annual deluge of “predictions and trends” articles for 2014.  Spoiler Alert: Hadoop isn’t going away and data, and what you do with it, is everything right now.

Maybe you’ve seen some of these articles but one in particular, Forbes’ “Top Four Big Data Trends”, and more specifically, one section caught our eye.  It’s simple, it’s casual (possibly too much so), but it really resonates:

“Visualization allows people who are not analytics gurus to get insights out of complex stuff in a way they can absorb.”

At Ranzal, we believe the goal of Business Intelligence and Data Discovery is to democratize the discovery process and allow anyone who can read a chart to understand their data.  It’s not about the fancy query or the massive map/reduce, it’s what you do with it.

Continue reading

Leveraging Your Organization’s OBI Investment for Data Discovery

Coupling disparate data sets into meaningful “mashups” is a powerful way to test new hypotheses and ask new questions of your organization’s data.  However, more often than not, the most valuable data in your organization has already been transformed and warehoused by IT in order to support the analytics needed to run the business.  Tools that neglect these IT-managed silos don’t allow your organization to tell the most accurate story possible when pursuing their discovery initiatives.  Data discovery should not focus only on the new varieties of data that exist outside your data warehouse.  The value from social media data and machine generated data cannot be fully realized until it can be paired with the transactional data your organization already stockpiles.

Judging by the heavy investment in a new “self-service” theme in the recently released version 3.1 of Endeca Information Discovery, this truth has not been lost on Oracle.

Companies that are eager to get into the data discovery game, yet are afraid to walk away from the time and effort they’ve poured into their OBI solution, can breathe a little easier.  Oracle has made the proper strides in the Endeca product to incorporate OBI into the discovery experience.

And unlike other discovery products on the market today, the access to these IT-managed repositories (like OBI) is centrally managed.  By controlling access to the data and keeping all data “on the platform”, this centralized management allows IT to avoid the common “spreadmart” problem that plagues other discovery products.

Rather than explain how OBI has been introduced into the discovery experience, I figured I would show you.  Check out this short 4 minute demonstration which illustrates how your organization can build their own data “mashups” leveraging the valuable data tied up in OBI.

 

 

Chances are that a handful of these tested hypotheses will unlock new ways to measure your business.  These new data mashups will warrant permanent applications that are made available to larger audiences within your organization.  The need for more permanent applications will require IT to “operationalize” your discovery application — introducing data updates, security, and properly sized hardware to support the application.

For these IT-provisioned applications, Oracle has also provided some tooling in Endeca to make the job more straightforward.  Specifically, when it comes to OBI, the product now boasts a wizard that will produce a Integrator project with all of the plumbing necessary to pull data tied up in OBI into a discovery application in minutes.  Check out this video to see how:

 

 

It is product investments like these that will allow organizations to realize the transformative effects data discovery can have on their business without having to ignore the substantial BI investments already in place.

As always, please direct any questions or comments to [at] ranzal.com.

The Feature List : Oracle Endeca Information Discovery 3.1

As promised last week, we’ve been compiling a list of all the new features that were added as part of the Oracle Endeca Information Discovery (OEID) 3.1 release earlier this month.

If we’ve missed anything, please shoot us an email and we’ll update the post.

OEID Integrator v3.1

hadoop-cloveretl

The gang at Javlin has implemented some major updates in the past 6 months, especially around big data.  The OEID Integrator releases, obviously, lag a bit behind their corresponding CloverETL release but there’s still a lot to get excited about from both a CloverETL and “pure OEID” standpoint:

  • Base CloverETL version upgraded from 3.3 to 3.4 – full details here
  • HadoopReader / HadoopWriter components
  • Job Control component for executing MapReduce jobs
  • Integrated Hive JDBC connection
  • Language Detection component!

The big takeaway here is the work that the Javlin team has done in terms of integrating their product more closely with the “Big Data” ecosystem.  Endeca has always been a great complementary fit with sophisticated data architectures based on Hadoop and these enhancements will only make it easier.

Keeping with our obsession of giving some time to the small wins that add big gains, I really like the quick win with the Language Detection component.  This is something that had been around “forever” in the old Endeca world of Forge and Dgidx but was rarely used or understood.  It is nice to see the return of this functionality as it will play a huge role in multi-lingual/multi-national organizations, especially those with a lot of unstructured data.  Think about a European bank with a large presence in multiple countries trying to hear the “Voice of the Customer”.  Having the ability to navigate, filter and summarize based on a customer’s native language gets so much easier.

OEID Web Acquisition Toolkit (aka KAPOW!) Continue reading

Adventures in Installing – Oracle Endeca 3.1 Integrator

The newest version of Oracle’s Endeca Information Discovery (OEID v3.1), Oracle’s data discovery platform, was released yesterday morning.  We’ll have a lot to say about the release, the features, and what an upgrade looks like in the coming weeks (for the curious, Oracle’s official press release is here) but top of our minds right now is: “How do I get this installed and up and running?”

After spending a few hours last night with the product, we wanted to share some thoughts on the install and ramp-up process and hopefully save some time for others who are looking to give the product a spin.  The first post concerns the installation of the ETL tool that comes with Oracle Endeca, OEID Integrator.

Installing OEID Integrator

When attempting to install Integrator, I hit a couple snags.  The first was related to the different form factor that the install has taken vs. previous releases.  In version 3.1, the install has moved away from an “Installshield-style” experience to more of a download, unzip and script approach.  After downloading the zip file and unpacking it, I ended up with a structure that looks like this:

installing-integrator-3.1

Seeing an install.bat, I decided to click it and dive right in.  After the first couple of prompts, one large change becomes clear.  The Eclipse container that hosts Integrator needs to be downloaded separately prior to installation (RTFM, I guess).

Not a huge deal but what I found was that is incredibly important that you download a very specific version of Eclipse (Indigo, according to the documentation) in order for the installation to complete successfully.  For example:

I tried to use the latest version of Eclipse, Kepler.  This did not work.
I tried to use Eclipse IDE for J2EE Developers (Indigo).  This did not work.
I used Eclipse IDE for Java Developers (Indigo) and it worked like a charm:

eclipse-indigo

In addition, I would highly recommend running the install script (install.bat) from the command line, rather than through a double-click in Windows Explorer.  Running it via a double-click can make it difficult to diagnose any issues you may encounter since the window closes itself upon completion.  If the product is installed from the command line, a successful installation on Windows should look like this:

integrator-success

Hopefully this saves some time for others looking to ramp up on the latest version of OEID.  We’ll be continuing to push out information as we roll the software out in our organization and upgrade our assets so watch this space.

 

The Only Oracle Endeca Information Discovery Specialized Partner

One of the “big things” that we alluded to in our previous post has finally come through and we could not be more excited.

Ranzal is the first, and the only, company to achieve Specialized Status for the Oracle’s Endeca Information Discovery (OEID) platform.

Given our background with the product, we knew about a year ago that this is something we wanted to aggressively pursue.  And now, with a big assist from our customer references and partners, we are proud to say we’ve been approved, just in time for Oracle Open World next week.

For those who may not know much about the Specialization program, the program is designed to spotlight Oracle partners who are the recommended go-to partners for a given product line.  In order to achieve this level, companies need to demonstrate that they have:

  • The ability to deliver successful projects and solutions on the OEID platform
  • The ability to sell and evangelize the platform’s capabilities to the market
  • An established customer base that is willing to provide references

With the rigorous success criteria tied to the above objectives, (ex: selling multiple OEID licenses, multiple customer references), it took us a little longer than we would have wanted but we couldn’t be happier to have this under our belt.

Again, a big thanks to our customers and our Oracle partner representatives.  Drinks on us next week in San Francisco!

Killing the Tag Cloud – Introducing the Concept Cloud

For the last decade, since its first appearance on Flickr, the tag cloud has been ubiquitous as the default construct for visualizing concepts and topics found within text. There’s a beauty in its simplicity, the larger the word on your screen, the more frequently it appears in the data you are analyzing.

Now, nothing against the tag cloud but we’ve been thinking it’s gone a little stale lately. The visuals can be a bit muddled and, while it’s great for identifying single concepts, it does nothing to inform the user of relationships between the identified terms or any sentiment that may be attached to a given concept. The sum total of the visual is different sized fonts and good luck to you if you’re trying to draw any conclusions from your data.

Enter the Concept Cloud

Straight out of the lab at Ranzal’s HQ, we feel like this is a huge stride forward in terms of visualizing unstructured concepts and relationships. You’ll notice in the above screen capture, we’ve highlighted 6 terms from a series of physician’s notes related to a patient’s heart issue.

The first thing you’ll notice above is that you still have the visual cue of “size equals frequency”. The most frequently referenced anatomy in the data are represented by the largest circles. We then take this concept and go deeper. Wave the mouse over one of the circles and you’ll see that we surface the exact number of references for the associated term.

node-wave-over

In addition, when you plug this visualization into a Data Discovery environment like Oracle Endeca Information Discovery, you get the ability to drill down and further investigate your data and have the cloud react accordingly.  Below, we have another data set featuring key persons and concepts from everyone’s favorite topic, American Politics.  You have the ability to click a circle and narrow your data to records that contain the term you’ve selected.

sentiment

The Concept Cloud is the Tag Cloud “all grown up”. The traditional visual cues of size and frequency remain and, through shaping and shading, they are enhanced with sentiment analysis and a nicer visual experience. The final advance that the Concept Cloud provides, and the problem that drove us to create this solution, is in the connections between the circles. It’s great that key concepts present in the data are identified. However, what about the relationships?

edge-wave-over

This is where we think this visualization separates itself from the pack. Using some “relational/set magic”, the number of times these terms are found in proximity to each other is calculated and used to inform the user visually. When you wave over a line, linked terms are brought to the fore and unrelated terms are faded.  And, just as larger circles can be highlighted to show the exact frequency of terms, the edges or connections provide the same level of precision and can show the exact number of links in common.

If you look back at the original graphic, it should be apparent that the terms are actually laid out according to how closely they are linked.  Terms that are found in common with one another are arranged more closely than those with a more loose association. Terms that are totally unrelated will shown on the page, but totally disconnected from their cohort.

Also, because this visualization is based on platform-neutral technology (including D3 and SVG) and not Flash, it looks great on mobile, supports zoom in, zoom out and scales beautifully.

One other thing to note on the last diagram is the color coding of different concepts. When pairing this technology with sentiment analysis engines, such as Lexalytics, we can appropriately shade the representative circles for terms and concepts to indicate whether they are being referred to in a positive or negative fashion.

We’re starting to integrate this capability on our current engagements so please contact us if you’d like to learn more or even if you’d just like to see what this would look like on top your own data.  We always welcome any feedback or suggestions, either comment below, tweet at us or send us a message at info [at] ranzal.com.

Quick Hits: Temporal Analysis in Endeca

I try to keep the pulse of the OTN Forum for OEID, or Oracle Endeca Information Discovery.  Of late, a lot of questions have come up around how to handle temporal analysis with Endeca.  Specifically, when producing visualizations by time (e.g. month), how do I ensure that I have a “bucket” for all months, even if my underlying data does not tie back to every month?  A common pain point in the product, to be sure.

To illustrate, say I have “sales” records, like so:

When I load these into my Endeca server and attempt to produce a visualization that totals my sales by month, I wind up with:

 

RETURN foo AS SELECT SUM(SalesAmt) As "TotSales" GROUP BY Month

 

Almost immediately, it jumps out at me that there is no bucket for “5-May”.  Upon investigation, this is *accurate* as I had no sales in May, but far from the visualization I require to properly convey that fact.

The best practice here is to introduce a secondary “record type” that I usually call “Calendar”.  Each record in this record type is a different day, and I include all of the derived attribution I may want for the varying temporal analysis I’d like to perform in my application.  Thus, my new “Calendar” record type might look like:

and so on…

Now when I issue the same EQL statement that powers my chart, all temporal “buckets” are covered by my calendar records.  The calendar records ensure that my GROUP BY is offered any and all buckets, even if there are no sales to total in a particular bucket.  After loading this second record type, I refresh my chart and voila:

chartsalesMay

 

I now have a bucket for “May” and my visualization properly conveys that sales tanked in May and someone needs to lose their job.

Hope this helps.