Big Data Discovery – Custom Java Transformations Part 2

In a previous post, we walked through how to implement a custom Java transformation in Oracle Big Data Discovery.  While that post was more technical in nature, this follow up post will highlight a detailed use case for the transformations and illustrate how they can be used to augment an existing dataset.

Example: 2015 Chicago Mayoral Election Runoff

For this example we will be using data from the 2015 Mayoral Election Runoff in Chicago.  Incumbent Rahm Emanuel defeated challenger “Chuy” Garcia with 55.7% of the popular vote.  Results data from the election were compiled and matched up with Chicago communities, which were then subdivided by zip code.  A small sample of the data can be seen below:

Sample election data

Sample election data

In its original state, the data already offers some insight into the results of the election, but only at a high level.  By utilizing the custom transformations, it is possible to bring in additional data and find answers to more detailed questions.  For example, what impact did the wealth of Chicago’s communities have on their selection of a candidate?

One indicator of wealth within a community is the median sale price of homes in that area.  Naturally, the higher the price of homes, the wealthier the community tends to be.  Zillow provides an API that allows users to query for a variety of real estate and demographic information, including median sale price of homes.  Through the custom transformations, we can augment the existing election results with the data available through the API.

The structure of the custom transformation is exactly the same as the ‘Hello World’ example from our previous post.  The transformation is initiated in the BDD Custom Transform Editor with the command runExternalPlugin('ZillowAPI.groovy',zip). In this case, the custom groovy script is called ZillowAPI.groovy and the field being passed to the script is the zip code, zip.

The script then uses the zip to construct a string and convert it to the URL required to make the API call:

def zip = args[0]
String url = "http://www.zillow.com/webservice/GetDemographics.htm?zws-id=<ZILLOW_API_KEY>&zip=" + zip;
URL api_url = new URL(url);

Once the transform script completes, the median_sale_price field is now accessible in BDD:

Updated data in BDD

Updated data in BDD

Now that the additional data is available, we can use it to build some visualizations to help answer the question posed earlier.

Median Sale Price by Chicago Community - Created using the Ranzal Data Visualization Portlet*

Median Sale Price by Chicago Community – Created using the Ranzal Data Visualization Portlet*

Percentage for Chuy by Community - Created using the Ranzal Data Visualization Portlet*

Percentage for Chuy by Community – Created using the Ranzal Data Visualization Portlet*

The two choropleths above show the median sale price by community and the percentage of votes for “Chuy” by community.  Communities in the northeastern sections of the city seem to have the highest concentration of median sale price, while communities in the western and southern sections tend to have lower prices.  For median sale price to be a strong indicator of how the communities voted, the map displaying votes for “Chuy” should show a similar pattern, with the communities grouped by northeast and southwest.  However, the pattern is noticeably different, with votes for “Chuy” distributed across all sections of the map.

Bar-Line chart of Median Sale Price and Percent for Chuy

Bar-Line chart of Median Sale Price and Percent for Chuy

Looking at the median sale price in conjunction with the percentage of votes for “Chuy” provides an even clearer picture.  The bars in the chart above represent the median sale price of homes, and are sorted in descending order from left to right.  The line graph represents the percentage of votes for “Chuy” in each community.  If there was a connection between median sale price and the percentage of votes for “Chuy”, we’d expect to see the line graph either increase or decrease as sale price decreases.  However, the percentage of votes varies widely from community to community, and doesn’t seem to follow an obvious pattern in relation to median sale price.  This corresponds with the observations from the two choropleths above.

While these findings don’t provide a definitive answer to the initial question as to whether community wealth was a factor in the election results, they do suggest that median sale price is not a good indicator of how Chicago communities voted in the election.  More importantly, this example illustrates how easy it is to utilize custom Java transformations in BDD to answer detailed questions and get more out of your original dataset.

If you would like to learn more about Oracle Big Data Discovery and how it can help your organization, please contact us at info [at] ranzal.com or share your questions and comments with us below.


* – The Ranzal Data Visualization Portlet is a custom portlet developed by Ranzal and is not available out of the box in BDD.  If you would like more information on the portlet and it’s capabilities, please contact us and stay tuned for a future blog post that will cover the portlet in more detail.

Big Data Discovery – Custom Java Transformations Part 1

In our first post introducing Oracle Big Data Discovery, we highlighted the data transform capabilities of BDD.  The transform editor provides a variety of built in functions for transforming datasets.  While these built in functions are straightforward to use and don’t require any additional configuration, they are also limited to a predefined set of transformations.  Fortunately, for those looking for additional functionality during transform, it is possible to introduce custom transformations that can leverage external Java libraries by implementing a custom Groovy script.  The rest of this post will walk through the implementation of a basic example, and a subsequent post will go in depth with a few real world use cases.

Create a Groovy script

The core component needed to implement a custom transform with external libraries is a Groovy script that defines the pluginExec() method.  Groovy is a programming language developed for the Java platform.  Details and documentation on the language can be found here.  For this basic example, we’ll begin by creating a file called CustomTransform.groovy and define a method, pluginExec(), which should take an object array, args, as an argument:

def pluginExec(Object[] args) {
    String input = args[0] //args[0] is the input field from the BDD Transform Editor 

    //Implement code to transform input in some way
    //The return of this method will be inserted into the transform field

    input.toUpperCase() //This example would return an upper cased version of input
}

pluginExec() will be applied to each record in the BDD dataset, and args[0] corresponds to the field to be transformed.  In the example script above, args[0] is assigned to the variable input and the toUpperCase() method is called on it.  This means that if this custom transformation is applied to a field called name, the value of name for each record will be returned upper cased (For example, “johnathon” => “JOHNATHON”).

Import Custom Java Library

Now that we’ve covered the basics of how the custom Groovy script works, we can augment the script with external Java libraries.  These libraries can be imported and implemented just as they would be in standard Java:

import com.oracle.endeca.transform.HelloWorld
    
def pluginExec(Object[] args) {
    String input = args[0] //Note that though the input variable is defined in this example, it is not used.  Defining input is not required.
    
    HelloWorld hw = new HelloWorld() //Create a new instance of the HelloWorld class defined in the imported library
    hw.testMe() //Call the testMe() method, which returns a string "Hello World"
}

In the example above, the HelloWorld class is imported.  A new instance of HelloWorld is assigned to the variable hw, and the testMe() method is called.  testMe() is designed to simply return the string “Hello World”.  Therefore, the expected output of this custom script is that the string “Hello World” will be inserted for each record in the transformed BDD dataset.  Now that the script has been created, it needs to be packaged up and added to the Spark class path so that it’s accessible during data processing.

Package the Groovy script into a jar

In order to utilize CustomTransform.groovy, it needs to packaged into a .jar file.  It is important that the Groovy script be located at the root of the jar, so make sure that the file is not nested within any directories.  See below for an example of the file structure:

CustomTransform.jar
  |---CustomTransform.groovy
  |---AdditionalFile_1
  |---AdditionalFile_2
  ...
  ...
  .etc

Note that additional files can be included in the jar as well.  These additional files can be referenced in CustomTransform.groovy if desired.  There are multiple ways to pack up the file(s), but the simplest is to use the command line.  Navigate to the directory that contains CustomTransform.groovy and use the following command to package it up:

# jar cf <new_jar_name> <input_file_for_jar>
> jar cf CustomTransform.jar CustomTransform.groovy

Setup a custom lib location in Hadoop

CustomTransform.jar and any additional Java libraries that are imported by the Groovy script need to be added to all spark nodes in your Hadoop cluster.  For simplicity, it is helpful to establish a standard location for all custom libraries that you want Spark to be able to access:

$ mkdir /opt/bdd/edp/lib/custom_lib

The /opt/bdd/edp/lib directory is the default location for the BDD dataprocessing libraries used by Spark.  In this case, we’ve created a subdirectory, custom_lib, that will hold any additional libraries we want Spark to be able to use.

Once the directory has been created, use scp, WinSCP, MobaXterm, or some other utility to upload CustomTransform.jar and any additional libraries used by the Groovy script into the custom_lib directory.  The directory needs to be created on all Spark nodes, and the libraries need to be uploaded to all nodes as well.

Update sparkContext.properties on the BDD Server

The last step that needs to be completed before running the custom transformation is updating the sparkContext.properties file.  This step only needs to be completed the first time you create a custom transformation as long as the location of the custom_lib directory remains constant for each subsequent script.

Navigate to /localdisk/Oracle/Middleware/BDD<version>/dataprocessing/edp_cli/config on the BDD server and open the sparkContext.properties file for editing:

$ cd /localdisk/Oracle/Middleware/BDD1.0/dataprocessing/edp_cli/config
$ vim sparkContext.properties

The file should look something like this:

#########################################################
# Spark additional runtime properties, see
# https://spark.apache.org/docs/1.0.0/configuration.html
# for examples
#########################################################


Add an entry to the file to define the spark.executor.extraClassPath property.  The value for this property should be <path_to_custom_lib_directory>/*.  This will add everything in the custom_lib directory to the Spark class path.  The updated file should look like this:

#########################################################
# Spark additional runtime properties, see
# https://spark.apache.org/docs/1.0.0/configuration.html
# for examples
#########################################################

spark.executor.extraClassPath=/opt/bdd/edp/lib/custom_lib/*

It is important to note, if there is already an entry in sparkContext.properties for the spark.executor.extraClassPath property, any libraries referenced by the property should be moved to the custom_lib directory so they are still included in the Spark class path.

Run the custom transform

Now that the script has been created and added to the Spark class path, everything is in place to run the custom transform in BDD.  To try it out, open the Transform tab in BDD and click on the Show Transformation Editor button.  In this example, we are going to create a new field called custom with the type String:

Create new attribute

Create new attribute

Now in the editor window, we need to reference the custom script:

Transform Editor

Transform Editor

The runExternalPlugin() method is used to reference the custom script.  The first argument is the name of the Groovy script.  Note that the value above is 'CustomTransform.groovy' and not 'CustomTransform.jar'.  The second argument is the field to be passed as an input to the script (this is what gets assigned to args[0] in pluginExec()).  In the case of the “Hello World” example, the input isn’t used, so it doesn’t matter what field is passed here.  However, in the first example that returned an upper cased version of the input field, the script above would return an upper cased version of the key field.

One of the nice features of the built-in transform functions is that they make it possible to preview the transform changes before committing.  With these custom scripts, however, it isn’t possible to see the results of the transform before running the commit.  Clicking preview will just return blank results for all fields, as seen in the example below:

Example of custom transform preview

Example of custom transform preview

The last thing to do is click ‘Add to Script’ and then ‘Commit to Project’ to kick off the transformation process.  Below are the results of the transform.  As expected, a new custom field has been added to the data set and the value “Hello World” has been inserted for every record.

Transform results

Transform results

This tutorial just hints at the possibilities of utilizing custom transformations with Groovy and external Java libraries in BDD.  Stay tuned for the second post on this subject, when we will go into detail with some real world use cases.

If you would like to learn more about Oracle Big Data Discovery and how it can help your organization, please contact us at info [at] ranzal.com or share your questions and comments with us below.

Bringing Data Discovery to Hadoop – Part 3

In our last post, we talked about some of the tools in the Hadoop ecosystem that Oracle Big Data Discovery takes advantage of to do its work — namely Hive and Spark. In this post, we’re going to delve a little deeper into how BDD integrates with data that is already sitting in Hive, how it can write transformed data back to HDFS, and how it can help give users new insights on that data.

Data Processing

BDD ships with a data processing tool that makes imports from Hive easy. Simply point it at the database and table(s) you would like to pull into BDD and the application does the rest. Behind the scenes, the data processing utility launches Spark workers to read in the data from HDFS for the targeted table into new Avro files. BDD then indexes the data in these files for easy discovery.

Spark at work.

Spark at work.

Another feature of BDD’s data processing is that it can be set to auto-detect new tables that are created in a Hive database to keep it in sync with Hive. The BDD Hive Table Detector automatically launches a workflow to import a table whenever one is created. Currently, BDD doesn’t yet support updates to existing tables but we hope to see that feature in an future release.

One thing to note: depending on the size of the table, BDD may import only a sampling of its data for discovery purposes. By default, the application’s record threshold is set to one-million. This is in order to keep any analysis of a particular collection as interactive as possible while maintaining a relatively dependable and accurate view. For most intents and purposes, this default setting should probably be enough. However, the threshold can be increased if necessary. Ultimately the amount of data sampling to use would have to be a balance between an individual’s needs and the computing resources available to them.

Exporting Back to Hive

A unique component of BDD is its ability to throw data back to Hadoop once you have it in a state that you are satisfied with or would like to share with other users. We have some campaign funding data to work with as a test case:

The Chicago mayor’s race has been getting some attention due to an unexpected underdog forcing incumbent Rahm Emanuel to a runoff. As you can see, the challenger, Chuy Garcia, is wildly out-funded compared to Emanuel:

Mayor01

Creating this application involved pulling campaign spending data for Illinois from electionmoney.org, importing it into BDD, and then joining a couple tables together and cleaning it all up using the transform tools contained within the application.

Now let’s say we wanted to export the results of this work — these joined, transformed data sets — for other users to query for themselves in Hive. We can do that with a simple, built-in export feature that can write our denormalized data set back to HDFS.

Export01

With a few quick clicks, BDD can create Avro-formatted files, write them to our Hadoop cluster, and then create the corresponding Hive table automatically:

Export03

This particular feature adds a lot of flexibility and opportunity for collaboration in teams where members span a wide range of skills. You can imagine users on the business side and technical side of a company throwing data sets back and forth between each other, sharing insights in a natural way that might have been much more difficult to accomplish in other environments.

That concludes our three-part look at Oracle Big Data Discovery. As we’ve said before, there is a lot to be excited about and we believe the application offers a viable data discovery solution to organizations running data in Hadoop, as well as those who are interested in creating first-time clusters.

For more information or guidance on how BDD could help your organization, contact us at info [at] ranzal.com.

Bringing Data Discovery To Hadoop – Part 2

The most exciting thing about Oracle Big Data Discovery is its integration with all the latest tools in the Hadoop ecosystem. This includes Spark, which is rapidly supplanting MapReduce as the processing paradigm of choice on distributed architectures. BDD also makes clever use of the tried and tested Hive as a metadata layer, meaning it has a stable foundation on which to build its complex data processing operations.

In our first post of this series, we showcased some of BDD’s most handy features, from its streamlined UI to its very flexible data transformation abilities. In this post, we’ll delve a little deeper into BDD’s underlying mechanics and explain why we think the application might be a great solution for Hadoop users.

Hive

Much of the backbone for BDD’s data processing operations lie in Hive, which effectively acts as a robust metastore for BDD. While operations on the data itself are not performed using Hive functions (which currently run on MapReduce), Hive is a great way to store and retrieve information about the data: where it lives, what it looks like, and how it’s formatted.

For organizations that are already running data in Hive, the integration with BDD couldn’t be simpler. The application ships with a data processing tool that can automatically import databases and tables from Hive, all while keeping data types intact. The tool can also sync up with a Hive database so that when new tables are created a user can automatically work with that data in BDD. If a table is dropped, BDD deletes that particular data set from its index. Currently, the 1.0 version doesn’t support updates to existing Hive tables, but we hope to see that feature in an upcoming release.

BDD can also upload data to HDFS and create a new table with that data in Hive to work with. It does this whenever a user uploads a file through the UI. For example, here’s what we saw in Hive with the consumer complaints data set from the last post after BDD imported it:

Example of an auto-generated Hive table by BDD

This easy integration with Hive makes BDD a good option for both experienced Hadoop users who are using Hive already, as well as less technical users.

Spark

While Hive provides a solid foundation for BDD’s operations, Spark is the workhorse. All data processing operations are run through Spark, which allows BDD to analyze and transform data in-memory. This approach effectively sidesteps the launching of slower MapReduce jobs through Hive and gives the processing engine direct access to the data.

When a user commits a series of transforms to a data set via the BDD UI, those transforms are interpreted into a Groovy script that are then passed to Spark through an Oozie job. Here, we can see how some date strings are converted to datetime objects behind the scenes:

Tech021

After Spark has done its handiwork, the data is then written out to HDFS as a new set of files, serialized and compressed in Avro. The original collection stays intact in another location in case we want to go back to it in the future.

From this point, the data is then loaded into the Dgraph.

Dgraph

The Dgraph is basically an in-memory index, and is what enables the real-time, dynamic exploration of data in BDD. This concept might be familiar to those who have used Oracle Endeca Information Discovery, where the Dgraph also played a key role, and this lineage means BDD inherits some very nice features: quick response, keyword search, impromptu querying, and the ability to unify metrics, structured and unstructured data in a single interface. The biggest difference now is that users have the ability to apply these real-time search and analytic capabilities to data sitting on Hadoop.

We think the marriage of this kind of discovery application with Hadoop makes a lot of sense. For starters, Hadoop has enabled organizations to store vast amounts of data cheaply without necessarily knowing everything about its structure and contents. BDD, meanwhile, offers a solution to indexing exactly this kind of data — data that is irregular, inconsistent and varied.

There’s also the issue of access. Currently, most data tools in the Hadoop ecosystem require a moderate level of technical knowledge, meaning wide swaths of an organization might have little to no view of all that data on HDFS. BDD offers a system to connect more people to that data, in a way that’s straightforward and intuitive.

If you would like to learn more about Oracle Big Data Discovery and how it might help your organization, please contact us at info [at] ranzal.com.

Bringing Data Discovery To Hadoop – Part 1

We have been anticipating the intersection of big data with data discovery for quite some time. What exactly that will look like in the coming years is still up for debate, but we think Oracle’s new Big Data Discovery application provides a window into what true discovery on Hadoop might entail.

We’re excited about BDD because it wraps data analysis, transformation, and discovery tools together into a single user interface, all while leveraging the distributed computing horsepower of Hadoop.

BDD’s roots clearly extend from Oracle Endeca Information Discovery, and some of the best aspects of that application — ad-hoc analysis, fast response times, and instructive visualizations — have made it into this new product. But while BDD has inherited a few of OEID’s underpinnings, it’s also a complete overhaul in many ways. OEID users would be hard-pressed to find more than a handful of similarities between Endeca and this new offering. Hence, the completely new name.

The biggest difference of course, is that BDD is designed to run on the hottest data platform in use today: Hadoop. It is also cutting edge in that it utilizes the blazingly fast Apache Spark engine to perform all of its data processing. The result is a very flexible tool that allows users to easily upload new data into their Hadoop cluster or, conversely, pull existing data from their cluster onto BDD for exploration and discovery. It also includes a robust set of functions that allows users to test and perform transformations on their data on the fly in order to get it into the best possible working state.

In this post, we’ll explore a scenario where we take a basic spreadsheet and upload it to BDD for discovery. In another post, we’ll take a look at how BDD takes advantage of Hadoop’s distributed architecture and parallel processing power. Later on, we’ll see how BDD works with an existing data set in Hive.

We installed our instance of BDD on Cloudera’s latest distribution of Hadoop, CDH 5.3. From our perspective, this is a stable platform for BDD to operate on. Cloudera customers also should have a pretty easy time setting up BDD on their existing clusters.

Explore

Getting started with BDD is relatively simple. After uploading a new spreadsheet, BDD automatically writes the data to HDFS, then indexes and profiles the data based on some clever intuition:What you see above displays just a little bit of the magic that BDD has to offer. This data comes from the Consumer Financial Protection Bureau, and details four years’ worth of consumer complaints to financial services firms. We uploaded the CSV file to BDD in exactly the condition we received it from the bureau’s website. After specifying a few simple attributes like the quote character and whether the file contained headers, we pressed “Done” and the application got to work processing the file. BDD then built the charts and graphs displayed above automatically to give us a broad overview of what the spreadsheet contained.

As you can see, BDD does a good job presenting the data to us in broad strokes. Some of the findings we get right from the start are the names of the companies that have the most complaints and the kinds of products consumers are complaining about.

We can also explore any of these fields in more detail if we want to do so:

Screen-Shot-2015-02-02-at-1.56.17-PM

Now we get an even more detailed view of this date field, and can see how many unique values there are, or if there are any records that have data missing. It also gives us the range of dates in the data. This feature is incredibly helpful for data profiling, but we can go even deeper with refinements.

Capture1

With just a few clicks on a couple charts, we have now refined our view of the data to a specific company, JPMorgan Chase, and a type of response, “Closed with monetary relief”. Remember, we have yet to clean or manipulate the data ourselves, but already we’ve been able to dissect it in a way that would be difficult to do with a spreadsheet alone. Users of OEID and other discovery applications will probably see a lot of familiar actions here in the way we are drilling down into the records to get a unique view of the data, but users who are unfamiliar with these kinds of tools should find the interface to be easy and intuitive as well.

Transform

Another way BDD differentiates itself from some other discovery applications is with the actions available under the “Transform” tab.

Within this section of the application, users have a wealth of common transformation options available to them with just a few clicks. Operations like converting data types, concatenating fields, and getting absolute values now can be done on the fly, with a preview of the results available in near real-time.

BDD also offers more complex transformation functions in its Transformation Editor, which includes features like date parsing, geocoding, HTML formatting and sentiment analysis. All of these are built-in to the application; no plug-ins required. Another nice feature BDD provides is an easy to way group (or bin) attributes by value. For example, we can find all the car-related financing companies and group them into a single category to refine by later on:

Transform021

Another nice added feature of BDD is the ability to preview the results of a transform before committing the changes to all the data. This allows a user to fine tune their transforms with relative ease and minimal back and forth between data revisions.

Once we’re happy with our results, we can commit the transforms to the data, at which point BDD launches a Spark job behind the scenes to apply the changes. From this point, we can design a discovery interface that puts our enriched data set to work.

Discover

Included with BDD are a set of dynamic, advanced data visualizations that can turn any data set into something profoundly more intuitive and usable:

Discover01

The image above is just a sampling of the kind of visual tools BDD has to offer. These charts were built in a matter of minutes, and because much of the ETL process is baked into the application, it’s easy to go back and modify your data as needed while you design the graphical elements. This style of workflow is drastically different from workflows of the past, which required the back- and front-ends to be constructed in entirely separate stages, usually in totally different applications. This puts a lot of power into the hands of users across the business, whether they have technical chops or not.

And as we mentioned earlier, since BDD’s indexing framework is a close relative to Endeca, it inherits all the same real-time processing and unstructured search capabilities. In other words, digging into your data is simple and highly responsive:

Discover02

As more and more companies and institutions begin to re-platform their data onto Hadoop, there will be a growing need to effectively explore all of that distributed data. We believe that Oracle’s Big Data Discovery offers a wide range of tools to meet that need, and could be a great discovery solution for organizations that are struggling to make sense of the vast stores of information they have sitting on Hadoop.

If you would like to learn more, please contact us at info [at] ranzal.com.

Also be sure to stay tuned for Part 2!

Announcing PowerDrill for Oracle EID 3.1

If you had distill what we at Ranzal’s Big Data Practice do down to its essence, it’s to use technology to make accessing and managing your data more intuitive, more useful.  Often this takes the form of data modeling and integration, data visualization or advice in picking the right technology for the problem at hand.

Sometimes, it’s a lot simpler than that.  Sometimes, it’s just giving users a shortcut or an easy way to do more with the tools they have.  Our latest offering, the PowerDrill for Oracle Endeca Information Discovery 3.1, is the quintessential example of this.

When dealing with large and diverse quantities of data, Oracle Endeca Studio is great for a lot of operations.  It enables open text search, it has data visualization, it enriches data, it surfaces all in-context attributes for slicing and dicing and it helps you find answers both high-level, say “Sales by Region”, and low, like “My best/worst performing product”.  But what about the middle ground?

For example, on our demo site, we have an application that allows users to explore publicly available data related to Parks and Recreation facilities in Chicago.  I’m able to navigate through the data, filter by the types of facilities available (Pools, Basketball Courts, Mini Golf, etc.), see locations on a map, pretty basic exploration.

The Parks of Chicago

The Parks of Chicago

Now, let’s say I’m looking for parks that fit a certain set of criteria.  For example, let’s say I’m looking to organize a 3-on-3 basketball tournament somewhere in the city.  I can use my discovery application to very easily find parks that have at least 2 basketball courts.

Navigate By Courts

Navigate By Courts


This leaves me with 80 potential parks that might be a candidate for my tournament.  But let’s say I live in the suburbs and I’m not all that familiar with the different neighborhoods of Chicago.  Wouldn’t it be great to use other data sets to quickly explore the areas surrounding these parks quickly and easily?  Enter the Power Drill. Continue reading

Fun with Shapefiles: The Two Utahs

A little midweek enjoyment, courtesy of our Advanced Visualization Framework.  Below, you can see a county-by-county map of Utah and all of its Oil and Gas Fields.

http://branchbird.com/utah.html

You can wave over counties and fields and get some basic statistics related to the county or field that you are inspecting.  For fields, we have the oil/gas field status, the year it was opened and other basic information such as whether or not it was merged with another field.

We had to “dumb it down a bit” and put it in to an iframe (WordPress!) but you can still some of the detail.  It’s obviously not as flexible as our “real visualizations” (no zooming, no refining, etc.) that render inside of Oracle Endeca Studio but gives you a sense of how quickly and easily our technology incorporates advanced GIS data into a Data Discovery application.

Deploying Oracle Endeca Portlets in WebLogic

We’re long overdue for a “public service” post dedicated to sharing best practices around how Ranzal does certain things during one of our implementation cycles.  Past installments have covered installation pitfalls, temporal analysis and the Endeca Extensions for Oracle EBS.

In this post, we’re sharing our internal playbook (adapted from our internal Wiki) for deploying custom portlets (such as our Advanced Visualization Framework or our Smart Tagger) inside of an Oracle Endeca Studio instance on WebLogic.

The documentation is pretty light in this area so consider this our attempt to fill in the blanks for anyone looking to deploy their own portlets (or ours!) in a WebLogic environment.  More after the jump… Continue reading

What You Can Do…

Last week, we announced general availability of our Advanced Visualization Framework (AVF) for Oracle Endeca Information Discovery.  We’ve received a lot of great feedback and we’re excited to see what our customers and partners can create and discover in a matter of days. Because the AVF is a framework, we’ve already gotten some questions and wanted to address some uncertainty around “what’s in the box”.  For example: Is it really that easy? What capabilities does it have? What are the out of the box visualizations I get with the framework?

Ease of Use

If you haven’t already registered and downloaded some of the documentation and cookbook, I’d encourage you to do so.  When we demoed the first version of the AVF at the Rittman Mead BI Forum in Atlanta this spring, we wrapped up the presentation with a simple “file diff” of a Ranzal AVF visualization.  It compared our AVF JavaScript and the corresponding “gallery entry” from the D3 site that we based it on.  In addition to allowing us to plug one of our favorite utilities (Beyond Compare 3), it illustrated just how little code you need to change to inject powerful JavaScript into the AVF and into OEID.

Capabilities

Talking about the framework is great, but the clearest way to show the capabilities of the AVG is by example.  So, let’s take a deep dive into two of the visualizations we’ve been working on this week.  First up, and it’s a mouthful, is our “micro-choropleth”. We started with a location-specific Choropleth (follow the link for a textbook definition) centered around the City of Chicago.  Using the multitude of publicly available shape files for Chicago, the gist of this visualization is to display some publicly available data at a micro-level, in this case crime statistics at a “Neighborhood” level: It’s completely interactive, reacts to guided navigation, gives contextual information when you mouse over and even gives you the details about individual events (i.e. crimes) when you click in. Great stuff but what if I don’t want to know about crime in Chicago?  What if I want to track average length of stay in my hospital by where my patients reside?   Similar data, same concept, how can I transition this concept easily?  Well, our micro-choropleth has two key capabilities, both enabled by the framework, to account for this.  Not only does it allow my visualization to contain a number of different shape layers by default (JavaScript objects for USA state-by-state, USA states and counties, etc.), it also gives you the ability to add additional ones via Studio (no XML, no code). Once I’ve added the new JavaScript file containing the data shape, I can simply set some configuration to load this totally different geographic data frame rather than Chicago.  I can then switch my geographic configuration (all enabled in my visualization’s definition) to indicate that I’ll be using zip codes rather than Chicago neighborhoods for my shapes. Note that our health care data and medical notes are real but we de-identify the data, leaving our “public data” at the zip code level of granularity.  From there, I simply change my query to hit population health data and calculate a different metric (length of stay in Days) and I’m done! That’s a pretty “wholesale” change that just got knocked out in a matter of minutes.  It’s even easier to make small tweaks.  For example, notice there are areas of “white” in my map that can look a little washed out.  These are areas (such as the U.S. Naval Observatory) that have zip codes but lack any permanent residents.  To increase the sharpness of my map, maybe I want to flip the line colors to black.  I can go into the Preferences area and edit CSS to my heart’s content.  In this case, I’ll flip the border class to “black” right through Studio (again, no cracking open the code)… …and see the changes occur right away. The same form factor is valid for other visualizations that we’ve been working on.  The following visualization leverages a D3 force layout to show a Node-Link analysis between NFL skill position players (it’s Fantasy Football season!) and the things they share in common (College attended, Draft Year, Draft Round, etc.).  Below, I’ve narrowed down my data (approximately 10 years worth) by selecting some of the traditional powers in the SEC East and limiting to active players. This is an example of one of our “template visualizations”.  It shows you relationships, interesting information but really is intended to show what you can do with your data.  I don’t think the visualization below will help you win your fantasy league though it may help you answer a trivia question or two.

However, the true value is in realizing how this can be used in real data scenarios.  For example, picture a network of data related to intelligence gathering.  I can visualize people, say known terrorists, and organizations they are affiliated with.  From there, I can see others who may be affiliated with those organizations in a variety of ways (family relations, telephone calls, emails).  The visualization is interactive, it lends itself to exploration through panning, scanning and re-centering.  It can show all available detail about a given entity or relationship and provide focused detail when things get to be a bit of a jumble: And again, the key is configuration and flexibility over coding.  The icons for each college are present on my web server but are driven entirely by the data, and retrieved and rendered using the framework.  The color and behavior of my circles is configurable via CSS.

What’s In The Box?

So, you’re seeing some of the great stuff we’ve been building inside our AVF.  Some of the visualizations are still in progress, some of them are “proof of concept” but a lot of it is already packaged up and included. We ship with visualizations for Box Plots, Donut Charts, Animated Timeline (aka Health and Wealth of Nations), and our Tree Map.  In addition, we ship with almost a dozen code samples for other use cases that can give you a jump start on what you’re trying to create. This includes a US Choropleth (States and Counties), a number of hierarchical and parent-child discovery visualizations as well as a Sunburst chart. In addition, we’ll be “refreshing the library” on a monthly basis with new visualizations and updates to existing ones.  These updates might be as simple as demonstrations of best practices and design patterns to fully fledged supported visualizations built by the Engineering team here in Chicago.  Our customers and partners who are using the framework can expect an update on that front around the first of the month.

As always, feedback and questions welcome at product [at] ranzal.com.

Introducing Visualizations 2.0

Ranzal is proud to announce general availability of our Advanced Visualization Framework (AVF) 2.0 for Oracle Endeca Information Discovery (OEID) today. A few months back, we released our first cut at a “framework within the framework” for building powerful data visualizations rapidly in an enterprise-class discovery environment. It allowed both our internal team and our customers to build visualizations that deliver powerful insights in a matter of days and hours (or sometimes minutes) rather than weeks and months.  You find something cool, you plug the JavaScript into our framework, fill out some XML configuration and you’re on your way.

So What’s New?

This new release builds on top of our previous work and makes vast improvements both to how visualizations get built (using JavaScript) and configured (inside of OEID Studio).  The most common piece of feedback we received the first time around was that, once an advanced visualization was ready to go, configuring that visualization could be exceedingly difficult even for a seasoned user. In this release, we’ve made great strides in improving what we call the configuration experience.   Within this area, we’ve invested primarily in what we call ease of use and user guidance.

Ease of Use

We set out in this release to make every part of the configuration screens easier to understand.  For each tab in our set of preferences, we’ve either streamlined the set of required fields and/or given the user a set of tools and information to use as a reference.

For example, take the query writing process.  In the previous version, users needed to enter their EQL essentially “without a net”.  There was no config-side validation, no visual help with simple constructs such as the field names available in a view.  It was hard, if not impossible, to get things right without some back and forth or three different browser tabs open. In the AVF 2.0, the EQL authoring process looks like this:

avf-eqlThe user no longer has to remember field names and an at-a-glance reference showing display names, attribute keys and types is front and center.  In addition, there is now on the spot validation of the query (upon hitting Save) to help diagnose any syntactical errors that may be present.

Throughout the configuration experience, we’ve made things easier to use.  But how does the AVG 2.0 help the user through the full process of configuring a visualization (read on).

Guided Configuration

To that end, we set out to make it easy for developers to guide their users in configuring the data elements (queries, CSS, etc.) that provide the backing for a visualization.  It was apparent very early on that, in many cases, building an advanced visualization requires some advanced capabilities.  This can be illustrated in the famous “Wealth and Health of Nations” visualization that we call the Animated Timeline:

It’s a really cool visualization with a nice combination of interactivity and dynamic playback.  However, the first time we encountered it, it took us a moment to wrap our heads around questions such as “how many metrics?, how many group bys?”.  It takes a fair amount of understanding to pull off generating the data for such a complex visualization*.

For a complex visualization, the developer who wrote the JavaScript has this in-depth understanding.  The trick is to provide a “no-code” capability for the developer to help the user along the configuration path.  In this release, every visualization and nearly every configurable field for a visualization can be enabled for contextual help.

This includes the visualization itself….

…its queries….

query-level

and even Custom Preferences of your own design.

Simply adding description attributes to the XML configuration for a given visualization type allows the developer to provide the power user with all the help they need.

*For the record, the Animated Timeline uses 3 metrics (X, Y, size of the bubble) and 3 group by attributes (Detail, Color and Time).

Pruning the Hedges

Frankly, the first time we released this offering, we tried to make it too configurable and too open.  Call it the software framework corollary of “Four Drafts”.

To take one egregious example, the AVF 1.0 had introduced the idea of configurable tokens.  Configurable tokens are still available because they’re extremely flexible and valuable.  However, we also had something called “conditional dynamic tokens”.  These tokens came with their own grammar, best described as hieroglyphic, that governed their values at all times (sort of a symbolic case statement).  It’s an extremely powerful construct for the developer (each token potentially saves you 5-10 lines of JavaScript) but completely confusing to the power user trying to configure a chart.  Things like that are gone.  The developer is left to do a little more work but the 99% of our users who are not making use of this functionality find the experience a lot easier to navigate.  Less is more.

What’s Next

We’re also making a concerted effort to bring more developers into the fold.  To that end, while we don’t make our framework available for download without some agreements in place, our Installation Guide and AVF Developer’s Guide can be found on our Downloads page and are available to registered users.  To register at Bird’s Eye View, simply click here or use the registration link on the right. The most exciting part of the whole documentation set (talk about an oxymoron) is a new section of the Developer’s Guide called the Cookbook.  It’s a set of small, simple examples that allow developers to quickly come up to speed on the framework and start writing visualizations that much faster (15 minutes). If you’re interested in learning more, don’t hesitate to comment below or drop us a line at product [at] ranzal.com.