Big Data Discovery – Custom Java Transformations Part 2

In a previous post, we walked through how to implement a custom Java transformation in Oracle Big Data Discovery.  While that post was more technical in nature, this follow up post will highlight a detailed use case for the transformations and illustrate how they can be used to augment an existing dataset.

Example: 2015 Chicago Mayoral Election Runoff

For this example we will be using data from the 2015 Mayoral Election Runoff in Chicago.  Incumbent Rahm Emanuel defeated challenger “Chuy” Garcia with 55.7% of the popular vote.  Results data from the election were compiled and matched up with Chicago communities, which were then subdivided by zip code.  A small sample of the data can be seen below:

Sample election data

Sample election data

In its original state, the data already offers some insight into the results of the election, but only at a high level.  By utilizing the custom transformations, it is possible to bring in additional data and find answers to more detailed questions.  For example, what impact did the wealth of Chicago’s communities have on their selection of a candidate?

One indicator of wealth within a community is the median sale price of homes in that area.  Naturally, the higher the price of homes, the wealthier the community tends to be.  Zillow provides an API that allows users to query for a variety of real estate and demographic information, including median sale price of homes.  Through the custom transformations, we can augment the existing election results with the data available through the API.

The structure of the custom transformation is exactly the same as the ‘Hello World’ example from our previous post.  The transformation is initiated in the BDD Custom Transform Editor with the command runExternalPlugin('ZillowAPI.groovy',zip). In this case, the custom groovy script is called ZillowAPI.groovy and the field being passed to the script is the zip code, zip.

The script then uses the zip to construct a string and convert it to the URL required to make the API call:

def zip = args[0]
String url = "<ZILLOW_API_KEY>&zip=" + zip;
URL api_url = new URL(url);

Once the transform script completes, the median_sale_price field is now accessible in BDD:

Updated data in BDD

Updated data in BDD

Now that the additional data is available, we can use it to build some visualizations to help answer the question posed earlier.

Median Sale Price by Chicago Community - Created using the Ranzal Data Visualization Portlet*

Median Sale Price by Chicago Community – Created using the Ranzal Data Visualization Portlet*

Percentage for Chuy by Community - Created using the Ranzal Data Visualization Portlet*

Percentage for Chuy by Community – Created using the Ranzal Data Visualization Portlet*

The two choropleths above show the median sale price by community and the percentage of votes for “Chuy” by community.  Communities in the northeastern sections of the city seem to have the highest concentration of median sale price, while communities in the western and southern sections tend to have lower prices.  For median sale price to be a strong indicator of how the communities voted, the map displaying votes for “Chuy” should show a similar pattern, with the communities grouped by northeast and southwest.  However, the pattern is noticeably different, with votes for “Chuy” distributed across all sections of the map.

Bar-Line chart of Median Sale Price and Percent for Chuy

Bar-Line chart of Median Sale Price and Percent for Chuy

Looking at the median sale price in conjunction with the percentage of votes for “Chuy” provides an even clearer picture.  The bars in the chart above represent the median sale price of homes, and are sorted in descending order from left to right.  The line graph represents the percentage of votes for “Chuy” in each community.  If there was a connection between median sale price and the percentage of votes for “Chuy”, we’d expect to see the line graph either increase or decrease as sale price decreases.  However, the percentage of votes varies widely from community to community, and doesn’t seem to follow an obvious pattern in relation to median sale price.  This corresponds with the observations from the two choropleths above.

While these findings don’t provide a definitive answer to the initial question as to whether community wealth was a factor in the election results, they do suggest that median sale price is not a good indicator of how Chicago communities voted in the election.  More importantly, this example illustrates how easy it is to utilize custom Java transformations in BDD to answer detailed questions and get more out of your original dataset.

If you would like to learn more about Oracle Big Data Discovery and how it can help your organization, please contact us at info [at] or share your questions and comments with us below.

* – The Ranzal Data Visualization Portlet is a custom portlet developed by Ranzal and is not available out of the box in BDD.  If you would like more information on the portlet and it’s capabilities, please contact us and stay tuned for a future blog post that will cover the portlet in more detail.

Big Data Discovery – Custom Java Transformations Part 1

In our first post introducing Oracle Big Data Discovery, we highlighted the data transform capabilities of BDD.  The transform editor provides a variety of built in functions for transforming datasets.  While these built in functions are straightforward to use and don’t require any additional configuration, they are also limited to a predefined set of transformations.  Fortunately, for those looking for additional functionality during transform, it is possible to introduce custom transformations that can leverage external Java libraries by implementing a custom Groovy script.  The rest of this post will walk through the implementation of a basic example, and a subsequent post will go in depth with a few real world use cases.

Create a Groovy script

The core component needed to implement a custom transform with external libraries is a Groovy script that defines the pluginExec() method.  Groovy is a programming language developed for the Java platform.  Details and documentation on the language can be found here.  For this basic example, we’ll begin by creating a file called CustomTransform.groovy and define a method, pluginExec(), which should take an object array, args, as an argument:

def pluginExec(Object[] args) {
    String input = args[0] //args[0] is the input field from the BDD Transform Editor 

    //Implement code to transform input in some way
    //The return of this method will be inserted into the transform field

    input.toUpperCase() //This example would return an upper cased version of input

pluginExec() will be applied to each record in the BDD dataset, and args[0] corresponds to the field to be transformed.  In the example script above, args[0] is assigned to the variable input and the toUpperCase() method is called on it.  This means that if this custom transformation is applied to a field called name, the value of name for each record will be returned upper cased (For example, “johnathon” => “JOHNATHON”).

Import Custom Java Library

Now that we’ve covered the basics of how the custom Groovy script works, we can augment the script with external Java libraries.  These libraries can be imported and implemented just as they would be in standard Java:

def pluginExec(Object[] args) {
    String input = args[0] //Note that though the input variable is defined in this example, it is not used.  Defining input is not required.
    HelloWorld hw = new HelloWorld() //Create a new instance of the HelloWorld class defined in the imported library
    hw.testMe() //Call the testMe() method, which returns a string "Hello World"

In the example above, the HelloWorld class is imported.  A new instance of HelloWorld is assigned to the variable hw, and the testMe() method is called.  testMe() is designed to simply return the string “Hello World”.  Therefore, the expected output of this custom script is that the string “Hello World” will be inserted for each record in the transformed BDD dataset.  Now that the script has been created, it needs to be packaged up and added to the Spark class path so that it’s accessible during data processing.

Package the Groovy script into a jar

In order to utilize CustomTransform.groovy, it needs to packaged into a .jar file.  It is important that the Groovy script be located at the root of the jar, so make sure that the file is not nested within any directories.  See below for an example of the file structure:


Note that additional files can be included in the jar as well.  These additional files can be referenced in CustomTransform.groovy if desired.  There are multiple ways to pack up the file(s), but the simplest is to use the command line.  Navigate to the directory that contains CustomTransform.groovy and use the following command to package it up:

# jar cf <new_jar_name> <input_file_for_jar>
> jar cf CustomTransform.jar CustomTransform.groovy

Setup a custom lib location in Hadoop

CustomTransform.jar and any additional Java libraries that are imported by the Groovy script need to be added to all spark nodes in your Hadoop cluster.  For simplicity, it is helpful to establish a standard location for all custom libraries that you want Spark to be able to access:

$ mkdir /opt/bdd/edp/lib/custom_lib

The /opt/bdd/edp/lib directory is the default location for the BDD dataprocessing libraries used by Spark.  In this case, we’ve created a subdirectory, custom_lib, that will hold any additional libraries we want Spark to be able to use.

Once the directory has been created, use scp, WinSCP, MobaXterm, or some other utility to upload CustomTransform.jar and any additional libraries used by the Groovy script into the custom_lib directory.  The directory needs to be created on all Spark nodes, and the libraries need to be uploaded to all nodes as well.

Update on the BDD Server

The last step that needs to be completed before running the custom transformation is updating the file.  This step only needs to be completed the first time you create a custom transformation as long as the location of the custom_lib directory remains constant for each subsequent script.

Navigate to /localdisk/Oracle/Middleware/BDD<version>/dataprocessing/edp_cli/config on the BDD server and open the file for editing:

$ cd /localdisk/Oracle/Middleware/BDD1.0/dataprocessing/edp_cli/config
$ vim

The file should look something like this:

# Spark additional runtime properties, see
# for examples

Add an entry to the file to define the spark.executor.extraClassPath property.  The value for this property should be <path_to_custom_lib_directory>/*.  This will add everything in the custom_lib directory to the Spark class path.  The updated file should look like this:

# Spark additional runtime properties, see
# for examples


It is important to note, if there is already an entry in for the spark.executor.extraClassPath property, any libraries referenced by the property should be moved to the custom_lib directory so they are still included in the Spark class path.

Run the custom transform

Now that the script has been created and added to the Spark class path, everything is in place to run the custom transform in BDD.  To try it out, open the Transform tab in BDD and click on the Show Transformation Editor button.  In this example, we are going to create a new field called custom with the type String:

Create new attribute

Create new attribute

Now in the editor window, we need to reference the custom script:

Transform Editor

Transform Editor

The runExternalPlugin() method is used to reference the custom script.  The first argument is the name of the Groovy script.  Note that the value above is 'CustomTransform.groovy' and not 'CustomTransform.jar'.  The second argument is the field to be passed as an input to the script (this is what gets assigned to args[0] in pluginExec()).  In the case of the “Hello World” example, the input isn’t used, so it doesn’t matter what field is passed here.  However, in the first example that returned an upper cased version of the input field, the script above would return an upper cased version of the key field.

One of the nice features of the built-in transform functions is that they make it possible to preview the transform changes before committing.  With these custom scripts, however, it isn’t possible to see the results of the transform before running the commit.  Clicking preview will just return blank results for all fields, as seen in the example below:

Example of custom transform preview

Example of custom transform preview

The last thing to do is click ‘Add to Script’ and then ‘Commit to Project’ to kick off the transformation process.  Below are the results of the transform.  As expected, a new custom field has been added to the data set and the value “Hello World” has been inserted for every record.

Transform results

Transform results

This tutorial just hints at the possibilities of utilizing custom transformations with Groovy and external Java libraries in BDD.  Stay tuned for the second post on this subject, when we will go into detail with some real world use cases.

If you would like to learn more about Oracle Big Data Discovery and how it can help your organization, please contact us at info [at] or share your questions and comments with us below.

Deploying Oracle Endeca Portlets in WebLogic

We’re long overdue for a “public service” post dedicated to sharing best practices around how Ranzal does certain things during one of our implementation cycles.  Past installments have covered installation pitfalls, temporal analysis and the Endeca Extensions for Oracle EBS.

In this post, we’re sharing our internal playbook (adapted from our internal Wiki) for deploying custom portlets (such as our Advanced Visualization Framework or our Smart Tagger) inside of an Oracle Endeca Studio instance on WebLogic.

The documentation is pretty light in this area so consider this our attempt to fill in the blanks for anyone looking to deploy their own portlets (or ours!) in a WebLogic environment.  More after the jump… Continue reading

Adventures in Installing – Oracle Endeca 3.1 Integrator

The newest version of Oracle’s Endeca Information Discovery (OEID v3.1), Oracle’s data discovery platform, was released yesterday morning.  We’ll have a lot to say about the release, the features, and what an upgrade looks like in the coming weeks (for the curious, Oracle’s official press release is here) but top of our minds right now is: “How do I get this installed and up and running?”

After spending a few hours last night with the product, we wanted to share some thoughts on the install and ramp-up process and hopefully save some time for others who are looking to give the product a spin.  The first post concerns the installation of the ETL tool that comes with Oracle Endeca, OEID Integrator.

Installing OEID Integrator

When attempting to install Integrator, I hit a couple snags.  The first was related to the different form factor that the install has taken vs. previous releases.  In version 3.1, the install has moved away from an “Installshield-style” experience to more of a download, unzip and script approach.  After downloading the zip file and unpacking it, I ended up with a structure that looks like this:


Seeing an install.bat, I decided to click it and dive right in.  After the first couple of prompts, one large change becomes clear.  The Eclipse container that hosts Integrator needs to be downloaded separately prior to installation (RTFM, I guess).

Not a huge deal but what I found was that is incredibly important that you download a very specific version of Eclipse (Indigo, according to the documentation) in order for the installation to complete successfully.  For example:

I tried to use the latest version of Eclipse, Kepler.  This did not work.
I tried to use Eclipse IDE for J2EE Developers (Indigo).  This did not work.
I used Eclipse IDE for Java Developers (Indigo) and it worked like a charm:


In addition, I would highly recommend running the install script (install.bat) from the command line, rather than through a double-click in Windows Explorer.  Running it via a double-click can make it difficult to diagnose any issues you may encounter since the window closes itself upon completion.  If the product is installed from the command line, a successful installation on Windows should look like this:


Hopefully this saves some time for others looking to ramp up on the latest version of OEID.  We’ll be continuing to push out information as we roll the software out in our organization and upgrade our assets so watch this space.


OEID 3.0 First Look — Text Enrichment & Whitespace

I recently spent some cycles building my first POC for a potential customer with OEID v3.0.  After running some of the unstructured data through the text enrichment component, I noticed something odd:


The charts I configured to group by those salient terms were displaying a “null” bucket.  This bucket was essentially collecting all records that were not tagged with a term.  After a bit of investigation, it seems this is expected behavior in v3.0 — the Endeca Server now treats empty, yet non-null attributes, as valid and houses them on the Endeca record.  Empty, yet non-null, attributes are common after employing some of the OOTB text enrichment capabilities in 3.0 (tagging, extraction, regex).  Thus, a best practice treatment for this side-effect is warranted.

The good news is that the workaround was very straightforward.

1) Add a “Reformatter” component to the .grf before the bulk loader with the same input and output metadata edge definition.  From the reformatter “Source” tab, select “Java Transform Wizard” and give your new transformation class a name like “removeWhitespaces”.  This will create a .java source file and a compiled .class file in your Integrator project’s ./trans directory (where Integrator expects your java source code to reside).


2) Provide the following java logic in your new “removeWhitespaces” transformation class:
import org.jetel.component.DataRecordTransform;
import org.jetel.exception.TransformException;
import org.jetel.metadata.DataFieldType;

public class removeWhitespaces extends DataRecordTransform {

public int transform(DataRecord[] arg0, DataRecord[] arg1) throws TransformException {
for(int i = 0; i < arg0.length; i++) {
DataRecord rec = arg0[i];
for(int j = 0; j < rec.getNumFields(); j++) {
if(rec.getField(j).getMetadata().getDataType().equals(DataFieldType.STRING)) {
if(rec.getField(j).getValue() == null || rec.getField(j).getValue().equals(“”) || rec.getField(j).getValue().toString().length() == 0) {
return 0;

3) Make sure the name of this new class is specified in the “Transform class” input.  Rerun the .grf that loads your data and….profit!


We look forward to sharing more emerging OEID v3.0 best practices here….and hearing about your approaches as well.



OEID 3.0 First Look – Update/Delete Data Improvements

For almost a decade, the core Endeca MDEX engine that underpins Oracle Endeca Information Discovery (OEID) has supported one-time indexing (often referred to as a Baseline Update) as well as incremental updates (often referred to as partials).  Through all of the incarnations of this functionality, from “partial update pipelines” to “continuous query”, there was one common limitation.  Your update operations were always limited to act on “per-record” operations.

If you’re a person coming from a SQL/RDBMS background, this was a huge limitation and forced a conceptual change in the way that you think about data.  Obviously, Endeca is not (and never was) a relational system but the freedom to update data whenever and where ever you please, that SQL provided, was often a pretty big limitation, especially at scale.  Building an index nightly for 100,000 E-Commerce products is no big deal.  Running a daily process to feed 1 million updated records into a 30 million record Endeca Server instance just so that a set of warranty claims could be “aged” from current month to prior month is something completely different.

Thankfully, with the release of the latest set of components for the ETL layer of OEID (called OEID Integrator), huge changes have been made to the interactions available for modifying an Endeca Server instance (now called a “Data Domain”).  If you’ve longed for a “SQL-style experience” where records can be updated or deleted from a data store by almost any criteria imaginable, OEID Integrator v3.0 delivers.

Continue reading

OEID 3.0 First Look – The Little Things

There’s so much new “goodness” in Oracle Endeca Information Discovery (OEID) 3.0, it’s been a little bit of a challenge to “spread the word” in small enough chunks.  We start writing these posts, get a little excited and pretty soon we’ve got Ranzal’s very own version of the Iliad.

In the coming weeks, there will be a few Iliads, and maybe an Odyssey as well, but before we get too deep into the platform, I wanted to illustrate and elaborate on a couple “small changes” that should prove beneficial to people just coming up to speed and OEID veterans alike.

The Guided Navigation Histogram

As one of my colleagues pointed out, I neglected to highlight a key enhancement to the Guided Navigation user experience when posting to the blog earlier this week.  Often when doing data modeling for an OEID application, you’ll be transforming, joining, doing denormalization and all sorts of other operations on your data as it is being brought into your Endeca Server.  What often happens is that you lose some of the original context that was present in the source system.  For example, you may have a set of sales records that a user has the ability to refine by State, by City and by Product.  When you wanted to give the user the ability to understand “how much data” was behind a given Guided Navigation option, the typical answer was to use Refinement Counts.


As you can see above, this construct gives a numerical value to the frequency of a given attribute value in the current data set.  However, this number often causes confusion for users.  Is it the number of Invoices?  Is it the number of line items on my invoices?  Is it the number of Shipments?  Often, it’s none of these things and simply an artifact of how the data is being modeled.  With OEID 3.0, there is a new way to visually display this frequency data, without the messiness of (often) meaningless numbers.

As you can see above, I get the same ability to message to users that most sales are occurring in Toronto with both versions of the product.  However, OEID 3.0 provides the immediate, visceral context that tells the user, my Toronto transactions are nearly three times as numerous.  In addition, the aforementioned absence of “strange numbers” eliminates confusion and encourages users to explore rather than over-analyze.

Multi-Lingual LQL Parsing And Validation

Continuing with the theme of Internationalization, the LQL Parsing Service now supports a language parameter when compiling and validating queries.  While English is still the lingua franca of the internals of the platform, having the ability to troubleshoot your queries in your native language is a huge plus.  Below, you can see the Metrics Bar Portlet returning my syntax error in Portuguese:

Note: For those of you following along, this is the “Unexpected Symbol” error where the per-select Where clause expects the criteria to be in parentheses.  At least I think it is, my Portuguese is a little rusty.

This concept is supported by the Parsing Service itself so any application making use of the Endeca Server web services can leverage this functionality as well.

Languages in Studio vs. Languages in the Engine

One additional note on support for multiple languages is that Endeca Server actually supports more languages than OEID Studio has been translated into so far.  Users in OEID Studio have ten locales to choose from in the application:

  • German
  • English (United States)
  • French
  • Portuguese (Portugal)
  • Italian
  • Japanese
  • Chinese (Traditional) zh_TW
  • Chinese (Simplified) zh_CN
  • Korean
  • Spanish (Spain)

However, Endeca Server supports the above 10 in addition to the following 12 (with their language codes, as Endeca Server expects them, in parentheses):

  • Catalan (ca)
  • Czech (cs)
  • Greek (el)
  • Hebrew (he)
  • Hungarian (hu)
  • Dutch (nl)
  • Polish (pl)
  • Romanian (ro)
  • Russian (ru)
  • Swedish (sv)
  • Thai (th)
  • Turkish (tr)

Note that Endeca Server expects RFC-3066 codes, which will differ slightly from the locales that are used in Studio as well.  For example, setting the language of a given attribute to en_US would not work in Endeca Server while being a perfectly good locale in Studio.  Language would be “en” for Server in this case.

That’s all for now.  More posts coming later today and tomorrow.  Happy Exploring!

Running all of your Endeca CAS crawls

I’ve recently been working with a customer who has a larger number of Endeca CAS crawls (300+) defined for their Endeca-driven enterprise search application.  Occasionally,  the need arises to rerun all of their crawls consecutively.

There are a number of ways to skin this cat; This certainly could be scripted using the java web service stubs that ship with CAS or scripted using java beanshell script in their AppConfig.xml file.

Recently, however, I wrote a quick and dirty DOS .bat file that leverages the cas-cmd.bat utitility to do this.

For those of you who find this need and don’t feel like sinking time into a java project, enjoy: 

@echo on
call cas-cmd listCrawls > crawls.lst
for /F "delims=" %%A in (crawls.lst) do CALL :RUNCRAWL %%A
echo SLEEP 10
echo checking %1 status...
call cas-cmd getCrawlStatus -id %1 2>&1 | FIND "NOT_RUNNING" 
IF %1 == "" GOTO END
echo running %1 ...
call cas-cmd startCrawl -id %1 2>&1 | FIND "CasCmdException" 

Endeca Performance Optimization

Ranzal launches the first of our Performance Analysis Tools: Phoenix

So, you’ve got this great Endeca Commerce implementation powering your online sales  and delivering a world-class experience to your customers.  Or you’ve got a terrific Data Discovery application built on the Oracle Endeca Information Discovery (OEID, for short) platform and it’s enabling your users to unlock all kinds of value from your structured and unstructured data.  Things are humming along and life is grand.  However, one day, you decide to implement some changes.  Maybe you’re rolling out a second business release or a whole new set of data sources or products.  Maybe you’re enabling record-level-security.  Post-rollout, you start to get the dreaded emails from your users:

“Hey Andy, the system seems really slow this morning.”

“This chart loads for me but not for Gireesh, can you help?”

“Your site used to be lightning-fast, now a search for ‘handbags’ takes 30 seconds!.”

Performance testing, forensic analysis, system troubleshooting.  For most people, these are not tasks that set one’s pulse racing with excitement.  However, monitoring performance and understanding where the bottlenecks are is a crucial element of maintaining a valuable Endeca application, be it Commerce or OEID.

Since we’re all about efficiency and helping people work on the fun stuff, we’ve spent some time over the last month building Phoenix, our performance analysis tool for Oracle Commerce and OEID.

What does it do?

Phoenix harnesses the entirety of Endeca system metrics that are tracked throughout the platform, from the back-end MDEX Engines for OEID/Endeca Commerce to the front-end OEID Studio application, and produces clear and concise HTML reports.  Our application breaks down and summarizes your system performance by Date and Time (down to the hour), Feature (Navigation vs. Search vs. Charting vs. Sorting vs. Everything), Portlet Instance (for OEID) and more, surfacing where your implementation is experiencing problems.

In addition, it identifies your worst performing operations and queries, taking a holistic view of the entire system, not just the back-end MDEX.  It surfaces the key information necessary to figure out which parts of your stack (Network, Front-end, Back-end) are acting as a bottleneck and which parts are cruising along performance-wise.  If you’re troubleshooting an OEID application, it can find your worst performing charts or visualizations.  For a Commerce application, it can find out which cartridges and features are contributing to your most expensive queries.

If you’re a long-time Endeca expert who has used the Cheetah tool before, it provides all the functionality that the Cheetah tool provided and goes way beyond.

What questions does it answer?

Which Endeca features are performing poorly?
What times of the day/month/year is my application taking on the most load?
Is my application returning too much data in its responses?
Do I have enough threads?
Are my requests “queueing”?
What are my worst-performing queries?
Is my network a bottleneck for responses?
Does my application tier read the MDEX responses too slowly?

How does it work?

Phoenix is written in Java following a modular and extensible pattern that allows new “parsers” to be plugged in to work with any version of the MDEX and EID Studio request logs, past or present.  With Perl no longer shipping with the OEID product, Java and the extensibility it offers, was the obvious choice for writing this utility.  The “default” version we ship is designed for v2.3 of the OEID platform and the latest version of Oracle Endeca Commerce but if you’re on an older version and looking to use it (Endeca Infront, Latitude 1, MDEX 5, etc.), let us know and we can make it available to you.

How do I get it?

If you’re interested in better understanding the health and performance characteristics of your Endeca implementation, contact us at to see what Ranzal and Phoenix can do for you.

What’s next?

Our plan is to continue to build out a suite of analysis, performance testing and optimization tools to help the Oracle Endeca community get the most out of their Commerce and Data Discovery applications.