OEID Incremental Updates

A fairly common approach…

More often than not, when pulling data from a database into OEID, we need to employ incremental updates.   To introduce incremental updates, we need a way to identify which records have been added, updated or deleted since our last load.  This change identification is commonly referred to as change data capture, or CDC.  There is no one way to accomplish CDC and often the best approach is dictated by the mechanisms in place in the source system.  Usually the database we’re pulling from isn’t leveraging any explicit change data capture (CDC) mechanism.

Note: If you’re pulling from text files and new records are being appended, you can look at the incremental reading feature of the UniversalDataReader component (pg. 268). http://docs.oracle.com/cd/E29805_01/integrator.230/DataIntegratorDesigner.pdf.

If you’re pulling from a database, and don’t have explicit database CDC features enabled, best practices usually dictate you create an “audit” or “execution_history” table to track previous full and incremental loads. This “audit” table simply records the date and time a load started and the date and time it ended, if it ended successfully. You would need to INSERT into this table before calling your incremental load graph in Integrator. Thus, when reading your table (or, better yet, denormalized view), you could issue your SQL SELECT statement with a few other WHERE conditions that leverage a “last_update_date” column in your view like so….

SELECT * FROM <View>
WHERE view_last_update_date >= (SELECT MAX(run_start_date) FROM audit_table WHERE run_status = 'Complete')
AND view_last_update_date < (SELECT MAX(run_start_date) FROM audit_table)

 

Once this incremental load graph completes, you’d need to update your audit table row with the end datetime of the run and the status=”Complete” flag.

Pivoting in Endeca

The impetus for Ranzal’s SmartStateManager

The prologue.

Pivoting across entities in your organization’s data is a central feature of any data discovery application.  For example, understanding what parts in your organization’s supply chain have the highest number of quality issues, and then pivoting to the discrete list of suppliers that provide those parts is a powerful, yet expected, data discovery capability.

Those familiar with Endeca’s approach to data modeling are well-versed in data “denormalization” — or joining the entities in the data together upon ingest.  This denormalization process makes for wider, fully-attributed records which allow for very performant information access as offered by the Endeca Server.   As an example, in this denormalized data model, our “part” record in Endeca is fully-attributed with the information about the supplier who supplies it and the technical attributes that describe it.

This denormalized approach is straightforward when consuming data from the star schemas in the data warehouse.  The “fact” drives the grain of our Endeca records and all of the surrounding dimension tables can merely be joined on to our fact to make our records wider and further attributed.  With data combined in this fashion, pivoting in Endeca is as easy as filtering our the records by attributes from one dimension table (say, “product_type:bikes”) and aggregating by attributes from another dimension table (say, “GROUP BY supplier_name”).

The challenge.

Pivoting in Endeca becomes challenging when the data to be consumed does not join together nicely into one “record type”.

(Note: A record type is nothing more than an additional attribute that we tag our Endeca records with to help our front end queries distinguish between different types of data.  Adding an additional attribute (e.g. “record_type:sales”) allows us to classify and summon certain types of records comprised of different entities at different grains for different purposes.)

To those who haven’t spent large amounts of time modeling data in Endeca, the “record type” concept can be confusing.  It is best illustrated by example.  Take the following logical data model…

LogicalDataModel1

In this logical data model, we have three entities: vendors, contracts, and payments.  In one system, a company establishes contracts (or purchase orders) with vendors to acquire their goods and/or services.  In a separate system, said company issues payments to those vendors once the company has been invoiced.  Under this design, there is no way to join contracts and payments together.  They are “sibling” entities that share the same parent, in this case vendors.  A company may have multiple contracts with a single vendor and, similarly, may have issued multiple payments to that same vendor, but with no direct link between the two.

As a common data store with self-describing records, the Endeca Server can bring this contract and payment data together as two record types.  The resulting data model would likely look like the following:

EndecaDataModel

The two record types in the Endeca Server can clearly be seen in the figure above, as can where they overlap and share attribution.  (Note: For simplicity sake, only some of the entities’ attribution is shown.)  There is no surprise that they both share attribution from their common parent, vendor (e.g. attribute=vendor_num).

The rub.

Even under this multiple “record type” data model, pivoting remains simple if the end user of the application refines by one of the shared attributes (e.g. Vendor_Num, Department, or a range filter on Amount).  In this case, both of the record types, contracts and payments, refine in lockstep and the application can still easily summon metrics around both contracts and payments (e.g. How many contracts and payments were in place in 2011 by the IT department?).

The OEID pivoting challenge is truly encountered when the end user of the application chooses to refine by one of the attributes that is not shared.  By design, the Endeca Server will only return those records that carry the attribute value selected by the end user.  Thus, if the end user wanted to see all records where the purchase order description was “DEMOLITION” only contract records would be returned since the “PO_Desc” attribute only exists on our contracts records.  This refinement would wipe out our payments records altogether, offering the end user no means to pivot to associated payments based on those “DEMOLITION” contracts.

This problem surfaces on enough OEID implementations, that experienced Endeca consultants have resorted to referring to shared attributes as “safe” and unshared attributes as “unsafe”.

There’s hope.

Ranzal has developed a solution to this problem.  Ranzal’s own Patrick Rafferty has developed our own SmartStateManager that addresses this challenge specifically.  The solution is elegant since OEID StateManagers are an expected extension point of the OEID Studio product.

The Ranzal SmartStateManager can be configured to address this problem in two ways:

1) Avoid filtering unsafe/unshared attributes.  

The Ranzal SmartStateManager can be configured to interrogate the Endeca Server to discern which attributes belong to which record types.  Thus, it knows not to apply filters to record types that do not carry the said attribution.  In the example above, the SmartStateManager could be configured to keep payments in context when a contracts-specific attribute like “PO_Desc:DEMOLITION” is used as the filter.

2) Allow for a key to be designated between the sibling entities.  

In our example above, we could also use the Ranzal SmartStateManager to define a shared key between contracts and payments, such as “Vendor Name”.  Thus, when the end user decides to filter down to contracts with a purchase order description of “DEMOLITION”, the Ranzal SmartStateManager subsequently filters down the associated payments by the collection of “Vendor Name” still in context.

To fully illustrate the problem and the solution, per the example above, I downloaded some publicly accessible contract and payment data from the City of Chicago.

You can see from the screen captures below, this commonly encountered issue is easily addressed allowing for new insights in the OEID product based on sophisticated pivoting that would otherwise not be available.

Without Ranzal’s SmartStateManager:

without

With Ranzal’s SmartStateManager:

with

If you’re interested to know more about Ranzal’s SmartStateManager or how Ranzal can help with your OEID initiatives, please visit us at ranzal.com.

 

Endeca Latitude Google Maps (with Heat Mapping!)

Introducing Ranzal’s SmartMap…

intro1Updated: 10/18/2012 3:16PM

Following Oracle’s acquisition of Endeca Technologies earlier this year, a number of modifications were made to the Discovery Framework/Latitude Studio product to align with Oracle’s corporate vision.  Unfortunately, one of those changes was to pull Google’s “Maps” technology out of the product and replace it with an alternative Oracle home-grown mapping technology.

For people like us who love the Google Maps experience and the functionality it offers, this was a setback.  Regardless of your position on whose maps provide a better experience, there were some key pieces of functionality left out during the transition.  One of those was the ability to show “heatmaps” in your application as an indicator of intensity or density in your visualization, something Endeca customers have seen in the past and are currently using.

With some hard work over the past few weeks Ranzal, leveraging the Oracle Endeca Information Discovery SDK, has developed SmartMap to bring this functionality back to the Endeca Agile BI community.  As you can see below, we’ve built a visualization that gives the same user experience as the current product version (results lists, highlighting, etc.) but with the added flexibility of Google Maps and heat mapping:

The visualization at the top of this post was built using a dataset from the City of Chicago’s Department of Public Health containing a list of Safety Inspections of restaurants and schools dating back to January 2010.  Using Endeca’s Guided Navigation technology, as a user, I can navigate down to show “only Schools”:

home-page1

I can pinch and refine my map to a certain area:

pinch

And I can get details on the type of issues that were encountered during the Health Inspection:

details2

As you can see the SmartMap gives you a boatload of options when presenting data to your users and provides yet another facet to use when exploring your application.  This is a simple example but we could easily have leveraged unstructured search over the Inspection Comments, range filtering by Inspection Score and any of the great functionality that Oracle Endeca provides.

We’re working on adding even more to the SmartMap such as exact address search (show me everything 3 miles from an exact address), heat maps over routes (not just places) and a few other bells and whistles.

Update: And on the subject of adding functionality to the map, we went ahead and enabled StreetView to give you another angle on what your data’s telling you:

duks-red-hots

 

Debugging Discovery Framework / Oracle Endeca Studio in Eclipse

Most, if not all, implementations of Oracle’s Endeca Discovery platform have some kind of custom portlet or supported extension that assists your users in the Discovery process.  Maybe you have a custom security manager that integrates a user’s LDAP groups at query time to restrict access to certain records.  Or maybe a custom charting portlet to include a chart type that you don’t get from the base software.

When your system goes live, you (typically) get all the source code for your customizations and you can maintain, modify and improve your component going forward.

That’s great but a custom portlet can be an extremely complex piece of code that can be difficult for even the most seasoned Java developer to pick up, dust off and improve.  At Ranzal, we find it incredibly valuable to be able to “step through” the code and understand what exactly is going on before we go off and make changes or upgrades.  So, here’s a primer on how we set up our local Eclipse instance in debug mode to enable this process:

Continue reading