OEID 3.0 First Look – Update/Delete Data Improvements

For almost a decade, the core Endeca MDEX engine that underpins Oracle Endeca Information Discovery (OEID) has supported one-time indexing (often referred to as a Baseline Update) as well as incremental updates (often referred to as partials).  Through all of the incarnations of this functionality, from “partial update pipelines” to “continuous query”, there was one common limitation.  Your update operations were always limited to act on “per-record” operations.

If you’re a person coming from a SQL/RDBMS background, this was a huge limitation and forced a conceptual change in the way that you think about data.  Obviously, Endeca is not (and never was) a relational system but the freedom to update data whenever and where ever you please, that SQL provided, was often a pretty big limitation, especially at scale.  Building an index nightly for 100,000 E-Commerce products is no big deal.  Running a daily process to feed 1 million updated records into a 30 million record Endeca Server instance just so that a set of warranty claims could be “aged” from current month to prior month is something completely different.

Thankfully, with the release of the latest set of components for the ETL layer of OEID (called OEID Integrator), huge changes have been made to the interactions available for modifying an Endeca Server instance (now called a “Data Domain”).  If you’ve longed for a “SQL-style experience” where records can be updated or deleted from a data store by almost any criteria imaginable, OEID Integrator v3.0 delivers.

Continue reading

OEID Incremental Updates

A fairly common approach…

More often than not, when pulling data from a database into OEID, we need to employ incremental updates.   To introduce incremental updates, we need a way to identify which records have been added, updated or deleted since our last load.  This change identification is commonly referred to as change data capture, or CDC.  There is no one way to accomplish CDC and often the best approach is dictated by the mechanisms in place in the source system.  Usually the database we’re pulling from isn’t leveraging any explicit change data capture (CDC) mechanism.

Note: If you’re pulling from text files and new records are being appended, you can look at the incremental reading feature of the UniversalDataReader component (pg. 268). http://docs.oracle.com/cd/E29805_01/integrator.230/DataIntegratorDesigner.pdf.

If you’re pulling from a database, and don’t have explicit database CDC features enabled, best practices usually dictate you create an “audit” or “execution_history” table to track previous full and incremental loads. This “audit” table simply records the date and time a load started and the date and time it ended, if it ended successfully. You would need to INSERT into this table before calling your incremental load graph in Integrator. Thus, when reading your table (or, better yet, denormalized view), you could issue your SQL SELECT statement with a few other WHERE conditions that leverage a “last_update_date” column in your view like so….

SELECT * FROM <View>
WHERE view_last_update_date >= (SELECT MAX(run_start_date) FROM audit_table WHERE run_status = 'Complete')
AND view_last_update_date < (SELECT MAX(run_start_date) FROM audit_table)

 

Once this incremental load graph completes, you’d need to update your audit table row with the end datetime of the run and the status=”Complete” flag.

Pivoting in Endeca

The impetus for Ranzal’s SmartStateManager

The prologue.

Pivoting across entities in your organization’s data is a central feature of any data discovery application.  For example, understanding what parts in your organization’s supply chain have the highest number of quality issues, and then pivoting to the discrete list of suppliers that provide those parts is a powerful, yet expected, data discovery capability.

Those familiar with Endeca’s approach to data modeling are well-versed in data “denormalization” — or joining the entities in the data together upon ingest.  This denormalization process makes for wider, fully-attributed records which allow for very performant information access as offered by the Endeca Server.   As an example, in this denormalized data model, our “part” record in Endeca is fully-attributed with the information about the supplier who supplies it and the technical attributes that describe it.

This denormalized approach is straightforward when consuming data from the star schemas in the data warehouse.  The “fact” drives the grain of our Endeca records and all of the surrounding dimension tables can merely be joined on to our fact to make our records wider and further attributed.  With data combined in this fashion, pivoting in Endeca is as easy as filtering our the records by attributes from one dimension table (say, “product_type:bikes”) and aggregating by attributes from another dimension table (say, “GROUP BY supplier_name”).

The challenge.

Pivoting in Endeca becomes challenging when the data to be consumed does not join together nicely into one “record type”.

(Note: A record type is nothing more than an additional attribute that we tag our Endeca records with to help our front end queries distinguish between different types of data.  Adding an additional attribute (e.g. “record_type:sales”) allows us to classify and summon certain types of records comprised of different entities at different grains for different purposes.)

To those who haven’t spent large amounts of time modeling data in Endeca, the “record type” concept can be confusing.  It is best illustrated by example.  Take the following logical data model…

LogicalDataModel1

In this logical data model, we have three entities: vendors, contracts, and payments.  In one system, a company establishes contracts (or purchase orders) with vendors to acquire their goods and/or services.  In a separate system, said company issues payments to those vendors once the company has been invoiced.  Under this design, there is no way to join contracts and payments together.  They are “sibling” entities that share the same parent, in this case vendors.  A company may have multiple contracts with a single vendor and, similarly, may have issued multiple payments to that same vendor, but with no direct link between the two.

As a common data store with self-describing records, the Endeca Server can bring this contract and payment data together as two record types.  The resulting data model would likely look like the following:

EndecaDataModel

The two record types in the Endeca Server can clearly be seen in the figure above, as can where they overlap and share attribution.  (Note: For simplicity sake, only some of the entities’ attribution is shown.)  There is no surprise that they both share attribution from their common parent, vendor (e.g. attribute=vendor_num).

The rub.

Even under this multiple “record type” data model, pivoting remains simple if the end user of the application refines by one of the shared attributes (e.g. Vendor_Num, Department, or a range filter on Amount).  In this case, both of the record types, contracts and payments, refine in lockstep and the application can still easily summon metrics around both contracts and payments (e.g. How many contracts and payments were in place in 2011 by the IT department?).

The OEID pivoting challenge is truly encountered when the end user of the application chooses to refine by one of the attributes that is not shared.  By design, the Endeca Server will only return those records that carry the attribute value selected by the end user.  Thus, if the end user wanted to see all records where the purchase order description was “DEMOLITION” only contract records would be returned since the “PO_Desc” attribute only exists on our contracts records.  This refinement would wipe out our payments records altogether, offering the end user no means to pivot to associated payments based on those “DEMOLITION” contracts.

This problem surfaces on enough OEID implementations, that experienced Endeca consultants have resorted to referring to shared attributes as “safe” and unshared attributes as “unsafe”.

There’s hope.

Ranzal has developed a solution to this problem.  Ranzal’s own Patrick Rafferty has developed our own SmartStateManager that addresses this challenge specifically.  The solution is elegant since OEID StateManagers are an expected extension point of the OEID Studio product.

The Ranzal SmartStateManager can be configured to address this problem in two ways:

1) Avoid filtering unsafe/unshared attributes.  

The Ranzal SmartStateManager can be configured to interrogate the Endeca Server to discern which attributes belong to which record types.  Thus, it knows not to apply filters to record types that do not carry the said attribution.  In the example above, the SmartStateManager could be configured to keep payments in context when a contracts-specific attribute like “PO_Desc:DEMOLITION” is used as the filter.

2) Allow for a key to be designated between the sibling entities.  

In our example above, we could also use the Ranzal SmartStateManager to define a shared key between contracts and payments, such as “Vendor Name”.  Thus, when the end user decides to filter down to contracts with a purchase order description of “DEMOLITION”, the Ranzal SmartStateManager subsequently filters down the associated payments by the collection of “Vendor Name” still in context.

To fully illustrate the problem and the solution, per the example above, I downloaded some publicly accessible contract and payment data from the City of Chicago.

You can see from the screen captures below, this commonly encountered issue is easily addressed allowing for new insights in the OEID product based on sophisticated pivoting that would otherwise not be available.

Without Ranzal’s SmartStateManager:

without

With Ranzal’s SmartStateManager:

with

If you’re interested to know more about Ranzal’s SmartStateManager or how Ranzal can help with your OEID initiatives, please visit us at ranzal.com.