Multi Data Source Joins

Overview

Knowi facilitates joins across multiple datasources to process, blend and store combined results seamlessly.

For example, if one datasource stores data about a supplier with a customer identifier, and another that stores data about the customer and billing information. Traditionally, to perform look ups from one to another requires engineering efforts along with mapping storage for the results. Knowi enables you to join across the same or disparate SQL or NoSQL databases to stitch the data back together into a single result along with the ability to store and track it.

Notes: Joins can be across disparate datasources or within the same datasource.

Joins can be setup via the UI or using the Cloud9Agent (inside your network).

Join Setup

UI

In the following example, we'll connect to a MongoDB database to query some data that includes a customer name and then perform a look up against a MySQL database to pull customer addresses.

  1. From the datasource menu on the left hand side bar, click on MongoDB. Accept the defaults (points to a MongoDB instance hosted by us).

  2. Set up a MySQL datasource and accept the defaults.

  3. Generate a Mongo Query. In this case we will use data the Query Generator to dynamically generate the queries.

    • select "sendingactivity" from the collections drop down menu.

    • select "sent", "customer" from the metrics drop down menu.

      Multi Datasource

  4. Click "preview" to view the results.

  5. Click "join".

  6. Add a new MySQL datasource from the dropdown menu.

  7. Use Query Generator to dynamically generate the queries.

  8. select "customer" from the collections drop down menu

  9. Map fields from previous queries to the current one using "join fields".

  10. Enter in "customer=customer" into the join fields box or select it from the Join Builder UI.

  11. Click "preview" to view the results.

    Multi Datasource Join

Note: To specify multiple keys use 'AND':

  customer=customer AND joinAfield=joinBField

The default join type is INNER JOIN when a join field is present. When a join field is not present, the data is blended together across the two datasets.

Other Join types supported: LEFT OUTER JOIN, FULL OUTER JOIN, RIGHT OUTER JOIN, LOOP JOIN. To specify a different join type, add the join type into the Join Fields. Example: FULL OUTER JOIN customer=customer

Preview Each Join Step: Use the Eye icon on the left hand side of the page near each of the query to preview the result of each join part separately.

Data Manipulation after Joins: You can modify/transform the data returned in powerful ways using Cloud9QL at the bottom of the page. This section can be used to apply a Cloud9QL query as the very last step of processing. Alternatively, you can also interact with the grid directly for any manipulations of the data.

Preview per join

Cloud9Agent

As an alternative to the UI based approach, joins can be specified on your agent directly.

For more details on agent setup and configuring datasource and query files, see Cloud9Agent.

Example query_xxx.json with joins:

[
    {
        "entityName":"Join Example",
        "dsName":"demoMongo",
        "queryStr":"db.pagehits.find({lastAccessTime: { $exists: true}}).limit(10)",
        "c9SQLFilter":"select date(lastAccessTime) as date, count(*) as hits_count group by date(lastAccessTime)",
        "joinQueries": [
            {
                "queryStr":"db.pagehits.find({lastAccessTime: { $exists: true}}).limit(10)",
                "c9SQLFilter":"select date(lastAccessTime) as date, sum(hits) as hits_total group by date(lastAccessTime)",
                "dsName":"demoMongo",
                "joinKeys": { "date": "date" }
            }
        ],
        "overrideVals":{
            "replaceAll":true
        }
    }
]

Join Types

Inner & Outer Joins

Knowi supports the following Inner & Outer Join types: INNER JOIN, LEFT OUTER JOIN, FULL OUTER JOIN, RIGHT OUTER JOIN. The default join type is INNER JOIN when a join field is present.

Blending

When Join keys are not specified, data across the datasets will be blended together.

Loop Joins

Loop Join is a special join type where results that implements a for..loop join where the first query results is fed into the second query one row at a time based on token(s) defined by the join condition. For each row of the first query's result, the second query will be executed and its result will be combined with the current row.

For example, consider 2 datasets:

 id | user                      user | sex
 ---------                      ----------
  1 | xxx                       xxx  |  M
  2 | yyy                       yyy  |  F

First query:

 select *

Second query:

 select * where user = USER_TOKEN

Now the LOOP JOIN condition can be defined as:

 LOOP JOIN user = USER_TOKEN

With the above definition, the execution will be as follows:

  1. Execute the first query which gives 2 rows as result.
  2. For each row of the above result, execute the second query with USER_TOKEN token replaced with actual value of user field of that row and join with it.

     select * where user = xxx 
     select * where user = yyy
    

The final result of the join will be:

 id | user | sex
 1 | xxx  |  M
 2 | yyy  |  F

Note: For REST API datasources, the any tokens defined in the end point or headers will be replaced.

Joins on Large Datasets

Knowi is architecturally designed to cater large scale joins across millions of records over disparate sources.

Joins are performed on the results of the two queries, where the join is performed in memory within Knowi, with a disk swap for larger datasets (with multi-threading and partitioning where applicable).
Note that data will be transferred over the network (into Knowi for cloud based modes, or on the agent if using an agent) for the results of each query before the join is performed. For larger dataset joins that spans millions of records, use the agent.

For more insights into Knowi's Join processing technology for large scale joins and benchmarks, contact your technical account manager.

Performance Optimizations

For Join processing on large scale datasets, you can prefix the Join Type with a set of predefined "Hints" to help determine the algorithms and optimizations for faster processing.

Prefix Description Example
SORTED Both sides of input data are sorted based on join keys. This will execute merge join but by-pass sorting SORTED INNER JOIN a = b
LEFT SORTED Left side of input data is sorted. This will first sort the right side of input data then merge join LEFT SORTED INNER JOIN a = b
RIGHT SORTED Right side of input data is sorted. This will first sort the left side of input data then merge join RIGHT SORTED INNER JOIN a = b
HASH LEFT Hash join will be used by hashing the left side data then join with the right side data HASH LEFT INNER JOIN a = b
HASH RIGHT Hash join will be used by hashing the right side data then join with the left side data HASH RIGHT INNER JOIN a = b
MULTI HASH LEFT Same as HASH LEFT but the join will be perform by multi-threads assuming multi-threads join is enable (see note below) MULTI HASH LEFT INNER JOIN a = b
MULTI HASH RIGHT Same as HASH RIGHT but the join will be perform by multi-threads assuming multi-threads join is enable (see note below) MULTI HASH RIGHT INNER JOIN a = b

Note: On on-prem build, to enable multi-threads join, change the following property inside cloud9.properties to the desired number of join-threads. The more threads you specify here the more memory you will need.

  queryJoinThreads=10