Multi Data Source Joins

Overview

Knowi facilitates joins across multiple datasources to process, blend and store combined results seamlessly.

For example, if one datasource stores data about a supplier with a customer identifier, and another one stores the data about the customer and billing information, then in such a case, performing the lookups from one to another would require engineering efforts along with mapping storage for the results.

Knowi enables you to join across the same or disparate SQL or NoSQL databases to stitch the data back together into a single result along with the ability to store and track it.

Notes: Joins can be set up across disparate datasources or within the same datasource.

Joins can be set up via the UI or using the Cloud9Agent (inside your network).

Join Setup

UI

In the following example, we'll connect to a MongoDB database to Query some data that includes a customer name and then perform a lookup against a MySQL database to pull customer addresses.

Step 1: Select the Queries from the left sidebar menu, click on the NEW DATASOURCE button at the top right corner of the interface, and select MongoDB as a datasource.

Accept the default configurations and click on Save (points to a MongoDB instance hosted by us).

Step 2: Repeat the above step but select MySQL as a datasource.

Accept the default configurations and click on Save

Step 3: Create a Query on MongoDB datasource via Visual Builder.

Select Queries from the left sidebar, click on the NEW QUERY button, and select MongoDB as a datasource (created in the Step 1 above).

While generating the Query, select the following:

  • select "sendingActivity" from the collections drop-down menu.

  • select "sent" from the metrics drop-down menu.

  • select "customer" from the metrics drop-down menu.

Here we are tracking a simple metric of Sent messages grouped by a customer.

Step 4: Click on the Preview button to view the results.

Now, for the returned results from MongoDB, let's look up the customer details, which is stored in a MySQL table called customer.

Step 5: Click on Join Source and select MySQL datasource (created in the Step 2 above) from the dropdown menu.

Step 6: Create a Query on MySQL datasource via Visual Builder.

select "customer" from the Tables drop-down menu

Step 7: Map fields from previous Queries to the current one using "join fields".

Enter "customer=customer" into the join fields box and select the Join Type using Join Builder.

Note: To specify multiple keys use 'AND': customer=customer AND joinAfield=joinBField

Step 8: Click "preview" to view the results.

The preview should now contain data from both MongoDB and MySQL that includes address and currency fields added to the MongoDB query for those customers.

Step 9: Click on Save Dataset and the dataset can now be added to a dashboard and shared easily.

The default join type is INNER JOIN when a join field is present. When a join field is not present, the data is blended together across the two datasets.

Other Join types supported: LEFT OUTER JOIN, FULL OUTER JOIN, RIGHT OUTER JOIN, LOOP JOIN.

To specify a different join type, you can proceed with any of the following two options:

  • Add the join type into the Join Fields. Example: FULL OUTER JOIN customer=customer

  • Click on Join Builder and make the selection of join type from the dropdown menu.

Data Manipulation after Joins: You can modify/transform the data returned in powerful ways using Cloud9QL at the bottom of the page. This section can be used to apply a Cloud9QL query as the very last step of processing. Alternatively, you can also interact with the grid directly for any manipulations of the data.

Cloud9Agent

As an alternative to the UI based approach, joins can be specified on your agent directly.

For more details on agent setup and configuring datasource and query files, see Cloud9Agent.

Example query_xxx.json with joins:

[
    {
        "entityName":"Join Example",
        "dsName":"demoMongo",
        "queryStr":"db.pagehits.find({lastAccessTime: { $exists: true}}).limit(10)",
        "c9SQLFilter":"select date(lastAccessTime) as date, count(*) as hits_count group by date(lastAccessTime)",
        "joinQueries": [
            {
                "queryStr":"db.pagehits.find({lastAccessTime: { $exists: true}}).limit(10)",
                "c9SQLFilter":"select date(lastAccessTime) as date, sum(hits) as hits_total group by date(lastAccessTime)",
                "dsName":"demoMongo",
                "joinKeys": { "date": "date" }
            }
        ],
        "overrideVals":{
            "replaceAll":true
        }
    }
]

Join Types

Inner & Outer Joins

Knowi supports the following Inner & Outer Join types: INNER JOIN, LEFT OUTER JOIN, FULL OUTER JOIN, RIGHT OUTER JOIN. The default join type is INNER JOIN when a join field is present.

Blending

When Join keys are not specified, data across the datasets will be blended together.

Loop Joins

Loop Join is a special join type where the first query result is fed into the second query one row at a time based on token(s) defined by the join condition. For each row of the first query's result, the second query will be executed and its result will be combined with the current row.

For example, consider 2 datasets:

 id | user                      user | sex
 ---------                      ----------
  1 | xxx                       xxx  |  M
  2 | yyy                       yyy  |  F

First query:

 select *

Second query:

 select * where user = USER_TOKEN

Now the LOOP JOIN condition can be defined as:

 LOOP JOIN user = USER_TOKEN

With the above definition, the execution will be as follows:

  1. Execute the first query which gives 2 rows as result.

For each row of the above result, execute the second query with the USER_TOKEN token replaced with the actual value of the user field of that row and join with it.

     select * where user = xxx 
     select * where user = yyy
  1. The final result of the join will be:

    id | user | sex 1 | xxx | M 2 | yyy | F

Note: For REST API datasources, any tokens defined in the end point or headers will be replaced.

Joins on Large Datasets

Knowi is architecturally designed to cater large scale joins across millions of records over disparate sources.

Joins are performed on the results of the two queries, where the join is performed in memory within Knowi, with a disk swap for larger datasets (with multi-threading and partitioning where applicable).

Note that data will be transferred over the network (into Knowi for cloud based modes, or on the agent if using an agent) for the results of each query before the join is performed. For larger dataset joins that spans millions of records, use the agent.

For more insights into Knowi's Join processing technology for large scale joins and benchmarks, contact your technical account manager.

Performance Optimizations

For Join processing on large scale datasets, you can prefix the Join Type with a set of predefined "Hints" to help determine the algorithms and optimizations for faster processing.

Prefix Description Example
SORTED Both sides of input data are sorted based on join keys. This will execute merge join but by-pass sorting SORTED INNER JOIN a = b
LEFT SORTED Left side of input data is sorted. This will first sort the right side of input data then merge join LEFT SORTED INNER JOIN a = b
RIGHT SORTED Right side of input data is sorted. This will first sort the left side of input data then merge join RIGHT SORTED INNER JOIN a = b
HASH LEFT Hash join will be used by hashing the left side data then join with the right side data HASH LEFT INNER JOIN a = b
HASH RIGHT Hash join will be used by hashing the right side data then join with the left side data HASH RIGHT INNER JOIN a = b
MULTI HASH LEFT Same as HASH LEFT but the join will be perform by multi-threads assuming multi-threads join is enable (see note below) MULTI HASH LEFT INNER JOIN a = b
MULTI HASH RIGHT Same as HASH RIGHT but the join will be perform by multi-threads assuming multi-threads join is enable (see note below) MULTI HASH RIGHT INNER JOIN a = b

Note: On on-prem build, to enable multi-threads join, change the following property inside cloud9.properties to the desired number of join-threads. The more threads you specify here the more memory you will need.

  queryJoinThreads=10