Intelligent CIO LATAM Issue 26

t cht lk

Imply has announced a new capability that makes Druid the first analytics database that can provide the performance of a strongly typed data structure with the flexibility of a schemaless data structure . Schema autodiscovery , now available in Druid 26.0 , is a new feature that enables Druid to automatically discover data fields and data types and update tables to match changing data without an administrator .

Shuffle joins

David Wang , VP of Product , Imply

The second major feature is the expansion of Druid ’ s architecture to support large complex joins at ingestion via shuffle joins . Previous join capabilities were limited to maintain high CPU efficiency for query performance , so large tables had to be pre-joined in the data pipeline via other systems like Spark .

David Wang , VP of Product at Imply , takes our questions on the newly announced enhancements .

Can you provide examples of how companies are using Imply ’ s solutions and what value and benefits they have experienced ?

Apache Druid is a popular open-source database for real-time analytics applications . Developers at thousands of companies choose this database because of its performance at scale and under load along with its comprehensive features for analyzing streaming data . Druid is the database-of-choice for analytics use cases including operational visibility of real-time events , rapid data exploration , customerfacing analytics and real-time decisioning .

Can you provide more details about the new features introduced in Milestone 3 of Project Shapeshift , such as schema auto-discovery and shuffle joins ?

Schema auto-discovery

Schema definition plays an essential role in query performance as a strongly typed data structure makes it possible to columnarize , index and optimize compression .

Defining the schema when loading data carries operational burden on engineering teams , especially with ever-changing event data flowing through Apache Kafka and Amazon Kinesis . Databases such as MongoDB utilize a schemaless data structure as it provides developer flexibility and ease of ingestion , but at a cost to query performance .

Druid has enhanced its ingestion capabilities to support large joins – architecturally via shuffle joins . This simplifies data preparation , minimizes reliance on external tools and adds to Druid ’ s capabilities for indatabase data transformation .

How does the automatic schema discovery feature in Druid 26.0 benefit developers and address the challenges of defining schemas when loading data ?

Developers rely on a strongly typed data format because of the query performance advantages that defined types per column provide in terms of query optimization , columnarization , compression , etc .

But the definition of that schema has to happen before data is loaded – commonly referred to as schemaon-write . But as source data changes it becomes a nightmare for engineering teams to manage .

Druid now uniquely fixes this challenge with schema auto-discovery . Druid continues to utilize a stronglytype data format for its performance benefits , but the definition of the schema is now ( optionally ) completely automated .

Druid can auto-discover column names and data types as data is ingested – and even as the data source changes – and store the data type for each dimension ’ s column with all the Druid segment optimization – all automatically . This means developers get all the flexibility and ease of a schemaless data format without any performance impact whatsoever .

What is the significance of Druid now supporting large complex joins during ingestion and how does it

76 INTELLIGENTCIO LATAM www . intelligentcio . com

Intelligent CIO LATAM Issue 26 | Page 76

t cht lk

t cht lk