Interactive real-time dashboards on data streams: Nishant Bangarwa, Hortonworks
When interacting with analytics dashboards in order to achieve a smooth user experience, two major key requirements are quick response time and data freshness. To meet the requirements of creating fast interactive BI dashboards over streaming data, organizations often struggle with selecting a proper serving layer. Cluster computing frameworks such as Hadoop or Spark work well for storing large volumes of data, although they are not optimized for making it available for queries in real-time. long query latencies also make these systems sub-optimal choices for powering interactive dashboards and BI use-cases. This talk presents an open source real-time data analytics stack using Apache Kafka, Druid and Superset. The stack combines the low latency streaming and processing capabilities of Kafka with Druid which enables immediate exploration and provides low latency queries over the ingested data streams. Superset provides the visualization and dashboarding that integrates nicely with Druid. In this talk, we will discuss why this architecture is well suited to interactive applications over streaming data, present an end to end demo of the complete stack and discuss its key features and discuss performance characteristics from real world use-cases. Slides and more information: https://fifthelephant.talkfunnel.com/2017/73-interactive-realtime-dashboards-on-data-streams-us
How Superset and Druid Power Real-Time Analytics at Airbnb
Recorded at DataEngConf SF17 in April, 2017. Real-time analytics at Airbnb is fluid and straightforward. This talk describes the solutions developed using Thrift, Kafka, Spark Streaming, Druid and Superset. Along the way, we will do a full tour of Superset and get to understand how it surpassed Tableau to become the most popular analytics tool at Airbnb. Superset is an open source, enterprise-ready, data exploration, visualization and dashboarding web application that integrates nicely with Druid as well as any SQL-speaking database. Attendees will get to understand how realtime analytics work at Airbnb while getting an extensive demo of Superset.
DRUID SUB SECOND OLAP QUERIES OVER PETABYTES OF STREAMING DATA
NISHANT BANGARWA Software Engineer Hortonworks
SF Big Analytics 12-01-2015 : Druid: Interactive Exploratory Analytics at Scale
Recorded at DataEngConf 17: Cluster computing frameworks such as Hadoop or Spark are tremendously beneficial in processing and deriving insights from . Druid: Interactive Exploratory Analytics at Scale Abstract: Druid is an open source, distributed data store designed to analyze event data. Druid powers . Cluster computing frameworks such as Hadoop or Spark are tremendously beneficial in processing and deriving insights from data. However, long query . Traditional SaaS solutions based on hadoop datastore Hive/Hbase or classical RDBMS work well for storing data, although they are not optimized for ingesting .
Building an Open Source Streaming Analytics Stack with Kafka and Druid - Fangjin Yang,
Building an Open Source Streaming Analytics Stack with Kafka and Druid - Fangjin Yang The maturation and development of open source technologies has made it easier than ever for companies to derive insights from vast quantities of data. In this session, we will cover how to build a streaming analytics stack using Kafka and Druid. This combination of technologies can power a robust data pipeline that supports real-time and batch ingestion, and flexible, low-latency queries. Analytics pipelines running purely on batch processing systems can suffer from hours of data lag. Initial attempts to solve this problem often lead to inflexible solutions, where the queries must be known ahead of time, or fragile solutions where the integrity of the data cannot be assured. Combining Kafka, and Druid can guarantee system availability, maintain data integrity, and support fast and flexible queries. About Fangjin Yang Fangjin is one of the main committers to the open source Druid project and a co-founder of Imply, a San Francisco technology company. Fangjin previously held senior engineering positions at Metamarkets and Cisco. He holds a BASc in Electrical Engineering and a MASc in Computer Engineering from the University of Waterloo, Canada.
How to Use Scala on Hadoop by Adam Ilardi
Adam Ilardi, a data scientist at eBay, will talk about their implementations of Hadoop jobs using Scala. He'll walk us through their transition from Pig and ...
#12263 youtube 00:58:39
Hadoop Tutorial: Analyzing Sensor Data
This video explores how to use Hadoop and the Hortonworks Data Platform to analyze sensor data from heating, ventilation and air conditioning data to maintai...
#33144 youtube 00:06:44
Hive Druid Part 2
Different Querying Engines on Hadoop | Hadoop Tutorial Videos | Mr. Srinivas
Different Querying Engines on Hadoop For more updates on courses and tips follow us on: Facebook: https://www.facebook.com/NareshIT Twitter: https://twitter.com/nareshitech Google+: https://plus.google.com/+NareshIT For Registration : https://goo.gl/r6kJbB Call: India- 8179191999, USA- 404-232-9879 Email: email@example.com
Druid Interactive Queries Meet Real-Time Data Eric Tschetter and Danny Yuan
On-the-fly aggregation with human-time (or "interactive") queries against fresh, at-the-moment data represents a growing trend. Many newly announced systems ...
#40807 youtube 00:44:11
PLOTCON 2017: Maxime Beauchemin, Superset: An open source data exploration platform
Airbnb developed Superset to provide all employees with interactive access to data while minimizing friction. Superset provides a quick way to intuitively visualize datasets by allowing users to create and share interactive dashboards; a rich set of visualizations to analyze your data, as well as a flexible way to extend the capabilities; an extensible, high-granularity security model allowing intricate rules on who can access which features and integration with major authentication providers (database, OpenID, LDAP, OAuth, and REMOTE_USER through Flask AppBuilder); a simple semantic layer, allowing you to control how data sources are displayed in the UI by defining which fields should show up in which drop-down and which aggregation and function (metrics) are made available to the user; and deep integration with Druid that allows for Superset to stay blazing fast while working with large, real-time datasets. Superset's main goal is to make it easy to slice, dice, and visualize data. Maxime Beauchemin explains how Superset empowers each and every employee to perform analytics at the speed of thought. Biography Maxime Beauchemin recently joined Airbnb as a data engineer developing tools to help streamline and automate data-engineering processes. He mastered his data-warehousing fundamentals at Ubisoft and was an early adopter of Hadoop/Pig while at Yahoo in 2007. More recently, at Facebook, he developed analytics-as-a-service frameworks around engagement and growth-metrics computation, anomaly detection, and cohort analysis. He’s a father of three, and in his free time, he’s a digital artist. You can read more about his projects on his blog, Digital Artifacts.
Apache Hadoop YARN: How YARN changed Hadoop from v1 to v2
Learn about the impact of Apache Hadoop YARN on Hadoop, and how it transforms Hadoop 2 into a Data Operating System.
#22922 youtube 00:13:39
Hadoop Tutorial: Analyzing Geolocation Data
This video explores how to use Hadoop and the Hortonworks Data Platform to analyze Geolocation data to show how a trucking company can analyze geolocation da...
#13179 youtube 00:07:01
Hadoop Tutorial: Analyzing Clickstream Data
This video explores how to use Hadoop and the Hortonworks Data Platform to analyze clickstream data to increase online conversions and revenue. Clickstream data is a data trail a user leaves while visiting a website. The clickstream is captured in semi-structured weblogs. Organizations use clickstream data for: path optimization, basket analysis, next product to buy analysis and allocation of web resources.
Open Enterprise Hadoop
Open Enterprise Hadoop is a new paradigm that scales with the demands of your big data applications. It is supported by a rich and growing partner ecosystem that enables enterprises to meet the unique demands of their industries. By making governance, security and operations an integral part of the the platform Open Enterprise Hadoop opens the door for integrating with existing enterprise architectures. All of this is possible because Open Enterprise Hadoop maximizes community innovation by collaborating with developers in open source and within an open community environment.
Combine Hadoop and Elasticsearch to Get the Most of Your Big Data
In this webinar, learn how you can leverage the full power of both platforms to maximize the value of your Big Data.
#32894 youtube 00:58:49
State of Druid - Gian Merlino
December 2016 Druid meetup in San Francisco, hosted by Sift Science.
Things We Wish We Knew Before Operationalising Kafka - BigData.SG & Hadoop.SG
Speaker: Sreekanth Ramakrishnan, Data Engineering Team @ Lazada Apache Kafka, has become one of the center pieces of most of enterprise big data stacks. We will be talking about things which we have learnt while getting our Kafka cluster production ready. We will also be talking about what happens in a Kafka cluster when things go wrong, data loss scenarios, things to watch out when you deploy Apache Kafka. Event Page: http://www.meetup.com/BigData-Hadoop-SG/events/230062826/ Produced by Engineers.SG Help us caption & translate this video! http://amara.org/v/IMB5/
Building Real Time BI Systems with Kafka, Spark & Kudu: Spark Summit East talk by Ruhollah Farchtchi
One of the key challenges in working with real-time and streaming data is that the data format for capturing data is not necessarily the optimal format for ad hoc analytic queries. For example, Avro is a convenient and popular serialization service that is great for initially bringing data into HDFS. Avro has native integration with Flume and other tools that make it a good choice for landing data in Hadoop. But columnar file formats, such as Parquet and ORC, are much better optimized for ad hoc queries that aggregate over large number of similar rows.
Druid Database Presentation
No Sql Database Research Project : University of Bridgeport. Mentor: Prof. JeongKyu Lee
Scalable On Demand Hadoop Clusters with Docker and Mesos
Introduction to Druid by Fangjin Yang
Real time exploratory analytics on large datasets, given at Square, sponsored by Hakka Labs.
#40806 youtube 00:45:51
Druid: A Real-Time Analytical Data Store
Presented by Fangjin Yang (Software Engineer) at Berkeley's AMPLab.
#40808 youtube 00:28:35
Realtime Analytics with Open Source Technologies
#65067 youtube 00:28:53
MetaMarkets - Introduction to Druid by Fangjin Yang
More introduction to druid here: http://www.hakkalabs.co/articles/introduction-to-druid-at-metamarkets.
#68439 youtube 00:45:51
Gian Merlino: Realtime Analytics for Metrics Data w/ Druid
Read the full blog post here - http://www.heavybit.com/library/blog/sf-metrics-meetup-opentracing-and-druid/ Druid is an open source, distributed data store designed to analyze event data. Druid powers user-facing data applications, provides fast queries on data in Hadoop, and helps you glean insights from streaming data. The architecture unifies historical and real-time data and enables fast, flexible OLAP analytics at scale. Gian covers Druid's design and architecture, and how Druid can be utilized to monitor metrics data. For more developer focused content, visit https://www.heavybit.com/library
Real Time Analytics at Scale with Druid Jan 28, 2016
Real Time Analytics at Scale with Druid Los Angeles High Scalability Group Real Time Analytics at Scale with Druid upload Yesterday · 7:00 PM Riot Games 1.27.16 - Guest list is now closed. If I messaged you about your full name and you still have not provided it, you unfortunately have not been included on the guest list. See you all tomorrow!! *** NOTE: Riot Games requires you provide your FULL NAME when you RSVP and show your ID at the gate in order to enter the campus. If you do not provide your full name to Kelsey in advance of the talk AND bring your ID, you sadly wont be able to get in. THE SCHEDULE: 7:00 pm: meet, greet & eat 7:45(ish) pm: main presentation, Followed by a healthy Q&A BACKGROUND ON GUMGUM GumGum invented in-image advertising in 2008, and continue to lead the industry with their solution for helping publishers monetize their images and giving advertisers a relevant way tell their brand stories through pictures. Their patented In-Image Ads are overlaid on editorial photos where a user's attention is actively focused, creating higher viewability and engagement, and a better consumer experience TALK OVERVIEW At GumGum, we use Druid to ingest more than 30 billion events every day. These events can be queried almost as soon as they happen with a very low response time. How does Druid do it? How does GumGum leverage Druid's capabilities? Why does Druid suits GumGum’s needs so well and why should you consider using it? This is a tell-all talk about our love story with Druid. ABOUT THE SPEAKER Guillaume Torche has been working as a Big Data Engineer at GumGum since September 2014. Guillaume was born in France and has studied computer science in Université de Technologie de Compiègne. He has been working on Druid for the past year and has built the cluster used in production. Guillaume is the official Druid champion at GumGum. ABOUT OUR VENUE HOSTS: Riot Games is an American video game publisher, established in 2006. Their main office is based in Los Angeles, with satellite offices in St. Louis, Dublin, Berlin, Seoul, Sao Paulo, Istanbul, Moscow, Sydney and Taipei. They are the brand responsible for producing League of Legends, an online battle arena, real-time strategy, video game. PARKING+ARRIVAL Parking and check in will take place at the Riot Games campus main entrance at 12333 W Olympic Blvd. Los Angeles, CA 90064. From the 10 west, exit Bundy drive north. Turn left on Olympic blvd, and right into the Riot lot. If you've hit Centinela, you've gone too far. If the parking lot is full when you arrive your car may be valeted. When you arrive you'll need to provide identification to check your name against the guest list
Jin Yu(Mafengwo):Big Data in Travel - Real-time Analytics with Kafka, Spark, and Druid
Mafengwo is the largest online travel community in China, with over 80 million online and mobile users. In this talk, I will present Mafengwo's data strategy, its real-time Kappa data architecture, and use cases for predictive analytics in the travel industry.
SF Big Analytics 12-01-2015 : Druid: Interactive Exploratory Analytics at Scale
Druid: Interactive Exploratory Analytics at Scale Abstract: Druid is an open source, distributed data store designed to analyze event data. Druid powers user-facing data applications, provides fast queries on data in Hadoop, and helps you glean insights from streaming data. The architecture unifies historical and real-time data and enables fast, flexible OLAP analytics at scale. We will cover the limitations of existing relational database management systems and NoSQL key/value stores that motivated Druid’s development, the architecture of the system, and where Druid sits in the data infrastructure space. Speaker: Gian Merlino Gian is one of the main Druid committers and recently co-founded a stealth startup. Previously, Gian led the data ingestion team at Metamarkets and held senior engineering positions at Yahoo. He holds a BS in Computer Science from Caltech.
How to Connect Tableau to Apache Druid Using Hive LLAP
This animation shows a real-time view of accessing Druid in Tableau using Hive LLAP. As you can see the response time is easily within interactive time scales. View the blog to learn more: https://hortonworks.com/blog/apache-hive-druid-part-1-3/
Hadoop Meetup (HUG) July 2014 - Pushing the limits of Realtime Analytics using Druid
Hadoop Meetup (HUG) July 2014 - Privilege Isolation in Docker Containers.
#43286 youtube 00:30:27
Democratizing data science Using spark, hive and druid
MZ is re-inventing how the entire world experiences data via our mobile games division MZ Games Studios, our digital marketing division Cognant, and our live data platform division Satori. Growing need of data science capabilities across the organization requires an architecture that can democratize building these applications and disseminating insight from the outcome of data science applications to the wider organization. Attend this session to learn about how we built a platform for data science using spark, hive, and druid specifically for our performance marketing division cognant.This platform powers several data science application like fraud detection and bid optimization at large scale. We will be sharing lessons learned over past 3 years in building this platform by also walking through some of the actual data science applications built on top of this platform. Attendees from ML engineering and data science background can gain deep insight from our experience of building this platform.
0605 Docker Based Hadoop Provisioning
#43287 youtube 00:43:43
DataEngConf: Building Satori, a Hadoop tool for Data Extraction at LinkedIn
LinkedIn is the professional profile of record for our 370M+ members globally, but many people don't realize the full potential of their LinkedIn profile – especially on mobile. Adding blogs, photos and other rich content to your profile on a small screen device can get tedious. That's why LinkedIn created Satori, a Hadoop tool that crawls the web and extracts data to discover members' professional content online. Satori uses machine learning techniques and leverages other open source tools like Nutch and Gobblin in order to help match members with relevant content in order to maximize their professional profile. In this talk, Nikolai will share his experience in building the product and discuss the challenges and opportunities encountered along the way.
Extend Governance in Hadoop with Atlas Ecosystem
[1.29] Danni the Druid: An Introduction
My name is Danni, and I'm a Druid! I'll be the new Thursday Substitute host. I thought my introduction would be more fun if I included some exuberant, random facts about who I am. I also briefly talk about some of the things I do in my practice. If you'd like to see my audition video, you can find it here: http://youtu.be/Ma94m5dK328 ----- My personal channel: https://www.youtube.com/user/sweetgirl3313 Blog: http://www.EsotericMoment.com Twitter: https://twitter.com/EsotericMoment Instagram: http://instagram.com/esotericmoment/ Tumblr: http://esotericmoment.tumblr.com/ Goodreads: https://www.goodreads.com/user/show/3... Topics/FAQs: http://paganperspectivecollab.blogspot.com -----
Analyze Streaming Data in About 30 Minutes with HDP and Druid
See how to deploy/configure Druid, then see how easy it is to stream data into Druid for real-time SQL analytics and visualization. Follow along at home using https://github.com/cartershanklin/druid-satori-demo
DRUID with Pivot - Build Extremely Fast Full Blown Site on your BigData in Minutes
DRUID with Pivot - Build Extremely Fast Full Blown Site on your BigData in Minutes http://druid.io/ http://pivot.imply.io/
Beyond Hadoop: Fast Ad-Hoc Queries on Big Data
Metamarkets' Lead Architect gives an overview of Druid at the Strata Conference in NYC.
#40805 youtube 00:36:27
Scalable Realtime Analytics using Druid
Nishant Bangarwa – Scalable Realtime Analytics using Druid
Traditional SaaS solutions based on hadoop datastore Hive/Hbase or classical RDBMS work well for storing data, although they are not optimized for ingesting data and making it immediately available for interactive ad-hoc low latency queries at a very high scale. Long query latencies make these solutions suboptimal choices to power interactive applications. This talk will introduce Druid as a complementing solution for scalable real-time ingestion and analytics. Druid is an open source distributed data warehouse, designed to support OLAP-like queries and is used in production at numerous companies. It was inspired by Google’s Dremel, PowerDrill and search framework. This talk will cover druid architecture, its storage internals and the common use cases druid is a good fit for.
Open Source Lambda Architecture with Hadoop, Kafka, Samza, and Druid
"Druid: Powering Interactive Data Applications at Scale" by Fangjin Yang
Cluster computing frameworks such as Hadoop or Spark are tremendously beneficial in processing and deriving insights from data. However, long query latencies make these frameworks sub-optimal choices to power interactive applications. Organizations frequently rely on dedicated query layers, such as relational databases and key/value stores, for faster query latencies, but these technologies suffer many drawbacks for analytic use cases. In this session, we discuss using Druid for analytics, and why the architecture is well suited to power analytic applications. User facing applications are replacing traditional reporting interfaces as the preferred means for organizations to derive value from their datasets. In order to provide an interactive user experience, user interactions with analytic applications must complete in an order of milliseconds. To meet these needs, organizations often struggle with selecting a proper serving layer. Many serving layers are selected because of their general popularity, without understanding the possible architecture limitations. Druid is an analytics data store designed for analytic (OLAP) queries on event data. It draws inspiration from Google's Dremel, Google's PowerDrill, and search infrastructure. Many large technology companies are switching to Druid for analytics, and we will cover why the technology is a good fit for its intended use cases.
Bringing Real Time to the Enterprise with Hortonworks DataFlow
TELUS is Canada’s fastest-growing national telecommunications company, with $12.9B of annual revenue and 12.7M customer connections. TELUS provides TELUS TV to more than 1.1 million customers in western Canada, with the goal of maximizing clients’ service experience. Every television set-top box (STB) in the TELUS network regularly logs its diagnostic state and those logs serve as an important signal of the state of the customers equipment. TELUS’ goal was to move from analysis of these STB logs from a daily, batch process towards streaming analytics that would allow the company to respond within minutes to changes in a device or service health. TELUS increased logging frequency by a factor of 50 and adapted the overall architecture to keep pace with that transformation. Members of the TELUS team will walk through the implementation and workflow challenges they overcame, working with Hortonworks to connect Apache Hadoop, Apache NiFi and Apache Spark, within a secure enterprise environment. They will also discuss the impact to their business and their customers’ satisfaction. Speakers: Cavan Loughran Solution & Big Data Architect Telus Oliver Meyn Solution Architect T4G Limited Link to Slides: https://www.slideshare.net/Hadoop_Summit/bringing-realtime-to-the-enterprise-with-hortonworks-dataflow Event session page: https://dataworkssummit.com/san-jose-2017/sessions/bringing-real-time-to-the-enterprise-with-hortonworks-dataflow/
Druid and Hive together: interactive realtime analytics at scale
Two popular open source technologies, Druid and Apache Hive, are often mentioned as viable solutions for large-scale analytics. Hive works well for storing large volumes of data, although not optimized for ingesting streaming data and making it available for queries in real time. Druid excels at low-latency, interactive queries over streaming data and making data available in real time for queries. Although the high level messaging presented by both projects may lead you to believe they are competitors in the same space, the technologies are, in fact, extremely complementary solutions. By combining the rich query capabilities of Hive with the powerful realtime streaming and indexing capabilities of Druid, we can build a more powerful, flexible, and extremely low-latency real time analytics solution. In this talk we will discuss the motivation to combine Hive and Druid together alongwith the benefits and benchmark numbers. Proposed Agenda of the talk: • Motivation behind combining Druid and Hive • Apache Hive—introduction • Druid—introduction • Druid and Hive together—benefits • Architecture for handling streaming data at scale • Demo • Benchmark numbers Speaker NISHANT BANGARWA Software engineer Hortonworks
How to work with Apache Kafka and Hadoop - Gwen Shapira from Cloudera
For more tech talks and to network with other engineers, check out our site https://www.hakkalabs.co/logs See the full post here: http://www.hakkalabs.co/art...
#62757 youtube 00:20:22
The Pillars of Effective Data Archiving and Tiering in Hadoop
Pete Kisich from FactorData Corporation will cover utilizing native Hadoop storage policies and types to effectively archive and tier data in your existing Hadoop infrastructure. Key focus areas are: 1. Current state of tiering in Hadoop 2. Identifying key metrics for successful archiving 3. Automation requirements at scale 4. Current limitations and gotchas The impact of successful archive provides Hadoop users better performance, lower hardware cost, and lower software costs. This session will cover the techniques and tools available to unlock this powerful capability in native Hadoop.
Accelerate SQL Analytics on HDP with Druid
See how to accelerate SQL analytics by creating Druid indexes on data stored in Hive. Follow along at home using https://github.com/cartershanklin/hive-druid-ssb
Jesús Camacho Rodríguez, Hortonworks
Interactive Analytics at Scale in Apache Hive Using Druid
Discover HDP 2.1: Interactive SQL Query in Hadoop with Apache Hive
In February 2013, the open source community launched the Stinger Initiative to improve speed, scale and SQL semantics in Apache Hive. After thirteen months o...
#36911 youtube 00:27:56
Genomic Data Analysis with Spark & Hadoop by Ryan Williams | DataEngConf NYC '16
Learn more about Ryan Williams and his talk on genomic data analysis with Apache Spark and Hadoop here: http://info.dataengconf.com/genomic-data-analysis-with-spark-and-hadoop
Apache Kylin: OLAP Cubes for NoSQL Data stores
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
As one of the few closed-loop payment platforms, PayPal is uniquely positioned to provide merchants with insights aimed to identify opportunities to help grow and manage their business. PayPal processes billions of data events every day around our users, risk, payments, web behavior and identity. We are motivated to use this data to enable solutions to help our merchants maximize the number of successful transactions (checkout-conversion), better understand who their customers are and find additional opportunities to grow and attract new customers. As part of the Merchant Data Analytics, we have built a platform that serves low latency, scalable analytics and insights by leveraging some of the established and emerging platforms to best realize returns on the many business objectives at PayPal. Join us to learn more about how we leveraged platforms and technologies like Spark, Hive, Druid, Elastic Search and HBase to process large scale data for enabling impactful merchant solutions. We’ll share the architecture of our data pipelines, some real dashboards and the challenges involved. Speakers KASIVISWANATHAN NATARAJAN Member of Technical Staff PayPal Inc DEEPIKA KHERA Senior Manager - Merchant Data Analytics PayPal
Streaming analytics at 300 billion events/day with Kafka, Samza, and Druid
Druid Installion on ubuntu
Easy installation of Druid on ubuntu in standalone mode.
Hadoop Query Performance Smackdown
BelFOSS 2018 SpotX: Hadoop, Spark & Druid
Io, Jen and Ronan from SpotX in Belfast provide an overview of the free/open-source software used for big data at SpotX including Hadoop, Spark and Druid. This is a talk given at BelFOSS 2018. Video produced using FOSS software including Kdenlive, Gimp & ffmpeg on Linux Mint.
Superset querying Druid
scale.bythebay.io: Pavan, Druid Lookups for High Cardinality Dimensions
Druid is a high-performance, column-oriented, distributed data store. Lookups are a concept in Druid where dimension values are (optionally) replaced with new values. The common use case of query-time lookups is to replace one dimension value (e.g. an ID) with another value (e.g. a human-readable Name). This is similar to a star-schema join. Druid has limited sup- port for joins through query-time lookups. Very small lookups (count of keys on the order of a few dozen to a few hundred) can be passed at query time as a "map" lookup as per dimension specs. For large lookups, Druid has an extension called Namespaced lookups. Namespaced lookups are appropriate for lookups that cannot be passed at query time due to their size, or are not desired to be passed at query time because the data is to reside in and be handled by the Druid servers. But Druid’s namespaced lookups has following limitations, • It is not suitable for high cardinality dimensions • It is not scalable for large data in the order of hundreds of millions of rows • Namespaced lookup support is limited to one key column with a corresponding value column • Real time updates to the lookup data is not possible These limitations encouraged us to develop a highly scalable, multi-column, configurable Druid lookup framework that supports real time updates on lookup data. Framework uses embeddable persistent key-value data store, kafka for messaging and HDFS for deep storage.
Ростислав Пашуто: Агрегированные запросы на больших массивах данных с использованием druid
Ростислав Пашуто: Engineer в InData Labs; г.Минск Доклад: «Агрегированные запросы на больших массивах данных с использованием druid» О чем: В то время как инструменты для пакетной обработки больших объемов данных совершенствуются, проблема запросов в реальном времени остается открытой и привлекает внимание все большего количества людей. В докладе Ростислав рассмотрит Druid (druid.io), open source хранилище, позволяющее делать OLAP запросы по данным событий, задача которого - закрыть брешь в текущей экосистеме больших данных, позволить делать аналитические запросы в режиме реального времени.
Querying Druid in SQL with Superset
Druid is a high performance, column-oriented distributed data store that is widely used at Oath for big data analysis. Druid has a JSON schema as its query language, making it difficult for new users unfamiliar with the schema to start querying Druid quickly. The JSON schema is designed to work with the data ingestion methods of Druid, so it can provide high performance features such as data aggregations in JSON, but many are unable to utilize such features, because they not familiar with the specifics of how to optimize Druid queries. However, most new Druid users at Yahoo are already very familiar with SQL, and the queries they want to write for Druid can be converted to concise SQL. We found that our data analysts wanted an easy way to issue ad-hoc Druid queries and view the results in a BI tool in a way that's presentable to nontechnical stakeholders. In order to achieve this, we had to bridge the gap between Druid, SQL, and our BI tools such as Apache Superset. In this talk, we will explore different ways to query a Druid datasource in SQL and discuss which methods were most appropriate for our use cases. We will also discuss our open source contributions so others can utilize our work.
Accelerating Data Warehouse Modernization with OLAP on Hadoop
In this talk Ajay Anand, VP Products, Kyvos Insights and Vineet Tyagi, CTO, Impetus examine the best practices for assessing what can be migrated, doing an impact / ROI analysis, and developing plans for data migration, offloading ETL, managing security, and building a consumable data layer for the business user. They also evaluate results obtained by enterprises that have embarked on this journey, including an assessment of operational benefits, acceptance by end users, and reduction in time to market.
Interactive Exploratory Analytics with Druid
Recorded at DataEngConf '17: Cluster computing frameworks such as Hadoop or Spark are tremendously beneficial in processing and deriving insights from data. However, long query latencies make these frameworks sub-optimal choices to power interactive applications. Organizations frequently rely on dedicated query layers, such as relational databases and key/value stores, for faster query latencies, but these technologies suffer many drawbacks for analytic use cases. In this session, we discuss using Druid for analytics, and why the architecture is well suited to power analytic applications. User facing applications are replacing traditional reporting interfaces as the preferred means for organizations to derive value from their datasets. In order to provide an interactive user experience, user interactions with analytic applications must complete in an order of milliseconds. To meet these needs, organizations often struggle with selecting a proper serving layer. Many serving layers are selected because of their general popularity, without understanding the possible architecture limitations. Druid is an analytics data store designed for analytic (OLAP) queries on event data. It draws inspiration from Google’s Dremel, Google’s PowerDrill, and search infrastructure. Many enterprises are switching to Druid for analytics, and we will cover why the technology is a good fit for its intended use cases. Speaker: Fangjin Yang, Imply
Hive LLAP "Inside Out"
A Hortonworks Premier Support Exclusive. Hive Product Management and Engineers explain Hive LLAP usage and architecture. Learn about Hortonworks Premier Support here: https://hortonworks.com/services/support/premier Brought to you by: https://www.linkedin.com/in/clukasik https://www.linkedin.com/in/jlongpre
Birds of a Feather: Apache Hive & Apache Druid
Apache Hive is the de facto standard for SQL queries in Hadoop. With the next phase of SQL in Hadoop, the Apache community has greatly improved Hive’s speed(LLAP), scale and SQL semantics. Come learn and discuss what is new in Hive 3.0. Apache Druid is an open source column-oriented distributed data store designed for OLAP queries on event data. Druid provides the ability to have interactive queries on real-time streams that are horizontally scalable. Druid has rich client libraries and integration with tools like Pivot and Apache Superset. Come learn about the latest developments in Druid and Hive/Druid integration. Hosts: Alan Gates, Vihang Karajgaonkar, Jesus Camacho Rodriguez, Nishant Bangarwa
Interactive real time dashboards on data streams using Kafka, Druid, and Superset
When interacting with analytics dashboards, in order to achieve a smooth user experience, two major key requirements are quick response time and data freshness. To meet the requirements of creating fast interactive BI dashboards over streaming data, organizations often struggle with selecting a proper serving layer. Cluster computing frameworks such as Hadoop or Spark work well for storing large volumes of data, although they are not optimized for making it available for queries in real time. Long query latencies also make these systems suboptimal choices for powering interactive dashboards and BI use cases. This talk presents an open source real time data analytics stack using Apache Kafka, Druid, and Superset. The stack combines the low-latency streaming and processing capabilities of Kafka with Druid, which enables immediate exploration and provides low-latency queries over the ingested data streams. Superset provides the visualization and dashboarding that integrates nicely with Druid. In this talk we will discuss why this architecture is well suited to interactive applications over streaming data, present an end-to-end demo of complete stack, discuss its key features, and discuss performance characteristics from real-world use cases. Speaker: Nishant Bangarwa, Software Engineer, Hortonworks
An introduction to Druid
Cluster computing frameworks such as Hadoop or Spark are tremendously beneficial in processing and deriving insights from data. However, long query latencies make these frameworks sub-optimal choices to power interactive applications. Organizations frequently rely on dedicated query layers, such as relational databases and key/value stores, for faster query latencies, but these technologies suffer many drawbacks for analytic use cases. In this session, we discuss using Druid for analytics and why the architecture is well suited to power analytic applications. User-facing applications are replacing traditional reporting interfaces as the preferred means for organizations to derive value from their datasets. In order to provide an interactive user experience, user interactions with analytic applications must complete in an order of milliseconds. To meet these needs, organizations often struggle with selecting a proper serving layer. Many serving layers are selected because of their general popularity without understanding the possible architecture limitations. Druid is an analytics data store designed for analytic (OLAP) queries on event data. It draws inspiration from Google’s Dremel, Google’s PowerDrill, and search infrastructure. Many enterprises are switching to Druid for analytics, and we will cover why the technology is a good fit for its intended use cases. Speaker: Nishant Bangarwa, Software Engineer, Hortonworks
Sherlock: an anomaly detection service on top of Druid
Sherlock is an anomaly detection service built on top of Druid. It leverages EGADS (Extensible Generic Anomaly Detection System; github.com/yahoo/egads) to detect anomalies in time-series data. Users can schedule jobs on an hourly, daily, weekly, or monthly basis, view anomaly reports from Sherlock's interface, or receive them via email. Sherlock has four major components: timeseries generation, EGADS anomaly detection, Redis backend and Spark Java UI. Timeseries generation involves building, validating, querying, parsing the Druid query. Parsed Druid response is then fed to EGADS anomaly detection component which detects and generates the anomaly reports for each input time-series data. Sherlock uses Redis backend to store jobs metadata, generated anomaly reports and persistent job queue for scheduling jobs, etc. Users can choose to have a clustered Redis or standalone Redis. Sherlock provides user interface built with Spark Java. The UI enables users to submit instant anomaly analysis, create, and launch detection jobs, view anomalies on a heatmap and on a graph.
Introduction to Druid, fast distributed data store (Nikita Salnikov-Tarnovski, Co-founder at Plumbr)
I would like to introduce a very nice time-series database product, that allows Plumbr to receive and process billions of events per day without breaking a sweat. After several disappointing evaluations of competing products Druid’s ease of use and performance characteristics made it our lucky ticket. This is the first time in many years that I am impressed with a product so much I would like to talk to everybody about it. :) In this talk I will present several architectural decisions made by Druid authors that allows it to receive a constant stream of real-time events, make them available for complex analytical queries almost instantly and answer those queries in a few seconds. They are very good examples of good solutions to complex problems that seem obvious in the hindsight.