Pydruid for data ingestion in druid

Is there any way I can use pydruid for data ingestion in druid ?. My goal is create a python app that can insert , update , as well as read data from druid

Hi! Currently, ingestion is launched via the Overlord API, which is as close as Druid 0.22 gets to INSERT, CREATE etc. statements.

And it’s also important you know about the UPDATE capabilities of Druid.

You might want to watch this video from PMC Chair @Gian_Merlino2 on some upcoming changes, particularly from about 8 minutes in…

You might also want to raise specific pydruid questions in that specific repo:

Could you please check if the config file is correct for mysql connector for single cluster :


#

# Licensed to the Apache Software Foundation (ASF) under one

# or more contributor license agreements.  See the NOTICE file

# distributed with this work for additional information

# regarding copyright ownership.  The ASF licenses this file

# to you under the Apache License, Version 2.0 (the

# "License"); you may not use this file except in compliance

# with the License.  You may obtain a copy of the License at

#

#   http://www.apache.org/licenses/LICENSE-2.0

#

# Unless required by applicable law or agreed to in writing,

# software distributed under the License is distributed on an

# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY

# KIND, either express or implied.  See the License for the

# specific language governing permissions and limitations

# under the License.

#

# Extensions specified in the load list will be loaded by Druid

# We are using local fs for deep storage - not recommended for production - use S3, HDFS, or NFS instead

# We are using local derby for the metadata store - not recommended for production - use MySQL or Postgres instead

# If you specify `druid.extensions.loadList=[]`, Druid won't load any extension from file system.

# If you don't specify `druid.extensions.loadList`, Druid will load all the extensions under root extension directory.

# More info: https://druid.apache.org/docs/latest/operations/including-extensions.html

druid.extensions.loadList=["druid-hdfs-storage", "druid-kafka-indexing-service", "druid-datasketches", "mysql-metadata-storage"]

# If you have a different version of Hadoop, place your Hadoop client jar files in your hadoop-dependencies directory

# and uncomment the line below to point to your directory.

#druid.extensions.hadoopDependenciesDir=/my/dir/hadoop-dependencies

#

# Hostname

#

druid.host=localhost

#

# Logging

#

# Log all runtime properties on startup. Disable to avoid logging properties on startup:

druid.startup.logging.logProperties=true

#

# Zookeeper

#

druid.zk.service.host=localhost

druid.zk.paths.base=/druid

#

# Metadata storage

#

# For Derby server on your Druid Coordinator (only viable in a cluster with a single Coordinator, no fail-over):

druid.metadata.storage.type=derby

druid.metadata.storage.connector.connectURI=jdbc:derby://localhost:1527/var/druid/metadata.db;create=true

druid.metadata.storage.connector.host=localhost

druid.metadata.storage.connector.port=1527

# For MySQL (make sure to include the MySQL JDBC driver on the classpath):

druid.metadata.storage.type=mysql

# druid.metadata.mysql.driver.driverClassName=org.mariadb.jdbc.Driver

druid.metadata.storage.connector.connectURI=jdbc:mysql://root@localhost/druid

druid.metadata.storage.connector.user=druid

druid.metadata.storage.connector.password=diurd

# For PostgreSQL:

# druid.metadata.storage.type=postgresql

# druid.metadata.storage.connector.connectURI=jdbc:postgresql://db.example.com:5432/druid

# druid.metadata.storage.connector.user=...

# druid.metadata.storage.connector.password=...

#

# Deep storage

#

# For local disk (only viable in a cluster if this is a network mount):

druid.storage.type=local

druid.storage.storageDirectory=var/druid/segments

# For HDFS:

#druid.storage.type=hdfs

#druid.storage.storageDirectory=/druid/segments

# For S3:

#druid.storage.type=s3

#druid.storage.bucket=your-bucket

#druid.storage.baseKey=druid/segments

#druid.s3.accessKey=...

#druid.s3.secretKey=...

#

# Indexing service logs

#

# For local disk (only viable in a cluster if this is a network mount):

druid.indexer.logs.type=file

druid.indexer.logs.directory=var/druid/indexing-logs

# For HDFS:

#druid.indexer.logs.type=hdfs

#druid.indexer.logs.directory=/druid/indexing-logs

# For S3:

#druid.indexer.logs.type=s3

#druid.indexer.logs.s3Bucket=your-bucket

#druid.indexer.logs.s3Prefix=druid/indexing-logs

#

# Service discovery

#

druid.selectors.indexing.serviceName=druid/overlord

druid.selectors.coordinator.serviceName=druid/coordinator

#

# Monitoring

#

druid.monitoring.monitors=["org.apache.druid.java.util.metrics.JvmMonitor"]

druid.emitter=noop

druid.emitter.logging.logLevel=info

# Storage type of double columns

# ommiting this will lead to index double as float at the storage layer

druid.indexing.doubleStorage=double

#

# Security

#

druid.server.hiddenProperties=["druid.s3.accessKey","druid.s3.secretKey","druid.metadata.storage.connector.password"]

#

# SQL

#

druid.sql.enable=true

#

# Lookups

#

druid.lookup.enableLookupSyncOnStartup=false

I’m getting an error with this config

Hi Mebin,

I’m researching your question and came across the MySQL Metadata Store. If you check Setting up MySQL, section 3, you’ll find configuration parameters. Hopefully this helps answer your config question.

Best,

Mark

I actually followed the tutorial and got this error. I’m actually stuck on this for the past two days.

Hi Mebin,
I think there is still some confusion in this thread.
The MySQL extension in Druid is meant to be used for the Druid Metadata repository, not as a source for loading data. Like Peter said, you can leave that in the default for the single-server config which is Derby.

Now if we put that aside, what I am reading is that you want to be able to produce data from a python application that gets ingested into Druid and is readily available for queries after that. With current Druid functionality, one option is to setup a stream (i.e. kafka), publish to it from your python application and create a real-time kafka ingestion job in Druid consuming that data.
One thing to consider however, is that this will create an asynchronous data feed into Druid, so you cannot expect to read the “inserted” data immediately.

Updating my post because I just read this: MySQL Metadata Store · Apache Druid
Which also says that the extension can be used to do batch ingestion by reading from MySQL. I learned something new.

But this would imply a batch operation, so you might insert from python into mysql and then on some periodic basis run a batch ingestion to get the latest from mysql into Druid. Not sure if this is what you want for your python application. Perhaps we should reset the conversation by having you describe the data flows you are looking to achieve.

Hope this helps.

1 Like

I fixed the issue. The issue was I had to add the jar file in the lib folder of druid. It wasn’t mentioned in the documentation

1 Like

Oh that’s annoying… which page were you looking at @mebinjoy ? Maybe we could do with an update based on your experience…

MySQL Metadata Store · Apache Druid