How-To: Connect to DuckDB
This article provides instructions for installing and configuring DuckDB
What Does This Article Cover?
HighByte® Intelligence Hub supports integration with DuckDB as a lightweight, high-performance option for querying industrial time-series data stored at the Edge. The process to configure DuckDB is provided below. This article covers the following.
- What is DuckDB?
- Why DuckDB?
- Installing DuckDB
- Limiting Memory Usage
- Merging Files
- Parquet ETL
- Considerations
- Other related material
What is DuckDB?
DuckDB is an open-source, embedded analytical database designed to run SQL queries directly against file-based data (such as Parquet, CSV, and JSON) without requiring a separate database server. HighByte® Intelligence Hub supports integration with DuckDB as a lightweight, high-performance option for querying industrial time-series data stored at the Edge. For manufacturers working with large volumes of historian or sensor data in Parquet format, this combination enables fast, code-efficient data access without the overhead of a full data warehouse deployment.
Why DuckDB?
Intelligence Hub version 4.5 introduced a Parquet format option for AVEVA PI System Connection Inputs. In this scenario DuckDB can be used for on-disk processing of Parquet files in Pipelines. Additionally, DuckDB might generally be a simpler and lightweight alternative to Spark for processing large data file.
Installing DuckDB
Use the following steps to install DuckDB for use with Intelligence Hub.
- Download the latest DuckDB JAR file
https://duckdb.org/install/?platform=windows&environment=java
- Move the JAR into the Intelligence Hub lib directory
runtime/lib
- Start or restart the Intelligence Hub
Mounting DuckDB JAR File into a Docker Container
When using a docker container, the jar file needs to be mounted into the docker container using a mount command. The jar file will physically reside on the host filesystem and will be defined as a virtual file within the docker container. Here is an example of mounting the DuckDB jar file. The jar file is physically located in C:\Temp and will be mounted into the /usr/local/highbyte/runtime/lib folder on the docker container.
--mount type=bind,source=C:\Temp\duckdb_jdbc-1.5.3.0.jar,target=/usr/local/highbyte/runtime/lib/duckdb_jdbc-1.5.3.0.jar
Here is an example of a complete docker command that creates a container with port mappings, using an environment variable to accept the EULA, mounting the appData folder and mounting the DuckDB jar file. See Docker Configuration in our Users Guide for more detail using this link.
docker create -p 45248:45245 -p 1891:1885 -p 8888:8885 -e ACCEPT_EULA=Y --name highbyte_r8 --mount type=bind,source=C:\Installs\hb_r8,target=/usr/local/highbyte/appData --mount type=bind,source=C:\Temp\duckdb_jdbc-1.5.3.0.jar,target=/usr/local/highbyte/runtime/lib/duckdb_jdbc-1.5.3.0.jar highbyte:4.5.0
Create JDBC Connection for DuckDB
Create a connection in Intelligence Hub and specify JDBC Driver as the connection type.
-
Host and Port properties are required when creating the connection but will not be used. Just specify "localhost" for the Host and 443 as the Port.
-
JDBC Connection String
- For in-memory database
jdbc:duckdb:
-
-
For a file-backed database
-
jdbc:duckdb:C:\Users\User\Path\To\Database\file.db
- Class Path
org.duckdb.DuckDBDriver
-
SQL Syntax
PostgresSQL

Limiting Memory Usage
The DuckDB driver is initially set up to use as much memory as it can. This can be problematic when trying to process relatively large files. Limit the memory usage of the driver by running the following query. This query can be set up to run via a dedicated pipeline, or pipelines can be configured to run this query on the first pipeline run.
SET memory_limit = '4GB';
Merging Files
DuckDB supports merging multiple Parquet files. Files can be selected using one of two mechanisms, File Glob syntax or absolute paths.
File Glob Syntax
This mechanism works well with static setups where there is consistent naming and pathing of files. It also avoids managing configuration on the Intelligence Hub side.
COPY (
SELECT * FROM read_parquet('C:\Users\User\Desktop\ParquetStore\duckDB\parquetDump\*.parquet')
) TO 'C:\Users\User\Desktop\ParquetStore\duckDB\parquetDump\Merged.parquet' (FORMAT PARQUET);
Other examples of syntax possible with Glob:
-
Match all Parquet files in the directory
*.parquet
-
Match files like data_2020.parquet or data_2024.parquet (? matches any single character)
data_202?.parquet
-
Matches all files whose name begins with data_2024_
data_2024_*.parquet
-
Match files starting with sales_ or inventory_
{sales,inventory}_*.parquet
-
Match Parquet files in all subdirectories recursively
**/*.parquet
Absolute Paths
This mechanism is intended for scenarios where certain files must be cherry-picked within a crowded directory. It is most useful when files are landing in multiple directories or when there is less consistency with file naming. This mechanism can also be driven by input parameters, supporting use cases where the pipeline identifies one or more files and parameterizes the input to process those exact files.
COPY (
SELECT * FROM read_parquet([
'',
'',
''
])
) TO 'C:\Users\User\Desktop\ParquetStore\duckDB\parquetDump\Merged.parquet' (FORMAT PARQUET);
Parquet ETL
DuckDB supports on-disk ETL of Parquet files. This connector can be used to rename columns, transform values, or perform complex operations like filtering rows based on conditional logic.
Rename Column
The query below renames the "name" column to "point":
COPY (
SELECT * EXCLUDE (name), name AS point
FROM read_parquet('C:\Users\User\Desktop\ParquetStore\duckDB\parquetDump\')
) TO 'C:\Users\User\Desktop\ParquetStore\duckDB\parquetDump\Renamed.parquet' (FORMAT PARQUET);
Model Values
The query below combines numerical values (value_int64 and value_double) into one column (valueNumeric) and the remaining values (value_string and value_boolean) into a stringified column (valueOther).
COPY (
SELECT
* EXCLUDE (name, value_internal_type, value_system_type, value_int64, value_double, value_string, value_name AS pointName,
CASE
WHEN value_internal_type IN ('int64', 'double')
THEN COALESCE(value_int64::DOUBLE, value_double::DOUBLE)
ELSE NULL
END AS valueNumeric,
CASE
WHEN value_internal_type NOT IN ('int64', 'double')
THEN to_json(COALESCE(value_string::VARCHAR, value_boolean::VARCHAR))::VARCHAR
ELSE NULL
END AS valueOther
FROM read_parquet('C:\Users\User\Desktop\ParquetStore\duckDB\parquetDump\merged.parquet')
) TO 'C:\Users\User\Desktop\ParquetStore\duckDB\parquetDump\modeled.parquet' (FORMAT PARQUET);
Filter Values
The example query below filters rows so that only rows where value_int64 is populated remain.
COPY (
SELECT *
FROM read_parquet('C:\Users\User\Desktop\ParquetStore\duckDB\parquetDump\')
WHERE value_int64 IS NOT NULL
) TO 'C:\Users\User\Desktop\ParquetStore\duckDB\parquetDump\Filtered.parquet' (FORMAT PARQUET);
Reading Parquet Data
The following is an example query that reads data from a Parquet file.
SELECT * FROM read_parquet('C:\Users\User\Desktop\ParquetStore\duckDB\parquetDump\Data_WithMetadata.parquet')
Considerations
Performing on-disk operations
To perform on-disk operations, the COPY method wraps the query and copies the resulting data into a specified file. If COPY were not used, the query would instead return the transformed data inline.
Good no data
Because COPY is used and no data is returned from the queries, all on-disk operations will return Good No Data. This means pipeline execution will stop when the queries are run. There are two ways to work around this: run a second query, or use the 4.5 Good No Data handler on the Read Stage in pipelines.
Running Two Queries
The secondary query below returns a single value to the pipeline, so the Good No Data result becomes a Good result.
COPY (
SELECT * FROM read_parquet('C:\Users\User\Desktop\ParquetStore\duckDB\parquetDump\*.parquet')
) TO 'C:\Users\User\Desktop\ParquetStore\duckDB\parquetDump\merged.parquet' (FORMAT PARQUET);
SELECT 1 AS value;
The Good No Data Handler on the Read Stage
With 4.5, the Read Stage received an enhancement that allows it to continue processing when it receives a Good No Data result.
Other related material: