Historians: Backfill AVEVA PI Data
This article provides a starter solution related to obtaining historical value changes from AVEVA PI System historian and writing files to a data lake.
What Does This Article Cover?
Some advanced analytics solutions require access to large amounts of historical time-series data stored in a process historian. For example, it may be necessary to retrieve several days’ worth of historical data for analysis. In these scenarios, it is not possible to obtain all the data in a single execution of an Intelligence Hub connection input. Instead, the data pipeline must be designed to collect the historical data incrementally.
This article outlines an example solution that retrieves historical data from an AVEVA PI Data Archive and writes the results as .CSV files to Amazon S3.
Solution Assumptions
The following summarizes the assumptions related to the example solution.
- The solution scope included thousands of PI Points.
- The frequency of value changes varied per PI Point, with thousands of value changes per minute for the in-scope PI Points.
- The value change data was obtained from PI Data Archive.
- AVEVA PI System was installed on an Amazon EC2 instance. The Intelligence Hub PI Connection agent was installed on the same EC2 instance.
- Intelligence Hub was installed on a second Amazon EC2 instance with 8 GB of RAM. The heap memory allocated to the Java Virtual Machine (JVM) was not adjusted.
- The Intelligence Hub Pipeline wrote data to a .CSV file that was stored in Amazon S3. Each .CSV file contained 5,000 records.
Solution Summary
The following summarizes the design of the example solution.
- Obtain PI Point Names
The first step in creating the solution is to obtain the in-scope PI Point names. The PI Point names can be obtained from AVEVA PI System Asset Framework, a file, a database, or query of AVEVA PI System Data Archive. In this case the PI Point names were obtained from Data Archive using the Intelligence Hub PI Connection Point Browse Type Input. The query should return the list of PI Point names quickly, for example in a few seconds or less. Cache can be enabled on the Connection Input. The format of the PI Point names should be a JSON array where each element is a string value. - Configure the Connection Input
Next the Connection Input to obtain the values can be configured. The Intelligence Hub PI Connection Point type input may be used. The Connection Input that obtains the PI Point Names can be used for the Reference. Example start and end date times should be defined as parameters for testing. Optimally the Connection should return data for the time span in a few seconds or less. Start with a small number of PI Point names and a short time span and increase to optimize. - Design the Intelligence Hub Model
An Intelligence Hub Model provides an opportunity to structure the payload written to the destination system. The schema will typically be narrow consisting of few columns. Consider how to handle values. For example, all values could be converted to strings or there could be a column for numeric values and a second column for non-numeric values. - Build the Intelligence Hub Pipeline
The Intelligence Hub Pipeline design is simple and consists of breaking up the large dataset returned by the Connection Input, modeling the data, creating a file, and writing the file to the Data Lake. The Pipeline's polling trigger should be optimized for the duration of the Connection Input, the duration of the Pipeline execution, considerations for a possible non-uniform flow of data, and a safety factor. The number of records that are buffered and written to the file should consider Data Lake processing and required latency. The start and end time for the total duration of the time span are defined in the Pipeline. The index interval is also defined in the Pipeline. The Pipeline manages these values as state and metadata. - Optimize for Performance
When optimizing the Pipeline consider the volume of data being processed. It might not be possible to use the Debug or Replay capabilities due to the volume of data being processed. - Isolate Backfill Workloads
Consider the other Pipelines in the Intelligence Hub instance and the use of the PI Connection. It might be necessary to dedicate a PI Connection or Intelligence Hub deployment for the purpose of processing the backfill. - A project file may be downloaded [here].
Results and Recommendations
The following summarizes the results of running the example solution. The example solution processed value changes for thousands of PI Points for a 24-hour period.
- The Connection Input returned data for an interval of one minute in .5 to 1 seconds.
- The Pipeline trigger was configured for 5 seconds. On average the Pipeline executed in about 1.5 seconds.
- The Pipeline processed all value changes in two hours.
- The Pipeline processed over 9 million value changes.
- The example solution did not consider error handling in the Pipeline for missed reads of PI data or an unexpected flood of PI data for example.
- The Backfill Pipeline was the only Pipeline running during the test.
- Performance of the solution is correlated to RAM allocated to the Intelligence Hub runtime. Performance would improve if additional resources were allocated.
Additional Resources