Real-time dashboards and reports made efficient and resilient

June 24th 2020 by Dr. Roland Kuhn

The ability to obtain an instant overview over a complex system is one of the main reasons for digitalisation: the needed data points are accessible to fully automated analysis and display, no monthly data collection or nightly computation runs are necessary. For a factory this includes not only the current status of a machine or workstation, we can now run complex analyses to see in real-time how well a production process is working. With this tool in hand, we are in a better position than ever to quickly iterate on process improvements and thus increase efficiency and avoid waste — in working time, materials, energy and other consumables.

Existing Tools

In the field of operating IT systems, the state of the art is to regularly collect metrics from individual components (like database load, CPU temperature, redundancy switch states) and store them in time-series databases, for example by using Prometheus. The data are then analysed whenever the display needs to be refreshed, a common tool here is Grafana in which common views are prepared and then updated every few seconds or minutes. When metrics stray outside predefined bounds — like a network link being >70% saturated — alarms are generated to call the operator’s attention to the potential problem.

This approach works well for homogeneous data like a computer’s memory utilisation, of which the min/max and average within a 5min window give a good overview over what is happening. And this approach assumes that all measurements for some point in time can be collected by Prometheus at that time.

What complex processes need

While the current Andon status (working, interrupted, idle) of a workstation can easily be treated using the aforementioned tools, this is not the fulfilment of the promises of digitalisation: we want to know how well the workstation has performed this hour compared to last week, how the wear and tear of equipment is progressing until maintenance is needed, how often the workstation was blocked from producing by missing material, personnel, or manufacturing orders. All these questions need a wealth of information to be combined from different sources, and the computation that is needed on these inputs is much more complex and therefore time-consuming than showing the average of a temperature value over time. Examples of such data sources are:

  • an audit trail of what happened at the workstation, who started and stopped working when, which machines were used at what times, when and how long interruptions were experienced and for what reason — in short we need production data acquired in digital form
  • an audit trail of those changes to the production plan that affected this workstation, because we need to know whether a manufacturing order was already scheduled when the workstation stood idle for a half hour
  • an audit trail of material movements towards and from this workstation, correlated with quality check results for those materials

The interpretation of such information streams is non-trivial, in particular it cannot well be described as standard SQL queries in a relational database. As an anecdote, we have implemented such complex processing using deeply nested and recursive queries in Postgres and thus saw that while it is theoretically feasible it is not practical — it is not efficient enough and the resultant SQL code is very hard to understand. But relational databases are a formidable data source for visualisation tools like Grafana, so there is an impedance mismatch to be bridged.

How it should work

Ideally, the database should contain easily digestible units of information as rows, like one row per period of work interruption that contains or references all needed circumstantial information. The visualisation then just needs to select the right rows and aggregate them, exactly what relational databases are made for.

It would also be highly beneficial if every time some relevant input changes — e.g. when someone starts or stops working — the corresponding database rows are updated. This means that the computational work of the analysis only runs when actually needed, not every time a display is refreshed; this usually saves more than one order of magnitude in required database server capacity.

The third point on our wish list is that those who need the data and who most deeply understand the data should also be able to program the analysis: the production planner or the operational excellence team should be in a position to explore the wealth of information at their fingertips and thus unleash their creative power to make our factories better through a detailed understanding of their inner workings.

What we can already do today

With decentralised event log systems like ActyxOS we can reliably record the audit trails needed as input, even when devices are not always connected to the network or when parts of the system fail. The important guarantee is that all data will eventually arrive and be analysed.

The next step is to describe the database tables needed for the efficient display of the real-time reports. This depends on the database and analysis tool used, for example Grafana and Postgres have a different sweet spot than Microsoft Power BI and SQL Server, both in terms of the optimal data model and of the user experience for defining the graphical visualisation.

We then need an adapter to translate streams of events (complex facts with a timestamp) into those database rows. This typically involves splitting the data by workstation, manufacturing order, user name, etc. and then aggregating the history of events that have happened for one such entity, so that we know for example when a particular manufacturing step was assigned to a given workstation — because the plan rarely stays the same from start to finish. These derived pieces of knowledge are then joined together to compare what should have happened with what has actually happened and compute KPIs to measure how well the plan did match reality and where the deviations originated.

Writing this adapter is not yet as easy as it could be (we’ll get back to this point later): we use the Differential Dataflow framework from Microsoft Research in its implementation for the Rust language by Frank McSherry, which is available under the MIT open-source license. This library solves the problem of figuring out which database rows in the output need to be changed when new events are added on the input side. It does this by giving the programmer a language to describe the computation — with transformations, filters, splitting, joining, aggregation, etc. — and turning that into a software machine that accepts inputs and produces the corresponding database changes are rows to be inserted or removed (so-called deltas).

ActyxOS allows event consumers like such a program to keep track of which events have already been ingested and our dataflow library allows this information to be propagated from inputs to outputs, so that we can make the changes to the database within the same transaction that records which events have been processed so far. Whenever this process needs to be restarted, it can pick up right at the point where it left off, without losing or duplicating data.

What does this gain us?

Before we implemented this event analysis pipeline, we performed a lot of analysis work within the database and within Power BI, leading to substantial operating costs for a conceptually simple feature like showing a few dashboards with performance metrics on the factory shop-floor with 30sec refresh cycle.

The approach described above focuses on the essentials and reduced our Power BI cost by a factor of 10–15 (depending on customer). It also allows us to show demanding real-time dashboards based solely on embedded devices on the factory shop-floor without any cloud involvement; this makes the dashboards reliable enough to use them for mission-critical functionality, a feat that is hard or impossible to achieve when serving the dashboards from the cloud.

If you want to know more about how we did this, check out the article on our developer blog including full source code.

Future improvements

As mentioned above, we currently rely on the Rust programming language for formulating the event analysis process, which is a language that is not widely known outside specific embedded or (distributed) web communities. Even though Microsoft recently started promoting its use for safe systems programming, it is not the perfect language for describing data transformation and aggregation pipelines — although it works really well once you are across the initial hurdle of learning a new language.

It would serve the third point on our wish list — the ability for production experts to code the logic — if that logic were written in a widely used data-centric language like SQL with certain extensions. Exactly that is the goal of Materialize, the company that is driving Differential Dataflow forward for this reason. Their current product is capable of translating SQL queries into code that is then executed by the same Rust library we use.

If Materialize solves our problem, then why are we not using it? We very likely will do so, as soon as the following main obstacles are removed. Materialize does not yet store which inputs it has already consumed and which outputs those resulted in, meaning that after each restart it needs to re-read all the event streams from the very beginning, which takes time. The second issue is that transforming causally linked streams of events into independent data rows that can further be joined and transformed requires code that SQL cannot well express, so Materialize will need to offer extensions to SQL for this purpose to make it efficient to write, read, and execute this processing step. The first problem is already being actively worked on, with progress promised for the upcoming 0.4 release, while the second problem may take longer to solve (adding user-defined functions to the query syntax).

LATEST POSTS