Predict, don’t react

Why predictive monitoring is essential for maximizing uptime in semiconductor manufacturing
Predict, don’t react

Cost of downtime related to IT systems

Semiconductor manufacturing is complex; it requires the delicate orchestration of automation systems in support of significant hardware, operating systems, memory and applications. The capital cost of building and operating a fab and margin pressure in the industry are very high and, as such, any unplanned downtime is very costly. A single unplanned stop can ripple through wafer starts, WIP, and yield curves, burning cash and deoptimizing schedules. Manufacturers express common problems and needs, including:

  • Manufacturing software systems are great until something stops working.
  • Unplanned downtime has a significant business impact.
  • System downtime affects production.
  • There is a need for forecasting systems to anticipate and alert to problems.
  • A system that can rapidly address problems as they occur is highly desired.

There are several studies indicating the impact of unplanned downtime based on fab volume (Figure 1).

Figure 1: The cost of downtime based on fab volume.
Figure 1: The cost of downtime based on fab volume.
The following real-world use case demonstrates the type of financial impact possible:
  • The situation – A bumping site using automation apps connected to 270 tools crashed and operation stopped for six hours.
  • The outcome – The economic cost of downtime for bumping tools based on industry standards is ~$500/hour. Just six hours of downtime resulted in a loss of $810,000.
  • The solution – This crash and resulting economic loss could have been prevented if the site had implemented a reliable and cost-effective alert system to warn of impending issues.

From monitoring to observability

Modern tools extend beyond metrics, logs, and traces. OpenTelemetry has added continuous profiling and is stabilizing semantic conventions so teams can correlate signals consistently across diverse stacks. AI/ML can be very valuable in the future. Rather than replacing existing capabilities, it will help accelerate Root Cause Analysis (RCA) and provide intelligent alerting amid cost pressures and the need to prove ROI.

For OpenTelemetry success, it is critical that apps follow core OpenTelemetry standards. This includes:

1. Signals Supported

  • Metrics: Quantitative measurements (e.g., CPU usage, latency, and throughput).
  • Logs: Event records with context.
  • Traces: Distributed transaction spans for request flows.
  • Profiles: Continuous profiling for CPU/memory usage.

2. Semantic Conventions

  • Standardized naming and attributes for telemetry data (e.g., http.method, db.system).
  • Ensure consistency across vendors and tools.

3. OTLP (OpenTelemetry Protocol)

  • A vendor-neutral protocol for exporting telemetry data.
  • Supports gRPC and HTTP for efficient transport.

In fabs, IT focuses on security, standardization, and scale. Convergence with OT brings rich process data to enterprise systems and modern tools down to the floor, improving visibility and responsiveness to avoid unplanned downtime. However, cultural and tooling gaps persist in siloed data, legacy protocols, and finger-pointing during incidents. Teams need a universal translator—shared telemetry and standardized events—to unify workflows.

Modern observability should deal with modern streams of data within various logs, traces, and process metrics. The ability to ingest these from an integrated automation stack with pre-defined rule-based knowledge is of immense value.

AI/LLM in observability: assistive RCA

GenAI is filling gaps by enabling natural language queries (“What changed before the spike?”), accelerating RCA, and helping teams manage cost/complexity—but it remains an assistant, not a replacement for engineers. If your automation stack application errors are well known and defined, an opportunity to accelerate RCA significantly is at reach. Standardizing log format, ingestion, and shipping from unstructured to structured data will accelerate it further.

How does SmartFactory Monitor tackle manufacturing systems monitoring issues?

SmartFactory Monitor is real-time monitoring software that helps detect issues and act before they affect production systems. It offers a customizable dashboard for viewing system status, quick notifications for problems, and options to display performance trends. With built-in predictive analytics, it can identify anomalies early. The software runs efficiently on modest hardware, supports various operating systems, and integrates with other SmartFactory products. Furthermore, an adapter exists to connect with third party applications, allowing any enterprise level monitoring system or connecting other apps of choice directly to SmartFactory Monitor.

In the not-too-distant future, SmartFactory Monitor will be able to integrate OpenTelemetry data from various SmartFactory apps and provide a unified monitoring approach for both legacy and containerized environments. This will ensure customers experience a seamless transition without losing access to familiar data. Aggregating telemetry from multiple sources (e.g., OpenTelemetry, Prometheus, Zabbix) minimizes the need for customers to adopt new tools or interfaces, which in turn improves usability and accelerates adoption.

Conclusion

Fabs are becoming observable systems: every state transition, network timing window, and microservice span contributes to the overall indication of reliability of IT systems. Unifying IT and OT telemetry aided by assistive AI will likely improve uptime even further, making the process intentional—not incidental. SmartFactory Monitor is well positioned to address these challenges with run-time monitoring for real time alerting, so IT personnel can take appropriate corrective actions. It is also enabling better planning of the IT systems needed to support high-volume and high-fidelity manufacturing in the semiconductor industry.

If you’re ready to rethink how your fab handles monitoring and observability, reach out.

About the Author

Picture of Yoram Barak, Global Product Manager
Yoram Barak, Global Product Manager
Prior to joining Applied Materials Automation Products Group in 2020, Yoram was a Global Marketing Manager at BASF Human Nutrition business division and before, an Innovation Manager for the Biosciences R&D Division at BASF. Yoram earned his PhD in Animal Sciences from the Hebrew University of Jerusalem and specialized in Biotechnology throughout his career.