Engineering Document (P&ID) Digitization

Oscar Fimbres

Engineering Document (P&ID) Digitization


P&IDs are ubiquitous in the manufacturing industry and now that more major players in the industry are working to make more agile, digitally connected factories, the manual process of digitizing these types of diagrams is a major roadblock in the process of onboarding new customers and factories.

In a recent project, we worked with a major customer in the manufacturing industry to help them build a solution that digitizes engineering diagrams, specifically P&IDs.

In their existing setup, the arduous manual mapping of information from P&ID sheets poses a significant challenge. This process is time-consuming, taking on the order of 3-6 months, and is prone to errors. Its quality depends on the expertise of the domain experts and often requires multiple rounds of review. While the solution doesn’t remove the process, it accelerates the initial ingest and digitization of the diagrams.

Their goal was to automate the identification of symbols, text, and connections within the P&ID. By representing these diagrams as knowledge graphs, we were able to leverage advanced algorithms for tasks like finding optimal routes, detecting system cycles, computing transitive closures, and more. The solution was predominantly based on Azure ML and Azure AI Document Intelligence.

One of the challenges of the project was to define a methodology to detect and comprehend these different components from P&ID sheets. Fortunately, the research papers Digitization of chemical process flow diagrams using deep convolutional neural networks and Digitize-PID: Automatic Digitization of Piping and Instrumentation Diagrams could provide insights that lay the foundation of our solution. Additionally, the latter research paper provides a synthetic dataset of 500 P&ID sheets with corresponding annotations that we could use for model training.

In this article, we will talk about our journey on how we solve this customer problem through a proposed method and the end-to-end solution. Before we dive in further, let’s set the context.

What is P&ID?

P&ID stands for Piping and Instrumentation Diagram. P&IDs are commonly used in the process plant industry. It is a schematic illustration of the plant that shows the interconnection of the process equipment, instrumentation used to control the process, and flow of the fluid and control signals.

In a P&ID, there are four main types of symbols: equipment, piping, instrumentation, and connectors.

  1. Equipment Symbols: These symbols represent the process equipment used in a plant or process. For example: Vacuum Pumps, Compressors, Heat exchangers.
  2. Piping Symbols: These symbols represent the various types of pipelines and process flow lines that transport fluids or materials through the process. For example: Valves, Reducers, Flanges.
  3. Instrumentation Symbols: These symbols represent the various types of instruments used to monitor and control the process parameters. For example: Flow meters, Pressure gauges, Temperature sensors.
  4. Connector Symbols: Also named as interconnector references, these symbols are used to show the connections between pipes and instruments/equipment. They can be present on the same sheet or on different sheets of the P&ID.

Key elements of a P&ID

Figure 1. Key elements of a P&ID (source)

What does it mean to digitize a P&ID?

P&IDs are typically produced by engineering and design companies, or provided by equipment manufacturers, and are commonly distributed in image format to safeguard intellectual property or contractual obligations for newly constructed facilities. Digitizing a P&ID involves converting the image format of the P&ID into a digital format – for us, this was a graph. This process entails recognizing symbols from the image, extracting relevant text and information needed to construct connection relationships between the symbols, and generating a structured dataset that can be interpreted by software systems.

Let’s go deeper on how we implemented each of these digitization steps and look at some of the challenges.

Proposed Implementation

Symbol Detection Module

Symbol Detection Challenges

  • The complexity of P&ID symbols can vary greatly, i.e. equipment symbols. The variability means that having a generalizable digitization process is difficult.
  • The shape of the different types of symbols may be similar, making it hard to distinguish in images with low resolution.

Object Identification Model Trained to Recognize 50+ symbols

During our exploration path, we realized the need to use an object identification model due to the large variety of symbols that a designer can use in a P&ID.

We started training our own symbol detection model using Azure Custom Vision, but were limited by the number of algorithms provided by Microsoft Data Scientists. To overcome this, we relied on the AutoML Image Object Detection from Azure ML to train the model with YOLOv5, which allows the use of different algorithms.

Luckily, we found a synthetic dataset (from the Digitize-PID: Automatic Digitization of Piping and Instrumentation Diagrams research paper) that recognizes 50+ symbols in a P&ID, this includes sample images in JPEG format with label annotations and bounding boxes for each piece of text and symbol in the image.

With this dataset available, we needed to prepare the data input for this AutoML training job. To facilitate this process, we developed a python module that handles this data transformation – it involves unzipping the dataset zip file, reading the label annotations in NPY format, converting them into JSONL annotations, and lastly uploading them to the blob storage associated for training data.

In case you want to recognize new types of symbols, you can simply feed the model with new training data (i.e. labeling new symbols using Azure ML Data Labeling tool).However, it’s important to note that this process requires manual effort and a sufficient amount of data to effectively recognize new symbols. This is where the automated training pipeline comes into the picture. We have developed an automated training pipeline that will generate a whole new model.

Symbol detection output

Figure 2. Detected symbols along with their type

Moreover, there’s potential for using the outputs of symbol detection inference on real P&IDs as training data (active learning). Keep in mind this would require manual validation and gathering customer permissions.

Automated Training Pipeline

The automated training pipeline executes a series of steps to train the Object Identification Model.

Training pipeline workflow

Figure 3. Training pipeline workflow

As it shows in Figure 2, this workflow starts pulling the data from the blob storage with training data – images and labels annotations – to aggregate it into a single annotation file. Later, the data is stratified and split for training and validation. This step is critical as it ensures that the label data is properly distributed so that the model is not biased. Then, it registers the dataset to be used for traceability and kicks off the AutoML Image Object Detection job. Once this job is finished, it generates a new model that will be registered and tagged as “best model” if the recall metric improves.

Now that we have a model ready, this can be deployed as a managed endpoint and is ready to be consumed.

Text Detection Module

Text Detection Challenges

  • Symbols can be crowded, making it hard to clearly associate text with symbols
  • The OCR detection process may result in invalid or missing characters, especially if the resolution quality is poor. For instance, valid asset tags used in engineering drawing like 3/4” x 1/8” or PS-12345 might not be correctly recognized.

Image Pre-processing Strategies

To enhance OCR processing efficiency and reduce color variations in an image, a pre-processing strategy implemented in this project was a simple grayscale conversion and binarization.

We also investigated image tiling as an option for pre-processing images, if OCR on a smaller image subset boosted accuracy; we didn’t see any significant benefit in accuracy from it and have thus decided not to implement it for simplicity. Therefore, it was better to keep it simple and use the single OCR pass approach.

Azure AI Services for OCR

The text recognition logic in this module is powered by Azure AI Document Intelligence, which also exposes APIs for Optical Character Recognition (OCR). This OCR service is optimized for large, text-heavy documents and engineering diagrams like P&IDs, and it includes a high-resolution extraction capability that we used to recognize small text from large-sized documents. The quality of the text detection was significantly good.

Initially we explored a general Optical Character Recognition (OCR) service using Azure AI Vision, unfortunately due to the nature of P&ID documents, the confidence in detecting complex text with high-resolution was not acceptable, i.e. confusing measurement symbols. Therefore, this approach was not pursued further.

Based on our project requirements, the relevant results include each piece of text recognized, the bounding box coordinates of that text in the image, and the confidence score of the OCR results; these can be further analyzed to associate text with symbols. OCR via Azure AI Document Intelligence is able to understand the content, layout, style, and semantic elements of the image analyzed; although we didn’t utilize this as part of this project, we noticed opportunities for future extension, such as extraction of structured text (e.g. tables).

Text detection output

Figure 4. Detected symbols with their corresponding tag

Line Detection Module

Line Detection Challenges

  • Determine different types of lines, such as dotted and solid lines, can be difficult due to variations in line style, factors like thickness, separation among lines.

Image Preprocessing Strategies

From the outset, we understood that running a line detection algorithm could be time-consuming. This duration is dependent on the image size and the number of lines drawn on the P&ID. Also, it’s important to note that the image may contain a significant amount of “noise” – elements we’re not interested in, such as text, symbols, legends on the right side, and the outer box.

To reduce processing time, we considered focusing on the area of interest by cropping the image and removing extraneous elements like symbols and text as detected in the previous steps.

Additionally, we noticed that line detection tends to produce extra lines if the line thickness is large. Therefore, the algorithm operates more efficiently when the lines are thinner. As a result, we applied the Zhang-Suen thinning algorithm as part of the preprocessing.

Considering the numerous filters and preprocessing strategies that can be time-consuming, we decided to use a background job to handle all this processing.

Hough Transform

We explored different line detection techniques from research papers, such as Standard Hough Transform, Contour Tracing and Pixel Search.

Since our requirements were limited to only detecting horizontal and vertical continuous lines, Standard Hough Transform fit our needs in terms of quality. The main difficulty was to find the right set of parameters to detect the line segments properly on P&IDs, i.e. setting the “hough_theta” variable is a good candidate to tune if diagonal lines aren’t being detected in image properly.

Line detection output

Figure 5. Detected lines

Graph Construction

Graph Construction Challenges

  • The connections can be complicated due to intersecting lines at different angles.
  • The direction of arrows is unknown, as the symbol detection inference was unable to indicate the orientation.

Representation of the Graph

This step, where we spent a significant amount of time, involves dealing with the unknown relationships between objects. The first crucial task was to model the detected objects as nodes within the graph. We took heavy inspiration from the paper.

Based on our requirements for graph modeling, we consider all symbols and lines detected as nodes. Text is not considered as a node here since it is already part of the symbol information and not relevant for the knowledge graph.

The help from key packages such as NetworkX and Shapely were instrumental in our process. NetworkX facilitates the construction of the graph in-memory, while Shapely assists in computing distances between objects.


An object connection graph is created using the bounding boxes of the symbols and the coordinates of the starting and end points of the lines. We defined this process in four steps.

In the first step, we preprocess the detected objects. This is necessary because the symbols and lines in the detected objects may not perfectly match those in the original diagram. To address this, we do line preprocessing by extending the length of the line with a small buffer. This is useful because it helps to ensure that lines connect as intended, even if the detection wasn’t perfect. It essentially gives us a margin of error. We also do text preprocessing by removing all text outside of the main content area of the P&ID, this reduces the noise and potential confusion from extraneous text that isn’t relevant.

In the second step, we establish proximity matching for the start and end points of lines. A line’s proximity can connect it to either a symbol or another line. However, if the connecting edge is a line, the situation can become complex. This is because we have to consider cases of three-way junction fittings or four-way junctions, where a line can branch off.

In the third step, we connect lines with the closest elements to ensure the connections are established correctly.

In the fourth step, we establish proximity matching between arrows and lines. This was a challenging problem. Although we were able to detect the arrows using symbol detection information, the orientation of the arrows remained unknown. We discovered that by using heuristics of lines and the intersection points of lines and arrow symbol bounding box, we could predict the position and thus establish the connection.

Graph Traversal

Graph Traversal Challenges

  • Not every line edge may contain an arrow symbol. This can halt the graph traversal or lead to incorrect traversal.

Flow Direction Propagation

In the previous step, we used the proximity between arrows and lines to determine the flow direction. However, a challenge we encountered in the P&IDs is that some lines belonging to the same process flow line do not have an arrow indicating their direction. This could potentially lead to incorrect traversal paths.

To address this issue, we need to propagate the flow direction from the lines with arrows to the lines without arrows, as long as they are part of the same process flow line. This way, we can ensure that all lines on the process flow line have the correct direction between the connected equipment, connectors, and arrows.


To get the connections between the asset, we use breadth-first search (BFS). We start the traversal using a terminal asset – equipment, connectors, or instruments – and discover all neighbors. The direction of traversal is determined by the momentum, which is essentially the process flow in the graph.

The concept of momentum or process flow can be understood as the direction of the process. An asset is said to be “upstream” of another if it precedes it in the process flow, and “downstream” if it follows it. There might be cases where the direction of the process flow is not clear (or at least ambiguous). In such cases is marked as “unknown”.

After BFS traversal, here is a graph example which includes nodes and their connections. For easier visualization, there is a color map for nodes, i.e. blue nodes represent equipment.

Graph Traversal Debug

Figure 6. Debug image displaying symbol connections after graph traversal

Exciting news! We’ve created a knowledge graph that’s ready to answer all your queries and dig out the information we need. But here’s the catch, it’s an in-memory graph. We are going to tuck this graph into a graph database.

Graph Persistence

Database & Schema

We chose SQL Graph Database due to its simplicity: we needed a data model that was easy to understand, maintain, query and update. This works similarly to a relational database, where the nodes are entities and edges are the relationships between entities.

In the schema we’ve proposed, the central entity is “Asset”. This entity holds crucial information, including symbol connections. One of the unique features of our schema is the separation of the “Asset” and “AssetType” entities. This design allows us to efficiently query all asserts belonging to a specific category.

We also included a “Connector” entity that is not explicitly supported yet, but is included in the data model for future extension.

Graph Schema

Figure 7. Knowledge P&ID proposed schema.

We chose PyODBC – due to its extensive community-support software – to connect and insert the knowledge graph into the database.

Now, our knowledge graph is persisted and can be queried at any time. It’s like having a personal librarian, always at your service!

Our End-to-end Digitization Solution for P&IDs

In a nutshell, the architecture solution is a three-pronged approach designed to streamline the digitization process: inference workflow, training and development workflow and active learning.

This end-to-end solution, designed for digitizing P&IDs, is also adaptable and versatile for other engineering diagrams. While it’s not without its limitations, it successfully detects 80% of objects, with provisions for corrections.

The imperfections of this system can mostly be addressed through threshold configuration adjustments, indicating potential for further refinement. It’s also important to note that this solution will need to be tuned to fit the characteristics of the P&IDs being digitized, which can vary greatly depending on the process used to generate them. We used a mixed dataset of synthetic and real P&IDs. The dataset was measured in graph construction, taking into account the symbol connections.

We encourage you to explore our GitHub solutions. Check out the MLOpsManufacturing repository, which focuses on Symbol Training Pipeline, and the Digitization of Piping and Instrument Diagrams repository, which provides the inference workflow. There is a comprehensive user guide to learn more about our work and how you can contribute to its evolution.


Comments are closed. Login to edit/delete your existing comments

Feedback usabilla icon