Engineering Document (P&ID) Digitization

Oscar Fimbres

Engineering Document (P&ID) Digitization


P&IDs are ubiquitous in the manufacturing industry and now that more major players in the industry are working to make more agile, digitally connected factories, the manual process of digitizing these types of diagrams is a major roadblock in the process of onboarding new customers and factories.

In a recent project, we worked with a major customer in the manufacturing industry to help them build a solution that digitizes engineering diagrams, specifically P&IDs.

In their existing setup, the arduous manual mapping of information from P&ID sheets poses a significant challenge. This process is time-consuming, taking on the order of 3-6 months, and is prone to errors. Its quality depends on the expertise of the domain experts and often requires multiple rounds of review. While the solution doesn’t remove the process, it accelerates the initial ingest and digitization of the diagrams.

Their goal was to automate the identification of symbols, text, and connections within the P&ID. By representing these diagrams as knowledge graphs, we were able to leverage advanced algorithms for tasks like finding optimal routes, detecting system cycles, computing transitive closures, and more. The solution was predominantly based on Azure ML and Azure AI Document Intelligence.

One of the challenges of the project was to define a methodology to detect and comprehend these different components from P&ID sheets. Fortunately, the research papers Digitization of chemical process flow diagrams using deep convolutional neural networks and Digitize-PID: Automatic Digitization of Piping and Instrumentation Diagrams could provide insights that lay the foundation of our solution. Additionally, the latter research paper provides a synthetic dataset of 500 P&ID sheets with corresponding annotations that we could use for model training.

In this article, we will talk about our journey on how we solve this customer problem through a proposed method and the end-to-end solution. Before we dive in further, let’s set the context.

What is P&ID?

P&ID stands for Piping and Instrumentation Diagram. P&IDs are commonly used in the process plant industry. It is a schematic illustration of the plant that shows the interconnection of the process equipment, instrumentation used to control the process, and flow of the fluid and control signals.

In a P&ID, there are four main types of symbols: equipment, piping, instrumentation, and connectors.

  1. Equipment Symbols: These symbols represent the process equipment used in a plant or process. For example: Vacuum Pumps, Compressors, Heat exchangers.
  2. Piping Symbols: These symbols represent the various types of pipelines and process flow lines that transport fluids or materials through the process. For example: Valves, Reducers, Flanges.
  3. Instrumentation Symbols: These symbols represent the various types of instruments used to monitor and control the process parameters. For example: Flow meters, Pressure gauges, Temperature sensors.
  4. Connector Symbols: Also named as interconnector references, these symbols are used to show the connections between pipes and instruments/equipment. They can be present on the same sheet or on different sheets of the P&ID.

Key elements of a P&ID

Figure 1. Key elements of a P&ID (source)

What does it mean to digitize a P&ID?

P&IDs are typically produced by engineering and design companies, or provided by equipment manufacturers, and are commonly distributed in image format to safeguard intellectual property or contractual obligations for newly constructed facilities. Digitizing a P&ID involves converting the image format of the P&ID into a digital format – for us, this was a graph. This process entails recognizing symbols from the image, extracting relevant text and information needed to construct connection relationships between the symbols, and generating a structured dataset that can be interpreted by software systems.

Let’s go deeper on how we implemented each of these digitization steps and look at some of the challenges.

Proposed Implementation

Symbol Detection Module


  • The complexity of P&ID symbols can vary greatly, i.e. equipment symbols. The variability means that having a generalizable digitization process is difficult.
  • The shape of the different types of symbols may be similar, making it hard to distinguish in images with low resolution.

Object Identification Model Trained to Recognize 50+ symbols

During our exploration path, we realized the need to use an object identification model due to the large variety of symbols that a designer can use in a P&ID.

We started training our own symbol detection model using Azure Custom Vision, but were limited by the number of algorithms provided by Microsoft Data Scientists. To overcome this, we relied on the AutoML Image Object Detection from Azure ML to train the model with YOLOv5, which allows the use of different algorithms.

Luckily, we found a synthetic dataset (from the Digitize-PID: Automatic Digitization of Piping and Instrumentation Diagrams research paper) that recognizes 50+ symbols in a P&ID, this includes sample images in JPEG format with label annotations and bounding boxes for each piece of text and symbol in the image.

With this dataset available, we needed to prepare the data input for this AutoML training job. To facilitate this process, we developed a python module that handles this data transformation – it involves unzipping the dataset zip file, reading the label annotations in NPY format, converting them into JSONL annotations, and lastly uploading them to the blob storage associated for training data.

In case you want to recognize new types of symbols, you can simply feed the model with new training data (i.e. labeling new symbols using Azure ML Data Labeling tool).However, it’s important to note that this process requires manual effort and a sufficient amount of data to effectively recognize new symbols. This is where the automated training pipeline comes into the picture. We have developed an automated training pipeline that will generate a whole new model.

Symbol detection output

Figure 2. Detected symbols along with their type

Moreover, there’s potential for using the outputs of symbol detection inference on real P&IDs as training data (active learning). Keep in mind this would require manual validation and gathering customer permissions.

Automated Training Pipeline

The automated training pipeline executes a series of steps to train the Object Identification Model.

Training pipeline workflow

Figure 3. Training pipeline workflow

As it shows in Figure 2, this workflow starts pulling the data from the blob storage with training data – images and labels annotations – to aggregate it into a single annotation file. Later, the data is stratified and split for training and validation. This step is critical as it ensures that the label data is properly distributed so that the model is not biased. Then, it registers the dataset to be used for traceability and kicks off the AutoML Image Object Detection job. Once this job is finished, it generates a new model that will be registered and tagged as “best model” if the recall metric improves.

Now that we have a model ready, this can be deployed as a managed endpoint and is ready to be consumed.

Text Detection Module


  • The OCR process might fail to detect some characters if the resolution quality is poor. For instance, a valid asset tag in P&ID could be 3/4" x 1/8", but the detection might only recognize it as 3/4 x 1/8, missing the inch symbol.
  • If symbols are crowded, the algorithm might incorrectly associate text with the wrong symbol.

Image Pre-processing Strategies

To enhance OCR processing efficiency and reduce color variations in an image, a pre-processing strategy implemented in this project was a simple grayscale conversion and binarization.

We also investigated image tiling as an option for pre-processing images, if OCR on a smaller image subset boosted accuracy; we didn’t see any significant benefit in accuracy from it and have thus decided not to implement it for simplicity. Therefore, it was better to keep it simple and use the single OCR pass approach.In our investigation, we considered image tiling as a pre-processing step to determine if performing OCR on smaller subsets of the image would enhance accuracy. However, the improvement was marginal, with an accuracy improve of just under 2% compared to a single OCR pass. Additionally, image tiling introduced more latency, adding approximately 200 ms per image subset. Given the negligible benefit and increased latency, along with the added complexity that image would introduce, we opted not to implement this approach. As a result, we found that a single ORC pass was the most efficient and effective method.

Azure AI Services for OCR

The text recognition logic in this module is powered by Azure AI Document Intelligence, which also exposes APIs for Optical Character Recognition (OCR). This OCR service is optimized for large, text-heavy documents and engineering diagrams like P&IDs, and it includes a high-resolution extraction capability that we used to recognize small text from large-sized documents. The quality of the text detection was acceptable for our use case, with an average error rate of 10% of error if the symbols were crowded.

Initially we explored a general Optical Character Recognition (OCR) service using Azure AI Vision, unfortunately due to the nature of P&ID documents, the confidence in detecting complex text with high-resolution was not acceptable, i.e. confusing measurement symbols. Therefore, this approach was not pursued further.

Based on our project requirements, the relevant results include each piece of text recognized, the bounding box coordinates of that text in the image, and the confidence score of the OCR results; these can be further analyzed to associate text with symbols. OCR via Azure AI Document Intelligence is able to understand the content, layout, style, and semantic elements of the image analyzed; although we didn’t utilize this as part of this project, we noticed opportunities for future extension, such as extraction of structured text (e.g. tables).

Text detection output

Figure 4. Detected symbols with their corresponding tag

Line Detection Module


  • Determining different types of lines, such as dotted and solid lines, requires the use of different line detection techniques due to variations in line style, factors like thickness, separation among lines.

Image Preprocessing Strategies

From the outset, we understood that running a line detection algorithm could be time-consuming. The time duration is dependent on the image size and the number of lines drawn on the P&ID. On average, it takes about 20 seconds to process images with a size of 7168 x 4561. Also, it’s important to note that the image may contain a significant amount of “noise” – elements we’re not interested in, such as text, symbols, legends on the right side, and the outer box.

To reduce processing time, we considered focusing on the area of interest by cropping the image and removing extraneous elements like symbols and text as detected in the previous steps.

Additionally, we observed that line detection techniques tend to generate extra lines proportional to the line thickness. Therefore, the algorithm operates more accurately when the lines are thinner (1 pixel). As a result, we applied the Zhang-Suen thinning algorithm as part of the preprocessing.

Considering the numerous filters and preprocessing strategies that can be time-consuming, we decided to use a background job to handle all this processing.

Hough Transform

We explored different line detection techniques from research papers, such as Standard Hough Transform, Contour Tracing and Pixel Search.

Since our requirements were limited to only detecting horizontal and vertical continuous lines, Standard Hough Transform fit our needs in terms of quality. The main difficulty was to find the right set of parameters to detect the line segments properly on P&IDs, i.e. setting the “hough_theta” variable is a good candidate to tune if diagonal lines aren’t being detected in image properly.

Line detection output

Figure 5. Detected lines

Graph Construction


  • The connections can be complicated due to intersecting lines at different angles.
  • The direction of arrows is unknown, as the symbol detection inference was unable to indicate the orientation.

Representation of the Graph

This step, where we spent a significant amount of time, involves dealing with the unknown relationships between objects. The first crucial task was to model the detected objects as nodes within the graph. We took heavy inspiration from the paper.

Based on our requirements for graph modeling, we consider all symbols and lines detected as nodes. Text is not considered as a node here since it is already part of the symbol information and not relevant for the knowledge graph.

The help from key packages such as NetworkX and Shapely were instrumental in our process. NetworkX facilitates the construction of the graph in-memory, while Shapely assists in computing distances between objects.


An object connection graph is created using the bounding boxes of the symbols and the coordinates of the starting and end points of the lines. We defined this process in four steps.

In the first step, we preprocess the detected objects. This is necessary because we observed in some P&IDs the detected symbols and lines may not align perfectly due to a small pixel difference. To rectify this, we extend the length of the detected lines by a small buffer during line preprocessing. This ensures that the lines connect as intended, effectively providing a margin of error. In addition to line preprocessing, we also perform text preprocessing by eliminating all text outside the main content area of the P&ID. This step reduces noise and potential confusion from irrelevant text.

Graph Construction Step 1

Figure 6. An horizontal line not aligning due to small pixel differences.

In the second step, we establish proximity matching for the start and end points of lines. A line’s proximity can connect it to either a symbol or another line. However, if the connecting edge is a line, the situation can become complex. This is because we must consider cases of three-way junction fittings or four-way junctions, where a line can branch off.

In the third step, we connect lines with the closest elements to ensure the connections are established correctly.

In the fourth step, we establish proximity matching between arrows and lines. This was a challenging problem. Although we were able to detect the arrows using symbol detection information, the orientation of the arrows remained unknown. We discovered that by using heuristics of the intersection points of lines and arrow symbol, we could determine the orientation of the arrows and thus establish the flow direction.

Graph Traversal


  • Not every line edge may contain an arrow symbol. This can halt the graph traversal or lead to incorrect traversal.

Flow Direction Propagation

In the previous step, we used the proximity between arrows and lines to determine the flow direction. However, a challenge we encountered in the P&IDs is that some lines belonging to the same process flow line do not have an arrow indicating their direction. This could potentially lead to incorrect traversal paths.

To address this issue, we need to propagate the flow direction from the lines with arrows to the lines without arrows, as long as they are part of the same process flow line. This way, we can ensure that all lines on the process flow line have the correct direction between the connected equipment, connectors, and arrows.


To get the connections between the asset, we use breadth-first search (BFS). We start the traversal using a terminal asset – equipment, connectors, or instruments – and discover all neighbors. The direction of traversal is determined by the momentum, which is essentially the process flow in the graph.

The concept of momentum or process flow can be understood as the direction of the process. An asset is said to be “upstream” of another if it precedes it in the process flow, and “downstream” if it follows it. There might be cases where the direction of the process flow is not clear (or at least ambiguous). In such cases is marked as “unknown”.

After BFS traversal, here is a graph example which includes nodes and their connections. For easier visualization, there is a color map for nodes, i.e. blue nodes represent equipment.

Graph Traversal Debug

Figure 7. Debug image displaying symbol connections after graph traversal

Exciting news! We’ve created a knowledge graph that’s ready to answer all your queries and dig out the information we need. But here’s the catch, it’s an in-memory graph. We are going to tuck this graph into a graph database.

Graph Persistence

Database & Schema

We chose SQL Graph Database due to its simplicity: we needed a data model that was easy to understand, maintain, query and update. This works similarly to a relational database, where the nodes are entities and edges are the relationships between entities.

In the schema we’ve proposed, the central entity is “Asset”. This entity holds crucial information, including symbol connections. One of the unique features of our schema is the separation of the “Asset” and “AssetType” entities. This design allows us to efficiently query all asserts belonging to a specific category.

We also included a “Connector” entity that is not explicitly supported yet, but is included in the data model for future extension.

Graph Schema

Figure 8. Knowledge P&ID proposed schema.

We chose PyODBC – due to its extensive community-support software – to connect and insert the knowledge graph into the database.

Now, our knowledge graph is persisted and can be queried at any time. It’s like having a personal librarian, always at your service!

Our End-to-end Digitization Solution for P&IDs

In a nutshell, the architecture solution is a three-pronged approach designed to streamline the digitization process: inference workflow, training and development workflow and active learning.

This end-to-end solution, designed for digitizing P&IDs, is also adaptable and versatile for other engineering diagrams. Based on experiments with both synthetic and customer datasets, it can detect 80% of the assets and connections. However, given the complexity of the task and the potential for errors in various detection phases, it is recommended that a comprehensive human review is conducted for the entire output of the solution. This includes not only the assets or connections that are not detected but also the ones detected by the solution. This review will provide necessary corrections and ensure the accuracy of the extracted information.

In addition to this, it is strongly recommended that all extracted assets and connections undergo human verification before using the generated assets graph. This step is essential to validate the accuracy of the data and to prevent any potential issues that may arise from incorrect information.

Other system errors can be addressed through configuration options, such as modifying the inference score threshold for symbol detection. This allows for further fine-tuning of the system to better suit the specific needs of the task at hand.

It’s also important to note that this solution will need to be tuned to fit the characteristics of the P&IDs being digitized, which can vary greatly depending on the process used to generate them.

We encourage you to explore our GitHub solutions. Check out the MLOpsManufacturing repository, which focuses on Symbol Training Pipeline, and the Digitization of Piping and Instrument Diagrams repository, which provides the inference workflow. There is a comprehensive user guide to learn more about our work and how you can contribute to its evolution.

Feedback usabilla icon