Broken link detection in the Azure SDK
It’s frustrating when you select a reference link in an open-source project and receive an HTTP 404 response. This experience gives customers the impression that the project’s repository isn’t maintained.
The Azure SDK team maintains 10 active GitHub repositories that have hundreds of READMEs in total. Before introducing the link checker, we relied on a product manager, content reviewer, or external customer to find broken links. Unfortunately, this approach eroded the trust of our customers. Hence, the Azure SDK Engineering Systems (EngSys) team designed the dead link checker and integrated it into the Continuous Integration (CI) pipelines. Now team members become aware of the errors before broken links are introduced into the repository’s main branch.
How link checker works
The link checker is implemented as a PowerShell script that’s commonly used across our engineering systems. You can invoke the script as follows:
.\Verify-Links -urls C:\README.md -checkLinkGuidance $true
For the given README or other Markdown file, the link checker locates all links. A web request is created for each link. If an error response is received, the link check fails. The broken links are printed for the code owner’s information.
We also maintain a README guidance and our link checker will check whether the links violate our guidance. By default, we enable the validation for this guidance. Privileged users with unique circumstances have the flexibility to disable the rule.
Once an owner acknowledges the failed links, the links can be added to an allowlist. This allowlist file enables people to continue checking in their code.
There are also cases in which a link points to a file being introduced in the same pull request (PR). As you probably guessed, the link will be valid only after merging the PR into the main branch. Therefore, two PRs would be required to avoid introducing a broken link to the new file. To address that problem, the link checker can mutate the GitHub repository’s main branch to the current commit and predict the validity after check-in.
Where we run link checker
Collectively, hundreds of PRs are merged into the Azure SDK repositories’ main branches each day. To prevent the introduction of invalid links, we enabled check-in PR validation. It saves a significant amount of the EngSys team’s time to verify the links themselves. For example, PR validation.
There are cases in which links begin to fail as times passes. The EngSys team has two CI pipelines checking this case:
- An aggregate report pipeline, which scans the entire repository on a nightly basis.
- The CI deployment runs link checker on service directories each night and for each release. An example of a service directory is storage. The validation in these pipelines further prevents the broken links from entering the SDK release pages and documentation sites.
As link checker usage increased, the EngSys team learned something. The high frequency in which some links were accessed resulted in throttling by some commonly referenced websites. For example, GitHub and npm. We decided to introduce caching to reduce the frequency of this issue.
Frequently used links are stored in a cache file. Each time link checker runs, the cache file’s links are treated as valid. Therefore, there’s no need to invoke a web request for each of those links. We also provide the flexibility to refresh the input cache file by scanning the repository and updating all good links into input cache file.
Here’s the workflow:
To reduce frustration and improve efficiency, the Azure SDK team designed the dead link checker and integrated it into the Continuous Integration (CI) pipelines. Detection of broken links occurs prior to being introduced into the repository’s main branch.