April 30th, 2026
0 reactions

Propagating SharePoint Document Permissions to AI Search and RAG Pipelines

Executive Summary

When integrating SharePoint content into downstream systems such as Azure AI Search, RAG pipelines, or Copilot extensions, preserving document-level access control is critical—especially for highly sensitive content.

This post describes a security-first architecture that propagates SharePoint document permissions into downstream systems so that authorization is enforced at query time. The approach:

  • Uses Microsoft Graph’s Sites.Selected permission for least-privilege access to SharePoint
  • Materializes document-level permissions into search index fields at ingestion time
  • Relies on Microsoft Entra ID object IDs (GUIDs) for stable, query-time filtering

Introduction: The Problem

In our last project, the customer needed to generate tailored briefings for multiple user groups. Creating those briefings required preprocessing a large volume of documents stored in SharePoint folders—many containing highly sensitive data. That meant we had to carry SharePoint’s document-level permissions all the way through downstream search and retrieval systems so only authorized users could see the right content.

SharePoint provides rich, hierarchical, and inheritable permissions—but downstream systems like Blob Storage, search indexes, and LLM retrieval layers do not natively understand SharePoint ACLs.

Without an explicit permission-mapping strategy, organizations risk:

  • Overexposing sensitive documents to unauthorized users
  • Violating Zero Trust principles by granting broad access
  • Failing internal security or compliance reviews

A common anti-pattern is granting applications Sites.Read.All, which unintentionally exposes all sites in a tenant. We needed a pattern that preserves document-level authorization information when we ingest content into downstream systems.

The Journey: Our Approach and Solution

We built a security-first pipeline that reads documents and permissions from SharePoint, normalizes identities to Microsoft Entra ID object IDs, and stores both content and ACL metadata in a search index. The ingestion app uses the Microsoft Graph Sites.Selected permission; everything downstream operates on the materialized permission data.

Design Goals

Goal Description
✅ Least-privilege access Explicit allow-listing per site—no tenant-wide permissions
✅ Document-level ACLs Materialize permissions so downstream systems can filter safely
✅ Deterministic identities Use GUIDs instead of emails or UPNs
✅ Broad compatibility Work with Copilot extensions, RAG retrievers, and search indexes
✅ Highly sensitive content Safe for regulated documents and compliance scenarios

Architecture Overview

The solution comprises five key components working together:

Component Role
SharePoint Online Hosts documents and source permissions
Microsoft Graph API Supplies document metadata and ACLs
Sites.Selected permission Ensures zero default access; each site is explicitly granted
Ingestion pipeline Reads documents and permissions, resolves effective ACLs, and normalizes identities
Search index with security trimming Stores allowedUsers and allowedGroups for query-time filters

End-to-End Permission Flow

End-to-end permission flow from SharePoint to filtered search

The Destination: Outcomes and Learnings

After piloting this pattern in production workloads, here’s what held up.

Permission Scoping with Sites.Selected

As described above, the ingestion application—the component that reads documents and their metadata from SharePoint and writes them into the search index—is registered with the Sites.Selected application permission. This means:

Characteristic Benefit
Zero access by default The app cannot read any site until explicitly granted
Explicit site grants SharePoint admins must allow each site individually
Enforced by SharePoint Access control is platform-enforced, not app logic
Clear audit trail Every grant is traceable and revocable

This sharply limits blast radius if the app is ever compromised and keeps the pattern aligned with Zero Trust principles.

Extract and Normalize Permissions via Microsoft Graph

For each document, effective permissions are retrieved using Microsoft Graph. The key challenge is that SharePoint permissions are hierarchical and inheritable, whereas downstream systems are ACL-agnostic. To bridge this gap, permissions must be resolved at ingestion time and stored explicitly.

GET /sites/{site-id}/drive/items/{item-id}/permissions

From the response, we extract:

  • User assignments — individual users with access
  • Group assignments — security groups and Microsoft 365 groups
  • Identity normalization — convert all identities to Microsoft Entra ID object IDs (GUIDs)
permissions = await graph_client.get_permissions(drive_id, item_id)

allowed_users = []
allowed_groups = []

for entry in permissions:
    grant = entry.get("grantedToV2", {})
    user = grant.get("user")
    group = grant.get("group")

    if user and user.get("id"):
        allowed_users.append(user["id"])  # Microsoft Entra ID object ID
    if group and group.get("id"):
        allowed_groups.append(group["id"])  # Group object ID

Key decisions:

Decision Rationale
Use GUIDs, not emails Object IDs remain stable across renames and domain changes
Preserve group IDs Enables offline expansion when search can’t expand groups natively
Resolve inheritance once Index holds the effective ACL for each document

Index Security Materialization

Permissions are stored directly in the search index as filterable fields:

chunk = {
    "content": document_text,
    "allowedUsers": allowed_users,
    "allowedGroups": ",".join(allowed_groups),
}

At query time, results are filtered using the authenticated user’s Microsoft Entra ID object ID and group memberships:

allowedUsers/any(u: u eq '{user_oid}') or allowedGroups/any(g: g eq '{group_oid}')

Benefits:

  • ✅ Authorization runs before documents are returned to the caller
  • ✅ Same pattern works for RAG retrievers and Copilot extensions
  • ✅ No post-retrieval filtering required—secure by design

Security Characteristics

This architecture delivers strong security guarantees:

Property Status
No tenant-wide content access
Explicit site allow-listing
Deterministic identity model
Secure by default for AI search and RAG
Compatible with Copilot extensions

Limitation: Stale Permissions

One important trade-off of materializing permissions at ingestion time is that permission changes in SharePoint are not automatically propagated to downstream systems. If a user’s access is revoked in SharePoint, that user may still see the document’s content in the search index or RAG pipeline until the next ingestion run.

To mitigate this:

  • Run the ingestion pipeline on a regular schedule so that permission updates are picked up in a timely manner.
  • Use SharePoint webhooks or event receivers to trigger re-ingestion when permissions change, reducing the staleness window.
  • Tune the refresh interval based on the sensitivity of the content—highly regulated data may warrant more frequent re-ingestion.
  • Communicate the expected propagation delay to stakeholders so they understand the security posture.

In our engagement, the ingestion pipeline ran on a periodic schedule, and the customer accepted a bounded delay for permission propagation given the sensitivity profile of the content.

Common Pitfalls to Avoid

Pitfall Why It’s Dangerous
❌ Using Sites.Read.All Breaks least privilege—exposes entire tenant
❌ Filtering results after retrieval Data already leaked to the application layer
❌ Relying on emails or display names These change; GUIDs don’t
❌ Ignoring group expansion Unexpanded groups lead to silent overexposure
❌ Assuming real-time permission sync Materialized permissions can become stale—plan for periodic refresh

Conclusion

Our customer needed to expose SharePoint content through AI-powered search and Copilot integrations without compromising document-level access control. By materializing permissions at ingestion time and filtering at query time, we delivered a solution that preserved the customer’s existing SharePoint security model across every downstream system.

When SharePoint content is integrated into AI-powered systems, authorization becomes a data problem. Treating permissions as first-class data ensures the system remains secure, auditable, and future-proof.

Key takeaways:

  • Use Sites.Selected for least-privilege access when the ingestion application reads from SharePoint
  • Document-level permissions must be materialized explicitly in your downstream index
  • GUID-based filtering ensures stable, deterministic identity matching
  • Authorization should always happen before retrieval, not after
  • Plan for periodic re-ingestion to keep materialized permissions in sync with SharePoint

This pattern has proven effective for enterprise-grade search, RAG pipelines, and Copilot integrations operating over highly sensitive documents.

Further Reading