Propagating SharePoint Document Permissions to AI Search and RAG Pipelines

Executive Summary

When integrating SharePoint content into downstream systems such as Azure AI Search, RAG pipelines, or Copilot extensions, preserving document-level access control is critical—especially for highly sensitive content.

This post describes a security-first architecture that propagates SharePoint document permissions into downstream systems so that authorization is enforced at query time. The approach:

Uses Microsoft Graph’s Sites.Selected permission for least-privilege access to SharePoint
Materializes document-level permissions into search index fields at ingestion time
Relies on Microsoft Entra ID object IDs (GUIDs) for stable, query-time filtering

Introduction: The Problem

In our last project, the customer needed to generate tailored briefings for multiple user groups. Creating those briefings required preprocessing a large volume of documents stored in SharePoint folders—many containing highly sensitive data. That meant we had to carry SharePoint’s document-level permissions all the way through downstream search and retrieval systems so only authorized users could see the right content.

SharePoint provides rich, hierarchical, and inheritable permissions—but downstream systems like Blob Storage, search indexes, and LLM retrieval layers do not natively understand SharePoint ACLs.

Without an explicit permission-mapping strategy, organizations risk:

Overexposing sensitive documents to unauthorized users
Violating Zero Trust principles by granting broad access
Failing internal security or compliance reviews

A common anti-pattern is granting applications Sites.Read.All, which unintentionally exposes all sites in a tenant. We needed a pattern that preserves document-level authorization information when we ingest content into downstream systems.

The Journey: Our Approach and Solution

We built a security-first pipeline that reads documents and permissions from SharePoint, normalizes identities to Microsoft Entra ID object IDs, and stores both content and ACL metadata in a search index. The ingestion app uses the Microsoft Graph Sites.Selected permission; everything downstream operates on the materialized permission data.

Design Goals

Goal	Description
✅ Least-privilege access	Explicit allow-listing per site—no tenant-wide permissions
✅ Document-level ACLs	Materialize permissions so downstream systems can filter safely
✅ Deterministic identities	Use GUIDs instead of emails or UPNs
✅ Broad compatibility	Work with Copilot extensions, RAG retrievers, and search indexes
✅ Highly sensitive content	Safe for regulated documents and compliance scenarios

Architecture Overview

The solution comprises five key components working together:

Component	Role
SharePoint Online	Hosts documents and source permissions
Microsoft Graph API	Supplies document metadata and ACLs
Sites.Selected permission	Ensures zero default access; each site is explicitly granted
Ingestion pipeline	Reads documents and permissions, resolves effective ACLs, and normalizes identities
Search index with security trimming	Stores `allowedUsers` and `allowedGroups` for query-time filters

End-to-End Permission Flow

End-to-end permission flow from SharePoint to filtered search

The Destination: Outcomes and Learnings

After piloting this pattern in production workloads, here’s what held up.

Permission Scoping with Sites.Selected

As described above, the ingestion application—the component that reads documents and their metadata from SharePoint and writes them into the search index—is registered with the Sites.Selected application permission. This means:

Characteristic	Benefit
Zero access by default	The app cannot read any site until explicitly granted
Explicit site grants	SharePoint admins must allow each site individually
Enforced by SharePoint	Access control is platform-enforced, not app logic
Clear audit trail	Every grant is traceable and revocable

This sharply limits blast radius if the app is ever compromised and keeps the pattern aligned with Zero Trust principles.

Extract and Normalize Permissions via Microsoft Graph

For each document, effective permissions are retrieved using Microsoft Graph. The key challenge is that SharePoint permissions are hierarchical and inheritable, whereas downstream systems are ACL-agnostic. To bridge this gap, permissions must be resolved at ingestion time and stored explicitly.

GET /sites/{site-id}/drive/items/{item-id}/permissions

From the response, we extract:

User assignments — individual users with access
Group assignments — security groups and Microsoft 365 groups
Identity normalization — convert all identities to Microsoft Entra ID object IDs (GUIDs)

permissions = await graph_client.get_permissions(drive_id, item_id)

allowed_users = []
allowed_groups = []

for entry in permissions:
    grant = entry.get("grantedToV2", {})
    user = grant.get("user")
    group = grant.get("group")

    if user and user.get("id"):
        allowed_users.append(user["id"])  # Microsoft Entra ID object ID
    if group and group.get("id"):
        allowed_groups.append(group["id"])  # Group object ID

Key decisions:

Decision	Rationale
Use GUIDs, not emails	Object IDs remain stable across renames and domain changes
Preserve group IDs	Enables offline expansion when search can’t expand groups natively
Resolve inheritance once	Index holds the effective ACL for each document

Index Security Materialization

Permissions are stored directly in the search index as filterable fields:

chunk = {
    "content": document_text,
    "allowedUsers": allowed_users,
    "allowedGroups": ",".join(allowed_groups),
}

At query time, results are filtered using the authenticated user’s Microsoft Entra ID object ID and group memberships:

allowedUsers/any(u: u eq '{user_oid}') or allowedGroups/any(g: g eq '{group_oid}')

Benefits:

✅ Authorization runs before documents are returned to the caller
✅ Same pattern works for RAG retrievers and Copilot extensions
✅ No post-retrieval filtering required—secure by design

Security Characteristics

This architecture delivers strong security guarantees:

Property	Status
No tenant-wide content access	✅
Explicit site allow-listing	✅
Deterministic identity model	✅
Secure by default for AI search and RAG	✅
Compatible with Copilot extensions	✅

Limitation: Stale Permissions

One important trade-off of materializing permissions at ingestion time is that permission changes in SharePoint are not automatically propagated to downstream systems. If a user’s access is revoked in SharePoint, that user may still see the document’s content in the search index or RAG pipeline until the next ingestion run.

To mitigate this:

Run the ingestion pipeline on a regular schedule so that permission updates are picked up in a timely manner.
Use SharePoint webhooks or event receivers to trigger re-ingestion when permissions change, reducing the staleness window.
Tune the refresh interval based on the sensitivity of the content—highly regulated data may warrant more frequent re-ingestion.
Communicate the expected propagation delay to stakeholders so they understand the security posture.

In our engagement, the ingestion pipeline ran on a periodic schedule, and the customer accepted a bounded delay for permission propagation given the sensitivity profile of the content.

Common Pitfalls to Avoid

Pitfall	Why It’s Dangerous
❌ Using `Sites.Read.All`	Breaks least privilege—exposes entire tenant
❌ Filtering results after retrieval	Data already leaked to the application layer
❌ Relying on emails or display names	These change; GUIDs don’t
❌ Ignoring group expansion	Unexpanded groups lead to silent overexposure
❌ Assuming real-time permission sync	Materialized permissions can become stale—plan for periodic refresh

Conclusion

Our customer needed to expose SharePoint content through AI-powered search and Copilot integrations without compromising document-level access control. By materializing permissions at ingestion time and filtering at query time, we delivered a solution that preserved the customer’s existing SharePoint security model across every downstream system.

When SharePoint content is integrated into AI-powered systems, authorization becomes a data problem. Treating permissions as first-class data ensures the system remains secure, auditable, and future-proof.

Key takeaways:

Use Sites.Selected for least-privilege access when the ingestion application reads from SharePoint
Document-level permissions must be materialized explicitly in your downstream index
GUID-based filtering ensures stable, deterministic identity matching
Authorization should always happen before retrieval, not after
Plan for periodic re-ingestion to keep materialized permissions in sync with SharePoint

This pattern has proven effective for enterprise-grade search, RAG pipelines, and Copilot integrations operating over highly sensitive documents.

Propagating SharePoint Document Permissions to AI Search and RAG Pipelines

Executive Summary

Introduction: The Problem

The Journey: Our Approach and Solution

Design Goals

Architecture Overview

End-to-End Permission Flow

The Destination: Outcomes and Learnings

Permission Scoping with Sites.Selected

Extract and Normalize Permissions via Microsoft Graph

Index Security Materialization

Security Characteristics

Limitation: Stale Permissions

Common Pitfalls to Avoid

Conclusion

Further Reading

Category

Topics

Author

Read next

SQL query generation from natural language

WebAssembly Data Processing at the Edge with Azure IoT Operations

Executive Summary

Introduction: The Problem

The Journey: Our Approach and Solution

Design Goals

Architecture Overview

End-to-End Permission Flow

The Destination: Outcomes and Learnings

Permission Scoping with Sites.Selected

Extract and Normalize Permissions via Microsoft Graph

Index Security Materialization

Security Characteristics

Limitation: Stale Permissions

Common Pitfalls to Avoid

Conclusion

Further Reading

Category

Topics

Share

Author

Read next

SQL query generation from natural language

WebAssembly Data Processing at the Edge with Azure IoT Operations

Stay informed