Labeling Existing SharePoint Documents at Rest with File Extension Matching

If you're rolling out Microsoft Copilot — or you just need a defensible data classification posture — you have a problem on day one: the millions of files already sitting in SharePoint and OneDrive are unlabeled. Default library labels handle new files going forward. Auto-labeling policies that look for sensitive information types (SITs) handle the regulated stuff. But what about the long tail of generic business documents that don't trigger a SIT and were created before your label policy existed?

The answer most people land on is the metered Graph assignSensitivityLabel API, which requires Syntex billing. It works, but it's pay-per-call across millions of files. There's a free path that almost nobody talks about: service-side auto-labeling policies with a file-extension match condition. Yes, it works in enforcement, retroactively, against existing files at rest, with no Syntex bill.

It also looks completely broken until you understand the pipeline that sits behind it. This post is the playbook for getting it working and — more importantly — for not giving up when the Purview dashboard lies to you.

The configuration

A policy that stamps Internal on every Office document and PDF in a SharePoint site, regardless of content:

Connect-IPPSSession

# 1. Create the policy
New-AutoSensitivityLabelPolicy `
  -Name "Org-DefaultInternal" `
  -SharePointLocation "https://contoso.sharepoint.com/sites/MySite" `
  -ApplySensitivityLabel "Internal" `
  -OverwriteLabel:$false `
  -Mode TestWithoutNotifications

# 2. Add a rule with the extension condition
New-AutoSensitivityLabelRule `
  -Policy "Org-DefaultInternal" `
  -Name  "Org-DefaultInternal-OfficeAndPDF" `
  -ContentExtensionMatchesWords @(
      "docx","docm","dotx","dotm",
      "xlsx","xlsm","xlsb","xltx","xltm",
      "pptx","pptm","ppsx","ppsm","potx","potm",
      "pdf"
    ) `
  -Workload SharePoint

# 3. Let simulation complete, review results, then promote
Set-AutoSensitivityLabelPolicy -Identity "Org-DefaultInternal" -Mode Enable

Three things to call out:

OverwriteLabel:$false is critical. The policy will respect existing higher-priority labels (Confidential, Restricted) and only stamp the unlabeled and lower-priority files. You don't have to scope around your already-classified content.
ContentContainsSensitiveInformation is left null. This is what makes the policy free. SIT-based auto-labeling has its own limits, but extension-only is a different code path with no metering.
Workload SharePoint can be repeated for OneDrive (-Workload OneDrive) on a separate rule, or scope the policy to OneDrive locations directly.

The trap that makes everyone abandon this approach

Promote the policy to Enable, come back four days later, and the Purview dashboard says:

Total Labeled	Pending	Failed
0	0	0

You'd be forgiven for concluding that extension-based enforcement doesn't work. The simulation matched eight files. The enforcement queue is empty. There's no error. Microsoft Support's first reply will likely tell you to use the Purview portal — which is exactly what you did.

It is working. The dashboard is lying.

How the pipeline actually behaves

Service-side auto-labeling — both simulation and enforcement — depends on the SharePoint crawler indexing pipeline having indexed each file after the policy was activated. Files that haven't been crawled since you flipped the policy on simply won't appear. This is mechanism, not bug, and it's not in the documentation in any prominent place.

Practical implications:

A brand-new library is invisible to the engine for hours. The crawler has to discover it, index its items, and then the labeling engine evaluates them on its own pass. End-to-end I see ~24 hours in test scenarios. Production sites with regular activity get evaluated faster.
Files that aren't indexed don't get evaluated. Open files, checked-out files, files in transient states, and files of unsupported types are silently skipped.
Existing labels of equal-or-higher priority suppress matching. This is correct behavior but contributes to "where did my expected matches go" confusion.
The dashboard counters update on a cadence that's not coupled to actual labeling activity. I have a tenant where every file in a test library is verifiably labeled and the per-policy CompletedItemsCount is still zero days later.

How to verify it's actually working

Don't trust the Purview dashboard. Verify three ways, in order of confidence:

Read _IpLabelId directly off the SharePoint list item. Use PnP PowerShell:

Connect-PnPOnline -Url $SiteUrl -Interactive
$items = Get-PnPListItem -List "Documents" -PageSize 500 |
  Where-Object { $_.FileSystemObjectType -eq "File" }

foreach ($it in $items) {
  $fv = $it.FieldValues
  Write-Host ("{0,-40} {1,-36} method={2}" -f `
    $fv["FileLeafRef"], $fv["_IpLabelId"], $fv["_IpLabelAssignmentMethod"])
}

A populated _IpLabelId is the GUID of the applied label. Cross-reference with Get-Label for the display name. Important gotcha: PnP's -Fields parameter sometimes drops _IpLabelId from the output; omit -Fields and read the full FieldValues dictionary directly.

Activity Explorer in the Purview portal. Look for Label apply succeeded events with the rule name in the Rule column. This is the closest thing to ground truth that the portal offers.
The dashboard counters. Useful only as a sanity check, not as primary evidence.

The other trap: malformed test files

If you generate test files in a hurry — or worse, ask an AI to produce them — verify the binaries before you draw conclusions. A .docx file with ASCII text inside is not a Word document. The auto-labeling engine will try to parse it as OOXML, fail, and report EncryptedFileNotSupported for every file. This error has nothing to do with encryption; it's the engine's bucket for any unparseable but structurally-expected document.

file Board-Minutes.docx
# Real:  Microsoft Word 2007+
# Fake:  ASCII text

Real Office files via Python:

from docx import Document
d = Document()
d.add_heading("Test", 0)
d.add_paragraph("Real content.")
d.save("Board-Minutes.docx")

Same applies to .xlsx (openpyxl), .pptx (python-pptx), and .pdf (weasyprint / reportlab).

When this approach makes sense

You need to retroactively label existing files at rest in SharePoint or OneDrive
You don't want to enable Syntex billing for the metered Graph API
Your "default" classification (typically Internal) is something you want applied to all unclassified Office documents and PDFs
You're willing to wait 24–72 hours for the first sweep across a new scope
You have higher-priority labels in place for sensitive content so the catch-all doesn't paint over them

When to use something else

For brand-new files going forward, library default sensitivity labels are simpler and stamp at upload time
For sensitive content, SIT-based auto-labeling is the right tool and is also free
For one-off remediation of a small file set with strict timing, the metered Graph API is fine if you accept the cost

Closing

Service-side auto-labeling with extension matching is one of the genuinely useful free features in Purview, and it's almost completely buried under bad telemetry and a dashboard that doesn't reflect the labeling engine's actual state. If you walk away with one thing: don't trust the policy dashboard. Verify on the file. The engine is doing its job; the UI hasn't caught up yet.