June 1st, 2020

What’s in My Index?

Brendan Flynn
Senior Software Engineer

What’s in My Index

Welcome back. This is part 3 of an initial 4-part series discussing everything search indexer related on Windows. In this article we are going to talk about what data is actually stored in the index. From files to custom data types, you will be able to start digging through the items that are stored in your own personal index by the time you are done reading.

List of posts in the series

  • The Evolution of Windows Search
  • Configuration and Settings
  • What’s in My Index? 👈You Are here
  • How to Make the Most of Search on Windows

Again, there is a lot of different terminology, concepts, and components to the indexing service most of which we will eventually cover in detail. For now, we are focusing on high-level concepts.

FTQuery

If you do not have it installed already, the latest Windows SDK contains a few tools that are very useful when exploring your index. In this post we will be walking through how to can use this tool to explore all the data in your index. If you would like to follow along, you can get the latest SDK here.

FTQuery lets you run any SQL (Structured Query Language) style query, which is one of the ways applications can query data in the index. This is one of the ways all the components you saw in the first post in the series were reading data from files. We will do a deep dive into the Structured Query Language in a later post, so for now we will use very simple queries below to extract data.

Open a command prompt, navigate to the default location of the tool in the SDK (typically C:\Program Files (x86)\Windows Kits\… on a 64-bit machine) and run the traditional windows executable help command.

The tool by default will show you a few example SQL queries that you can run to see what is in the index. The first query you see “SELECT path from SystemIndex” will list out every single item in the index which not only takes a while but is not very useful in the command prompt.

I recommend redirecting the output when using this tool to analyze the data further in a .txt file if you feel the result list will be large. I’ve modified the query to use ‘TOP 25’ to limit the result set to the first 25 entries in the index.

In the above query we’re asking for the first 25 items that the indexing service ever knew about because we’re not using any sort of restrictions or conditions on the query.

Gathering the Data

The service when it first starts will issue initial “crawls” on the locations it cares about by default, which above you can see includes ‘C:\users\’ and ‘C:\programdata\microsoft\windows\start menu\’ in this case. Crawls are what helps the service initially discover your files, folders, content and also happen periodically as new folders of interest are added to the machine.

The service keeps track of the items it’s gathered and the last time it has done so in order to keep a history of when every item was processed. This query will show you the single last item that the service has processed due to the file being modified.

You can remove or alter the “TOP 1” to show as many items as you like. In my personal index I’ve been working on some blog posts, so it makes sense that as I’m editing this post in a word document the Search Indexer is picking up the necessary changes and it’s telling me that this is the last file it’s indexed.

You will notice that if you create a .txt file in C:\myawesomefolder\ and re-run the above query that with the default Search Indexer configuration the file does not show up. This is because C:\myawesomefolder\ is not an indexed location by default.

In the last article we covered briefly what indexing scopes are and how to modify them. If you go into the indexing options applet in the control panel you can add C:\myawesomefolder\ and the service will start picking up the data in that folder.

Image word image

Now let us re-run the query for the last indexed item. As expected, the last item is in the newly added folder to the indexing scope list.

Notice in this specific query above I have added WHERE SCOPE=’file:’. All items in the index have unique paths or URLs to represent them. These paths are sometimes computed via the property store of the item for file and folders or are manually constructed for other items with specific protocol handlers.

By running the query “SELECT path FROM SystemIndex” and redirecting the output to a text file you can examine every single item in the index. You will notice items with URLs starting with mapi16://, winrt://, iehistory://, and of course, file://. This beginning part of the URL is the protocol the item is associated with.

Here are the first items when I dump my index:

file:C:/Users

file:C:/Users/Brendan

file:C:/Users/Default

file:C:/Users/Default/AppData

file:C:/Users/desktop.ini

file:C:/Users/Public

file:C:/Users/brflynn/AppData/Local/Packages/Microsoft.Search_8wekyb3d8bbwe

file:C:/Users/brflynn/AppData/Local/Packages/Microsoft.Search_8wekyb3d8bbwe/LocalState

file:C:/Users/brflynn/AppData/Local/Packages/Microsoft.Search_8wekyb3d8bbwe/RoamingState

file:C:/Users/brflynn/AppData/Local/Packages/Microsoft.Search_8wekyb3d8bbwe/LocalCache

file:C:/Users/brflynn/AppData/Local/Packages/Microsoft.Search_8wekyb3d8bbwe/AppData

Item URLs & Procotols

Items in the index are represented in the following format:

Protocol:// {optional user security identifier} Path/Identity of Item

The protocol portion of the URL tells the indexer which protocol handler to load to process that item. Every item must have a protocol handler. My team owns the winrt:// and the file:// protocols and these are the default ones when you install Windows 10.

The other protocols you might see on your machine depending on which applications you install could be:

  • mapi16:// (Outlook)
  • oneindex16:// (OneNote)
  • IEHistory:// (Internet Explorer History)

Depending on if the item is associated with a user or not, the next portion of the item’s URL will be a security identifier telling the indexing service which user context has access to the item. File urls do not contain security identifiers tied to the items, unless the items are encrypted and specifically owned by a particular user.

Child Protocol Host Processes

You have probably seen a process named SearchProtocolHost.exe running on your system from time to time, or multiple at once even. This is the process that gets launched in a user or system context depending on what item is trying to be indexed. For items that have a user SID in the URL, they must be opened in a process running in a logged-on user context. For items without a user security identifier, we create a SearchProtocolHost.exe process in a system context for indexing. Files and folders (using the file:// protocol) are the most typical items you will see being indexed by SearchProtocolHost.exe process running in a system context.

There are two reasons we use this model. The first is the above reason that in order to open OneNote/Outlook or other files that require a user context SearchIndexer.exe would need to impersonate and we cannot do that over and over for every item we see, especially if the system has many users. The second is we load a handful of external plugins that developers and applications have created over time and we do not want that code to run in our main service process.

Content Filters and SearchFilterHost.exe

Applications can install custom property handlers or IFilter add-ins for a registered extension in the system. With .mp3 files, the artist, album, song title and other information about the file are embedded in the file itself, but without writing a custom parser for the files contents to extract that information the Indexer would be unable to find it and store it.

During indexing the Indexer will look-up for each extension it sees which property handler and filter handler to load. The property handler extracts properties known to that extension, which in most cases is not plain text and able to be parsed in any generic way. Property handlers or filters that require looking through the contents or file stream to parse data are loaded into a third child process called SearchFilterHost.exe. This process is responsible for loading these filters and using them to parse the content so that these properties or text in the file can be stored in the index for searching.

This is very powerful for large documents or finding snippets of text in an item. Using FTQuery, I am going to search for a document that is on my machine by entering words I know are in the document.

I play a ton of fantasy football, and I happen to be trying to find a document that displays all the rankings of the NFL players regarding drafting them on a fantasy team. Here is my query:

Saquon Barkley is a running back in the NFL for the New York Giants (yes, I am a Giants fan 😊). By using some different clauses in the query, I was able to find all documents that contained both the text Saquon and Barkley. Looks like this text is in my personal OneDrive in the rankings file I was trying to find, and of course the draft of this blog post.

The same results can be found by using the Windows Search Box.

NOTE: By prepending docs: you are telling Windows Search Box to go to the Documents page first and then run the query. The Documents page by default will search both file contents as well as file properties. The All page will only search file properties, not contents.

For a complete detailed and documented list of all the ways you can search your index, please refer to our MSDN documentation.

It’s not common for you to have to use a tool in the SDK to find files, and we’re constantly working with the Windows Search Box and File Explorer teams to make sure the search experience in those applications can support finding what the user is trying to.  I will be publishing a post soon called “How to file Windows Search feedback” which will go into detail on what data is necessary when filing Windows Search feedback. It will contain details about ways to troubleshoot why searches may not be giving you the results you are after, and the best way to give us feedback so we can use it to improve our product.

That is all for this post, thanks for tuning in.

Next Up, concluding this series: How to Make the Most of Search on Windows

Author

Brendan Flynn
Senior Software Engineer

3 comments

Discussion is closed. Login to edit/delete existing comments.

  • Hugo Schmidt

    Hello Brendan,

    thx for your Articles - i am stronly hoping for/expecting your next article - How to file Windows Search feedback.

    intersting for me especially is - how to make it useful for User/Customers, which range of "kind of searches" can i run with easy combinations like x AND y -- or X and y created in 2019, or x and y modyfied last by mr brendan - can you show the range of...

    Read more
  • John Schroedl

    “The All page will only search file properties, not contents.”

    This is another one of the reasons why it seems that Windows Search is broken. Can I change the default to include file contents?

    • Brendan FlynnMicrosoft employee Author

      Thanks for the feedback John. I have reached out to the Search Box team with this feedback directly. I would also file a feedback hub item and send me the link (either here) or @brflynn_ms on twitter so that I can route it for visibility.