Summary: Learn how to use Windows PowerShell regular expressions to parse an RSS feed.
Hey, Scripting Guy! How can I find patterns in text with regular expressions when I am unsure of where line breaks will occur in the text I am reading?
—TT
Hello TT,
Microsoft Scripting Guy, Ed Wilson here. Today Tome Tanasovski is back to finish out Guest Blogger Week. Tome will handle the answer to your inquiry.
Tome is a Windows engineer for a market-leading, global financial services firm in New York City. He is the founder and leader of the New York City PowerShell User group, a cofounder of the NYC Techstravaganza, a blogger, a speaker, and a regular contributor to the Windows PowerShell forum at Microsoft. He is currently working on the PowerShell Bible, which is due out in 2011 from Wiley. Tome is also a recipient of the MVP award for Windows PowerShell. Tome will be providing an hour-long, deep-dive about regular expressions via Live Meeting on March 22, 2011 for the UK PowerShell User group.
Regular expressions is one of the topics that will figure in the 2011 Scripting Games. In addition to the resources mentioned in the 2011 Scripting Games Study Guide, you should review TOME’s blogs today and tomorrow. If you get a chance to attend the March 22, 2011 Live Meeting, that would be helpful as well.
Did you know that regular expressions have their roots in neurophysiology? I am very fascinated with this fact. Regular sets and expressions were created as a method of creating notation for patterns. I am amazed that we were so close to understanding artificial intelligence in the theoretical days of the mind and mathematics as early as the 1940s.
If there’s one thing we learned from IBM’s Watson computer, it is that in order to create a machine that acts like the human mind, it not only needs a large database of information, but it must have the ability to trace the patterns that associate those bits of data together. A big part of what makes us human is the power to associate and find patterns—so, let us take a few minutes to understand what it is to be human by tackling a complex regular expression question together.
The most common method of consuming text data with a regular expression is to read it line-by-line and look for patterns. For example, we can read each line in a dictionary file and print out all of the words that begin with the letter a. This is shown in the following example.
Foreach ($line in (Get-content dictionary.txt)) {
If ($line -match ‘^a’) {
$line
}
}
Unfortunately, things are not always as controlled as this example. It is possible that you will not know how many lines lie between the things that you are trying to match. The most common example of this is when you are looking for patterns within text that is returned from a web server. Although there may be better ways to consume XML data, we are going to use the RSS feed for the Scripting Guys Script Repository as our example because the structure of XML makes it easier to understand the technique.
If you are unfamiliar with the RSS specifications, you only need to know that we are expecting to find a series of items between the <item></item> tags. Within these item tags there will be other tags such as title, links, content, author, and description. There is no rule that says all of the tags must be used, nor is there a rule that governs what order the tags are displayed. An item can exist on one line such as the one shown here.
<item><title>Hey Scripting Guy! Rules</title><description>None needed</description></item>
On the other hand, it can also exist on multiple lines as illustrated here.
<item><title>
Hey Scripting Guy! Rules</title>
<description>None needed</description>
</item>
If you are asked to find the text between the title and the description of each item, you would think this is an easy task to accomplish. However, how do you explain this to the regular expression engine?
The problem is solved with three steps or techniques. First, you need to use a regular expression on the entire string rather than processing the text in a line-by-line fashion. Second, you need a way to accept anything between the tags – including new lines and carriage returns. Third, you need to process the match repeatedly in the string so that you can return the numerous <item> tags we are bound to find.
Parsing an entire string of text with a regular expression
Parsing a string is fairly straightforward by using regular expressions. When we use the downloadString method on a System.Net.WebClient object, the item is returned as an entire string. Rather than splitting that string and iterating over each line, set the regular expression by using your favorite regex operator or cmdlet as shown here.
$web = New-Object System.Net.WebClient
$scRss = $web.DownloadString(‘http://gallery.technet.microsoft.com/scriptcenter/site/feeds/searchRss?sortBy=date’)
$scRss |Select-String -Pattern ‘<item>’ |select Matches
$scRss -match ‘<item>’
It is also useful to have a technique to grab an entire file into a single string in case you would like to apply the same approach to a data file. This technique is shown in the following example.
#This one is fastest, but will fail on really large files
$text = [System.IO.File]::OpenText(“C:\file.txt”).ReadToEnd()
#This one is much slower, but is more consistent
$text = “”;get-content c:\fiile.txt |%{$text +=$_}
Accepting anything between tags including line terminators: [\s\S]*?
The [\s\S] pattern is one of the character sets that I use most often. It tells the regular expression engine to match anything that is a space or is not a space. You could use any of the match or notmatch metacharacters to create the same class; for example, [\w\W] and [\d\D] all have the same meaning.
Unfortunately, you cannot use the dot (.) in this scenario because the dot matches everything except newline characters. You may not be aware of this, but \s (spaces) matches new lines and carriage returns. This is a great metacharacter to use when you are unsure of exactly how the text is terminating lines. The most common way of terminating a line is with either `n (new line) or `r`n (carriage return->newline), but there is no way to know for sure how the text you are consuming is terminating a line. Therefore, the \s* handles that unknown scenario.
In addition to using [\s\S], we are also using the *? quantifiers to indicate that the [\s\S] can match zero or multiple times (*). But it should only match as few characters as possible (?). Here is an example of a regular expression that will find the contents of the <title></title> tags between the first pair of <item></item> tags.
$scRss -match ‘<item>[\s\S]*?<title>([\s\S]*?)</title>[\s\S]*?<item>’
$matches[1] # Display the contents of the parenthesis in the regex
Matching multiple times in a string
Multiple matching in a string is not that difficult, but it can be confusing to consume the information that is returned. In order to match multiple times, you can use the -AllMatches parameter of the Select-String cmdlet. You then need to access the matches property that is returned from Select-String. Each match will then have groups that correspond to the contents of the captures (parenthesis) in your regular expressions. If we wanted to get the title of each item in the RSS feed, we would use a command similar to this one.
$regex = ‘<item>[\s\S]*?<title>([\s\S]*?)</title>[\s\S]*?<item>’
($scRss |Select-String -Pattern $regex -AllMatches).matches |foreach {
$_.groups[1].value
}
Real-world example
Although we have answered the initial question, I think it’s interesting to show this technique with a practical application. This final example puts it all together. It captures the contents of <link> and <title> between each <item> in the RSS feed for the Scripting Guys Script Repository, and it displays them in a grid view.
$web = New-Object System.Net.WebClient
$scRss = $web.DownloadString(‘http://gallery.technet.microsoft.com/scriptcenter/site/feeds/searchRss?sortBy=date’)
$regex = ‘<item>[\s\S]*?<link>([\s\S]*?)</link>[\s\S]*?<title>([\s\S]*?)</title>[\s\S]*?</item>’
($scRss |Select-String -Pattern $regex -AllMatches).matches |foreach {
$obj = New-Object psobject
$obj |Add-Member -MemberType NoteProperty -Name Title -Value $_.groups[2].value
$obj |Add-Member -MemberType NoteProperty -Name Link -Value $_.groups[1].value
$obj
} |Out-GridView -Title “Scripting Guys Script Repository Scripts”
The output of the script appears in the following image.
TT, that is all there is to using regular expressions to process RSS feed data. Thank-you, Tome, for sharing with us again today. That wraps up Guest Blogger Week. Tomorrow is the weekend, and you know that means…Weekend Scripter.
I invite you to follow me on Twitter and Facebook. If you have any questions, send email to me at scripter@microsoft.com, or post your questions on the Official Scripting Guys Forum. See you tomorrow. Until then, peace.
Ed Wilson, Microsoft Scripting Guy
0 comments