Expert Solution for 2011 Scripting Games Advanced Event 6: Use PowerShell to Get Twitter IDs from a Web Page


Summary: Microsoft Windows PowerShell senior software engineer, Lee Holmes, solves 2011 Scripting Games Advanced Event 6 and gets Twitter IDs from a web page.

Microsoft Scripting Guy, Ed Wilson, here. Today we have Lee Holmes as our expert commentator for Advanced Event 6.

Photo of Lee Holmes

Lee Holmes is a senior software engineer on the Microsoft Windows PowerShell team, and he has been an authoritative source of information about Windows PowerShell since its earliest betas. He is the author of the Windows PowerShell Cookbook, Windows PowerShell Pocket Reference, and the Windows PowerShell Quick Reference.
Lee’s contact information:
Blog: Precision Computing

Worked solution

While getting ready to attend a SQL Saturday event, you suddenly realize that this is the perfect opportunity to simultaneously flex your scripting social networking muscles. Fortunately, the SQL Saturday site is so organized that it has a page for everybody that is coming. Let’s figure out their Twitter user names.

When dealing with data from the wild internet, you generally have three options: web services, highly-structured data feeds (such as RSS, ATOM, and REST), or the basic HTML normally intended for web browsers. For the first two, Windows PowerShell offers some excellent tools. For web services, the New-WebServiceProxy cmdlet lets you interact with the resource as though it were a regular .NET object, working with properties, calling methods, and more. For highly-structured data feeds, Windows PowerShell’s [XML] type adapter makes quick work of the content they return.

In this case, the SQL Saturday networking page is just a simple web page. Let’s use the System.Net.WebClient class to download it and see what it contains:

$wc = New-Object System.Net.WebClient
$htmlContent = $wc.DownloadString($uri)

Sometimes, web pages are written in a form called XHTML, which is a form much more structured than regular HTML. When this is true, you can use Windows PowerShell’s [XML] type adapter to work with its content.

Unfortunately, for us, the SQL Saturday page is not an example of this kind of page:

Image of code

That’s OK—mankind made it to the moon without the help of XML. Surely, we can extract data from a web page without it!

When we take a look at the contents of $htmlContent, we see that all of the links to Twitter accounts follow a pattern:

PS > $htmlContent
                <span id=”ctl00_ContentPlaceHolder1_DataList1_ctl35_Label1″><font size=”3″>Ed Wilson
<a href=” class=”noarrrow”>(…)</span>

This is the kind of pattern that lends itself to the Select-String cmdlet. The Select-String cmdlet takes text (or files) as input, applies a Regular Expression to that content, and returns objects that represent the match. If you specify the AllMatches parameter, the Select-String cmdlet returns an object for each match that it finds in the content.

Regular Expressions are, at their heart, a finely-tuned language with the sole purpose of parsing text. Writing one is part art, part science. Although Windows PowerShell does an amazing job at shielding you from the crazy world of text parsing, Regular Expressions become invaluable for those times when text is all you’ve got.

Here’s the pattern we’ll use:

$pattern = ‘<a href=”[^”]*)”‘

The Regular Expression portion inside the single quotes says:

1) Find the literal text (<a href=”

2) Start remembering the stuff that comes next: (

3) Find a bunch of characters: []*

4) That are not ( ^ ) the quote character ()

5) Then stop remembering: )

6) And find another quote character:

Next, we supply this pattern to the Select-String cmdlet. We use the AllMatches parameter to get all of the results:

$result = $htmlContent | Select-String -Pattern $pattern -AllMatches

Here’s what one result looks like:

PS > $result.Matches[0]


Groups   : {<a href=”″, mhthomas42}

Success  : True

Captures : {<a href=”″}

Index    : 28613

Length   : 43

Value    : <a href=

In Regular Expressions the text inside parenthesis are called groups, so Windows PowerShell exposes their values in the Groups property. Groups[0]represents everything that was matched, while Groups[1]and on represent any groups that we defined. We can dig into the groups to find the information we need:

PS > $result.Matches[0].Groups[1]

Success  : True

Captures : {mhthomas42}

Index    : 28645

Length   : 10

Value    : mhthomas42

Success! Now, we weren’t just interested in one user name, let’s call our good friend Foreach-Object to process them all, and then spend the rest of the day gloating:

$usernames = $result.Matches | Foreach-Object { $_.Groups[1].Value }

PS > $usernames







Oh no! We’ve got a bug! Somehow, we left in somebody’s user name. Let’s take a look at the content itself:

<span id=”ctl00_ContentPlaceHolder1_DataList1_ctl01_Label1″><font size=”3″>Aaron Nelson <a href=”” cl…

It turns out that we aren’t to blame…we’ve just got dirty data!

Although we could get “smarter” with our Regular Expression, it’s usually more trouble than it’s worth. Taking a second pass on the data is often easier, so let’s take that approach. We can go through the user names again, this time using Windows PowerShell’s Replace operator to remove anything up to (and including) the slash:

$usernames -replace “.*/(.*)”,‘$1’

The Replace operator has two parts on the right-hand side: the Regular Expression to find, and the content to replace it with.

Converting that to English, we have:

1) Find a bunch of text: .*

2) Followed by a slash: /

3) Then start remembering stuff: (

4) Find a bunch of text: .*

5) And then stop remembering stuff: )

For the replacement portion, ‘$1’ means “the text that was captured in Group #1.” As with the Select-String example, Group #1 is the text that was matched inside of parenthesis. Putting this in single quotes is important. Like other strings, Windows PowerShell will think that “$1” means a Windows PowerShell variable if you use double quotes.

Although there are simpler ways to do this replacement, that wouldn’t give us an excuse to talk about capture group references, now would it?

When we put this all into a script, we’ve got a powerful little web scraper!


## Retrieves the list of Twitter usernames from

    ## The web page holding the twitter usernames
    [URI] $Uri = “”

## Download the file
$wc = New-Object System.Net.WebClient
$htmlContent = $wc.DownloadString($uri)

## Find all hyperlinks that are of the form:<username>
$pattern = ‘<a href=”[^”]*)”‘
$result = $htmlContent | Select-String -Pattern $pattern -AllMatches
$usernames = $result.Matches | Foreach-Object { $_.Groups[1].Value }

## Dirty data! Welcome to the internet!
## Some of the URLs are incorrect, such as
## If a username has a slash in it, just take everything after it.
$usernames -replace “.*/(.*)”,‘$1’


You can also find Lee’s script at the Script Repository.

Thank you, Lee.

I invite you to follow me on Twitter and Facebook. If you have any questions, send email to me at, or post your questions on the Official Scripting Guys Forum. See you tomorrow. Until then, peace.

Ed Wilson, Microsoft Scripting Guy