October 18th, 2006

How Can I Split a String Only on Specific Instances of a Character?

Hey, Scripting Guy! Question

Hey, Scripting Guy! I have a series of string values that I need to convert to an array, splitting the value on the ampersand (&). However, if the ampersand is followed by amp (in other words, if the string is &amp) I don’t want to split the string at that point. How do I do that?

— AK

SpacerHey, Scripting Guy! AnswerScript Center

Hey, AK. We apologize for taking so long to post the answer to your question. We actually answered this a long time ago, but every time we sent it in for editing the Scripting Editor rejected it; the whole thing sounded so crazy she assumed we’d started drinking. But that’s not true; we actually started drinking a little over 6 years ago.

Note. Why, yes, as a matter of fact the Scripting Guy who writes this column did start working at Microsoft a little over 6 years ago. How did you know?

Editor’s Note: Keep in mind that by “drinking” we’re not necessarily referring to alcoholic beverages. One too many espresso shots have been known to have adverse effects on some of the Scripting Guys. Although so has one too few shots. It’s a delicate balance around here sometimes.

Admittedly, this is a somewhat crazy question. So let’s see if we can explain the situation a little better. Based on your email we’re assuming you have a string that looks something like this:

aaa&bbb&ccc&aamp;ddd

What you’d like to do is convert this string into an array, splitting the string on the ampersand. However – and this is an important however – you don’t want to split on any ampersand that’s followed by the letters amp. In other words, you want to split the string only at the following locations:

aaa&bbb&ccc&aamp;ddd

In the end, that would give you an array with the following elements:

aaa
bbb&ccc
aamp;ddd

It’s crazy, but we think we understand what you want to do. Granted, we’re not totally sure why you want to do it. But, then again, that’s not really any of our business, is it?

Actually, we’ve encountered similar situations before, particularly in Microsoft Word. On more than one occasion we’ve been given documents that have a paragraph return at the end of each line. We need to replace those paragraph returns with blank spaces. However, we can’t just go out and replace every single paragraph return in the document; if we did that then we’d replace the legitimate paragraph returns (that is, the ones between paragraphs) as well. To solve that problem in Microsoft Word we used an approach very similar to the approach we’re going to use today, which is why we considered ourselves at least semi-qualified to answer this question.

Oh, good point: what is the approach we’re going to use today? This:

strText = “aaa&bbb&ccc&aamp;ddd”

strText = Replace(strText, “&amp”, “@@@@”)

arrText = Split(strText, “&”)

For i = 0 to Ubound(arrText) – 1 arrText(i) = Replace(arrText(i), “@@@@”, “&amp”) Next

For Each strItem in arrText Wscript.Echo strItem Next

Trust us; it’s supposed to look like that.

Before the rest of you start accusing us of drinking (as his long-suffering family will attest, the Scripting Guy who writes this column is too much of a cheapskate to be a drinker [yes, those lattes get expensive]) let’s explain what we’re doing here. In the first line of code we take our string value and assign it to a variable named strText. In other words, everything starts out pretty easy. But then it gets a little weird:

strText = Replace(strText, “&amp”, “@@@@”)

What’s the point of that line of code? Well, as we know, we want to split the line at each and every ampersand … provided, of course, that the ampersand isn’t followed by the letters amp. That means that the value &amp is a problem: it has an ampersand, which means we ought to split the line at that point. However, that ampersand is followed by the letters amp, which means we shouldn’t split the line at that point. Obviously what we need to do is find each ampersand, then check to see if the character is followed by the letters amp.

But, to tell you the truth, that sounded way too hard. (It can be done, but ….) And so we decided to use logic instead. If the value &amp is causing us problems then the logical thing to do is to get rid of that value. That’s why we used VBScript’s Replace function to replace all instances of &amp with @@@@.

Before you ask, no, we didn’t have to use @@@@; we could have used any string of characters that doesn’t appear elsewhere in the text. Our text doesn’t include the string !!!!, which means we could have used that value instead. However, our text does include the value bbb, which means that bbb would be a poor choice as a replacement value. Why? Hopefully that will become clear in just a moment.

So what have we gained by doing this? What we’ve gained is this: the value of strText is now equal to the following, with @@@@ replacing any instances of &amp:

aaa&bbb@@@@;ccc&aamp;ddd

That might not look like that big of a deal, but it is: after all, with the problem value &amp temporarily removed we’re free to split our string on all the remaining ampersands, something we do here:

arrText = Split(strText, “&”)

In turn, that gives us an array consisting of the following items:

aaa&
bbb@@@@;ccc
&aamp;ddd

See? That’s pretty good; in fact, if it wasn’t for the value @@@@ stuck in the middle of line 2 it would be perfect. But that’s OK; to paraphrase an age-old parental threat, we brought @@@@ into this world and we can take @@@@ out of this world (or at least out of our array). In order to restore the original text we set up a For Next loop that loops through each item in the array. For each of those items we use the Replace function to replace any instances of @@@@ with – you guessed it — &amp. That’s what this block of code is for:

For i = 0 to Ubound(arrText) – 1
    arrText(i) = Replace(arrText(i), “@@@@”, “&amp”)
Next

To make a long story short, early on the value &amp was causing us problems; therefore, we temporarily removed that value from the string. All we’re doing now is restoring &amp to its rightful place. Make sense?

In fact, if you now echo back all the values in the array (something we do in the last block of code) you get back this:

aaa
bbb&ccc
aamp;ddd

Which, believe it or not, is exactly what we wanted to get back.

Whew. You know, we could really use a drink right about now.

Of water. Sheesh.

Author

0 comments

Discussion are closed.