Hey, Scripting Guy! How Can I Delete Duplicate Entries in a Tab-Delimited File?

ScriptingGuy1

Hey, Scripting Guy! Question

Hey, Scripting Guy! How can I delete the duplicate entries in a tab-delimited file?
— ST

SpacerHey, Scripting Guy! AnswerScript Center

Hey, ST. Before we start, is anyone out there familiar with the book Not Quite What I Was Planning: Six-Word Memoirs by Writers Famous and Obscure? Not Quite What I Was Planning is an interesting little book in which the editors of SMITH magazine asked people to summarize their lives in six words. Examples from both the book and the magazine include the following:

Found true love, married someone else.

I still make coffee for two.

Born good. Went bad. Good again.

Road less traveled, now know why

Oh, and this one from Amy Sedaris:

Mushrooms. Clowns. Wands. Five. Wig. Thatched.

Well, OK; never mind about that last one.

At any rate, the Scripting Guy who writes this column thought it might be cool to apply this idea to Hey, Scripting Guy! After all, if he could answer the day’s question in just six words, well, that would save him a heck of a lot of time and effort. Unfortunately, though, he had to quickly give up on that idea; as it turns out, the Scripting Guy who writes this column can’t even refer to himself using just six words:

1.

The

2.

Scripting

3.

Guy

4.

who

5.

writes

6.

this

7.

column

Needless to say, he tends to be a little long-winded from time-to-time.

Actually he did come up with one possibility: “Here is the code. Good luck.” However, he wasn’t sure if anyone would count that as having sufficiently explained the script and how it works.

Therefore, he decided to compromise a little: we’ll do part of today’s column in six-word blocks, enclosing those six-word explanations in square brackets. The rest of the column will be done in typical Hey, Scripting Guy! fashion. [Typical fashion. Why are you booing?]

So let’s get started. How can you delete duplicate entries from a tab-delimited file? [Beats us. Try reading the article.]

Note. Say what you want, but it’s still better than the one Amy Sedaris came up with. And she’s a real writer and comedian.

Here’s a script that can delete duplicate entries from a tab-delimited file (although we’re ashamed to admit that the script requires more than six words):

Const ForReading = 1
Const ForWriting = 2

Set objDictionary = CreateObject("Scripting.Dictionary")

Set objFSO = CreateObject("Scripting.FileSystemObject")
Set objFile = objFSO.OpenTextFile("C:\Scripts\Test.txt", ForReading)

strContents = objFile.ReadAll
objFile.Close

strContents = Replace(strContents, vbCrLf, vbTab)
arrContents = Split(strContents, vbTab)

For Each strItem in arrContents
    If Not objDictionary.Exists(strItem) Then
        objDictionary.Add strItem, strItem   
    End If
Next

i = 1

For Each strKey in objDictionary.Keys
    If i < 3 Then
        strNewContents = strNewContents & strKey & vbTab
        i = i + 1
    Else
        strNewContents = strNewContents & strKey & vbCrLf
        i = 1
    End If
Next

Set objFile = objFSO.OpenTextFile("C:\Scripts\Test.txt", ForWriting)
objFile.Write strNewContents
objFile.Close

As you can see, our script starts out by creating a pair of constants, ForReading and ForWriting, a pair of constants we’ll need when we open our text file. [Two constants. Must open file twice.] After we define the constants we create instances of the objects Scripting.Dictionary and Scripting.FileSystemObject, then use the following line of code to open the file C:\Scripts\Test.txt for reading:

Set objFile = objFSO.OpenTextFile("C:\Scripts\Test.txt", ForReading)

As soon as the file is open we use the ReadAll method to read the entire contents of the file into a variable named strContents, then call the Close method to (temporarily) close Test.txt. [We’ve got the file. Now what?]

Before we do anything else let’s take a look at the file we just opened (and, not coincidentally, the value of the variable strContents):

Apple    Apple    Apple
Banana   Cherry   Cherry
Cherry   Date     Fig
Fig      Lemon    Orange
Orange   Peach    Peach
Pear     Pear     Pear

As you can see, this is a simple three-column list of various fruits. As you can also see, many of these fruits appear more than once; for example, we have three listings for Apple and two for Peach. [One date. Story of our lives.] What we want to do now is eliminate all the duplicate listings. [Originals only. Duplicates need not apply.] In other words, we want a three-column list that looks like this:

Apple    Banana    Cherry
Date     Fig       Lemon
Orange   Peach     Pear

Good question: how are we going to do that? Well, for starters, we need a list of words that’s a bit easier to work with. In fact, what we really need is a one-column list of words rather than this fancy three-column format. [One column’s great; three’s a crowd.] In order to combine our collection of words into a single list the first thing we do is replace any instances of the carriage return-linefeed character (vbCrLf) with a tab character (vbTab):

strContents = Replace(strContents, vbCrLf, vbTab)

What this does is give us a single line of words, with each word in that line separated by a tab. In other words, something similar to this (except that this single line would continue to scroll on and on and on):

Apple    Apple    Apple    Banana  Cherry   Cherry

[One lone banana. Life isn’t fair.]

Granted, that might not seem like we’ve made much progress. Believe it or not, however, we have. [Holy smokes: Scripting Guys make progress!] Now that all our words are separated by tabs we can use the Split function to split strContents on the tab character:

arrContents = Split(strContents, vbTab)

Why would we want to do that? Because that gives us an array named arrContents, an array consisting of the following elements:

Apple 
Apple 
Apple
Banana
Cherry
Cherry
Cherry
Date  
Fig
Fig   
Lemon 
Orange
Orange
Peach 
Peach
Pear  
Pear  
Pear

Now that we have a nice little array like this one it’s actually quite easy to eliminate the duplicate items. To that end, we first set up a For Each loop designed to loop through all the items in the array arrContents. [For Each: everyone gets a turn!]:

For Each strItem in arrContents

Inside that loop we take the first item in the array, and use the Exists method to see if that item exists in the Dictionary [See? We never forget the Dictionary.]:

If Not objDictionary.Exists(strItem) Then

If the item does exist in the Dictionary that means this is a duplicate item; in that case, we simply go back to the top of the loop and repeat the process with the next item in the array. If the item doesn’t exist in the Dictionary then we use the following line of code to add item to the Dictionary, using the word itself as both the Dictionary key and item:

objDictionary.Add strItem, strItem

By the time we’re done our Dictionary should contain the following keys (and items):

Apple 
Banana
Cherry
Date  
Fig
Lemon 
Orange
Peach 
Pear

That’s pretty good, except now we need to turn this one-column list back into a three-column list. [Sometimes three is better than one.] To do that we first assign the value 1 to a counter variable named i:

i = 1

We then set up another For Each loop, this one designed to walk through all the keys in the Dictionary object [Can never have too many loops.]:

For Each strKey in objDictionary.Keys

Because we want a three-column list, the first thing we do inside the loop is check to see if the value of our counter variable i is less than 3. If it is, we execute this block of code:

strNewContents = strNewContents & strKey & vbTab
i = i + 1

As you can see, there’s nothing too complicated going on here. [Scripting Guys: Simple minds, simple scripts.] In line 1 we’re assigning a value to a variable named strNewContents; to be a little more specific, we’re assigning this variable the existing value of strContents plus the value of the Dictionary key plus a tab character. In line 2, we then increment the value of our counter variable by 1. What does all that mean? That means that, after three trips through the loop, strContents will be equal to this:

Apple    Banana  Cherry

Now, what’s going to happen the next time through the loop? Well, the next time through the loop the counter variable i will be equal to 3. That means we’re going to execute these two lines of code instead:

strNewContents = strNewContents & strKey & vbCrLf
i = 1

Notice the differences? In this block of code, we’re tacking a carriage return-linefeed character onto the end of the string instead of adding a tab character. On top of that, we also reset the value of i to 1. Those two things ensure that the next item in our list (the fourth item) will show up in the first column of row 2 rather than the fourth column of row 1.

As soon as we’ve reformatted strNewContents we reopen the file Test.txt, this time for writing:

Set objFile = objFSO.OpenTextFile("C:\Scripts\Test.txt", ForWriting)

We use the Write method to replace the existing contents of Test.txt with the value of strNewContents, then use the Close method to close the file one more time, this time for good. [File closed? Column must be over.]

That should do it, SS. (Shoot, that’s only five words. Double shoot: that’s only five words, too. Dang; and that last sentence was seven words!) Let us know if you have any questions. Also, let us know if you have any six-word descriptions for the Script Center and the Scripting Guys; we’ll publish those in a future column. For example, here’s one we came up with to describe this column:

OK …. But why no scripting information?

And here’s one that summarizes our beloved Scripting Editor:

Loves people. Had one for breakfast.

See you all tomorrow.

Wait; check that. How about this:

New column tomorrow; see you then.

That’s better.

Wait; check that. How about this:

That’s better; must use six words.

0 comments

Discussion is closed.

Feedback usabilla icon