April 22nd, 2019

A little program to look for files with inconsistent line endings

I wrote this little program to look for files with inconsistent line endings. Maybe you’ll find it useful. Probably not, but I’m posting it anyway.

using System;
using System.Collections.Generic;
using System.IO;

class Program
{
    static IEnumerable<FileInfo> EnumerateFiles(string dir)
    {
        var info = new System.IO.DirectoryInfo(dir);
        foreach (var f in info.EnumerateFileSystemInfos(
                           "*.*", SearchOption.TopDirectoryOnly))
        {
            if (f.Attributes.HasFlag(FileAttributes.Hidden))
            {
                continue;
            }

            if (f.Attributes.HasFlag(FileAttributes.Directory))
            {
                switch (f.Name.ToLower())
                {
                    case "bin":
                    case "obj":
                        continue;
                }

                foreach (var inner in EnumerateFiles(f.FullName))
                {
                    yield return inner;
                }
            }
            else
            {
                yield return (FileInfo)f;
            }
        }
    }

    // Starting in the current directory, enumerate files
    // (see EnumerateFiles for criteria), and report what
    // type of line ending each file uses.

    static void Main()
    {
        foreach (var f in EnumerateFiles("."))
        {
            // Skip obvious binary files.
            switch (f.Extension.ToLower())
            {
                case ".png":
                case ".jpg":
                case ".gif":
                case ".wmv":
                    continue;
            }

            int line = 0; // total number of lines found
            int cr = 0;   // number of lines that end in CR
            int crlf = 0; // number of lines that end in CRLF
            int lf = 0;   // number of lines that end in LF

            var stream = new FileStream(
                     f.FullName, FileMode.Open, FileAccess.Read);
            using (var br = new BinaryReader(stream))
            {
                // Slurp the entire file into memory.
                var bytes = br.ReadBytes((int)f.Length);
                for (int i = 0; i < bytes.Length; i++)
                {
                    if (bytes[i] == '\r')
                    {
                        if (i + 1 < bytes.Length &&
                            bytes[i+1] == '\n')
                        {
                            line++;
                            crlf++;
                            i++;
                        }
                        else
                        {
                            line++;
                            cr++;
                        }
                    }
                    else if (bytes[i] == '\n')
                    {
                        lf++;
                        line++;
                    }
                }
            }

            if (cr == line)
            {
                Console.WriteLine("{0}, {1}", f.FullName, "CR");
            }
            else if (crlf == line)
            {
                Console.WriteLine("{0}, {1}", f.FullName, "CRLF");
            }
            else if (lf == line)
            {
                Console.WriteLine("{0}, {1}", f.FullName, "LF");
            }
            else
            {
                Console.WriteLine("{0}, {1}, {2}, {3}, {4}",
                              f.FullName, "Mixed", cr, lf, crlf);
            }
        }
    }
}

The Enumerate­Files method recursively enumerates the contents of the directory, but skips over hidden files, hidden directories, and directories with specific names.

The main program takes the files enumerated by Enumerate­Files, ignores certain known binary file types, and for the remaining files, counts the number of lines and how many of them use any particular line terminator.

If the file’s lines all end the same way, then that line terminator is reported with the file name. Otherwise, the file is reported as Mixed and the number of lines of each type is reported.

I use this little program when chasing down line terminator inconsistencies. Maybe that’s not something you have to deal with, in which case lucky you.

Topics
Code

Author

Raymond has been involved in the evolution of Windows for more than 30 years. In 2003, he began a Web site known as The Old New Thing which has grown in popularity far beyond his wildest imagination, a development which still gives him the heebie-jeebies. The Web site spawned a book, coincidentally also titled The Old New Thing (Addison Wesley 2007). He occasionally appears on the Windows Dev Docs Twitter account to tell stories which convey no useful information.

12 comments

Discussion is closed. Login to edit/delete existing comments.

  • Paulo Morgado

    Although very convinient for to look at, a switch statement that requires the creation of a string is not the best practice. Using Equals with a StringComparison would be better.

  • Alex Cohn

    same can be achieved with good old file command, can’t it?

    • Raymond ChenMicrosoft employee Author

      'file' is not recognized as an internal or external command, operable program or batch file.

      • Alex Cohn

        Runs perfectly from WSL bash.

  • Ji Luo

    As an old school programmer I would use a DFA to handle the stream…

  • cheong00

    I wonder, shouldn’t a whitelist approach be used (i.e.: only visit file extensions you want to look for different line endings)?
    There are who knows how many files types on the disk which are not text files and will just produce unwanted noise.

    • Raymond ChenMicrosoft employee Author

      I would rather have false positives than false negatives. In my case, those file types were the only binary types in the directory. This was a quick-and-dirty tool, so I could just hard-code the binary types I needed to exclude.

  • Henry Skoglund

    Hi, found a bug, files without any line terminators are reported as being CR terminated.
    Also, as @Michael Liu says above, you can simplify usig ReadAllBytes. And use the old trustworthy Split method to further simplify could give something like this:
    <code>

    Read more
    • Raymond ChenMicrosoft employee Author

      I had no zero-line files, so that issue never arose. I like your Split trick, especially over-splitting and subtracting out the crlfs. Not particularly efficient, but efficiency wasn’t the goal here.

      • Henry Skoglund

        Thanks! And thank you for restoring the formatting of the code to a readable state.

    • Andrew Cook

      While stringifying everything might result in fewer lines of code, Raymond's method only has a single string of residue per file, and only works on a single in-memory copy of the file, rather than creating several stringified copies only to immediately throw them away. Thankfully, the CLR doesn't automatically intern all strings like some languages do, or else you'd have five copies of each file forever.

      Read more
  • Michael Liu

    You can just use File.ReadAllBytes instead of having to deal with FileStream and BinaryReader.