A little program to look for files with inconsistent line endings

Raymond Chen

Raymond

I wrote this little program to look for files with inconsistent line endings. Maybe you’ll find it useful. Probably not, but I’m posting it anyway.

using System;
using System.Collections.Generic;
using System.IO;

class Program
{
    static IEnumerable<FileInfo> EnumerateFiles(string dir)
    {
        var info = new System.IO.DirectoryInfo(dir);
        foreach (var f in info.EnumerateFileSystemInfos(
                           "*.*", SearchOption.TopDirectoryOnly))
        {
            if (f.Attributes.HasFlag(FileAttributes.Hidden))
            {
                continue;
            }

            if (f.Attributes.HasFlag(FileAttributes.Directory))
            {
                switch (f.Name.ToLower())
                {
                    case "bin":
                    case "obj":
                        continue;
                }

                foreach (var inner in EnumerateFiles(f.FullName))
                {
                    yield return inner;
                }
            }
            else
            {
                yield return (FileInfo)f;
            }
        }
    }

    // Starting in the current directory, enumerate files
    // (see EnumerateFiles for criteria), and report what
    // type of line ending each file uses.

    static void Main()
    {
        foreach (var f in EnumerateFiles("."))
        {
            // Skip obvious binary files.
            switch (f.Extension.ToLower())
            {
                case ".png":
                case ".jpg":
                case ".gif":
                case ".wmv":
                    continue;
            }

            int line = 0; // total number of lines found
            int cr = 0;   // number of lines that end in CR
            int crlf = 0; // number of lines that end in CRLF
            int lf = 0;   // number of lines that end in LF

            var stream = new FileStream(
                     f.FullName, FileMode.Open, FileAccess.Read);
            using (var br = new BinaryReader(stream))
            {
                // Slurp the entire file into memory.
                var bytes = br.ReadBytes((int)f.Length);
                for (int i = 0; i < bytes.Length; i++)
                {
                    if (bytes[i] == '\r')
                    {
                        if (i + 1 < bytes.Length &&
                            bytes[i+1] == '\n')
                        {
                            line++;
                            crlf++;
                            i++;
                        }
                        else
                        {
                            line++;
                            cr++;
                        }
                    }
                    else if (bytes[i] == '\n')
                    {
                        lf++;
                        line++;
                    }
                }
            }

            if (cr == line)
            {
                Console.WriteLine("{0}, {1}", f.FullName, "CR");
            }
            else if (crlf == line)
            {
                Console.WriteLine("{0}, {1}", f.FullName, "CRLF");
            }
            else if (lf == line)
            {
                Console.WriteLine("{0}, {1}", f.FullName, "LF");
            }
            else
            {
                Console.WriteLine("{0}, {1}, {2}, {3}, {4}",
                              f.FullName, "Mixed", cr, lf, crlf);
            }
        }
    }
}

The Enumerate­Files method recursively enumerates the contents of the directory, but skips over hidden files, hidden directories, and directories with specific names.

The main program takes the files enumerated by Enumerate­Files, ignores certain known binary file types, and for the remaining files, counts the number of lines and how many of them use any particular line terminator.

If the file’s lines all end the same way, then that line terminator is reported with the file name. Otherwise, the file is reported as Mixed and the number of lines of each type is reported.

I use this little program when chasing down line terminator inconsistencies. Maybe that’s not something you have to deal with, in which case lucky you.

Raymond Chen
Raymond Chen

Follow Raymond   

12 comments

Comments are closed.

  • Avatar
    Michael Liu

    You can just use File.ReadAllBytes instead of having to deal with FileStream and BinaryReader.

  • Avatar
    Henry Skoglund

    Hi, found a bug, files without any line terminators are reported as being CR terminated.
    Also, as @Michael Liu says above, you can simplify usig ReadAllBytes. And use the old trustworthy Split method to further simplify could give something like this:

    string bytes = System.Text.Encoding.Default.GetString(System.IO.File.ReadAllBytes(f.FullName));
    int crlf = bytes.Split(new String[] { "\r\n" }, StringSplitOptions.None).Count() - 1;
    int cr = bytes.Split('\r').Count() - 1 - crlf;
    int lf = bytes.Split('\n').Count() - 1 - crlf;
    if (0 == crlf + cr + lf)
        Console.WriteLine("{0}, {1}", f.FullName, "None");
    else if (0 == cr + lf)
        Console.WriteLine("{0}, {1}", f.FullName, "CRLF");
    else if (0 == crlf + lf)
        Console.WriteLine("{0}, {1}", f.FullName, "CR");
    else if (0 == crlf + cr)
        Console.WriteLine("{0}, {1}", f.FullName, "LF");
    else
        Console.WriteLine("{0}, {1}, {2}, {3}, {4}",
                             f.FullName, "Mixed", cr, lf, crlf);
    
    • Avatar
      Andrew Cook

      While stringifying everything might result in fewer lines of code, Raymond’s method only has a single string of residue per file, and only works on a single in-memory copy of the file, rather than creating several stringified copies only to immediately throw them away. Thankfully, the CLR doesn’t automatically intern all strings like some languages do, or else you’d have five copies of each file forever.

    • Raymond Chen
      Raymond Chen

      I had no zero-line files, so that issue never arose. I like your Split trick, especially over-splitting and subtracting out the crlfs. Not particularly efficient, but efficiency wasn’t the goal here.

      • Avatar
        Henry Skoglund

        Thanks! And thank you for restoring the formatting of the code to a readable state.

  • Avatar
    cheong00

    I wonder, shouldn’t a whitelist approach be used (i.e.: only visit file extensions you want to look for different line endings)?
    There are who knows how many files types on the disk which are not text files and will just produce unwanted noise.

    • Raymond Chen
      Raymond Chen

      I would rather have false positives than false negatives. In my case, those file types were the only binary types in the directory. This was a quick-and-dirty tool, so I could just hard-code the binary types I needed to exclude.

  • Avatar
    Paulo Morgado

    Although very convinient for to look at, a switch statement that requires the creation of a string is not the best practice. Using Equals with a StringComparison would be better.