A little program to look for files with inconsistent line endings

Raymond Chen

I wrote this little program to look for files with inconsistent line endings. Maybe you’ll find it useful. Probably not, but I’m posting it anyway.

using System;
using System.Collections.Generic;
using System.IO;

class Program
{
    static IEnumerable<FileInfo> EnumerateFiles(string dir)
    {
        var info = new System.IO.DirectoryInfo(dir);
        foreach (var f in info.EnumerateFileSystemInfos(
                           "*.*", SearchOption.TopDirectoryOnly))
        {
            if (f.Attributes.HasFlag(FileAttributes.Hidden))
            {
                continue;
            }

            if (f.Attributes.HasFlag(FileAttributes.Directory))
            {
                switch (f.Name.ToLower())
                {
                    case "bin":
                    case "obj":
                        continue;
                }

                foreach (var inner in EnumerateFiles(f.FullName))
                {
                    yield return inner;
                }
            }
            else
            {
                yield return (FileInfo)f;
            }
        }
    }

    // Starting in the current directory, enumerate files
    // (see EnumerateFiles for criteria), and report what
    // type of line ending each file uses.

    static void Main()
    {
        foreach (var f in EnumerateFiles("."))
        {
            // Skip obvious binary files.
            switch (f.Extension.ToLower())
            {
                case ".png":
                case ".jpg":
                case ".gif":
                case ".wmv":
                    continue;
            }

            int line = 0; // total number of lines found
            int cr = 0;   // number of lines that end in CR
            int crlf = 0; // number of lines that end in CRLF
            int lf = 0;   // number of lines that end in LF

            var stream = new FileStream(
                     f.FullName, FileMode.Open, FileAccess.Read);
            using (var br = new BinaryReader(stream))
            {
                // Slurp the entire file into memory.
                var bytes = br.ReadBytes((int)f.Length);
                for (int i = 0; i < bytes.Length; i++)
                {
                    if (bytes[i] == '\r')
                    {
                        if (i + 1 < bytes.Length &&
                            bytes[i+1] == '\n')
                        {
                            line++;
                            crlf++;
                            i++;
                        }
                        else
                        {
                            line++;
                            cr++;
                        }
                    }
                    else if (bytes[i] == '\n')
                    {
                        lf++;
                        line++;
                    }
                }
            }

            if (cr == line)
            {
                Console.WriteLine("{0}, {1}", f.FullName, "CR");
            }
            else if (crlf == line)
            {
                Console.WriteLine("{0}, {1}", f.FullName, "CRLF");
            }
            else if (lf == line)
            {
                Console.WriteLine("{0}, {1}", f.FullName, "LF");
            }
            else
            {
                Console.WriteLine("{0}, {1}, {2}, {3}, {4}",
                              f.FullName, "Mixed", cr, lf, crlf);
            }
        }
    }
}

The Enumerate­Files method recursively enumerates the contents of the directory, but skips over hidden files, hidden directories, and directories with specific names.

The main program takes the files enumerated by Enumerate­Files, ignores certain known binary file types, and for the remaining files, counts the number of lines and how many of them use any particular line terminator.

If the file’s lines all end the same way, then that line terminator is reported with the file name. Otherwise, the file is reported as Mixed and the number of lines of each type is reported.

I use this little program when chasing down line terminator inconsistencies. Maybe that’s not something you have to deal with, in which case lucky you.

12 comments

Discussion is closed. Login to edit/delete existing comments.


Newest
Newest
Popular
Oldest
  • Paulo Morgado 0

    Although very convinient for to look at, a switch statement that requires the creation of a string is not the best practice. Using Equals with a StringComparison would be better.

  • Alex Cohn 0

    same can be achieved with good old file command, can’t it?

    • Raymond ChenMicrosoft employee Author 0

      'file' is not recognized as an internal or external command, operable program or batch file.

      • Alex Cohn 0

        Runs perfectly from WSL bash.

  • Ji Luo 0

    As an old school programmer I would use a DFA to handle the stream…

  • cheong00 0

    I wonder, shouldn’t a whitelist approach be used (i.e.: only visit file extensions you want to look for different line endings)?
    There are who knows how many files types on the disk which are not text files and will just produce unwanted noise.

    • Raymond ChenMicrosoft employee Author 0

      I would rather have false positives than false negatives. In my case, those file types were the only binary types in the directory. This was a quick-and-dirty tool, so I could just hard-code the binary types I needed to exclude.

  • Henry Skoglund 0

    Hi, found a bug, files without any line terminators are reported as being CR terminated.
    Also, as @Michael Liu says above, you can simplify usig ReadAllBytes. And use the old trustworthy Split method to further simplify could give something like this:

    string bytes = System.Text.Encoding.Default.GetString(System.IO.File.ReadAllBytes(f.FullName));
    int crlf = bytes.Split(new String[] { "\r\n" }, StringSplitOptions.None).Count() - 1;
    int cr = bytes.Split('\r').Count() - 1 - crlf;
    int lf = bytes.Split('\n').Count() - 1 - crlf;
    if (0 == crlf + cr + lf)
        Console.WriteLine("{0}, {1}", f.FullName, "None");
    else if (0 == cr + lf)
        Console.WriteLine("{0}, {1}", f.FullName, "CRLF");
    else if (0 == crlf + lf)
        Console.WriteLine("{0}, {1}", f.FullName, "CR");
    else if (0 == crlf + cr)
        Console.WriteLine("{0}, {1}", f.FullName, "LF");
    else
        Console.WriteLine("{0}, {1}, {2}, {3}, {4}",
                             f.FullName, "Mixed", cr, lf, crlf);
    
    • Raymond ChenMicrosoft employee Author 0

      I had no zero-line files, so that issue never arose. I like your Split trick, especially over-splitting and subtracting out the crlfs. Not particularly efficient, but efficiency wasn’t the goal here.

      • Henry Skoglund 0

        Thanks! And thank you for restoring the formatting of the code to a readable state.

    • Andrew Cook 0

      While stringifying everything might result in fewer lines of code, Raymond’s method only has a single string of residue per file, and only works on a single in-memory copy of the file, rather than creating several stringified copies only to immediately throw them away. Thankfully, the CLR doesn’t automatically intern all strings like some languages do, or else you’d have five copies of each file forever.

  • Michael Liu 0

    You can just use File.ReadAllBytes instead of having to deal with FileStream and BinaryReader.

Feedback usabilla icon