How to Filter Long-Running IEnumerables for Distinct Objects

by Larry Spencer Sunday, December 30, 2012 12:25 PM

In the last post, I presented a class that enumerates the files that are in a directory and then continues to notify its consumers when additional files are created. Because some programs fire multiple file-created events when they create files, my class can yield duplicate file names.

One way to eliminate them would be LINQ's Distinct() method. As I said two posts ago, that would work quite well in this situation.

But what if file ABC.TXT arrives today, and then another ABC.TXT arrives next week? Those are probably distinct files, even though they have the same name. If the file watcher is part of a long-running server application, this is a very real possibility.

The solution I'll propose here is a class called TimedDistinctCollection. (It's a class, not an extension method.) It's pretty simple. We use the MemoryCache class, new with .NET 4, to remember which files we have seen before. We insert file names into the cache with whatever sliding expiration time is appropriate to our situation.

And while we're at it, why not make the TimedDistinctCollection suitable for any type of object, not just files? All we have to do is tell the class how to make a cache key out of an item. The third parameter to the constructor is a Func that serves this purpose. For file names, we can use the name itself as the key.

So, without further ado, here is the class and an example that combines it with the CreatedFileCollection.

TimedDistinctCollection.cs

 

using System;
using System.Collections;
using System.Collections.Generic;
using System.Diagnostics.Contracts;
using System.Runtime.Caching;

namespace Fws.Collections
{
    /// <summary>
    /// Filters an IEnumerable as LINQ's Distinct() extension method does, but
    /// maintains only a time-limited memory of previous items. This makes it
    /// suitable for long-running (server) operations where LINQ's Distinct()
    /// would gradually consume more and more memory.
    /// </summary>
    /// <typeparam name="T">The type of item being enumerated.</typeparam>
    public class TimedDistinctCollection<T> : IEnumerable<T>
    {
        readonly IEnumerable<T> _source;
        readonly CacheItemPolicy _cachePolicy;
        readonly Func<T, string> _keyMaker;

        /// <summary>
        /// Constructor.
        /// </summary>
        /// <param name="source">The IEnumerable whose distinct items we want.</param>
        /// <param name="recallTime">How long this class will remember previous items.</param>
        /// <param name="keyMaker">Makes a cache key out of an item.</param>
        public TimedDistinctCollection(IEnumerable<T> source, TimeSpan recallTime, Func<T, string> keyMaker)
        {
            Contract.Requires(source != null);
            Contract.Requires(keyMaker != null);

            _source = source;
            _keyMaker = keyMaker;
            _cachePolicy = new CacheItemPolicy() { SlidingExpiration = recallTime };
        }

        /// <summary>
        /// Yield the distinct items in the wrapped IEnumerable.
        /// </summary>
        /// <returns>An IEnumerator that may be used in a foreach loop.</returns>
        public IEnumerator<T> GetEnumerator()
        {
            var cache = new MemoryCache(this.GetType().ToString());
            {
                foreach (var item in _source)
                {
                    string key = _keyMaker(item);
                    if (!cache.Contains(key))
                    {
                        cache.Add(key, item, _cachePolicy);
                        yield return item;
                    }
                }
            }
        }

        /// <summary>
        /// Required method for IEnumerable.
        /// </summary>
        /// <returns>The generic enumerator, but as a non-generic version.</returns>
        IEnumerator IEnumerable.GetEnumerator()
        {
            return GetEnumerator();
        }
    }
}

 

An Example

 

IEnumerable distinctFiles = new TimedDistinctCollection(
    // The wrapped, IEnumerable-derived collection of files.
    new CreatedFileCollection(cts.Token, sourceDir, "*.txt"),
    // How long we want to remember a file for the purpose of detecting duplicates.
    TimeSpan.FromMinutes(5),
    // The file name itself can serve as the cache key
    fileName => fileName);

 

The Next Step

Over the next two posts, we'll extend the IEnumerable<T> chain one step further, with classes that determines that the created, distinct files are actually ready to be processed.

Tags: , ,

All | General

About the Author

Larry Spencer

Larry Spencer develops software with the Microsoft .NET Framework for ScerIS, a document-management company in Sudbury, MA.