Friday, April 29, 2011

ensure two char arrays are not the same

I am randomly generating a grid of characters and storing it in a char[,] array ...

I need a way to ensure that i haven't already generated a grid before serializing it to a database in binary format...what is the best way to compare two grids based on bytes? The last thing i want to do is loop through their contents as I am already pulling one of them from the db in byte form.

I was thinking checksum but not so sure if this would work.

char[,] grid = new char[8,8];
char[,] secondgrid = new char[8,8];//gets its data from db
From stackoverflow
  • From what I can see, you are going to have to loop over the contents (or at least, a portion of it); there is no other way of talking about an arrays contents.

    Well, as a fast "definitely not the same" you could compute a hash over the array - i.e. something like:

        int hash = 7;
        foreach (char c in data) {
            hash = (hash * 17) + c.GetHashCode();
        }
    

    This has the risk of some false positives (reporting a dup when it is unique), but is otherwise quite cheap. Any use? You could store the hash alongside the data in the database to allow fast checks - but if you do that you should pick your own hash algorithm for char (since it isn't guaranteed to stay the same) - perhaps just convert to an int, for example - or to re-use the existing implementation:

        int hash = 7;
        foreach (char c in data) {
            hash = (hash * 17) + (c | (c << 0x10));
        }
    


    As an aside - for 8x8, you could always just think in terms of a 64 character string, and just check ==. This would work equally well at the database and application.

  • I'd go with a checksum/hash mechanism to catch a large percentage of the matches, then do a full comparison if you get a match.

    What is the range of characters used to fill in your grid? If you're using just letters (not mixed case, or case not important), and an 8x8 grid, you're only talking about 7 or so possible collisions per item within your problem space (a very rare occurence) assuming a good hashing function. You could do something like:

    1. Generate Grid
    2. Load any matching grids from DB
    3. if found match from #2, goto 1
    4. Use your new grid.
  • Can't you get the database to do it? Make the grid column UNIQUE. Then, if you need to detect that you've generated a duplicate grid, the method for doing this might involve checking the number of rows affected by your operation, or perhaps testing for errors.

    Also, if each byte is simply picked at random from [0, 255], then performing a hash to get a 4-byte number is no better than taking the first four bytes out of the grid. The chance of collisions is the same.

    Marc Gravell : Since this is a char[] (not byte[]), you'd only have time for 2 characters... using a hash algorithm will make better use if the used/unused code-point ranges, and will (in typical use) give a better collision rate than just taking the first two chars.
    Artelius : Well, my unknowledge of C# shines through. But my first point still stands.
  • Try this (invoke ComputeHash for every matrix and compare the guids):

    private static MD5 md5 = MD5.Create();
    public static Guid ComputeHash(object value)
    {
        Guid g = Guid.Empty;
        BinaryFormatter bf = new BinaryFormatter();
        using (MemoryStream stm = new MemoryStream())
        {
            bf.Serialize(stm, value);
            g = new Guid(md5.ComputeHash(stm.ToArray()));
            stm.Close();
        }
        return g;
    }
    

    note: Generating the byte array might be accomplished a lot simpler since you have a char array.

0 comments:

Post a Comment