Friday, February 11, 2011

Validate String against USPS State Abbreviations

I need to be able to validate a string against a list of the possible United States Postal Service state abbreviations, and Google is not offering me any direction.

I know of the obvious solution: and that is to code a horridly huge if (or switch) statement to check and compare against all 50 states, but I am asking StackOverflow, since there has to be an easier way of doing this. Is there any RegEx or an enumerator object out there that I could use to quickly do this the most efficient way possible?

[C# and .net 3.5 by the way]

List of USPS State Abbreviations

  • Here's a regex. Enjoy!

    ^(?-i:A[LKSZRAEP]|C[AOT]|D[EC]|F[LM]|G[AU]|HI|I[ADLN]|K[SY]|LA|M[ADEHINOPST]|N[CDEHJMVY]|O[HKR]|P[ARW]|RI|S[CD]|T[NX]|UT|V[AIT]|W[AIVY])$
    
    Michael Haren : Props on a cool regex but that is just plain ridiculous. I do not think regex is the way to go--this is extremely difficult to verify with the eye and any test you write to verify it works is likely to be worse than just implementing it a clearer way in the first place.
    Craig Trader : Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems. -- Jamie Zawinski.
    hughdbrown : Just a heads-up: this regex includes a number of abbreviations not commonly considered, like Puerto Rico, Northern Mariana Islands, Palau, and Marshall Islands.
    Craig Trader : The regex is matching the USPS state abbreviation list referenced in the question. I match the same list in my answer.
  • I'd populate a hashtable with valid abbreviations and then check it with the input for validation. It's much cleaner and probably faster if you have more than one check per dictionary build.

    Craig : Voted up for clean and quick solution. Take some design time and make up for it at runtime!
    Michael Haren : Thanks for clarifying the specific generics, Jon.
  • A HashSet<string> is the cleanest way I can think of using the built-in types in .NET 3.5. (You could easily make it case-insensitive as well, or change it into a Dictionary<string, string> where the value is the full name. That would also be the most appropriate solution for .NET 2.0/3.0.)

    As for speed - do you really believe this will be a bottleneck in your code? A HashSet is likely to perform "pretty well" (many millions of lookups a second). I'm sure alternatives would be even faster - but dirtier. I'd stick to the simplest thing that works until you have reason to believe it'll be a bottleneck.

    (Edited to explicitly mention Dictionary<,>.)

    From Jon Skeet
  • I like something like this:

    private static String states = "|AL|AK|AS|AZ|AR|CA|CO|CT|DE|DC|FM|FL|GA|GU|HI|ID|IL|IN|IA|KS|KY|LA|ME|MH|MD|MA|MI|MN|MS|MO|MT|NE|NV|NH|NJ|NM|NY|NC|ND|MP|OH|OK|OR|PW|PA|PR|RI|SC|SD|TN|TX|UT|VT|VI|VA|WA|WV|WI|WY|";
    
    public static bool isStateAbbreviation (String state)
    {
      return state.Length == 2 && states.IndexOf( state ) > 0;
    }
    

    This method has the advantage of using an optimized system routine that is probably using a single machine instruction to do the search. If I was dealing with non-fixed length words, then I'd check for "|" + state + "|" to ensure that I hadn't hit a substring instead of full match. That would take a wee bit longer, due to the string concatenation, but it would still match in a fixed amount of time. If you want to validate lowercase abbreviations as well as uppercase, then either check for state.UpperCase(), or double the 'states' string to include the lowercase variants.

    I'll guarantee that this will beat the Regex or Hashtable lookups every time, no matter how many runs you make, and it will have the least memory usage.

    Matthew Ruston : What happens if the user manages to enter "L|" as their input? I imagine it would validate under this code. It could easily be fixed with an IndexOf("|") line.
    Craig Trader : Again, if you're worried about it, that's when you concatenate the delimiters around the search string. Thus you would be checking for "|L||", which would fail.

0 comments:

Post a Comment