Thursday, May 5, 2011

I need a regex to get the src attribute of an img tag

Hi,

I have a string which follows literally:

"lt;img src=quot;http://www.news.gov.tt/thumbnail.php?file=Hon__Jerry_Narace_Minister__Of_Health_599152837.jpgamp;size=summary_mediumquot;gt;lt;pgt;Fifty-eight people have been tested for Influenza A/H1N1 virus, commonly called swine flu, in Trinidad and Tobago. \r\nThe tests have all come back negative, Health Minister Jerry Narace said yesterday. \r\n\r\n"

I would like to get the url between the 'quot;' strings, ie,

http://www.news.gov.tt/thumbnail.php?file=Hon__Jerry_Narace_Minister__Of_Health_599152837.jpgamp;size=summary_medium

using a regex in .NET.

Any ideas?

From stackoverflow
  • Regex r = new Regex("(?<=img src=&quot;).*?(?=&quot;)");
    

    Should do the trick for you, assuming there aren't any ampersands hiding out there somewhere.

    EDIT: After posting this answer, I noticed ampersands I saw before in your string were no longer present.

  • This regex should sort you out to grab the src content of just the IMG tags:

    (?<=<img.*?src=\&quot;)[^\"]*(?=\&quot;.*?((&frasl;&gt;)|(&gt;.*&lt&frasl;img&gt;)))
    

    It doesn't rely on positioning or the src within the tag, it does require that you set the case sensitivity to insensitive to be stable though.

    Patjbs version will grab you the src of all tags, which will cause instability if you're parsing html that contains linked in external content - such as javascript, external div content etc.

    string htmlString = @"<img id="tagId" src="myTagSource.gif" name="imageName" />";
    string matchString = Regex.Match(htmlString, @"(?<=<img.*?src=\")[^\"]*(?=\".*?((/>)|(>.*</img)))").Value;
    

    matchString now equals "myTagSource.gif"

    I notice that your input string is missing some & (ampersand) to denote the escape chars such as quot; there's going to be no way (without forcing the logic to look for quot; lt; gt;) to interpret those characters programmatically. You would have to do a replace on the initial string to convert it to a regex interpretable [is that a word?] string.

    So let's say you grab all these strings out of the page, you'd need to assume that all instances of lt; become < and all gt; become >, all quot; become ".

    You cannot also assume that the data provided will always come back in this form, sometimes the string may contain other tag information (id, name, border info etc). So I think perhaps the most ideological and the most maintainable solutions may diverge here a slightly. The most ideological way would be to do it in one parse, but the most maintenance friendly may be to do it in two steps, first converting the input string to a standard html string, and the second to extract the source data.

    Alternatively, you could do it in one parse, replacing the html construct in my pattern with the corresponding character replacements (assuming they're using standard encoding but dropping the ampersand), although, it's not quite as readable, and likely to cause some confusion to anyone maintaining the code:

    (?<=lt;img.?src=\quot;).?(?=\quot;.?((frasl;gt;)|(gt;.lt;frasl;imggt;)))

    Edit: If it turns out that they are using standard encoding and you just haven't provided the & in your example, then you can just sub in first pattern I presented referencing the decoded URL using:

    string MatchValue = Regex.Match(HttpUtility.UrlDecode(inputString), pattern).Value;
    

    This will decode the string you get back from them into a standard string replacing the escaped characters with the correct characters and then run the same pattern.

    Bruce : My string actually has '"' surrounding the img url, so this won't work...
    BenAlabaster : @Bruce If they *are* using the " & > < then you can use the HttpUtility to decode to the xml readable string and use the same regex pattern as I presented in my edited version.
    BenAlabaster : @Bruce, does the & [ampersand] actually appear before the escape chars? i.e. & < > etc?
    Bruce : @balabaster, I take your point about the " etc. But unfortunatly the string is written without the ampersand.
  • Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.

  • ^\"lt;img\s+src\=quot;(.+)quot;
    

    Given the following input:

    "lt;img src=quot;http://www.news.gov.tt/thumbnail.php?file=Hon__Jerry_Narace_Minister__Of_Health_599152837.jpgamp;size=summary_mediumquot;gt;lt;pgt;Fifty-eight people have been tested for Influenza A/H1N1 virus, commonly called swine flu, in Trinidad and Tobago. \r\nThe tests have all come back negative, Health Minister Jerry Narace said yesterday. \r\n\r\n"
    

    this regex returns the following:

    http://www.news.gov.tt/thumbnail.php?file=Hon__Jerry_Narace_Minister__Of_Health_599152837.jpgamp;size=summary_medium
    

    which I believe is exactly what you required.

    Hope this helps, Ryan

0 comments:

Post a Comment