+ Reply to Thread
Results 1 to 9 of 9

Thread: img src preg_match_all regex problem

  1. #1
    learning_brain is offline x10 Sophmore learning_brain is an unknown quantity at this point
    Join Date
    Apr 2010
    Location
    UK, Midlands
    Posts
    170

    img src preg_match_all regex problem

    I have searched around everywhere for the right way of doing this.....

    I have an image search engine and as part of it, I have a page that can extract and store img sources.

    The problem is that the regex I'm using is not always reliable.

    1) issues with relative paths (doesn't inlcude complete path)
    2) there is also an issue with links such as "LBPC-Style/site_icons/profile.png".

    PHP Code:
    <?php
    //define url to search
    $url $_POST['url'];
    //get contents
    $contents file_get_contents($url);

    //set matching pattern for img tag source
    $pattern '/src=[\"\']?([^\"\']?.*(png|jpg|gif))[\"\']?/i';

    //match all img tag source
    preg_match_all($pattern$contents$images);


    //count number of items in array
    $imageCount count($images[1]);

    //loop through each item
    for ($i=0$i<$imageCount$i++){

        echo 
    "<br/>".$images[1][$i];
        
        echo 
    '<img src="'.$images[1][$i].'" width="100"/>';
        
        
    $insertSQL sprintf("INSERT INTO IMAGES (URL, KEYWORDS) VALUES (%s, %s)",
                           
    GetSQLValueString($images[1][$i], "text"),
                           
    GetSQLValueString($images[1][$i], "text"));

          
    $Result1 mysql_query($insertSQL$freewebhost) or die(mysql_error());
    }
    ?>
    images are always a problem due to the construct.

    i.e. <img src="http://www.mysite.com/image.png"/> would be ideal but...

    <img alt="description" src="image.png" width="100"/> is not.... see what I mean?

    Is there a better way of doing this?

  2. #2
    descalzo's Avatar
    descalzo is offline Grim Squeaker descalzo has a brilliant futuredescalzo has a brilliant futuredescalzo has a brilliant future
    Join Date
    Jul 2009
    Location
    Ankh-Morpork
    Posts
    7,636

    Re: img src preg_match_all regex problem

    1. Use regexps to get the domain and the directory paths from $url. ie if $url is 'http://www.example.com/stuff/page.html' you want
    $domain = 'http://www.example.com' and
    $directory = 'http://www.example.com/stuff/'

    2. Loop through you link matches and test:
    a. Against '/^https?:\/\//' ... if it matches, it is the format you want.
    b. Then against '/\//' ... if it matches, the link is of the format '/dirpath/moon.jpg' ... concatenate with $domain to get your full url
    c. Rest should be of the format 'sun.gif' or 'dirpath/img/relative.png' But those should work (not tested).

    I would guess that PHP should have a library that would parse an HTML page and pull out normalized links, but I am not sure. That would be the cleanest way.
    Last edited by descalzo; 05-22-2010 at 07:14 PM.
    Nothing is always absolutely so.

  3. #3
    misson is offline x10 Spammer misson is a jewel in the rough
    Join Date
    Mar 2008
    Location
    Libertatia
    Posts
    2,506

    Re: img src preg_match_all regex problem

    Quote Originally Posted by descalzo View Post
    I would guess that PHP should have a library that would parse an HTML page and pull out normalized links, but I am not sure.
    The closest I can think of is realpath(), but that only works on local files (the comments for realpath() have some functions that work on URLs).

    SimpleXML offers an alternate way of getting the source URLs, but won't resolve relative references.

    PHP Code:
    $doc = new SimpleXMLElement($url0True);
    $imageSrcs $doc->xpath("//img/@src");
    foreach (
    $imageSrcs as $img) {
        
    process($img->src);

    Since image elements aren't the only ones with a src attribute, a better regexp might be: /<img[^>]*src=(['"]?)([^>]*)\1/.
    Last edited by misson; 05-23-2010 at 01:53 AM.
    Be sure to read all pages linked in this post; they have further information that should prove useful. When asking for help, make sure you follow Eric Raymond's and Jon Skeet's guidelines for prompt, accurate responses. Please answer any questions I ask; they're not rhetorical (probably). Any posted code is intended as illustrative example, rather than a solution to your problem to be copied without alteration. Study it to learn how to write your own solution.
    Misson, not Mission.

  4. #4
    learning_brain is offline x10 Sophmore learning_brain is an unknown quantity at this point
    Join Date
    Apr 2010
    Location
    UK, Midlands
    Posts
    170

    Re: img src preg_match_all regex problem

    Thanks both of you.

    I took descalzo's advice with the extraction of the url path (concatenating url host/dir.file) and have now got the following;

    PHP Code:
    <?php
    //define url to search
    $url $_POST['url'];

    //split url
    preg_match('/((http|https|ftp):\/\/)?((.*?)\/)?((.*)\/)?(.*)?/',$url$urlParts);

    //concatenate host and directory path
    $urlPath $urlParts[3].$urlParts[5];

    //get contents
    $contents file_get_contents($url);

    //set matching pattern for img tag source
    $pattern '/src=[\"\']?([^\"\']?.*(png|jpg|gif))[\"\']?/i';
    //match all img tag source
    preg_match_all($pattern$contents$images);

    //count number of items in array
    $imageCount count($images[1]);

    //loop through each item
    for ($i=0$i<$imageCount$i++){

        
    $testPattern1 '/^https?:\/\//';
        
    $testPattern2 '/\//';

        if (
    $images[1][$i] = preg_match(??????????????)){
        
            
    $imageURL $images[1][$i];

            echo 
    "<br/>".$imageURL;
            echo 
    '<img src="'.$imageURL.'" width="100"/>';
            
            
    $insertSQL sprintf("INSERT INTO IMAGES (URL, KEYWORDS) VALUES (%s, %s)",
                               
    GetSQLValueString($imageURL"text"),
                               
    GetSQLValueString($imageURL"text"));
            
            
    $Result1 mysql_query($insertSQL$freewebhost) or die(mysql_error());
            
        } elseif (
    $images[1][$i] = preg_match(??????????????)){
            
            
    $imageURL $urlPath.$images[1][$i]
            
            echo 
    "<br/>".$imageURL;
            echo 
    '<img src="'.$imageURL.'" width="100"/>';
            
            
    $insertSQL sprintf("INSERT INTO IMAGES (URL, KEYWORDS) VALUES (%s, %s)",
                               
    GetSQLValueString($imageURL"text"),
                               
    GetSQLValueString($imageURL"text"));
            
            
    $Result1 mysql_query($insertSQL$freewebhost) or die(mysql_error());
        }
    }
    ?>
    The $urlPath I got more by good luck than good management, however it works.

    But as you can see, I having difficulty getting my head round the "if" construct...
    Last edited by learning_brain; 05-23-2010 at 05:17 AM.

  5. #5
    misson is offline x10 Spammer misson is a jewel in the rough
    Join Date
    Mar 2008
    Location
    Libertatia
    Posts
    2,506

    Re: img src preg_match_all regex problem

    Quote Originally Posted by learning_brain View Post
    PHP Code:
    //loop through each item
    for ($i=0$i<$imageCount$i++){

        
    $testPattern1 '/^https?:\/\//';
        
    $testPattern2 '/\//'
    Don't set variables that are invariant inside a loop; you're just wasting cycles.

    Quote Originally Posted by learning_brain View Post
    PHP Code:
        if ($images[1][$i] = preg_match(??????????????)){
        
            
    $imageURL $images[1][$i];

            echo 
    "<br/>".$imageURL;
            echo 
    '<img src="'.$imageURL.'" width="100"/>';
            
            
    $insertSQL sprintf("INSERT INTO IMAGES (URL, KEYWORDS) VALUES (%s, %s)",
                               
    GetSQLValueString($imageURL"text"),
                               
    GetSQLValueString($imageURL"text"));
            
            
    $Result1 mysql_query($insertSQL$freewebhost) or die(mysql_error());
            
        } elseif (
    $images[1][$i] = preg_match(??????????????)){
            
            
    $imageURL $urlPath.$images[1][$i]
            
            echo 
    "<br/>".$imageURL;
            echo 
    '<img src="'.$imageURL.'" width="100"/>';
            
            
    $insertSQL sprintf("INSERT INTO IMAGES (URL, KEYWORDS) VALUES (%s, %s)",
                               
    GetSQLValueString($imageURL"text"),
                               
    GetSQLValueString($imageURL"text"));
            
            
    $Result1 mysql_query($insertSQL$freewebhost) or die(mysql_error());
        }
    }
    ?> 
    You're repeating far too much code here. Convert relative URLs into absolute URLs. After that, the rest of the code is the same.

    PHP Code:
    ?><ol><?php
    foreach ($images[1] as $imageURL) {
        
    $imageURL normalize($imageURL$baseURL);
        
    ?><li><?php 
            
    echo $imageURL
            
    ?><img src="<?php echo $imageURL?>" alt="<?php ... ?>"/><?php
            
    if (! ImageIndex::add($imageURL, ...)) { // or SearchDB::addImage(...) or what-have-you
                // couldn't add image URL to database.
            
    }
        
    ?></li><?php
    }
    ?></ol><?php
    Don't use or die, and don't output database error messages to non-admin users.

    Quote Originally Posted by learning_brain View Post
    But as you can see, I having difficulty getting my head round the "if" construct...
    One thing you should do is write the URL conversion as a function or a class rather than doing everything inline. It will help you focus on the specific task of normalizing absolute and relative URLs.

    To test for an absolute URL, try %^((?:https?)://[^/]+)?(/?)(.*)% (you don't need to check for a scheme of "ftp"). If the first and second groups are empty, you've got a relative path. If only the first group is empty, you've got an absolute path. If no group is empty, you've got an absolute URL.
    Last edited by misson; 05-23-2010 at 07:05 AM.
    Be sure to read all pages linked in this post; they have further information that should prove useful. When asking for help, make sure you follow Eric Raymond's and Jon Skeet's guidelines for prompt, accurate responses. Please answer any questions I ask; they're not rhetorical (probably). Any posted code is intended as illustrative example, rather than a solution to your problem to be copied without alteration. Study it to learn how to write your own solution.
    Misson, not Mission.

  6. #6
    learning_brain is offline x10 Sophmore learning_brain is an unknown quantity at this point
    Join Date
    Apr 2010
    Location
    UK, Midlands
    Posts
    170

    Re: img src preg_match_all regex problem

    @misson - doing most of your suggestions now.

    OK, my url host/directory isn't working for all urls - only the one I tested.

    #1,#2,#3 etc depend on # of directories/subdirectories so in this case, if the url is only the root, I don't get what I need.

    2ndly, my abs/rel path test ain't working too well...

    PHP Code:
    <?php
    //define url to search
        
    $url $_POST['url'];

    //split url
        
    preg_match('/((http|https|ftp):\/\/)?((.*?)\/)?((.*)\/)?(.*)?/',$url$urlParts);

    //concatenate host and directory path
        
    $urlHostDir $urlParts[1].$urlParts[3].$urlParts[5];

        echo 
    "<br>Host and Directory: ".$urlHostDir."<br/>";

    //get contents
        
    $contents file_get_contents($url);

    //define regexp for img tag source
        
    $pattern '/src=[\"\']?([^\"\']?.*(png|jpg|gif))[\"\']?/i';
    //match all img tag source
        
    preg_match_all($pattern$contents$images);

    //count number of items in array
        
    $imageCount count($images[1]);

    //loop through each item
        
    for ($i=0$i<$imageCount$i++){
        
    //check if absolute or relative path

            
    echo "<br/>Testing: ".$images[1][$i];
        
            if(
    preg_match('/^http|https?:\/\//'$images[1][$i]))
            {
                echo 
    "<br/>path is absolute";
                
    $completeImageURL $images[1][$i];
            }
            else
            {
                echo 
    "<br/>path is relative";
                
    // concatenate host, directory and relative path
                
    $completeImageURL $urlHostDir.$images[1][$i];    
            }
    //echo path and image
            
    echo "<br/>Path to save: ".$completeImageURL;
            echo 
    '<br/><img src="'.$completeImageURL.'" width="100"/>';
            
    //insert into DB
            //$insertSQL = sprintf("INSERT INTO IMAGES (URL, KEYWORDS) VALUES (%s, %s)",
            //                       GetSQLValueString($completeImageURL, "text"),
            //                       GetSQLValueString($completeImageURL, "text"));
                
            //$Result1 = mysql_query($insertSQL, $freewebhost) or trigger_error('Query failed');
        
    }
    ?>
    Actually, after further testing, this is riddled with problems.

    Some paths are absolute with http://www.testsite.com/gb/images/image.jpg... fine

    but what about nasty relative ones...

    /gb/images/image.jpg
    ../images/image.jpg
    ./images/image.jpg

    or even just image.jpg

    all depending on where the file sits and how it has been coded. This obviously changes how many directory layers I need to concatenate.

    Nasty................
    Last edited by learning_brain; 05-23-2010 at 04:25 PM.

  7. #7
    misson is offline x10 Spammer misson is a jewel in the rough
    Join Date
    Mar 2008
    Location
    Libertatia
    Posts
    2,506

    Re: img src preg_match_all regex problem

    Take a second look at the comments for realpath() for sample functions that resolve relative URLs. It's not that nasty. RFC 3986 § 5 even gives an algorithm. Here are two more examples:

    PHP Code:
    function resolveURL($url$base) {
        
    preg_match('%^([^:/]+://[^/]+)([^?]*/)%'$base$baseParts);
        
    // $baseParts[1] is the scheme & host; $baseParts[2] is a '/' terminated absolute path
        
    preg_match('%^((?:https?://[^/]+)?)(/?)(.*)%'$url$urlParts);
        if (empty(
    $urlParts[1])) {
            
    $urlParts[1] = $baseParts[1];
        }
        if (empty(
    $urlParts[2])) {
            
    $urlParts[2] = $baseParts[2];
        }
        
    array_shift($urlParts);
        return 
    implode(''$urlParts);
    }
    // or, based on parse_url()
    function resolveURL($url$base) {
        
    $url parse_url($url);
        if (! 
    is_array($base)) {
            
    $base parse_url($base);
        }
        foreach (
    $base as $name => $part) {
            if (!isset(
    $url[$name])) {
                
    $url[$name] = $base[$name];
            }
        }
        return 
    "$url[scheme]://$url[host]$url[path]";

    Note that the first ignores query strings in the base URL and treat query strings in the URL to resolve as part of the path. The second will copy over any query string that's in the base URL.

    If you want to remove dot segments, remove any occurrence matching %/(\.|[^/]+/+\.\.)(/|$)% from the path segment after you've added the missing URL components.

    If the HTTP extension is installed, it turns out all you need is http_build_url():
    PHP Code:
    function resolveURL($url$base) {
        return 
    http_build_url($base$urlHTTP_URL_JOIN_PATH);

    Make sure you check for the <base> tag when setting the URL base.
    PHP Code:
    if (preg_match('%<base\s*href=['"]?([^'">]*)%', $url, $matches)) {
        // in case the base tag doesn'
    t have an absolute URLresolve it
        $base 
    resolveURL($matches[1], $url);
    } else {
        
    $base $url;
    }
    // remove trailing non-directory component, if any. Not strictly necessary
    $base preg_replace('%[^/]*$%'''$base); 
    Be sure to read all pages linked in this post; they have further information that should prove useful. When asking for help, make sure you follow Eric Raymond's and Jon Skeet's guidelines for prompt, accurate responses. Please answer any questions I ask; they're not rhetorical (probably). Any posted code is intended as illustrative example, rather than a solution to your problem to be copied without alteration. Study it to learn how to write your own solution.
    Misson, not Mission.

  8. #8
    learning_brain is offline x10 Sophmore learning_brain is an unknown quantity at this point
    Join Date
    Apr 2010
    Location
    UK, Midlands
    Posts
    170

    Re: img src preg_match_all regex problem

    Thanks misson - that's cracked it - works like a dream.

  9. #9
    misson is offline x10 Spammer misson is a jewel in the rough
    Join Date
    Mar 2008
    Location
    Libertatia
    Posts
    2,506

    Re: img src preg_match_all regex problem

    Be sure you follow the advice in my sig. Specifically,
    Quote Originally Posted by misson
    Any posted code is intended as illustrative example, rather than a solution to your problem to be copied without alteration. Study it to learn how to write your own solution.
    In particular, the sample resolveURL() functions don't handle query strings properly. An implementation that's closer to RFC 3986's algorithm is:

    PHP Code:
    function sortBy(array $toSort, array $order) {
        
    $order array_intersect_key($order$toSort);
        return 
    array_merge($order$toSort);
    }

    function 
    build_url($parts) {
        foreach (array(
    'host' => '://''query' => '?''fragment' => '#') as $part => $pre) {
            if (isset(
    $parts[$part])) {
                
    $parts[$part] = $pre $parts[$part];
            }
        }
        return 
    implode(''$parts);
    }

    function 
    resolveURL($url$base) { 
        
    $urlParts parse_url($url); 
        if (! 
    is_array($base)) { 
            
    $base parse_url($base);
        }
        
    $base['path'] = preg_replace('%[^/]+$%'''$base['path']);
        foreach (
    $base as $name => $part) { 
            if (!isset(
    $urlParts[$name])) { 
                
    $urlParts[$name] = $base[$name]; 
            } else {
                break;
            }
        }
        if (
    $urlParts['path'][0] != '/') {
            
    $urlParts['path'] = $base['path'] . $urlParts['path'];
        }
        
    $urlParts['path'] = preg_replace('%/(?:\.|[^/]+/+\.\.)(/|$)%''$1'$urlParts['path']);
        return 
    build_url(sortBy($urlParts$base));

    Be sure to read all pages linked in this post; they have further information that should prove useful. When asking for help, make sure you follow Eric Raymond's and Jon Skeet's guidelines for prompt, accurate responses. Please answer any questions I ask; they're not rhetorical (probably). Any posted code is intended as illustrative example, rather than a solution to your problem to be copied without alteration. Study it to learn how to write your own solution.
    Misson, not Mission.

+ Reply to Thread

Similar Threads

  1. JavaScript RegEx $1 Tokens
    By masterjake in forum Programming Help
    Replies: 5
    Last Post: 11-01-2009, 12:36 PM
  2. Simple htaccess RegEx
    By nterror in forum Programming Help
    Replies: 5
    Last Post: 05-30-2009, 09:33 AM
  3. Help with Regex
    By dickey in forum Programming Help
    Replies: 2
    Last Post: 02-19-2009, 06:42 AM
  4. a little javascript regex help
    By MasterMax1313 in forum Programming Help
    Replies: 2
    Last Post: 05-02-2008, 09:00 AM
  5. cURL Login And Cookies and Regex
    By Tau_Zero in forum Programming Help
    Replies: 4
    Last Post: 12-12-2007, 12:29 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
x10hosting free hosting for the masses
dedicated servers