+ Reply to Thread
Page 1 of 2 12 LastLast
Results 1 to 10 of 14
Like Tree2Likes

Thread: Replacing all img src in loaded html

  1. #1
    learning_brain is offline x10 Sophmore learning_brain is an unknown quantity at this point
    Join Date
    Apr 2010
    Location
    UK, Midlands
    Posts
    170

    Replacing all img src in loaded html

    I don't know if I'm going to get any sense here but here goes...

    I have a class that analyses an image (test file here refresh for another image) and censors anything that has too many fleshtones by returning a pixellated version.

    I am trying to set up a webpage wrapper that loads the html and replaces every image src with the output from the class.

    Currently, I'm loading the page using cURL into a variable $html.

    What comes next I can't get my head round. After lots of reading up, there's mentions of DOM documents, Xpaths, str_replace, for each, preg_replace etc.. etc but I can't work out how to apply it to this situation.

    The closest topic I can find is here but again, I can't get my head round using the existing image url to call a class and return a replacement image url (or the same one if it passes).

    Any explanations in simple English would be much appreciated.

    Thank you

    Rich

  2. #2
    misson is offline x10 Spammer misson is a jewel in the rough
    Join Date
    Mar 2008
    Location
    Libertatia
    Posts
    2,506

    Re: Replacing all img src in loaded html

    Take a closer look at the "Extracting data from HTML" document you linked to. It does almost what you want; it iterates over all elements in a document with a certain tag, performing an operation on each. Where it differs is in processing all anchor elements ("//a") rather than images ("//img"), and echoing the "href" attribute rather than setting the "src" attribute. Lastly, you want to output the original document as HTML after processing. If those hints aren't enough, let me know and I'll post a modified version of Kore Nordmann's code.

    DOM is simply an OOP interface for documents; a DOM document is a document that supports the DOM interface. Xpaths are like CSS selectors but with different syntax (one that slightly resembles filesystem paths). If you use Firefox, there are a number of xpath add-ons (such as FirePath with FireBug) that let you play with xpaths.
    Last edited by misson; 08-18-2011 at 06:29 PM.
    Be sure to read all pages linked in this post; they have further information that should prove useful. When asking for help, make sure you follow Eric Raymond's and Jon Skeet's guidelines for prompt, accurate responses. Please answer any questions I ask; they're not rhetorical (probably). Any posted code is intended as illustrative example, rather than a solution to your problem to be copied without alteration. Study it to learn how to write your own solution.
    Misson, not Mission.

  3. #3
    learning_brain is offline x10 Sophmore learning_brain is an unknown quantity at this point
    Join Date
    Apr 2010
    Location
    UK, Midlands
    Posts
    170

    Re: Replacing all img src in loaded html

    As always - you come up with the goods, but scraping content was not my problem.

    My issue was the replacement of the src Attribute.... but you did give me some hints which I found very useful.

    I am having secondary problem though.

    Current code...

    PHP Code:
    $target_url $_GET['url'];
        
        
    $oldSetting libxml_use_internal_errorstrue );
        
    libxml_clear_errors();
        
    $html = new DOMDocument();
        
    $html->loadHtmlFile($target_url);
        
    $xpath = new DOMXPath($html);
        
    $imgtags $xpath->query'//img' );
        foreach (
    $imgtags as $imgtag) {
            
    $absoluteImgSrc url_to_absolute($target_url$imgtag->getAttribute('src'));
            
    $analysedImage = new ImageAnalysis();
            
    $analysedImage->doAnalysis($absoluteImgSrc);
            
    $imgtag->setAttribute('src',$analysedImage->outputURL);
        }
        
    libxml_clear_errors();
        
    libxml_use_internal_errors$oldSetting );
        
        
        
    $root $html->createElement('html');
        
    $root $html->appendChild($root);
        
        
    $head $html->createElement('head');
        
    $head $root->appendChild($head);
        
        
    $title $html->createElement('title');
        
    $title $head->appendChild($title);
        
        
    $text $html->createTextNode('This is the title');
        
    $text $title->appendChild($text);
        
        echo 
    $html->saveHTML(); 
    within the foreach, the url_to_absolute() function is one that resolves an absolute url so it can be used from any location.

    Now I know my class works... on my own server... as you can see.

    However, it doesn't seem to work using another site and the same process....

    Not sure about the necessity of creating elements and replacing child?????? I'm guessing this is to do with XML????

    Need to think this through....

    Rich
    Last edited by learning_brain; 08-19-2011 at 05:41 PM.

  4. #4
    misson is offline x10 Spammer misson is a jewel in the rough
    Join Date
    Mar 2008
    Location
    Libertatia
    Posts
    2,506

    Re: Replacing all img src in loaded html

    $html already has an <html> element, and should already have <head> and <title> elements. After the loop and resetting libxml's error setting, all you should need is the echo $html->saveHTML();. The rest gets you malformed HTML.

    When you say "it doesn't work", how exactly doesn't it work? Give sample input & output, and say how the output differed from what you expected.
    Last edited by misson; 08-19-2011 at 07:42 PM.
    learning_brain likes this.
    Be sure to read all pages linked in this post; they have further information that should prove useful. When asking for help, make sure you follow Eric Raymond's and Jon Skeet's guidelines for prompt, accurate responses. Please answer any questions I ask; they're not rhetorical (probably). Any posted code is intended as illustrative example, rather than a solution to your problem to be copied without alteration. Study it to learn how to write your own solution.
    Misson, not Mission.

  5. #5
    learning_brain is offline x10 Sophmore learning_brain is an unknown quantity at this point
    Join Date
    Apr 2010
    Location
    UK, Midlands
    Posts
    170

    Re: Replacing all img src in loaded html

    Thanks Misson

    I found the bug - it was to do with relative paths rather than absolute which I have now fixed.

    The development page is at http://www.qualityimagesearch.com/cbic_wrapper.php

    Just put in an html address (full url) and it will process the images accordingly. I have also changed all a hrefs so that you can browse within the wrapper.

    My only issue at the moment is that some css files are not being linked... possibly another url issue.

    The image censoring seems relatively accurate at present, but improvements have to be made.

    Also - it would be good to have the form re-iterated at the top of the final html so that another address can be entered, but if I simply echo it before the saveHTML, it will come before the html->body etc, which isn't ideal.

    Maybe there's a way to append the <body></body> content....

    Progress is good so far and reasonably fast too.

    Rich
    Last edited by learning_brain; 08-20-2011 at 02:10 PM.

  6. #6
    misson is offline x10 Spammer misson is a jewel in the rough
    Join Date
    Mar 2008
    Location
    Libertatia
    Posts
    2,506

    Re: Replacing all img src in loaded html

    Quote Originally Posted by learning_brain View Post
    Also - it would be good to have the form re-iterated at the top of the final html so that another address can be entered, but if I simply echo it before the saveHTML, it will come before the html->body etc, which isn't ideal.

    Maybe there's a way to append the <body></body> content....
    Build the form in the usual way (with DOMDocument::createElement calls) and insert or append the form element to the body element.

    As an alternative to creating the form programmatically, you can create a DOMDocumentFragment then have it parse a string containing the form. You then add the fragment to the body element as you would the programmatically created element (with DOMNode::insertBefore or DOMNode::appendChild). Note that, as with all nodes, the document fragment must be owned by the document, otherwise you'll never be able to add the nodes. The simplest way to do this is to use DOMDocument::createDocumentFragment.

    PHP Code:
    $formSource = <<<ETX
    <form method="get">
        <input name="u" />
        ...
    </form>
    ETX;

    $form $doc->createDocumentFragment();
    $form->appendXML($formSource);
    $body->appendChild($form); 
    You can also create a document fragment in the usual OO way (with new), but you must add the fragment to the document using DOMDocument::importNode before you parse the HTML string (orphaned DOM nodes are read-only). Since importNode creates a copy of the node rather than altering the original, you must also use the returned document fragment rather than the original, which will still be orphaned and read-only. You might as well just use createDocumentFragment.

    PHP Code:
    ...
    $form = new DOMDocumentFragment();
    $form $doc->importNode($form);
    $form->appendXML($formSource);
    ... 
    If PHP's DOMDocument supported adoptNode, then creating a fragment with new would be viable, but sadly DOMDocument implements DOM level 2, not 3.
    Last edited by misson; 08-20-2011 at 05:02 PM.
    Be sure to read all pages linked in this post; they have further information that should prove useful. When asking for help, make sure you follow Eric Raymond's and Jon Skeet's guidelines for prompt, accurate responses. Please answer any questions I ask; they're not rhetorical (probably). Any posted code is intended as illustrative example, rather than a solution to your problem to be copied without alteration. Study it to learn how to write your own solution.
    Misson, not Mission.

  7. #7
    learning_brain is offline x10 Sophmore learning_brain is an unknown quantity at this point
    Join Date
    Apr 2010
    Location
    UK, Midlands
    Posts
    170

    Re: Replacing all img src in loaded html

    Hmmm - got an unexpected $end at the end of the script....

    So I tried playing with the <<<ETX wrapper.

    eg..

    PHP Code:
    $formSource=<<<STX
            <form id="form" name="form" method="get" action="">
                <label>
                <input name="url" type="text" id="url" size="100" />
                </label>
                <label>
                <input type="submit" name="button" id="button" value="Go" />
                </label>
            </form>
    ETX; 
    PHP Code:
    $formSource='<<<STX
            <form id="form" name="form" method="get" action="">
                <label>
                <input name="url" type="text" id="url" size="100" />
                </label>
                <label>
                <input type="submit" name="button" id="button" value="Go" />
                </label>
            </form>
    ETX'

    PHP Code:
    $formSource='<<<STX
            <form id="form" name="form" method="get" action="">
                <label>
                <input name="url" type="text" id="url" size="100" />
                </label>
                <label>
                <input type="submit" name="button" id="button" value="Go" />
                </label>
            </form>
    ETX>>>'

    PHP Code:
    $formSource='<<<ETX
            <form id="form" name="form" method="get" action="">
                <label>
                <input name="url" type="text" id="url" size="100" />
                </label>
                <label>
                <input type="submit" name="button" id="button" value="Go" />
                </label>
            </form>
    ETX'

    The furthest I got was "Call to a member function appendChild() on a non-object" - so I'm guessing it wasn't parsing the string correctly... can't find many pages explaining STX..ETX... ASCII characters and syntax.

    Am I missing something??

    Rich

  8. #8
    misson is offline x10 Spammer misson is a jewel in the rough
    Join Date
    Mar 2008
    Location
    Libertatia
    Posts
    2,506

    Re: Replacing all img src in loaded html

    Quote Originally Posted by learning_brain View Post
    Hmmm - got an unexpected $end at the end of the script....
    [...]
    PHP Code:
    $formSource=<<<STX
       [...]
    ETX; 
    The start and end identifier must be exactly the same. Only identifier characters and (optionally) a trailing ";" can be present in the line with the closing identifier. You have "STX" as an opening identifier, so the string is never closed (as you can see by the red in this colorized source). I sometimes use ETX as a delimiter since the ETX control character means "end of text", but "STX" and "ETX" have no significance in PHP. You could use "EOS" (short for "end of string"), or "String_end" or "Whoops_Mrs_Miggens_youre_sitting_in_my_artichokes ".

    Quote Originally Posted by learning_brain View Post
    PHP Code:
    $formSource='<<<STX
       [...]
    ETX'

    This is simply a single quoted string, as are the rest of the samples, as the color again reveals. The "<<<STX" and "ETX'" are a part of that string.

    Quote Originally Posted by learning_brain View Post
    The furthest I got was "Call to a member function appendChild() on a non-object" - so I'm guessing it wasn't parsing the string correctly...
    This means thing you're calling the method on (the $obj in $obj->method()) isn't an object, not that an argument to the method isn't an object. Whatever you're doing to retrieve the DOM node (the body element?) is failing. Try:
    PHP Code:
    $body $xpath->query('/html/body')->item(0); 

    Quote Originally Posted by learning_brain View Post
    can't find many pages explaining STX..ETX... ASCII characters and syntax.
    You don't need info about the ETX character, you need to read over the PHP documentation on heredoc syntax for strings.
    Last edited by misson; 08-21-2011 at 06:31 AM.
    learning_brain likes this.
    Be sure to read all pages linked in this post; they have further information that should prove useful. When asking for help, make sure you follow Eric Raymond's and Jon Skeet's guidelines for prompt, accurate responses. Please answer any questions I ask; they're not rhetorical (probably). Any posted code is intended as illustrative example, rather than a solution to your problem to be copied without alteration. Study it to learn how to write your own solution.
    Misson, not Mission.

  9. #9
    learning_brain is offline x10 Sophmore learning_brain is an unknown quantity at this point
    Join Date
    Apr 2010
    Location
    UK, Midlands
    Posts
    170

    Re: Replacing all img src in loaded html

    Thanks Misson

    The $body = $xpath->query('/html/body')->item(0); worked a treat... although I don't understand why lol.

    Your code for the ETX was the one I tried first (which was giving me the unexpected $end). The others were just trials.

    After a lot a reading up, I found that it was having problems with the tabbing, so I removed them and...works a treat! (although it adds the code to the end of the body so I'm having difficulty getting a css relative position at the top of the page without overlaying over existing headers..) Just need to style it up and sort out the <link...> tags and I should have a working solution!

    Because of the increased loading time before it returns the html, I was trying to get a progress bar working, but I guess that's for another day and another thread...

    Thanks again for all your help.

    Rich
    Last edited by learning_brain; 08-21-2011 at 03:14 PM.

  10. #10
    misson is offline x10 Spammer misson is a jewel in the rough
    Join Date
    Mar 2008
    Location
    Libertatia
    Posts
    2,506

    Re: Replacing all img src in loaded html

    Quote Originally Posted by learning_brain View Post
    The $body = $xpath->query('/html/body')->item(0); worked a treat... although I don't understand why lol.
    Think about what DOMXPath::query returns and what /html/body selects, then what the DOMNodeList::item returns.

    Quote Originally Posted by learning_brain View Post
    Your code for the ETX was the one I tried first (which was giving me the unexpected $end).
    Not quite. Here's what I wrote:
    Quote Originally Posted by misson View Post
    PHP Code:
    $formSource = <<<ETX
    <form method="get">
        <input name="u" />
        ...
    </form>
    ETX; 
    My first line differs from yours: I use "ETX", you use "STX", which makes all the difference.

    It also appears that vBulletin is adding spaces after the open and close identifiers in the rendered post (but not the source, as you can see if you quote the original message). Those extra spaces will cause a "Parse error: syntax error, unexpected T_SL" error.

    Quote Originally Posted by learning_brain View Post
    After a lot a reading up, I found that it was having problems with the tabbing, so I removed them and...works a treat!
    The only place whitespace should matter is on the lines with the identifiers, where there should be none. Within the heredoc string itself, whitespace won't affect parsing.

    Quote Originally Posted by learning_brain View Post
    (although it adds the code to the end of the body so I'm having difficulty getting a css relative position at the top of the page without overlaying over existing headers..) Just need to style it up and sort out the <link...> tags and I should have a working solution!
    You could use insertBefore and DOMNode::$firstchild rather than appendChild to make the form the first child of <body>, or use fixed positioning. If you do the latter, add extra space to the top of the body (by e.g. adding padding) so the top of the page isn't eclipsed by the form.
    Be sure to read all pages linked in this post; they have further information that should prove useful. When asking for help, make sure you follow Eric Raymond's and Jon Skeet's guidelines for prompt, accurate responses. Please answer any questions I ask; they're not rhetorical (probably). Any posted code is intended as illustrative example, rather than a solution to your problem to be copied without alteration. Study it to learn how to write your own solution.
    Misson, not Mission.

+ Reply to Thread
Page 1 of 2 12 LastLast

Similar Threads

  1. Replacing some text
    By Linkz0rs in forum Programming Help
    Replies: 5
    Last Post: 08-16-2011, 12:13 AM
  2. PHP Question: After the html page loaded the php script will execute?
    By waldopulanco29 in forum Programming Help
    Replies: 5
    Last Post: 05-20-2010, 08:44 PM
  3. Loaded index.html but no joy
    By RUSSELLA in forum Free Hosting
    Replies: 2
    Last Post: 11-21-2008, 12:00 AM
  4. Replacing default text
    By bunglebrown in forum Programming Help
    Replies: 0
    Last Post: 09-01-2008, 06:38 PM
  5. replacing old webpage
    By guraot in forum Programming Help
    Replies: 12
    Last Post: 04-29-2008, 07:18 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
x10hosting free hosting for the masses
dedicated servers