+ Reply to Thread
Results 1 to 7 of 7

Thread: Clearing Dom Object

  1. #1
    learning_brain is offline x10 Sophmore learning_brain is an unknown quantity at this point
    Join Date
    Apr 2010
    Location
    UK, Midlands
    Posts
    170

    Clearing Dom Object

    I have a site that crawls sites for images using CURL and parsing to DOM elements.

    This works great for single urls, but what I want to achieve is for a preliminary a->href search and then a loop to search through all href pages for images as well (1 deep).

    Ideally, I would like to extend this to possibly 5 deep, but I'm guessing I would also have timeout issues as well.

    This is fine in theory, but this means opening up potentially lots of pages an creating lots of dom documents. My first effort ended up with "memory limit exceeded".

    Is there a way to clear a dom document after each loop so that it creates a new one afresh?

    Is it as simple as $dom="" ???

    misson, if you're reading this, you may remember that I wanted help on preg_match_all and regex issues. After a lot of experimentation, this dom parser seems to be a far simpler solution, together with an absolute path resolver . The results are great and far easier to manipulate.
    Last edited by learning_brain; 06-03-2010 at 12:08 PM.

  2. #2
    marshian's Avatar
    marshian is offline x10 Elder marshian is an unknown quantity at this point
    Join Date
    Jan 2008
    Location
    Belgium
    Posts
    526

    Re: Clearing Dom Object

    What you should do is create a list of all pages you have to index.
    Eg.
    1. Download page 1
    2. Search & process(*) url's
    3. Search & process images
    4. Download the next page and continue with 2

    (*) = add the url to the list of url's to process.

    About the memory usage, see curl_close.
    Closes a cURL session and frees all resources. The cURL handle, ch, is also deleted.
    That should be what you're looking for (:

    Additionally, about the time-outs, you could just continue going deeper, processing images and getting more url's on the way until you've reached a certain critical time. Or explained in human language, get the current time when you start your indexing and keep processing the next url on your list as long as you've not yet reached X seconds.

    Pseudo-code:
    Code:
    $time = time();
    while(time() < $time + 20 && has_url) {
        process_url(next_url());
    }
    
    function process_url() {
        // get url's
        add_url($url);
        // get images
        add_image($url);
    }
    Is that of any use?
    Real programmers don't document their code - if it was hard to write, it should be hard to understand.

  3. #3
    learning_brain is offline x10 Sophmore learning_brain is an unknown quantity at this point
    Join Date
    Apr 2010
    Location
    UK, Midlands
    Posts
    170

    Re: Clearing Dom Object

    That's a pretty good start... Thanks!

    Quick questions though... You mention a "list" of URL's to process. This isn't how I was going about it.

    Initially, the page was opening the submitted URL and scaping for a hrefs, then img src's. Then as part of that loop, it would start another loop to go through all url's obtained from the first scrape. As it is aprt of the first scrape loop, I can't close the first CURL otherwise, I lose the URL list to work on.

    Your "list" idea sounds better but I don't know how to go about it.... would this be a seperate table in the db which is then accessed by chronjob?

    Your timeout idea is a bit confusing. Surely this will limit the time the processing has to complete, not extend it... which is what I would need for a complete 1st level/2nd level url scrape.

    I'm determined to get this cracked.

    Current code below..

    PHP Code:
    <?php
    require_once('Connections/freewebhost.php');//connection parameters
    mysql_select_db($database_freewebhost$freewebhost);//select mysql database
    require_once('functions/url_to_absolute.php');//function to resolve absolute urls
    require_once('functions/getmysqlvaluestring.php');//function to sanitise string prior to mysql injection

    if (isset($_POST['domain_url']))//if a url is submitted
    {
        
    //print another form
        
    echo '
        <form id="form1" name="form1" method="post" action="">
        <label> Submit another Page URL
            <input type="text" name="domain_url" id="domain_url" />
        </label>
        <label>
          <input type="submit" name="button" id="button" value="Submit" />
          </label>
        </form>
        '

        
        
    //define url to search
        
    $domain_url $_POST['domain_url'];
        echo 
    "<div style='background-color: #ccc; border:1px solid #ccc; padding: 10px; margin-left:10px; margin-top:10px;'>";
        echo 
    "<hr/><strong>Crawling: ".$domain_url."</strong><hr/>";

        
    //------------------------------------------------CURL------------------------------------------------
        
    $userAgent 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';
        
        
    $ch curl_init();
        
    curl_setopt($chCURLOPT_USERAGENT$userAgent);
        
    curl_setopt($chCURLOPT_URL,$domain_url);
        
    curl_setopt($chCURLOPT_FAILONERRORtrue);
        
    //curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
        
    curl_setopt($chCURLOPT_AUTOREFERERtrue);
        
    curl_setopt($chCURLOPT_RETURNTRANSFER,true);
        
    curl_setopt($chCURLOPT_TIMEOUT10);
        
    $html curl_exec($ch);
        if (!
    $html)
        {
            echo 
    "<br />cURL error number:" .curl_errno($ch);
            echo 
    "<br />cURL error:" curl_error($ch);
            exit;
        }
        
        
        
    $dom = new DOMDocument();
        @
    $dom->loadHTML($html);
        
        
    $xpath = new DOMXPath($dom);
        
    $img $xpath->evaluate("/html/body//img");
        
    //------------------------------------------------END CURL------------------------------------------------
        
        //-------------------------------------start domain url loop-------------------------------
        
    for ($i 0$i $img->length$i++)
        {
            echo 
    "<div style='background-color: #fff; border:1px solid #ccc; padding: 10px; margin-left:10px; margin-top:10px;'>";
            
    $imgTags $img->item($i);
            
    $imageUrl url_to_absolute($domain_url$imgTags->getAttribute('src'));
            
    $imageAlt $imgTags->getAttribute('alt');

            echo 
    '<br/><img src="'.$imageUrl.'"/><br/>';
            echo 
    "Image URL: ".$imageUrl."<br/>";
            echo 
    "Image Keywords: ".$imageAlt."<br/>";
            
            
    //get image size
               
    $imgSize=getimagesize($imageUrl);
               echo 
    "Size: ".$imgSize[0]."x".$imgSize[1]."<br/>";
            
            
    //if image too small
            
    if($imgSize[0]<300 || $imgSize[1]<300)
            {
                echo 
    " Image less than 300 x 300px.  Skipping......";
            
    //else if image is size specified
            
    } else {
                echo 
    " Size OK";

               
    //check if already exists in DB
                
    if(mysql_num_rows(mysql_query("SELECT URL FROM IMAGES WHERE URL = '$imageUrl'")))
                {
                    echo 
    "Image URL exists. Skipping.....";
                } else {
                    echo 
    "<br/>Image URL does not exist in database: Saving.....";
                
                    
    //insert into DB
                    
    $insertSQL sprintf("INSERT INTO IMAGES (URL, KEYWORDS) VALUES (%s, %s)",
                                           
    GetSQLValueString($imageUrl"text"),
                                           
    GetSQLValueString($imageAlt"text"));
                        
                    
    $Result1 mysql_query($insertSQL$freewebhost) or trigger_error('Query failed');
                }
            }
    //end image test
            
            
    echo "</div>";
        }
    //-------------------------------------end domain url loop-------------------------------
        
        
    echo "<br/>Page crawl complete.";
        echo 
    "</div>";
        
        
        
        
    } else {
    //if form is not submitted, print initial form
        
    echo '
        <p><a href="index.php">Back to Image Search</a></p>

        <form id="form1" name="form1" method="post" action="">
        <label> Submit full Page URL
            <input type="text" name="domain_url" id="domain_url" />
        </label>
        <label>
          <input type="submit" name="button" id="button" value="Submit" />
          </label>
        </form>
        '
    ;
    }
    ?>

  4. #4
    misson is offline x10 Spammer misson is a jewel in the rough
    Join Date
    Mar 2008
    Location
    Libertatia
    Posts
    2,506

    Re: Clearing Dom Object

    curl_close() is similar in many ways to mysql_close(). curl_close() for when you no longer need to fetch resources using a curl session (mysql_close() is for when you're done with a MySQL connection). The counterpart to curl_close() is curl_init() (like mysql_close()/mysql_connect()) and the number of calls to curl_close() must be no greater than the number of calls to curl_open(). You can reuse a curl session (as you can reuse MySQL connections) by simply calling curl_setopt() to set a new URL (along with any other curl options), followed by curl_exec(). Also, when a curl session is garbage collected, it gets closed. As a consequence, curl_close() doesn't necessarily gain you much. If your script has stages where it doesn't need to fetch anything, closing a curl session early may help. If you only have one curl session or need it throughout the script, there isn't a great benefit.

    The world wide web forms a graph. Resources (anything with a URL) is a node, links are edges. Resources that can't contain anchors (such as images) are leaves. You want to traverse a portion of this graph, which leads to two algorithms: breadth-first search (BFS) and depth-first search (DFS). In the former, you process all nodes at a given distance from the starting node before processing nodes further out; in the latter, you fully process one branch before processing the next. Each is very similar. Here's an outline for both:
    1. Put the root node in the list N
    2. While there's a node left in list N
      1. remove the next node, store it as the current node
      2. (preorder) process current node
      3. add each child of the current node to the list N
      4. (postorder) process current node
    The main difference between a BFS and DFS is the data structure used to hold the list of nodes to process. A BFS uses a queue (first-in, first-out) and a DFS uses a stack (first-in, last-out). PHP doesn't have specialized queues and stacks, but you can implement them with arrays and array_push() with array_pop() or array_shift(). You can implement DFS recursively, in which case the node list is the call stack. This also gives you a new version of every local variable, which may or may not be desirable. If you only want one instance of local variables, don't use the call stack. This is the main source of memory usage. Don't use the call stack and reuse whichever resources you can (such as the DOMDocument and curl session) to reduce memory usage.

    Another axis along which traversal algorithms differ is the point at which additional node processing (i.e., beyond adding its children to the list) is performed: before, during or after adding children to the list, also called pre-, in- and post-order traversal. Pre- and post-order are marked in the outline. In your case, extracting image URLs and adding them to the DB counts as additional processing. You can re-use the curl session and DOMDocument with either pre- or post-order traversal.

    The linked articles have more specific information about tree traversal.

    If you can rely on allow_url_fopen being set, you don't need to use curl. Simply pass the URL to DOMDocument::load().

    One other issue is that BFS and DFS are designed to work on trees, which are connected graphs without cycles. Since the web most decidedly has cycles, you'll have to do something to break them. Use a set to record URLs. Here's the pseudocoded algorithm, updated to handle cycles:
    Code:
    set Seen to {}
    add root to Nodes
    while size(Nodes) > 0:
        remove next element of Nodes and store it as current node
        [additional processing of current node]
        for each child of current:
            if child is not in Seen:
                add child to Seen and Nodes
        [additional processing of current node]
    In PHP, you can use an associative array as a set of URLs. Mapping set operations to array operations:
    • $item is in $Set := isset($Set[$item])
    • add $item to $Set := $Set[$item] = true
    • remove $item from $Set := unset($Set[$item])
    Last edited by misson; 06-03-2010 at 09:07 PM.
    Be sure to read all pages linked in this post; they have further information that should prove useful. When asking for help, make sure you follow Eric Raymond's and Jon Skeet's guidelines for prompt, accurate responses. Please answer any questions I ask; they're not rhetorical (probably). Any posted code is intended as illustrative example, rather than a solution to your problem to be copied without alteration. Study it to learn how to write your own solution.
    Misson, not Mission.

  5. #5
    learning_brain is offline x10 Sophmore learning_brain is an unknown quantity at this point
    Join Date
    Apr 2010
    Location
    UK, Midlands
    Posts
    170

    Re: Clearing Dom Object

    Thanks misson - helpful as always.

    After reviewing all of this (and there was a lot to understand and get my head around), I have made a few decisions...

    Firstly, I've got this sort of working. The trouble is, my url-to-crawl list grows exponentially and my host loses interest (just stops unexpectedly or says "mysql has gone away" - what on holiday?) before completion of all scraping, leaving lots of open ends and no way of tracking what has or has not been crawled.

    I went back to marshian's comment about a list...and thought why not? A pending queue... Its so much easier to track and manage.

    So I now have seperate pages.

    1) submit initial URL (Could be root), which crawls for links and adds to mysql queue table.
    2) processing page which loops through unprocessed url's looking for images/saving src and updates queue when done. It also looks for tertiary url links and adds to queue. This page I can keep refreshing or put as a chron job.

    Just need to tidy up now.

    One thing I need to check is whether the link in the queue is a url for an image itself. (i.e. http://www.mysite.com/images/image.jpg) in which case it should just save it as an image url rather than as a page to be crawled.

  6. #6
    marshian's Avatar
    marshian is offline x10 Elder marshian is an unknown quantity at this point
    Join Date
    Jan 2008
    Location
    Belgium
    Posts
    526

    Re: Clearing Dom Object

    That's pretty much what my idea was as well, yet implemented in a different method. Nice find, it sounds lik a good implementation. Just a small remark, you're basically about to go multi-threaded now. You'll want to lock your resources. Basically, at the beginning of your processing page add a mysql LOCK query and UNLOCK when you're done. This way you prevent your script from interfering with itself if it would happen to run twice.

    About your content remark (whether the link is an image or file), it's impossible to tell what kind of document you'll be seeing if you only have an url. Your example (http://www.mywebsite.com/images/image.jpg) might refer to "image.jpg" or "image.jpg/index.php", there's no way of telling which one it is. And that's just the tip of the iceberg. Ever heard of mod_rewrite? It allows you to rewrite any incoming request. For example GET /index.php might be rewritten to GET /somesubdir/image.jpg.

    Luckily there is something that should help you out here. The mime type of the response content is usually send by the server, in the Content-type header. The value of these header is a mime type, like (for exampe) text/html, image/jpeg, image/png. You can use curl_getinfo along with the option CURLINFO_CONTENT_TYPE to find out which content-type the content is.

    Obviously this should be done by your crawler when he starts reading a new page. You'll be interested in anything matching image/* (or image/.+ in valid regex). These are the images whose url you can immediately store.
    Real programmers don't document their code - if it was hard to write, it should be hard to understand.

  7. #7
    learning_brain is offline x10 Sophmore learning_brain is an unknown quantity at this point
    Join Date
    Apr 2010
    Location
    UK, Midlands
    Posts
    170

    Re: Clearing Dom Object

    Thanks marshian.

    Your locking idea is a good one - I'll check that out.

    The curl_getinfo presents a problem... I don't want to have to open a curl for every link and would prefer to check it before I start the curl object to preserve resources.

    Order:

    Loop through URL pending queue (limit results per load)

    {
    Check if image - if yes, check size, check if exists, save to db
    If not an image, create curl object
    Loop through img elements

    {
    Scrape src and alt info
    Absolutise (is that a word?) src
    check size
    check if exists
    save to db
    refresh mysql connection
    }
    Close curl
    }




    I have however found a very simple solution. I just do a getimagesize() test on the url and if it returns a value, it's a readable image of some type. If not, its likely to be html. Seeing as I do a getimagesize() check on every link anyway, this seems to fit the bill and works OK with initial testing.

    I have also now split out the url and image searches. The url scrape nearly always returns far more results than the image scrape which means my pending urls was growing stupidly. As split files, I can control the frequency of the crawl types depending on list length.

    I also had to add a mysql_ping() in there as it was disconnecting half way through if no suitable images were found. That seems to work nicely too.

    I'm almost there now but have posted another thread regarding the filtering of dynamically created urls, which are giving me a headache - leading to endless loops of effectively the same page.

    Thanks for all the help.
    Last edited by learning_brain; 06-07-2010 at 02:10 PM.

+ Reply to Thread

Similar Threads

  1. $_SESSION[] clearing can't execute login script!!!
    By celebro in forum Free Hosting
    Replies: 1
    Last Post: 01-02-2010, 06:03 AM
  2. Clearing temporary internet files
    By cyberxzt in forum Off Topic
    Replies: 0
    Last Post: 09-25-2007, 06:50 PM
  3. 404 Errors - Clearing Cache Doesn't Work
    By xopto1 in forum Free Hosting
    Replies: 4
    Last Post: 07-26-2007, 05:19 PM
  4. Clearing Old 'msconfig' Startup List
    By TheJeffsta in forum Computers & Technology
    Replies: 2
    Last Post: 08-25-2006, 06:25 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
x10hosting free hosting for the masses
dedicated servers