+ Reply to Thread
Results 1 to 9 of 9
Like Tree4Likes
  • 1 Post By cybrax
  • 1 Post By essellar
  • 1 Post By lllllllbob61
  • 1 Post By essellar

Thread: Help with Googlebot eating my bandwidth!

  1. #1
    learning_brain is offline x10 Sophmore learning_brain is an unknown quantity at this point
    Join Date
    Apr 2010
    Location
    UK, Midlands
    Posts
    170

    Help with Googlebot eating my bandwidth!

    I'm not sure where to put this query.

    I've been checking my logs because I easily get through 20GB of Bandwidth per month - which takes me up to my limit prematurely and then cuts me off about 7 days before the end of the month (not X10hosting).

    In the stats, Googlebot (just last month) ate 16.10GB last month before service was terminated 8 days before month end - yes that's right 16.10 GB - not MB!!! In more detail, I find they consume about 700,000 KB a day!



    It hit pages 251,496 times in June alone.

    OUCH!!!

    OK - it's a biggish site with currently 342,000 odd pages and images - dynamically listed in the sitemap-indexes.

    Under Google Webmaster tools, I've now set crawl rate at 200 seconds between requests, but is there any other advice anyone can give? (other than disallowing bots in the robots file).

    Thoughts would be appreciated. I get pretty hacked off when I lose service for days on end.

    Thanks

    Rich
    Last edited by learning_brain; 07-10-2011 at 04:31 PM.

  2. #2
    cybrax's Avatar
    cybrax is offline x10 Elder cybrax is on a distinguished road
    Join Date
    Aug 2009
    Location
    UK
    Posts
    699

    Re: Help with Googlebot eating my bandwidth!

    Well.. if the site is making a regular income now might be the time to upgrade the hosting

    On the other hand I would consider swaping in a medium size image for pages that googlebot crawls rather than leave the high resolution one in place. Not sure how the folk at Mountain View feel about that, so probably better to ask them first.

    Without the stats it's hard to say, are these unique crawls or is the bot repeatedly crawling the same images over and over? Dynamically re-writing the robots file may provide a solution if this is the case.
    learning_brain likes this.
    The code must flow.
    Project 157: Latest UK Jobs direct to your mobile phone
    New Domain under construction: Lovelogic.net
    home for some new projects that we can't keep here ;)


  3. #3
    essellar's Avatar
    essellar is offline Community Advocate essellar has a spectacular aura about
    Join Date
    Feb 2010
    Location
    Toronto, Ontario, CA
    Posts
    1,153

    Re: Help with Googlebot eating my bandwidth!

    Well, the number of page hits isn't bad at all for indexing a site that size; the problem is the amount of data per page. I can't see how ~250K pages translates to >16GB unless Google is indexing your high-res images. If the high-res image view pages are part of your site map, I can't see how to claw back the bandwidth without blocking access to your images (non-thumbs) directory with robots.txt. On the other hand, if what you're offering for indexing is just the search results pages, then you should be able to use rel="nofollow" in the links on the thumbnails -- the thumbs will still be indexed (probably multiple times each, since they're likely to turn up on multiple results pages), but the high-res images won't.
    learning_brain likes this.
    “Beware of bugs in the above code; I have only proved it correct, not tried it.” --Donald Knuth
    "It was as if its architects were given a perfectly good hammer and gleefully replied, 'neat! With this hammer, we can build a tool that can pound in nails.'" -- Alex Papadimoulis (on TheDailyWTF.com)

  4. #4
    descalzo's Avatar
    descalzo is offline Grim Squeaker descalzo has a brilliant futuredescalzo has a brilliant futuredescalzo has a brilliant future
    Join Date
    Jul 2009
    Location
    Ankh-Morpork
    Posts
    7,636

    Re: Help with Googlebot eating my bandwidth!

    If you have your images in several directories, try excluding the bots from one directory per month.
    Nothing is always absolutely so.

  5. #5
    lllllllbob61 is offline x10Hosting Member lllllllbob61 is an unknown quantity at this point
    Join Date
    Jun 2011
    Location
    USA
    Posts
    9

    Re: Help with Googlebot eating my bandwidth!

    oh wow.. that's a lot bandwidth. I will try to help ya. :-D

    Try Enabling file caching with your htaccess file. That should help. and improve page load time.

    Add these lines to your htaccess file and save. You can also change the # days to # months.
    (This htaccess code is for an apache server. Like stoli here at x10. ;-)
    ---------------------------------------------
    Code:
    ## EXPIRES CACHING ##
    <IfModule mod_expires.c>
    ExpiresActive On
    ExpiresByType image/jpg "access 7 days"
    ExpiresByType image/jpeg "access 7 days"
    ExpiresByType image/gif "access 7 days"
    ExpiresByType image/png "access 7 days"
    ExpiresByType text/css "access 7 days"
    ExpiresByType application/pdf "access 7 days"
    ExpiresByType text/x-javascript "access 7 days"
    ExpiresByType application/x-shockwave-flash "access 7 days"
    ExpiresByType image/x-icon "access 7 days"
    ExpiresDefault "access 7 days"
    </IfModule>
    <IfModule mod_headers.c>
    <FilesMatch "\.(js|css|xml|gz)$">
    Header append Vary Accept-Encoding
    </FilesMatch>
    </IfModule>
    <IfModule mod_deflate.c>
    #The following line is enough for .js and .css
    AddOutputFilter DEFLATE js css
    #The following line also enables compression by file content type, for the following list of Content-Type:s
    AddOutputFilterByType DEFLATE text/html text/plain text/xml application/xml
    #The following lines are to avoid bugs with some browsers
    BrowserMatch ^Mozilla/4 gzip-only-text/html
    BrowserMatch ^Mozilla/4\.0[678] no-gzip
    BrowserMatch \bMSIE !no-gzip !gzip-only-text/html
    </IfModule>
    ## EXPIRES CACHING ##
    Check your pages performance with google's page speed online. http://pagespeed.googlelabs.com/
    See what your performance score is before and after adding that code.
    (note..google pagespeed online may take a bit to acknowledge some of your htaccess updates. )
    There is also a browser plugin available for firefox and chrome for instant checks. here-a.

    Also in a sitemap.xml you can set <changefreq>monthly</changefreq> <priority>0.5</priority> for each url. But in your case, it would be way too many urls. and a sitemap generator would really eat up your bandwidth I would assume.

    Let me know if that helps. :-)
    learning_brain likes this.

  6. #6
    Darkmere's Avatar
    Darkmere is offline x10 Lieutenant Darkmere is an unknown quantity at this point
    Join Date
    Jun 2011
    Location
    Fernley, Nevada
    Posts
    360

    Re: Help with Googlebot eating my bandwidth!

    I do not think you can control the amount of times the spiders visit your site can you? But this might work as well tell the spiders not to send in a download request I dont remember the command right off the top of my head but if it is not there they send a request back to Google to download and index your entire web site every time the spiders visit

    Computer Forensics/Security Student
    Kaplan University
    Site: http://skynet.x10.mx

  7. #7
    learning_brain is offline x10 Sophmore learning_brain is an unknown quantity at this point
    Join Date
    Apr 2010
    Location
    UK, Midlands
    Posts
    170

    Re: Help with Googlebot eating my bandwidth!

    Quote Originally Posted by cybrax View Post
    Well.. if the site is making a regular income now might be the time to upgrade the hosting
    Unfortunately, it's making just enough to cover current hosting

    Quote Originally Posted by cybrax View Post
    On the other hand I would consider swaping in a medium size image for pages that googlebot crawls rather than leave the high resolution one in place. Not sure how the folk at Mountain View feel about that, so probably better to ask them first.
    All of the high res images are (ahem) hotlinked. How is that counting toward bandwidth I hear you ask... I'm not sure either but it is. That said, I'm also struggling on server space - with about 1/3 gone already with only thumbs. Adding a medium res version would simply overload my storage limit......

    Quote Originally Posted by cybrax View Post
    Without the stats it's hard to say, are these unique crawls or is the bot repeatedly crawling the same images over and over? Dynamically re-writing the robots file may provide a solution if this is the case.
    Pages are being repeatedly crawled. I'm not sure how a robots file re-write will help????? Advice would be good here. I believe the changefreq in the sitemap is ignored by the major engines?

    Quote Originally Posted by essellar View Post
    Well, the number of page hits isn't bad at all for indexing a site that size; the problem is the amount of data per page. I can't see how ~250K pages translates to >16GB unless Google is indexing your high-res images. If the high-res image view pages are part of your site map, I can't see how to claw back the bandwidth without blocking access to your images (non-thumbs) directory with robots.txt. On the other hand, if what you're offering for indexing is just the search results pages, then you should be able to use rel="nofollow" in the links on the thumbnails -- the thumbs will still be indexed (probably multiple times each, since they're likely to turn up on multiple results pages), but the high-res images won't.
    Yes the high res images are part of the site - mainly to try to improve SEO. Blocking the view_image.php page (either block the one file in robots.txt or your suggested nofollow) would do nicely but will significantly alter availability to SEO.

    Quote Originally Posted by descalzo View Post
    If you have your images in several directories, try excluding the bots from one directory per month.
    Nope - sorry - thumbs in one dir - high res are all hotlinked (and yes I know that's naughty but I don't have the space to support hundreds of thousands of high res images).

    Quote Originally Posted by lllllllbob61 View Post
    oh wow.. that's a lot bandwidth. I will try to help ya. :-D

    Try Enabling file caching with your htaccess file. That should help. and improve page load time.

    Add these lines to your htaccess file and save. You can also change the # days to # months.
    (This htaccess code is for an apache server. Like stoli here at x10. ;-)
    ---------------------------------------------
    Code:
    ## EXPIRES CACHING ##
    <IfModule mod_expires.c>
    ExpiresActive On
    ExpiresByType image/jpg "access 7 days"
    ExpiresByType image/jpeg "access 7 days"
    ExpiresByType image/gif "access 7 days"
    ExpiresByType image/png "access 7 days"
    ExpiresByType text/css "access 7 days"
    ExpiresByType application/pdf "access 7 days"
    ExpiresByType text/x-javascript "access 7 days"
    ExpiresByType application/x-shockwave-flash "access 7 days"
    ExpiresByType image/x-icon "access 7 days"
    ExpiresDefault "access 7 days"
    </IfModule>
    <IfModule mod_headers.c>
    <FilesMatch "\.(js|css|xml|gz)$">
    Header append Vary Accept-Encoding
    </FilesMatch>
    </IfModule>
    <IfModule mod_deflate.c>
    #The following line is enough for .js and .css
    AddOutputFilter DEFLATE js css
    #The following line also enables compression by file content type, for the following list of Content-Type:s
    AddOutputFilterByType DEFLATE text/html text/plain text/xml application/xml
    #The following lines are to avoid bugs with some browsers
    BrowserMatch ^Mozilla/4 gzip-only-text/html
    BrowserMatch ^Mozilla/4\.0[678] no-gzip
    BrowserMatch \bMSIE !no-gzip !gzip-only-text/html
    </IfModule>
    ## EXPIRES CACHING ##
    Check your pages performance with google's page speed online. http://pagespeed.googlelabs.com/
    See what your performance score is before and after adding that code.
    (note..google pagespeed online may take a bit to acknowledge some of your htaccess updates. )
    There is also a browser plugin available for firefox and chrome for instant checks. here-a.

    Also in a sitemap.xml you can set <changefreq>monthly</changefreq> <priority>0.5</priority> for each url. But in your case, it would be way too many urls. and a sitemap generator would really eat up your bandwidth I would assume.

    Let me know if that helps. :-)
    This is great and is certainly something I should do to improve load times. I believe however that googlebot does not use the cache as it is trying to discover changed content... not sure on this one. I've got the FF plugin for site performace.

    As for the sitemap, I have a static sitemap-index and several sitemap.php files - each of which are dynamically written depending on images found. So yes, I could do this but as said earlir, I think the major engines ignore the changefreq tag. (I could be wrong).

    Quote Originally Posted by Darkmere View Post
    I do not think you can control the amount of times the spiders visit your site can you? But this might work as well tell the spiders not to send in a download request I dont remember the command right off the top of my head but if it is not there they send a request back to Google to download and index your entire web site every time the spiders visit
    You can control the hit frequency for google - from webmaster controls. I've now set mine at 200 seconds between each request and I'll see what that does in reality. I don't know about the other engines but the second highest consumer of bandwidth is minute in comparison.

    And yeah - re-indexing the entire site each visit would be catastrophic!!! :D


    I think in summary, I'll have to block the crawlers from the high res image pages but this then gives me another problem. (doesn't it always?). How then do I optimise SEO for the thumbnail results pages to best advantage?? I can't really use the usual url, page title, h1 tags etc, so I need to come up with another solution. In addition, if only the thumbs are being indexed, it's going to severley affect my CTR as the high res thing is what the site is about.

    ...hmmmm

    *Scratches head*

    Rich

  8. #8
    essellar's Avatar
    essellar is offline Community Advocate essellar has a spectacular aura about
    Join Date
    Feb 2010
    Location
    Toronto, Ontario, CA
    Posts
    1,153

    Re: Help with Googlebot eating my bandwidth!

    Hmm -- if the hi-res images are hotlinked (as opposed to piped through your server or stored locally), that should be a bandwidth freebie at your end, so you can take that out of consideration. That means that the "base problem" is that there are a lot of access paths to the images. You can try extending the expiry on the thumbs, as suggested, but you'll probably find that it's not going to save you as much as you'd hoped -- it depends how Google's spider manages cache. Using a CDN (like Cloudflare) might work a lot better, since only the first request for a thumb would actually hit your server. The real difficulty is, then, that you've covered the search space well: if your search criteria were specific enough to make indexing trivial (say, one keyword per image), then your site would be the next best thing to useless.

    If you are piping through (with, say, curl), then you're taking a double bandwidth hit with every request (fulfilling Google's request and making your own for the hi-res image at the original source). You can cut that in half by creating a local copy, but then your storage skyrockets and you're still out-of-pocket a whole bunch to fix it.
    learning_brain likes this.
    “Beware of bugs in the above code; I have only proved it correct, not tried it.” --Donald Knuth
    "It was as if its architects were given a perfectly good hammer and gleefully replied, 'neat! With this hammer, we can build a tool that can pound in nails.'" -- Alex Papadimoulis (on TheDailyWTF.com)

  9. #9
    learning_brain is offline x10 Sophmore learning_brain is an unknown quantity at this point
    Join Date
    Apr 2010
    Location
    UK, Midlands
    Posts
    170

    Re: Help with Googlebot eating my bandwidth!

    Yep - I thought that hotlinked images were freebies too so you must be right. However, I have them as embedded images rather than link-to-.jpg, so it may be counting?? Perhaps I should just do a simple js lightbox type effect - but then I lose the page ranking benefit.

    I am also at a loss as to how Google manages the cache, but if it's anything like mine, it has to load the entire content to check for currency of content. The only other way is if google were to store every page on first crawl and then only look for changes... which I can't see happening.

    Double hitting is a problem with Adsense pages. Every time a user hits one page, it also get hit by Google Media Partners - another blow!

    The crawler/spiderer is the only function to use cURL, but it does run pretty much constantly so that too is eating into bandwidth - but surely that wouldn't be linked to the google drain??

    The alteration of the google request frequency doesn't seem to have helped much either so no joy there....

    I think for the moment, I'm just going to block google from the view_image page until I can come up with a more permanent solution. It will take a while for the indexed pages to drop off anyway so I have some time to think. It will also tell me if that's where the major hit is coming from.

    Rich

+ Reply to Thread

Similar Threads

  1. The Amazing Ice-Cream Eating Man
    By Soki in forum Off Topic
    Replies: 0
    Last Post: 06-19-2008, 10:22 AM
  2. googlebot indexing forum
    By cryptx in forum Free Hosting
    Replies: 6
    Last Post: 12-07-2007, 12:38 PM
  3. X10 Forum in eyes of Googlebot
    By t2t2t in forum Off Topic
    Replies: 7
    Last Post: 07-28-2007, 03:47 PM
  4. Replies: 5
    Last Post: 07-07-2007, 09:00 AM
  5. Im not eating all my space up
    By Brandon in forum Free Hosting
    Replies: 10
    Last Post: 10-19-2006, 09:32 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
x10hosting free hosting for the masses
dedicated servers