+ Reply to Thread
Results 1 to 8 of 8

Thread: Filtering Dynamic URL's from URL scrape

  1. #1
    learning_brain is offline x10 Sophmore learning_brain is an unknown quantity at this point
    Join Date
    Apr 2010
    Location
    UK, Midlands
    Posts
    170

    Filtering Dynamic URL's from URL scrape

    My Image crawler is now working but...

    The URL crawl picks up every URL link... which is fine on static pages but on dynamic pages, this can be a problem seeing as exactly the same page content can have a different URL.

    i.e.

    http://www.mysite.com/index.php?main...3da01967c1c616

    is likely to be the same as

    http://www.mysite.com/index.php?main...3da01967c1c616

    now this is tricky because some of the ?q= data is important in dynamically generated site but a lot is irrelevant, such as session etc.

    So how do I filter out this garbage???

    Rhy

  2. #2
    lemon-tree's Avatar
    lemon-tree is offline x10 Minion lemon-tree has a spectacular aura about
    Join Date
    Nov 2007
    Posts
    1,420

    Re: Filtering Dynamic URL's from URL scrape

    Just to let you know, crawlers and scrapers are specifically prohibited by the x10 Terms of Service and can result in a suspension.
    Script Hosting: Space provided by x10Hosting is to be used to create a functional website, we do not allow bots, content scrapers, or any other script that runs continuously on your account. Any scripts that are executed via cron or manually must be directly related to your website.
    Last edited by lemon-tree; 06-06-2010 at 05:17 AM.

  3. #3
    learning_brain is offline x10 Sophmore learning_brain is an unknown quantity at this point
    Join Date
    Apr 2010
    Location
    UK, Midlands
    Posts
    170

    Re: Filtering Dynamic URL's from URL scrape

    Quote Originally Posted by lemon-tree View Post
    Just to let you know, crawlers and scrapers are specifically prohibited by the x10 Terms of Service and can result in a suspension.
    Good job I'm not using their services anymore then :D

    I stopped using X10 months ago when they deleted a complete mysql db when they moved.

  4. #4
    essellar's Avatar
    essellar is offline Community Advocate essellar has a spectacular aura about
    Join Date
    Feb 2010
    Location
    Toronto, Ontario, CA
    Posts
    1,153

    Re: Filtering Dynamic URL's from URL scrape

    There's no way to programmatically determine what part of the link is semantically significant in terms of where the link points (apart from obvious URL parts, as in the case of something like "&session_id=xxxxxxxxxxxxxxxxxxxxx"). You would need to follow the links and compare the results (normally by hashing the returned data and comparing hashes with unique pages having similar URLs). Note, though, that different access paths to the same data may result in different HTML presentations (different templates for the same data), which would generate different hashes. And remember that REST-ish URLs (something I tend to use whenever the platform allows it) may mean that the same data may be accessed using very different URLs, depending on how the user discovered the resource, so merely looking at the URL is insufficient.
    “Beware of bugs in the above code; I have only proved it correct, not tried it.” --Donald Knuth
    "It was as if its architects were given a perfectly good hammer and gleefully replied, 'neat! With this hammer, we can build a tool that can pound in nails.'" -- Alex Papadimoulis (on TheDailyWTF.com)

  5. #5
    descalzo's Avatar
    descalzo is offline Grim Squeaker descalzo has a brilliant futuredescalzo has a brilliant futuredescalzo has a brilliant future
    Join Date
    Jul 2009
    Location
    Ankh-Morpork
    Posts
    7,636

    Re: Filtering Dynamic URL's from URL scrape

    Why do you think Google etc want you to have a Site Map on PHP driven sites?
    Nothing is always absolutely so.

  6. #6
    KryptosV2 is offline x10Hosting Member KryptosV2 is an unknown quantity at this point
    Join Date
    May 2010
    Posts
    24

    Re: Filtering Dynamic URL's from URL scrape

    Use a regex.replace method. The regular expression can contain normal words to search for, but for the numbers you can use .*? (lazily find any character until the next expression is matched).

    To remove, say, products_id you would use...
    regex.replace(the file, "products_id.*?&", "");

  7. #7
    descalzo's Avatar
    descalzo is offline Grim Squeaker descalzo has a brilliant futuredescalzo has a brilliant futuredescalzo has a brilliant future
    Join Date
    Jul 2009
    Location
    Ankh-Morpork
    Posts
    7,636

    Re: Filtering Dynamic URL's from URL scrape

    Quote Originally Posted by KryptosV2 View Post
    Use a regex.replace method. The regular expression can contain normal words to search for, but for the numbers you can use .*? (lazily find any character until the next expression is matched).

    To remove, say, products_id you would use...
    regex.replace(the file, "products_id.*?&", "");
    His problem is that he does not know the format of the query string ahead of time. He is working on a form of 'spider' or 'crawler' that will visit various sites automatically.
    Nothing is always absolutely so.

  8. #8
    learning_brain is offline x10 Sophmore learning_brain is an unknown quantity at this point
    Join Date
    Apr 2010
    Location
    UK, Midlands
    Posts
    170

    Re: Filtering Dynamic URL's from URL scrape

    Quote Originally Posted by descalzo View Post
    His problem is that he does not know the format of the query string ahead of time. He is working on a form of 'spider' or 'crawler' that will visit various sites automatically.
    Absolutely!

    Thanks everyone for confirming what I already feared.

    I did think about comparing content with existing, but this is a huge drain on resources (I would think).

    descalzo - site maps! why didn't I think of that! As I can obtain the root address, it's also likely I can find the site map file (if exists) on dynamic sites. I'll do an initial check on that first and then add urls if one isn't found. This is still likely to give me problems though so I'll have to do a url queue purge every so often on sites that are multiplying uncontrollably.

+ Reply to Thread

Similar Threads

  1. cURL , how to scrape multiple page ?
    By fordvb in forum Programming Help
    Replies: 6
    Last Post: 11-24-2009, 10:42 AM
  2. Screen Scrape Help
    By driveflexfuel in forum Programming Help
    Replies: 1
    Last Post: 03-23-2009, 06:19 AM
  3. Iframe filtering
    By galaxyAbstractor in forum Programming Help
    Replies: 5
    Last Post: 03-07-2009, 02:55 PM
  4. Problems with my URL's
    By Kansy in forum Free Hosting
    Replies: 5
    Last Post: 04-08-2007, 05:00 AM
  5. How do you read the URL's
    By Chris S in forum Off Topic
    Replies: 5
    Last Post: 04-03-2006, 10:26 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
x10hosting free hosting for the masses
dedicated servers