Re: Filtering Dynamic URL's from URL scrape
There's no way to programmatically determine what part of the link is semantically significant in terms of where the link points (apart from obvious URL parts, as in the case of something like "&session_id=xxxxxxxxxxxxxxxxxxxxx"). You would need to follow the links and compare the results (normally by hashing the returned data and comparing hashes with unique pages having similar URLs). Note, though, that different access paths to the same data may result in different HTML presentations (different templates for the same data), which would generate different hashes. And remember that REST-ish URLs (something I tend to use whenever the platform allows it) may mean that the same data may be accessed using very different URLs, depending on how the user discovered the resource, so merely looking at the URL is insufficient.
“Beware of bugs in the above code; I have only proved it correct, not tried it.” --Donald Knuth
"It was as if its architects were given a perfectly good hammer and gleefully replied, 'neat! With this hammer, we can build a tool that can pound in nails.'" -- Alex Papadimoulis (on TheDailyWTF.com)