That's a pretty good start... Thanks!
Quick questions though... You mention a "list" of URL's to process. This isn't how I was going about it.
Initially, the page was opening the submitted URL and scaping for a hrefs, then img src's. Then as part of that loop, it would start another loop to go through all url's obtained from the first scrape. As it is aprt of the first scrape loop, I can't close the first CURL otherwise, I lose the URL list to work on.
Your "list" idea sounds better but I don't know how to go about it.... would this be a seperate table in the db which is then accessed by chronjob?
Your timeout idea is a bit confusing. Surely this will limit the time the processing has to complete, not extend it... which is what I would need for a complete 1st level/2nd level url scrape.
I'm determined to get this cracked.
Current code below..
PHP Code:
<?php
require_once('Connections/freewebhost.php');//connection parameters
mysql_select_db($database_freewebhost, $freewebhost);//select mysql database
require_once('functions/url_to_absolute.php');//function to resolve absolute urls
require_once('functions/getmysqlvaluestring.php');//function to sanitise string prior to mysql injection
if (isset($_POST['domain_url']))//if a url is submitted
{
//print another form
echo '
<form id="form1" name="form1" method="post" action="">
<label> Submit another Page URL
<input type="text" name="domain_url" id="domain_url" />
</label>
<label>
<input type="submit" name="button" id="button" value="Submit" />
</label>
</form>
';
//define url to search
$domain_url = $_POST['domain_url'];
echo "<div style='background-color: #ccc; border:1px solid #ccc; padding: 10px; margin-left:10px; margin-top:10px;'>";
echo "<hr/><strong>Crawling: ".$domain_url."</strong><hr/>";
//------------------------------------------------CURL------------------------------------------------
$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$domain_url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
//curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html = curl_exec($ch);
if (!$html)
{
echo "<br />cURL error number:" .curl_errno($ch);
echo "<br />cURL error:" . curl_error($ch);
exit;
}
$dom = new DOMDocument();
@$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$img = $xpath->evaluate("/html/body//img");
//------------------------------------------------END CURL------------------------------------------------
//-------------------------------------start domain url loop-------------------------------
for ($i = 0; $i < $img->length; $i++)
{
echo "<div style='background-color: #fff; border:1px solid #ccc; padding: 10px; margin-left:10px; margin-top:10px;'>";
$imgTags = $img->item($i);
$imageUrl = url_to_absolute($domain_url, $imgTags->getAttribute('src'));
$imageAlt = $imgTags->getAttribute('alt');
echo '<br/><img src="'.$imageUrl.'"/><br/>';
echo "Image URL: ".$imageUrl."<br/>";
echo "Image Keywords: ".$imageAlt."<br/>";
//get image size
$imgSize=getimagesize($imageUrl);
echo "Size: ".$imgSize[0]."x".$imgSize[1]."<br/>";
//if image too small
if($imgSize[0]<300 || $imgSize[1]<300)
{
echo " Image less than 300 x 300px. Skipping......";
//else if image is size specified
} else {
echo " Size OK";
//check if already exists in DB
if(mysql_num_rows(mysql_query("SELECT URL FROM IMAGES WHERE URL = '$imageUrl'")))
{
echo "Image URL exists. Skipping.....";
} else {
echo "<br/>Image URL does not exist in database: Saving.....";
//insert into DB
$insertSQL = sprintf("INSERT INTO IMAGES (URL, KEYWORDS) VALUES (%s, %s)",
GetSQLValueString($imageUrl, "text"),
GetSQLValueString($imageAlt, "text"));
$Result1 = mysql_query($insertSQL, $freewebhost) or trigger_error('Query failed');
}
}//end image test
echo "</div>";
}//-------------------------------------end domain url loop-------------------------------
echo "<br/>Page crawl complete.";
echo "</div>";
} else {//if form is not submitted, print initial form
echo '
<p><a href="index.php">Back to Image Search</a></p>
<form id="form1" name="form1" method="post" action="">
<label> Submit full Page URL
<input type="text" name="domain_url" id="domain_url" />
</label>
<label>
<input type="submit" name="button" id="button" value="Submit" />
</label>
</form>
';
}
?>