+ Reply to Thread
Results 1 to 5 of 5

Thread: Improved Search Algorythm (Alternative to FULLTEXT)

  1. #1
    learning_brain is offline x10 Sophmore learning_brain is an unknown quantity at this point
    Join Date
    Apr 2010
    Location
    UK, Midlands
    Posts
    170

    Improved Search Algorythm (Alternative to FULLTEXT)

    I have an image search engine (not on this server) which is now crawling nicely, but the search function leaves much to be desired!

    Currently, I'm using the fulltext MATCH AGAINST system; which works nicely but has two major limitations:

    1) My host limits me to a 4 character search (all 3 character strings are ignored)
    2) It doesn't support the %searchstring% feature.

    Firstly, I want a 3 character search and 2nd, I have urls which I want to use the %searchstring% functionality.

    In addition, it has be be sorted by relevance!!!

    OK..... so I have done a lot of work on this and have come up with something that works but I'm VERY concerned about efficiency. With only a thousand or so records, this would be OK, but when the index grows to millions, this is going to be a headache.

    Wondering if there is a better way of doing it?

    PHP Code:
    <?php
    mysql_connect
    ("blah de blah") or die(mysql_error());
    mysql_select_db("DB") or die(mysql_error());

    ?>
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Frameset//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-frameset.dtd">
    <html xmlns="http://www.w3.org/1999/xhtml">
    <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <title>Search</title>
    </head>

    <body>
    <form id="form1" method="post" action="">
      <label>
        <input type="text" name="search" id="search" />
      </label>
      <label>
        <input type="submit" name="button" id="button" value="Submit" />
      </label>
    </form>

    <?php
    $q
    =strtolower(strip_tags($_POST['search']));

    echo 
    $q."<br/>";

    $q_array explode(" "$q);
    $q_num count($q_array);

    print_r($q_array);

           
    $query "SELECT ID, IMG_URL, KEYWORDS FROM IMAGES"//full table content query
           
    $result mysql_query($query); //action result
           
    $num_results mysql_num_rows($result); //count rows
           
           
    for ($i 0;$i $num_results$i++) { //loop through rows
                  
                  
    $row mysql_fetch_object($result); //get row content as object
                        
                  //assign object values to variables
                  
    $id $row->ID;
                  
    $img_url strtolower(strip_tags($row->IMG_URL));
                  
    $img_url_FC $row->IMG_URL_FC;
                  
    $keywords strtolower(strip_tags($row->KEYWORDS));
                  
    $keywords_FC $row->KEYWORDS_FC;
                  
                  
    //------------------------------------------SCORING----------------------------------------
                  
                  
    $relevancy 0;  //set relevance to zero
                  
                  //--------------------SEARCH THROUGH 1ST COLUMN---------------------------
                  //score for exact string
                  
    $relevancy $relevancy + (substr_count($img_url$q))*10;
                        
                  for (
    $a 0$a $q_num$a++)
                  {
                          
    //count characters.  if less than 3, ignore
                        
    if(strlen($q_array[$a])>2){
                             
    //gives score of 1 for each occasion words appear
                             
    $relevancy $relevancy substr_count($img_url$q_array[$a]);
                        }
                  } 
                  
                  
    //--------------------SEARCH THROUGH 2ND COLUMN---------------------------
                  //score for exact string
                  
    $relevancy $relevancy + (substr_count($keywords$q))*10;
                  
                  for (
    $a 0$a $q_num$a++)
                  { 
                        
    //count characters.  if less than 3, ignore
                        
    if(strlen($q_array[$a])>2){ 
                             
    //gives score of 1 for each occasion words appear
                             
    $relevancy $relevancy substr_count($keywords$q_array[$a]);
                          }
                  }
                  
    //------------------------------------------END SCORING----------------------------------------
                  
                  
                  //add id to relevancy
                  
    $relevancy $relevancy.".".$id;
                  
                  
    //assign results to array
                  
    if ($relevancy>0.999999999999999999999999999999){
                      
                    
    $processed_array[$relevancy]["IMG_URL_FC"] = $img_url_FC;
                      
    $processed_array[$relevancy]["KEYWORDS_FC"] = $keywords_FC;
                      
                  }
                  
           } 
    //end for ($i = 0;$i < $num_results; $i++) { //loop through rows
           
    echo "<br/>";
           
           
    krsort($processed_array);
           
           
    $count_processed_array count($processed_array);
           echo 
    "<br/>".$count_processed_array." results<br/>";
           
           foreach(
    $processed_array as $relevance=>$val){
               echo 
    "<br/>";
               echo 
    "Relevance: ".floor($relevance)."<br/>";
               
               echo 
    $processed_array [$relevance] [IMG_URL_FC].": ";
               echo 
    $processed_array [$relevance] [KEYWORDS_FC]."<br/>";

            }


     
    ?>
    </body>
    </html>

  2. #2
    misson is offline x10 Spammer misson is a jewel in the rough
    Join Date
    Mar 2008
    Location
    Libertatia
    Posts
    2,506

    Re: Improved Search Algorythm (Alternative to FULLTEXT)

    Quote Originally Posted by learning_brain View Post
    My host limits me to a 4 character search (all 3 character strings are ignored)
    This is the default limit is imposed by MySQL.

    Quote Originally Posted by learning_brain View Post
    It doesn't support the %searchstring% feature.
    Do you mean you also want to search for exact matches of the search string?

    You're right, when the DB gets too large, performance will be terrible. Since you're only using the DB for persistent storage, the DB isn't used to its full potential.

    Here's some ideas on how to approach this, rather than a full solution.
    • Think about writing a stored routine rather than a PHP function.
    • In natural language mode for fulltext, you can use AS SCORE to get the relevance MySQL calculates.
    • Divide the search string into terms longer than three characters and terms three characters long (discard terms of less than three characters). Pass these two values, along with the original search string, to the stored routine.
    • For terms of >3 characters, use the standard fulltext search. For terms of 3 characters, use a regexp (note scoring these will be difficult). For the full string match, use LIKE, giving matched records some constant score.
    • For each row, combine its scores, discard those rows with scores that are too low, sort & return.

    For the above, temporary tables and joins are your friend.
    Be sure to read all pages linked in this post; they have further information that should prove useful. When asking for help, make sure you follow Eric Raymond's and Jon Skeet's guidelines for prompt, accurate responses. Please answer any questions I ask; they're not rhetorical (probably). Any posted code is intended as illustrative example, rather than a solution to your problem to be copied without alteration. Study it to learn how to write your own solution.
    Misson, not Mission.

  3. #3
    learning_brain is offline x10 Sophmore learning_brain is an unknown quantity at this point
    Join Date
    Apr 2010
    Location
    UK, Midlands
    Posts
    170

    Re: Improved Search Algorythm (Alternative to FULLTEXT)

    I knew you wouldn't let me down!!

    I'm also glad you understand the scope of my challenge. There are some things I need to clarify...

    As you may/may not remember, I have an image search engine and I am currently capturing 3 strings... crawled url, image url and keywords found in 'alt'.

    This means I have lots of continuous strings i.e. images/wallpaperslordoftherings/amonhen.jpg

    The fulltext search does not distinguish "the rings" or "wallpapers" from this string which is why I wanted to use a wildcard either side of each split search word. From what I understand, the natural language mode does not do this either and is still limited to the default 4 character minimum string search (not good). To be honest, I went for the fulltext search becasue of the relavance scoring, which suits me well, but the same issues remain unless my host will change the fulltext setting and then re-boot the MySQL server (unlikely).

    I know what you mean about using a regexp for the <4 strings, but as you said, this will be very tricky to pull off convincingly without lots of messy coding.

    I have, in the absence of a suitable solution, refined the existing code and will certainly read up on stored routines if it helps later on (now starting reading up on it and it looks like a good solution).

    I now however have another issue which is blocking my progess a little.... new thread methinks...

    Thank you very much for your time spent on this. It is much appreciated.

  4. #4
    misson is offline x10 Spammer misson is a jewel in the rough
    Join Date
    Mar 2008
    Location
    Libertatia
    Posts
    2,506

    Re: Improved Search Algorythm (Alternative to FULLTEXT)

    To generate a search string for all the keywords:
    PHP Code:
    $fullsearch implode('%'$keywords); 
    However, the order of the keywords is significant. If you don't want this to be the case, you could also generate every permutation of every combination of search terms. You'll have to write a function to do this. I wouldn't recommend it, since it will generate sum(i=1..n; n!/i!) terms in as much time. Three terms goes to ten, which isn't so bad, but four balloons to 41, five to 206 and six to 1237. The increase is more than exponential. The only mitigating factor is that people aren't too likely to enter more than six keywords.

    Another option is to use both approaches: use a stored routine with fulltext, regexps and LIKE matching to simply cut down on the number of results the query returns, then use PHP to handle the final scoring & sorting. The regexp MySQL uses doesn't need to be too tricky:
    PHP Code:
    $threes '[[:<:]](' implode('|'array_filter($keywords, function($word) {return strlen($word) == 3;}) ) . ')[[:>:]]'
    Note that this requires PHP 5.3 for anonymous functions; for 5.2, you can use create_function. You've now got a MySQL regexp for the three-character keywords. It's probably better to add the word boundary markers within the stored routine.

    If you don't care to use MySQL's fulltext scores, you can search for all rows containing at least one keyword:
    PHP Code:
    $searchQuery $db->prepare(
       
    'SELECT id, img_url, keywords 
          FROM images 
          WHERE img_url RLIKE :keywords 
            OR keywords RLIKE :keywords'
    );
    ...

    $keywords preg_split("/[^\w']+/"$_REQUEST['keywords']);
    $terms preg_replace("/'/""'?"
             '[[:<:]](' 
             
    implode('|'array_filter($keywords
                                         function(
    $word) {return strlen($word) >= 3;}) ) 
            . 
    ')[[:>:]]'
        
    );
    $searchQuery->execute(array(':keywords' => $terms)); 
    With this option, you don't use a stored routine and PHP still handles scoring & sorting.
    Be sure to read all pages linked in this post; they have further information that should prove useful. When asking for help, make sure you follow Eric Raymond's and Jon Skeet's guidelines for prompt, accurate responses. Please answer any questions I ask; they're not rhetorical (probably). Any posted code is intended as illustrative example, rather than a solution to your problem to be copied without alteration. Study it to learn how to write your own solution.
    Misson, not Mission.

  5. #5
    learning_brain is offline x10 Sophmore learning_brain is an unknown quantity at this point
    Join Date
    Apr 2010
    Location
    UK, Midlands
    Posts
    170

    Re: Improved Search Algorythm (Alternative to FULLTEXT)

    This is a lot to get my head round and areas I'm not totally comfortable with yet.

    Your suggestion of implode isn't suitable becasue I don't want to use the entire string to search (unless I do that as well as what I already have... hmmm). A search string may be "Lord of the rings" or even "Rings belonging to a lord" - which is why I wanted to loop through each word.

    I love the option of minimising the MySQL result before processing - this will greatly improve efficiency.

    Just need to figure out caching now as well!

+ Reply to Thread

Similar Threads

  1. The used table type doesn't support FULLTEXT indexes
    By ptsettings in forum Free Hosting
    Replies: 0
    Last Post: 05-01-2010, 08:19 PM
  2. new graphical layout and improved functions!!!
    By gptsven in forum Review My Site
    Replies: 9
    Last Post: 03-18-2010, 09:42 AM
  3. Improved Service Performance
    By pentium42006 in forum Feedback and Suggestions
    Replies: 4
    Last Post: 07-20-2009, 06:38 PM
  4. Replies: 0
    Last Post: 12-21-2008, 12:13 AM
  5. [improved] Ip Banning Script
    By Brandon in forum Tutorials
    Replies: 0
    Last Post: 01-15-2006, 09:13 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
x10hosting free hosting for the masses
dedicated servers