(Project) Designing and Building a Simple Search Engine with PHP, MySql

I hope this will encourage you to develop an engine that suits your particular needs, with the exact features you desire.
Database Design and Logic
We'll use MySQL as a database backend to store our search data. It's possible to shell out to Unix commands such as grep and find, but that would mean running the search engine on the machine hosting the files. As well, it would be more difficult to index pages served from a database. We'll tackle the database first.
The database for the search engine consists of three tables: page, word, and occurrence. page holds all indexed web pages, and word holds all of the words found on the indexed pages. The rows in occurrence correlate words to their containing pages. Each row represents one occurrence of one particular word on one particular page. The SQL for creating these tables are shown below.

CREATE TABLE page (
   page_id int(10) unsigned NOT NULL auto_increment,
   page_url varchar(200) NOT NULL default '',
   PRIMARY KEY (page_id)
) TYPE=MyISAM;

CREATE TABLE word (
   word_id int(10) unsigned NOT NULL auto_increment,
   word_word varchar(50) NOT NULL default '',
   PRIMARY KEY (word_id)
) TYPE=MyISAM;

CREATE TABLE occurrence (
   occurrence_id int(10) unsigned NOT NULL auto_increment,
   word_id int(10) unsigned NOT NULL default '0',
   page_id int(10) unsigned NOT NULL default '0',
   PRIMARY KEY (occurrence_id)
) TYPE=MyISAM;
While page and word hold actual data, occurrence acts only as a reference table. By joining occurrence with page and word, we can determine which pages contain a word, as well as how many times the word occurs. Before that, though, we need some data.

Populating The Database
Okay, the database is created and we're ready to feed it some content. For this, we'll create a PHP script that takes a user-specified URL, reads the document representing the URL, extracts words from the page, and creates records in the database. Here's the program:

<?
/*
* populate.php
*
* Script for populating the search database with words,
* pages and word-occurences.
*/

/* Connect to the database: */
mysql_pconnect("localhost","root","secret")
    or die("ERROR: Could not connect to database!");

mysql_select_db("test");

/* Define the URL that should be processed: */

$url = addslashes( $_GET['url'] );

if( !$url )
{
   die( "You need to define a URL to process." );
}
else if( substr($url,0,7) != "http://" )
{
   $url = "http://$url";
}

/* Does this URL already have a record in the page-table? */
$result = mysql_query("SELECT page_id FROM page WHERE page_url = \"$url\"");
$row = mysql_fetch_array($result);

if( $row['page_id'] )
{
   /* If yes, use the old page_id: */
   $page_id = $row['page_id'];
}
else
{
   /* If not, create one: */
   mysql_query("INSERT INTO page (page_url) VALUES (\"$url\")");
   $page_id = mysql_insert_id();
}

/* Start parsing through the text, and build an index in the database: */
if( !($fd = fopen($url,"r")) )
   die( "Could not open URL!" );

while( $buf = fgets($fd,1024) )
{
   /* Remove whitespace from beginning and end of string: */
   $buf = trim($buf);

   /* Try to remove all HTML-tags: */
   $buf = strip_tags($buf);
   $buf = ereg_replace('/&\w;/', '', $buf);

   /* Extract all words matching the regexp from the current line: */
   preg_match_all("/(\b[\w+]+\b)/",$buf,$words);

   /* Loop through all words/occurrences and insert them into the database: */
   for( $i = 0; $words[$i]; $i++ )
   {
      for( $j = 0; $words[$i][$j]; $j++ )
      {
         /* Does the current word already have a record in the word-table? */
         $cur_word = addslashes( strtolower($words[$i][$j]) );

         $result = mysql_query("SELECT word_id FROM word
                                WHERE word_word = '$cur_word'");
         $row = mysql_fetch_array($result);
         if( $row['word_id'] )
         {
            /* If yes, use the old word_id: */
            $word_id = $row['word_id'];
         }
         else
         {
            /* If not, create one: */
            mysql_query("INSERT INTO word (word_word) VALUES (\"$cur_word\")");
            $word_id = mysql_insert_id();
         }

         /* And finally, register the occurrence of the word: */
         mysql_query("INSERT INTO occurrence (word_id,page_id)
                      VALUES ($word_id,$page_id)");
         print "Indexing: $cur_word<br>";
      }
   }
}

fclose($fd);

?>

Note the use of addslashes() when assigning values to $url and $cur_word. Since these variables are used within SQL queries, it's very important to make sure they're free from harmful data. Imagine if someone called this script with a URL of "; DROP TABLE FOO;. As addslashes() quotes database special characters, someone of malicious bent would have to go to much more work to exploit the database.
This script connects to the database, registering the page in the database if it does not exist, and starts to retrieve data. The preg_match_all() call extracts all words (matching the regular expression) from the page. Each word must be recorded in the occurrence and word table.
When building the index, only three SQL INSERT statements actually matter. When a page is first indexed, it must be recorded:
INSERT INTO page (page_url) VALUES ("http://www.onlamp.com/");
The first occurrence of a word within the entire dataset must be recorded:
INSERT INTO word (word_word) VALUES ("linux");
Each occurrence of a word within a page must be recorded:
INSERT INTO occurrence (word_id,page_id) VALUES ($word_id,$page_id);
Let's now index a few pages. The seven sites that makes up the O'Reilly Network is probably a good idea. We will call populate.php with the site URLs as the only argument, one at a time:
http://localhost/populate.php?url=http://www.macdevcenter.com/
http://localhost/populate.php?url=http://www.onjava.com/
http://localhost/populate.php?url=http://www.onlamp.com/
http://localhost/populate.php?url=http://www.openp2p.com/
http://localhost/populate.php?url=http://www.osdir.com/
http://localhost/populate.php?url=http://www.perl.com/
http://localhost/populate.php?url=http://www.xml.com/
A quick investigation of the tables now should result in something like this:

mysql> SELECT * FROM page;
+---------+------------------------------+
| page_id | page_url                     |
+---------+------------------------------+
| 1       | http://www.macdevcenter.com/ |
| 2       | http://www.onjava.com/       |
| 3       | http://www.onlamp.com/       |
| 4       | http://www.openp2p.com/      |
| 5       | http://www.osdir.com/        |
| 6       | http://www.perl.com/         |
| 7       | http://www.xml.com/          |
+---------+------------------------------+
7 rows in set (0.00 sec)

mysql> SELECT COUNT(*) FROM word;
+----------+
| COUNT(*) |
+----------+
| 2423     |
+----------+
1 row in set (0.00 sec)

mysql> SELECT COUNT(*) FROM occurrence;
+----------+
| COUNT(*) |
+----------+
| 20628    |
+----------+
1 row in set (0.00 sec)
The database now has seven indexed pages, resulting in 2,423 unique words and 20,628 occurrences. We're ready for the next level, the search interface.

The Search Interface
Of course, users will not be able to work with the MySQL database directly. Therefore, we'll create another PHP script that provides an HTML form to query the database. This works just like any other search engine. The user enters a word in a textbox, hits Enter, and receives a page of results linked to the appropriate pages. The result order depends on the number of times a keyword appears in each document. The search.php script is listed below.

<?

/*
* search.php
*
* Script for searching a database populated with keywords by the
* populate.php-script.

*/

print "<html><head><title>My Search Engine</title></head><body>\n";

if( $_POST['keyword'] )
{
   /* Connect to the database: */
   mysql_pconnect("localhost","root","secret")
       or die("ERROR: Could not connect to database!");
   mysql_select_db("test");

   /* Get timestamp before executing the query: */
   $start_time = getmicrotime();

   /* Set $keyword and $results, and use addslashes() to
    *  minimize the risk of executing unwanted SQL commands: */
   $keyword = addslashes( $_POST['keyword'] );
   $results = addslashes( $_POST['results'] );

   /* Execute the query that performs the actual search in the DB: */
   $result = mysql_query(" SELECT p.page_url AS url,
                           COUNT(*) AS occurrences
                           FROM page p, word w, occurrence o
                           WHERE p.page_id = o.page_id AND
                           w.word_id = o.word_id AND
                           w.word_word = \"$keyword\"
                           GROUP BY p.page_id
                           ORDER BY occurrences DESC
                           LIMIT $results" );

   /* Get timestamp when the query is finished: */
   $end_time = getmicrotime();

   /* Present the search-results: */
   print "<h2>Search results for '".$_POST['keyword']."':</h2>\n";
   for( $i = 1; $row = mysql_fetch_array($result); $i++ )
   {
      print "$i. <a href='".$row['url']."'>".$row['url']."</a>\n";
      print "(occurrences: ".$row['occurrences'].")<br><br>\n";
   }

   /* Present how long it took the execute the query: */
   print "query executed in ".(substr($end_time-$start_time,0,5))." seconds.";
}
else
{
   /* If no keyword is defined, present the search page instead: */
   print "<form method='post'> Keyword:
          <input type='text' size='20' name='keyword'>\n";
   print "Results: <select name='results'><option value='5'>5</option>\n";
   print "<option value='10'>10</option><option value='15'>15</option>\n";
   print "<option value='20'>20</option></select>\n";

   print "<input type='submit' value='Search'></form>\n";
}

print "</body></html>\n";

/* Simple function for retrieving the current timestamp in microseconds: */
function getmicrotime()
{
   list($usec, $sec) = explode(" ",microtime());
   return ((float)$usec + (float)$sec);
}

?>

The script may be called with or without the keyword argument. If it's defined, the script searches for that word in the database. It will also show the length of time it took to process the query. Otherwise, the script presents the search page instead.

Comments

Popular posts from this blog

AGRICULTURAL SCIENTIST RECRUITMENT BOARD

HSC Exam Time Table 2013 Arts, Commerce, Science

CIVIL SERVICES' (I.A.S.) EXAMINATION