PHP Function to Create Keyword Tags from Text

I’m building a few websites at the moment that require me to extract suggested keywords or tags from a string of text. The PHP function below will accomplish this with a little added extra functionality.

Creating the Keywords/Tags

Given the text body, I wanted to accomplish the following:

  • Based on the entire text, I wanted to create keywords from all words in the text.
  • I wanted a ‘blacklist’ of $commonWords that I could remove from the returned keyword array.
  • I wanted to compare the extracted keywords (the $words array) with an array of permitted keywords. Only the returned keywords that are also in the $allowedWords array will be returned.
  • I wanted to restrict keyword output to words over ‘n’ characters in length.
  • I was required to limit keywords that appeared a minimum of ‘n’ times in the submitted text.
  • I wanted to specify how many keywords, in total, would be returned.

Example

Take, for example, the following block of text (extracted from another blog post):

Many systems that traditionally had a reliance on the pneumatic system have been transitioned to the electrical architecture. They include engine start, API start, wing ice protection, hydraulic pumps and cabin pressurisation. The only remaining bleed system on the 787 is the anti-ice system for the engine inlets. In fact, Boeing claims that the move to electrical systems has reduced the load on engines (from pneumatic hungry systems) by up to 35 percent (not unlike today’s electrically power flight simulators that use 20% of the electricity consumed by the older hydraulically actuated flight sims).

Usage:

echo extract_keywords($text);

Output: ice, pneumatic, engine, electrical

The extracted keywords aren’t ideal… but they are a good starting point for ‘suggested’ tags that the end user can refine.

The PHP Function

You should download the PHP function below. The $commonWords array is several hundred words in length so it wasn’t practical reproducing it in this post.

<?php
function extract_keywords($str, $minWordLen = 3, $minWordOccurrences = 2, $asArray = false, $maxWords = 8, $restrict = false)
{
    $str = str_replace(array("?","!",";","(",")",":","[","]"), " ", $str);
    $str = str_replace(array("\n","\r","  "), " ", $str);
    strtolower($str);

	function keyword_count_sort($first, $sec)
	{
		return $sec[1] - $first[1];
	}
	$str = preg_replace('/[^\p{L}0-9 ]/', ' ', $str);
	$str = trim(preg_replace('/\s+/', ' ', $str));
	
	$words = explode(' ', $str);

	// Only compare to common words if $restrict is set to false
	// Tags are returned based on any word in text
	// If we don't restrict tag usage, we'll remove common words from array
	if ($restrict == false) {
	Full list of common words in the downloadable code
	$commonWords = array('a','able','about','above','abroad','according');
	$words = array_udiff($words, $commonWords,'strcasecmp');
	}

	// Restrict Keywords based on values in the $allowedWords array
	// Use if you want to limit available tags
	if ($restrict == true) {
	$allowedWords =  array('engine','boeing','electrical','pneumatic','ice');
	$words = array_uintersect($words, $allowedWords,'strcasecmp');
	}

	$keywords = array();

	while(($c_word = array_shift($words)) !== null)
	{
		if(strlen($c_word) < $minWordLen) continue;

		$c_word = strtolower($c_word);
		if(array_key_exists($c_word, $keywords)) $keywords[$c_word][1]++;
		else $keywords[$c_word] = array($c_word, 1);
	}
	usort($keywords, 'keyword_count_sort');

	$final_keywords = array();
	foreach($keywords as $keyword_det)
	{
		if($keyword_det[1] < $minWordOccurrences) break;
		array_push($final_keywords, $keyword_det[0]);
	}
	$final_keywords = array_slice($final_keywords, 0, $maxWords);
	return $asArray ? $final_keywords : implode(', ', $final_keywords);
}
?>

Notes on Usage

In the above example, if $restrict = true were set to false, the tags returned would be system, systems, engine, start, ice. This is because we’re only omitting the $commonWords from the result (and evaluating every other word for consideration). The results is less accurate than comparing against a preferred keyword array.

The most accurate results are obtained from refining the $allowedWords array and including as many subject-specific words as possible to cover all preferred tags.

$minWordLen determines what words are searched. In our case, anything less than 3 characters in length will be ignored.

$minWordOccurrences determines how many times a word must be written into text before it can be considered for inclusion in returned keywords.

$asArray specifies whether the keywords are rendered as text or as an array.

$maxWords determines the maximum number of words to return in the keyword string.

Advanced Usage

Consider the following block of text:

$exampletext = “The quick brown fox jumped over the lazy dog. The quick brown fox jumped over the lazy dog. The quick brown fox jumped over the lazy dog. The slow orange cat walked over the lazy owner. The quick brown fox jumped over the lazy dingo.”;

echo extract_keywords($exampletext, 3, 1, false, 5, false);

Returns: lazy, dog, jumped, fox, brown

First Name:
Your Email Address:
 




Download: Extract Keywords and Tags from Text
Description: Extract Keywords and Tags from Text
Author:Marty
Category: PHP code
Date: July 21, 2012



If you liked this article, you may also like:

  1. Add WordPress tags within your post
  2. Calculate and highlight the differences between strings of text with PHP
  3. PHP function to truncate text (into a preview or excerpt) with trailing dots…
  4. Wrap a long string, word or URL over multiple lines with a PHP function
  5. [Shortcode] examples [3] – Text Boxes
About Marty

is a passionate web developer from Sydney, Australia. He owns about 600 websites and makes a healthy living from working the web. As a day job, he works as a pilot for an international airline. Follow Marty on Twitter or Google+.

Comments

  1. Marilyn says:

    Hi Marty, is there any coding or tutorial in php where I can evaluate the accuracy of part of speech tagging? Thanks in advance.

  2. Marilyn says:

    For example, I have this sentence, ‘The quick brown fox jumped over the lazy dog’. Then, I tag all the words using pos tagger based on the lexicon and rules and it became like this ‘The/DT quick/NN brown/NN fox/NN jumped/VB over/NN the/DT lazy/NN dog/NN’. So, based on the tagging, how can I calculate the percentage of correct tagged words in that sentence to find out how accurate my tagging is. How can I know the words are correctly tagged or not. Thanks.

    • Marty says:

      Okay… this is something I’ll have to think about. Is it possible to ‘construct’ each word/component in an array and then do a find and replace? Could you compare words against, say, the WordNet dictionary (that you can download here) and then replace each occurrence? All the word forms could be inserted into a database then queried against. That would make them consistent with a dictionary response. Does that make sense?

      It’s not something I know enough about to give an answer, but I’ll do some reading to see if I can help. If you contact me via email with what you’re doing and how you’re trying to do it, I’ll see if I can help.

  3. Paul says:

    Hi Marty can I extract the keywords intro a table?
    like
    keyword
    keyword2

    Thank you

Speak Your Mind

*