What is Parsing? (You can click this link to see the Wikipedia article)
In short parsing is taking chunks of data and replacing them with something else, based on the chunk you took out.
So why is this useful you may ask. Lets take a look at a problem the users of OpsWeb (The Intranet site I develop) had. One of the largest parts of OpsWeb is the article section. There are over 600 technical articles and more being added every week. A lot of times users will want to link from one article to another. To do this they had to add a normal HTML link. Even thou my user population is made of IT staff and programmers most of them have little or no experience with HTML. This made linking hard, and it should be easy. We want lots of links to help connect information together.
Here is the solution:
From now on to add a link to an article you do not use HTML. You now write the link as [[article#]], or [[article#:link text]].
Examples:
[[530]] When the page is displayed it is shown as OpsWeb Improvements and Changes
[[530:List of Changes]] -> List of Changes
Both of the examples make a link to article 530, and the only thing the user needs to know is the number of the article. (The users are very aware of the article numbers as they are used as a short hand reference on process flowcharts and emails. Example "Hay Bob i need you to preform the reset procedure from article 126")
How to write a simple text parser in PHP
Enough into on to the good stuff. Lets look at some code.
public function body_parser($text){
//paser to prossess body text of an article
$result = preg_replace_callback("/\[\[(\d+)\]\]/", array($this, 'p_format_link'), $text);
$result = preg_replace_callback("/\[\[(\d+):(.*)\]\]/", array($this, 'p_format_link_with_text'), $result);
return $result;
}
public function p_format_link($given){
return
"<a href="http://egondev.com/opsweb/articles/articles_control.php?action=display_one&record_id=">"
.$this->get_title_from_id_DB($given[1])
."</a>";
}
public function p_format_link_with_text($given){
return
"<a href="http://egondev.com/opsweb/articles/articles_control.php?action=display_one&record_id=">"
.$given[2]
."</a>";
}
The Function body_parser finds all of our link tags (the [[article#]] bits) using regular expressions and then pass the links to one of the other 2 functions to be formatted in to HTML. Then it puts the formatted text into the body in place of our link tag.
Example:
$body = "hello I am a short article with a link to [[530]]";
$body = body_parser($body);
echo $body;
>> "hello I am a short article with a link to OpsWeb Improvements and Changes"
The heart of this thing is the preg_replace_callback function. What it does is it searches a string using a regular expression. When it finds a match in that string it calls the function given as the second param and pass the regular expression match to the function. Then it will replace the matched part of the sting with the return of the function.
mixed preg_replace_callback ( mixed $pattern , callback $callback , mixed $subject [, int $limit [, int &$count ]] )
Lets take a look at one of the preg_replace_callback lines from my parser
$result = preg_replace_callback("/\[\[(\d+)\]\]/", array($this, 'p_format_link'), $text);
- "/\[\[(\d+)\]\]/" will match any chunk of text starting with [[ followed by 1 or more digits and ending with ]]
- The (\d+) part of the expression will give us just the digits as a sub part of the matched text.
- array($this, 'p_format_link') this is the call back function. The syntax looks odd because all 3 of these functions are in a class to reference the call back function correctly you need to use this form. With out a class structure it would just be the name of the function you want to pass the match to. No ()s.
- $text is the sting we a passing to the preg_replace_callback function.
- $result is the new string with the matches subsited with the return form the call back function.