It is estimated that many students may not have heard of this extension. This is not a teddy bear, but an extension that handles HTML-related operations. It can be used to format and display content in data formats such as HTML, XHTML, and XML. .
About Tidy library
The Tidy library extension is released with PHP, that is to say, we can add --with-tidy when compiling and installing PHP to install this extension together, or afterwards through the tidy directory under the ext/ folder in the source package The source code in to install. At the same time, the Tidy extension also needs to rely on a tidy library, we need to install it on the operating system, if it is CentOS, directly yum install libtidy-devel will do.
Tidy format
First, let's take a look at how to format a piece of HTML code through this Tidy extension library.
$content = <<<EOF
<html><head><title>test</title></head> <body><p>error<br>another line</i></body>
</html>
EOF;
$tidy = new Tidy();
$config = [
'indent'=>true,
'output-xhtml'=>true,
];
$tidy->parseString($content, $config);
$tidy->cleanRepair();
echo $tidy, PHP_EOL;
// <html xmlns="http://www.w3.org/1999/xhtml">
// <head>
// <title>
// test
// </title>
// </head>
// <body>
// <p>
// error<br />
// another line
// </p>
// </body>
// </html>
The HTML code in the $content we defined is a very irregular piece of HTML code without any format. After instantiating a Tidy object, using the parseString() method and executing the cleanRepair() method, and then directly printing the $tidy object, we get the formatted HTML code. It seems to be very standard, whether it is xmlns or indentation format is very standard.
The parseString() method has two parameters, the first parameter is the string to be formatted. The second parameter is the formatted configuration. This configuration receives an array, and its internal content must also be the configuration information defined in the Tidy component. These configuration information can be inquired in the second link after the article. Here we only configure two contents, indent indicates whether to apply the indentation block level, and output-xhtml indicates whether the output is xhtml.
The cleanRepair() method is used to perform cleanup and repair operations on the parsed content, which is actually formatted cleanup work.
Note that we are directly printing the Tidy object in the test code, that is to say, this object implements \_\_toString(), and it actually looks like this.
var_dump($tidy);
// object(tidy)#1 (2) {
// ["errorBuffer"]=>
// string(112) "line 1 column 1 - Warning: missing <!DOCTYPE> declaration
// line 1 column 70 - Warning: discarding unexpected </i>"
// ["value"]=>
// string(195) "<html xmlns="http://www.w3.org/1999/xhtml">
// <head>
// <title>
// test
// </title>
// </head>
// <body>
// <p>
// error<br />
// another line
// </p>
// </body>
// </html>"
// }
Various attribute information acquisition
var_dump($tidy->isXml()); // bool(false)
var_dump($tidy->isXhtml()); // bool(false)
var_dump($tidy->getStatus()); // int(1)
var_dump($tidy->getRelease()); // string(10) "2017/11/25"
var_dump($tidy->getHtmlVer()); // int(500)
We can get some information about the document to be processed through the properties of the Tidy object, such as whether it is XML or whether it is XHTML content.
getStatus() returns the status information of the Tidy object. At present, this 1 means that there is warning or auxiliary function error information. From the content of the Tidy object printed above, we can see that there is in the errorBuffer property of this object warning Alarm information.
getRelease() returns the version information of the current Tidy component, that is, the information of the tidy component installed on the operating system. getHtmlVer() returns the detected HTML version. The 500 here has no more instructions and introduction materials. I don't know what this 500 means.
In addition to the above, we can also get the configuration information and related instructions in the previous $config.
var_dump($tidy->getOpt('indent')); // int(1)
var_dump($tidy->getOptDoc('output-xhtml'));
// string(489) "This option specifies if Tidy should generate pretty printed output, writing it as extensible HTML. <br/>This option causes Tidy to set the DOCTYPE and default namespace as appropriate to XHTML, and will use the corrected value in output regardless of other sources. <br/>For XHTML, entities can be written as named or numeric entities according to the setting of <code>numeric-entities</code>. <br/>The original case of tags and attributes will be preserved, regardless of other options. "
The getOpt() method requires a parameter, that is, the information content configured in the $config that needs to be queried. If you view the parameters that we have not configured in the $config, then the returned values are all the default configuration values. getOptDoc() is very intimate, it returns a description of a parameter.
Finally, there are some more dry methods, you can directly operate the node.
echo $tidy->head(), PHP_EOL;
// <head>
// <title>
// test
// </title>
// </head>
$body = $tidy->body();
var_dump($body);
// object(tidyNode)#2 (9) {
// ["value"]=>
// string(60) "<body>
// <p>
// error<br />
// another line
// </p>
// </body>"
// ["name"]=>
// string(4) "body"
// ["type"]=>
// int(5)
// ["line"]=>
// int(1)
// ["column"]=>
// int(40)
// ["proprietary"]=>
// bool(false)
// ["id"]=>
// int(16)
// ["attribute"]=>
// NULL
// ["child"]=>
// array(1) {
// [0]=>
// object(tidyNode)#3 (9) {
// ["value"]=>
// string(37) "<p>
// ………………
// ………………
echo $tidy->html(), PHP_EOL;
// <html xmlns="http://www.w3.org/1999/xhtml">
// <head>
// <title>
// test
// </title>
// </head>
// <body>
// <p>
// error<br />
// another line
// </p>
// </body>
// </html>
echo $tidy->root(), PHP_EOL;
// <html xmlns="http://www.w3.org/1999/xhtml">
// <head>
// <title>
// test
// </title>
// </head>
// <body>
// <p>
// error<br />
// another line
// </p>
// </body>
// </html>
I believe that without too much explanation, you can see that head() returns the content in the <head> tag, while body() and html() are also corresponding related tags, and root() returns the root The entire content of the node can be regarded as the entire document content.
The content returned by these methods is actually a TidyNode object, which we will explain in detail later.
Convert directly to string
The above operation code is based on the parseString() method. It has no return value, or it returns just a boolean type of success or failure indicator. If we need to get the formatted content, we can only use the object as a string or use root() to get all the content. In fact, there is another method that directly returns a formatted string.
$tidy = new Tidy();
$repair = $tidy->repairString($content, $config);
echo $repair, PHP_EOL;
// <html xmlns="http://www.w3.org/1999/xhtml">
// <head>
// <title>
// test
// </title>
// </head>
// <body>
// <p>
// error<br />
// another line
// </p>
// </body>
// </html>
The parameters of the repairString() method are exactly the same as parseString(), the only difference is that it returns a string instead of operating inside the Tidy object.
Conversion error message
In the first test code, when we used var_dump() to print the Tidy object, we saw that there was an error message in the errorBuffer variable. This time let's take another HTML code snippet with more problems.
$html = <<<HTML
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<p>paragraph</p>
HTML;
$tidy = new Tidy();
$tidy->parseString($html);
$tidy->cleanRepair();
echo $tidy->errorBuffer, PHP_EOL;
// line 4 column 1 - Warning: <p> isn't allowed in <head> elements
// line 4 column 1 - Info: <head> previously mentioned
// line 4 column 1 - Warning: inserting implicit <body>
// line 4 column 1 - Warning: inserting missing 'title' element
$tidy ->diagnose();
echo $tidy->errorBuffer, PHP_EOL;
// line 4 column 1 - Warning: <p> isn't allowed in <head> elements
// line 4 column 1 - Info: <head> previously mentioned
// line 4 column 1 - Warning: inserting implicit <body>
// line 4 column 1 - Warning: inserting missing 'title' element
// Info: Doctype given is "-//W3C//DTD XHTML 1.0 Strict//EN"
// Info: Document content looks like XHTML 1.0 Strict
// Tidy found 3 warnings and 0 errors!
In this test code, we use a new diagnose() method, which is used to perform diagnostic tests on the document and add more information about the document to the errorBuffer object variable.
TidyNode operation
As we mentioned before, the methods head(), html(), body(), and root() all return a TidyNode object. Is there anything special about this object?
$html = <<<EOF
<html><head>
<?php echo '<title>title</title>'; ?>
<#
/* JSTE code */
alert('Hello World');
#>
</head>
<body>
<?php
// PHP code
echo 'hello world!';
?>
<%
/* ASP code */
response.write("Hello World!")
%>
<!-- Comments -->
Hello World
</body></html>
Outside HTML
EOF;
$tidy = new Tidy();
$tidy->parseString($html);
$tidyNode = $tidy->html();
showNodes($tidyNode);
function showNodes($node){
if($node->isComment()){
echo '========', PHP_EOL,'This is Comment Node :"', $node->value, '"', PHP_EOL;
}
if($node->isText()){
echo '--------', PHP_EOL,'This is Text Node :"', $node->value, '"', PHP_EOL;
}
if($node->isAsp()){
echo '++++++++', PHP_EOL,'This is Asp Script :"', $node->value, '"', PHP_EOL;
}
if($node->isHtml()){
echo '********', PHP_EOL,'This is HTML Node :"', $node->value, '"', PHP_EOL;
}
if($node->isPhp()){
echo '########', PHP_EOL,'This is PHP Script :"', $node->value, '"', PHP_EOL;
}
if($node->isJste()){
echo '@@@@@@@@', PHP_EOL,'This is JSTE Script :"', $node->value, '"', PHP_EOL;
}
if($node->name){
// getParent()
if($node->getParent()){
echo '&&&&&&&& ', $node->name ,' getParent is : ', $node->getParent()->name, PHP_EOL;
}
// hasSiblings
echo '^^^^^^^^ ', $node->name, ' has siblings is : ';
var_dump($node->hasSiblings());
echo PHP_EOL;
}
if($node->hasChildren()){
foreach($node->child as $child){
showNodes($child);
}
}
}
// ………………
// ………………
// ********
// This is HTML Node :"<head>
// <?php echo '<title>title</title>'; ><#
// /* JSTE code */
// alert('Hello World');
// #>
// <title></title>
// </head>
// "
// &&&&&&&& head getParent is : html
// ^^^^^^^^ head has siblings is : bool(true)
// ………………
// ………………
// ++++++++
// This is Asp Script :"<%
// /* ASP code */
// response.write("Hello World!")
// %>"
// ………………
// ………………
The specific test steps of this code and the explanation of each function will not be listed in detail. You can see through the code that our TidyNode object can determine the content of each node, such as whether there are child nodes, and whether there are sibling nodes. The content of the object node can determine the format of the node, whether it is a comment, whether it is text, whether it is JS code, whether it is PHP code, whether it is ASP code or the like. I don't know how you feel when you see it here. Anyway, I think this thing is very interesting, especially the method of judging PHP code.
Information statistics function
Finally, let's take a look at some statistical functions in the Tidy extension library.
$html = <<<EOF
<p>test</i>
<bogustag>bogus</bogustag>
EOF;
$config = array('accessibility-check' => 3,'doctype'=>'bogus');
$tidy = new Tidy();
$tidy->parseString($html, $config);
echo 'tidy access count: ', tidy_access_count($tidy), PHP_EOL;
echo 'tidy config count: ', tidy_config_count($tidy), PHP_EOL;
echo 'tidy error count: ', tidy_error_count($tidy), PHP_EOL;
echo 'tidy warning count: ', tidy_warning_count($tidy), PHP_EOL;
// tidy access count: 4
// tidy config count: 2
// tidy error count: 1
// tidy warning count: 6
In fact, the numbers they return are the numbers of some error messages. tidy_access_count() represents the number of accessibility warnings encountered, tidy_config_count() is the number of configuration information errors, and the other two can be seen from the names, so I don’t need to say more.
Summarize
In short, the Tidy extension library is a less common but very interesting library. For some scenarios, such as template development and other functions, there are still some uses. You can report the mentality of learning and take a closer look. Maybe it can just solve your most difficult problem right now!
Test code:
Reference documents:
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。