文本匹配语言是了解过滤规则运行原理的关键,它使您可以匹配复杂的html标签,甚至把匹配的文本存储为变量应用于替换文本当中。
【The text matching language is the key to understanding how the Proxomitron's filters work. It allows you to match complex combinations of HTML tags and store parts of the matched text into variables which can later be used in the replacement text.】
如果您比较熟悉DOS以及UINX的文件名,通配符(*,?,[...]) ,或者熟悉正则表达式,您会很快熟悉匹配规则的。
【If you're familiar with DOS and UNIX style filename wildcards (*,?,[...]) or regular expressions, you'll find much that's familiar in Proxomitron's matching rules. My original goal, in fact, was to create a matching language as easy to use as wildcards, but with much of the added power of regular expressions. I'm not exactly sure I succeeded, but it somehow I got it all to work! ;-)】
大多数规则是为了html而特别设计的。
【Many of the rules have also been specifically designed to make working with HTML easier. For instance, since case is seldom important in HTML, all matching is case insensitive - saving you the trouble of testing for both upper and lower case.】
元字符
这里是所有元字符 的列表
* | 型号会匹配任何字符串,例如 "foo*bar" 将匹配 "foobar","fooma babar",甚至是 "foo goat bat bison bar"。 |
? | 问号只匹配任意一个单独的字符。 "?oat" 将匹配 "boat" 或 "goat" 甚至是 "<oat" |
[abc...] | 方括号匹配 '[' 和 ']' 当中任意单个的字符,也可以使用破折号"[A-Z]"表示从A到Z的范围,还可以使用 "[0-9]"表示任意单个的数字。如果在前面加 "^" ,其含义是不包含,例如"[^0-9abc]"的含义是匹配任何非数字的,并且不是"a", "b" 或 "c"的字符。 【Square Brackets matches any single character listed within the '[' and ']' Ranges can also be checked by using a dash: "[A-Z]" will match any letter "A" through "Z" while "[0-9]" will match any single digit. If the first character is a "^" it will match any character not within the brackets - "[^0-9abc]" will match any character that's not a digit and not "a", "b", or "c".】 |
[#n:n] | 特殊数字匹配,可以用来匹配html标签的数值。例如要匹配100到150之间的数字可以用 "[#100:150]"来表达。如果第二个数字是“*”则表示无穷大,例如"[#40:*]"将会匹配任何大于40的数字,而匹配小于40的数字可以用"[#0:40]"表示。如果需要精确匹配某一个数字,可以用忽略第二位数字,用"[#100]"表达。数字匹配的时候会忽略首位为“0”的数字-例如 tag="0100"与tag="100"对于数字过滤来说是一样的。先前我们用"-" 而不是 ":"来分隔数字,但是这使得表示负数表的非常困难,现在我们可以使用两种方法任意之一来表示,但是如果您需要表达负数- 例如[#-200:150]您还是需要用 ":"来分隔两组数字。 【Special numeric match. This is used to easily check for numeric value ranges in HTML tags. For example, to check for a number between 100 and 150 use "[#100:150]". If the second number is a '*' it acts as if it's infinitely large, "[#40:*]" would match any number greater than 40. To check for a number less than 40 simply use "[#0:40]". To check for an exact number the second number can be left out (as in "[#100]"). The numeric match will match regardless of leading zeros or quotes surrounding a number - tag="0100". earlier versions of Proxomitron used to use "-" instead of ":" to separate the neumbers, but this made testing for negative numbers difficult. Currently Proxomitron will accept either way, but if you're test includes a negative value (like [#-200:150]) you'll have to use ":".】 |
" " | 它永远匹配空格 - 但是它会非常贪婪的捕获所有空格和制表符。 【 any number of spaces or tabs it may find. Use it where there may or may not be a space between items. For example】 "<tag value>" 会被匹配 "<tag value>" 和 "<tag value>" 甚至 "<tagvalue>"都会被匹配 【"<tag value>" would match "<tag value>" or "<tag value>" or even "<tagvalue>".】 |
\s | 反斜杠s::就像空格一样也会贪婪捕获所有的空格和制表符,但是它的条件是至少要匹配一个,例如:"<tag\s>"将会匹配"<tag >" 或者"<tag >",但不会匹配 "<tag>"。 【Like the space, will consume any number of spaces or tabs, but there must be at least one for it to match. For example "<tag\s>" would match "<tag >" or "<tag >" but not "<tag>"】 |
\w | 反斜杠w:单字的过滤。它可以过滤没有空格的任何字符。基本上它和 "\s" 是相对的,但是在某些时候,它也可以和"*"有些类似。不同的就是在碰到空格或">"(HTML 的结束标签)时它将会停止匹配。匹配标签和网址的时候非常有用。可以查看 提示和技巧 【Backslash-w: Word match. Will match any number of non-space characters. it's basically the opposite of "\s", but in some ways it's also similar to "*". The difference being it will stop if it hits a space or a ">" (which marks the end of a HTML tag). It comes in very handy when matching tag values and URLs (See tips and tricks)】 |
\t | 匹配单独的一个制表符。 |
\r | 匹配单独的一个回车符。 |
\n | 匹配单独的一个换行符。 |
\0-9 | 反斜杠+数字 0-9:匹配变量。这是一个关键的匹配。它匹配的方式就像"*" 字符一样,但是它会把过滤的内容储存到变量中。这些储存了原始 HTML 部分数据的变量,可以再使用或是修改标签的某一部份。例如:变更背景 标签,你可以使用... 【This is one of Proxomitrons key matching rules. It matches just like the "*" character, but stores whatever is matched into one of ten variables. These variables can then be used in the replacement text to include parts of the original HTML. Use it to change only part of a tag while leaving other parts intact. For example, to change only the background in a <body ... > tag you could use...】 匹配:<body \1 background="*" \2 > 替换:<body \1 background="mybackground.gif" \2 > 这样,介于 body 和 background 之间的文本,将会原封不动地被放入变量 \1 内,而 \2 也一样,此外 background 的参数部份,因为有 * 过滤符号,所以不管是什么,将会被 mybackground.gif 取代。 【This way, anything else originally within the body tag, both before and after the background tag, will be included in the replacement text.】 如果使用更复杂匹配方式,也可直接放入 \0-9 变量内,例如两者之间没有空格的过滤格式 "(foo*bar)\1" 括号内的东西将会全部被存为变量。 【More complex matches can be captured by placing the \0-9 directly after a set of parenthesis with no spaces in-between as in "(foo*bar)\1" It this case anything matched within those parenthesis will be placed into the variable. This is similar to regular expressions, but with the added benefit of being able to choose which variable gets used.】 |
\# | 反斜杠# 的含义很类似 \0 到 \9,除此之外,它每一次都会将数据存储到替换堆栈。例如:\# 第一次匹配 "foo" 然后再匹配"bar",这时替换堆栈就有两个数据。接下来你可以用 "\@" 把替换堆栈内一次捉取出来成 "foobar"。 【Backslash-hash (or pound sign) works much like \0 through \9 except each time it's used the value is pushed onto the Replacement Stack. This can be thought of as sticking the new value onto the end of the variable instead of just replacing its previous value. For example, if \# first matched "foo" and then matched "bar", it would contain both values. Use "\@" in the replacement section to print out all values captured or "foobar".】 |
| | 垂直线表示“或”的意思,例如 "foo|bar" 可以匹配 "foo"以及 "bar"。 |
& | 使用 And 符号代表 "与" 功能。例如:"*foo&*bar" 可以匹配 "foo bar" 或 "bar foo" 但却不能匹配 "foo foo"。注意星号 "*" 的使用,像下面的 "height" 常跟着 "width" 就需要用到“与”功能。 【Use the Ampersand as a "AND" function. For example "*foo&*bar" would match "foo bar" or "bar foo" but not "foo foo". Note the use of the asterisk - something like this is always needed with the AND function since a word could never be both "foo" and "bar" at the same place and time ;). AND is useful for situations where tag values may come in any order...】 <img src="picture" height=60 width=200> 也可以写成: <img width=200 height=60 src="picture"> 都可以这样匹配: <img (*src="picture" & *height=60 & *width=200)*> |
&& | 两个 & 号的功能有点类似一个 "&" 但又有差异 - 第二个 & 的匹配会精确地限制匹配第一个&的匹配,不太明白?看下面的例子... 【The Double Ampersand (or AND-AND) function works similar to the single "&" with one important (but useful) difference - the second half of the AND is limited to matching exactly what the first part did. Confused? It's actually pretty simple. Say you have an expression like this...】 (<img * > && \1 ) ...now the "\1" normally acts as a "*" and given a regular AND would match from the start of "<img " all the way to the end of the available text (and well past the end of the image tag). The AND-AND limits it to matching only the contents of the <img ... > tag and no more (so the \1 will only capture the <img ...> tag). You can use this like a bounds to limit the scope of a match and prevent "run-away" expressions. |
( ... ) | 使用圆括号可以建立子表示式。例如:"foo(bar|bear|goat)" 可以匹配 "foobar"、"foobear" 也可以匹配 "foogoat"。也可以使用嵌套表达式,如:"foo(bar|(black|brown|puce) bear|goat)" 可以匹配 "foobar"、"fooblackbear"、"foobrownbear" ....等等。也可以和 "[...]" 搭配使用,假如在 "(" 之后的第一个字符是 "^",那表示将是不包括的意思。例如:"(^foo|bar)" 可是任何东西,但却不能是 "foo" 或 "bar"。 【Use parenthesis to create matching sub-expressions within matching phrases. For example "foo(bar|bear|goat)" would match "foobar", "foobear" or even "foogoat". Parenthesis can be nested, as in "foo(bar|(black|brown|puce) bear|goat)" which would match "foobar" "fooblackbear" "foobrownbear" etc.. Also as with "[...]", if the first character following a "(" is "^" the expression will match only if the expression within does not match. For example, "(^foo|bar)" would match anything that's not "foo" or "bar". Note that a negated expression consumes no characters - it just test them. I think Perl calls this a "negative forward assertion"?】 |
+ | 加号表示连续重复的字符。例如:"a+" 可匹配 "a", "aa"以及 "aaaa"。复杂一点的如: 【The plus sign indicates a run of repeating characters. For instance, "a+" would match "a", "aa", or "aaaa". You can use it after other meta characters or parenthesis for more complex runs. For example...】 [abc]+ 可以匹配 "a","b"以及 "c" 诸如 "ababccba"这样 ([a-z]&[^n])+ 可以匹配"a"到 "z"的任意字母但除了 "n" (foo)+ 匹配"foo", "foofoo", "foofoofoo"等等 An important point to make about + is that it's a "blind" run. This means it repeats at long as the condition it's testing is true regardless of anything the follows it! For example "(foo)+foobar" could never match. Why? well the loop will eat up all the "foo's" their are leaving no "foo" for "foobar"! This can actually be very useful sometimes, but if it's not what you want try "++" instead. |
++ | 两个+号有点像一个+号,但是它会注意到它之后的文本(它可以"看")。它会循环到它发现到相配的过滤规则,这和"."有点相似。 【A double-plus acts much like the single "+" plus except it also pays attention to what comes afterwards (it can "see" so to speak). It only loops until it finds the rest of the expression matches. This is very similar to how the "." operator works in normal regular expressions for example. 】 |
{n,n} | "+" 或 "++" 都可以搭配使用大括号。这可以控制最小和最大的循环匹配次数。例如:"[a]+{4,10}" 可以匹配 4 到 10 个的 "A" 而 "[b]+{20}" 可以精确匹配 20 个 "B"。星号 "*" 表示 "无限" 的意思,例如:"[D]+{10,*}" 可以匹配 10 到更多的 "D"。 【Either "+" or "++" can be followed by a pair of "curly braces". These can be used to control the minimum and maximum times the express may loop. For example, "[a]+{4,10}" would match a string of from four to ten "A's" while "[b]+{20}" would match a string of exactly twenty "B's". An asterisk "*" denotes "infinity" so for example, "[c]+{10,*}" would match ten or more "C's". (For the regexp people in the audience, one difference to keep in mind is this must follow either "+" or "++" and cannot be used on it's own)】 |
\ | 反斜杠可以使用在某些特殊字符上,例如:匹配左括号 "\(",或匹配反斜杠 "\\" 【The Backslash can be used to "escape" any character that has special meaning and treat it as a normal character. For example, to match a parenthesis in the HTML text use "\(", to match a backslash itself use "\\".】 |
= | "="不只是匹配等号而具有特殊魔力,"="自身前后的空格都可以匹配。例如: foo="bar"可以匹配 foo= "bar"或foo = "bar"
【The equal character has special "magic". It will match not only the "=" itself, but also any whitespace before or after - making tests for tag values easier. For example foo="bar" also matches foo= "bar" or foo = "bar" 】 |
" | 它可以过滤双引号或是单引号,例如: " * " 可以是... 【The double quote - it will match either double or single quotes (since either may be used in HTML). for example " * " would match...】 "oh happy mongoose" 或 'oh happy mengeese' 如果您想匹配双引号只需在它前面加上 \" |
' | 单引号会试图匹配出对称的结尾引号,可以是单引号或是双引号。不用感到困惑,HTML中经常混合使用这两种引号,看下面的例子 【The single quote is smarter than your average quote: It attempts to match the appropriate ending quote for any quote previously matched by the double quote - even if there are other quotes in between! Confused? Don't be - in HTML it's common to use a mixture of single and double quotes when you need "quotes within quotes" - look at the following examples...】 单在双内: href=" javascript:window.open( ' bison.html ' ); " 或 双在单内: href=' javascript:window.open( " bison.html " ); ' 您可以这样匹配href=( " * ' ),用"去匹配前面的引号,并用'匹配后面的引号。这里有些限制:开始和结束的匹配都必须在同一个子表达示式区间内 - 即必须要在相同的嵌套区间内。例如.... 【both these could be matched by href=( " * ' ) - simply use the double quote to match the initial quote and the single quote to match the ending quote. There are some restrictions here: First both the starting & ending quotes must be in the same sub-expression - that means in the same set of nested parenthesis. for example....】 " some text ' 可以匹配 ( " some text ' ) 可以匹配 " ( some text | other text) ' 也可以 匹配但是 " ( some text ' ) 和 ( " | ) some text ( ' | )都不能匹配 Another restriction is start and end quotes can't be nested in the same sub-expression - a matching clause of... " something " something else ' end of something ' won't work However, you can nest them using a different sub-expression, like so... " something ( " something else ' ) end of something ' It's also worth noting that if no previous double quote was matched, the single quote just matches a normal single quote. Still it's safer to use \' to explicitly check for a single quote if you need to. |
特殊的替换文本代码
除了过滤规则之外我们还有一些特殊的代码可以使用。
首先是"\0" 到 "\9" are used to insert values stored into the corresponding variables from the matching expression. For stuff captured by "\#" you can either use "\@" which will print everything that's been stored or "\#" again which will print the next item it stored each time you use it in the replacement (so if for example you stored three items, using "\# \# \#" would print them with spaces in between).
这里还有一些您可用于替换部分的特殊代码。
\u | 目前网页的网址。 |
\k | 删除目前的联机:在使用 HTTP headers 时相当有用,可以断开被过滤网页的读取和一些特殊的网址。 |
\h | 主机名的网址。 |
\p | 路径部分的网址。 |
\q | 包含问号字符串的网址。(跟随在 "?" 之后的东西) |
\a | 包含锚点字符串的网址。(跟随在 "#" 之后的东西) |
\d | 目录位置的 "file://" 网址格式。 |
\x | 包含网址的前缀命令,如果已经设定的话。 |
提示: \h \p \q \a 以及 \u can actually be used in the matching section as well. \h in particular can sometimes be useful to see if a URL on the page has the same hostname as the page itself (or is located on a different server)
匹配的命令(外部功能)
除了上面提到的元字符以外,还有专门的匹配命令,匹配命令扩展了标准匹配规则,并且加入了各种各样实用的功能。它们都以$开头,命令全部是大写字母,并以(...)中省略号的内容作为参数。例如: “$LST(...)”用来核对是否有符合块文件的匹配内容。下面是举例:
【Besides the normal meta-characters above, Proxomitron also now features special matching commands. They extend the normal matching rules and add all sorts of useful functions. They all begin with a "$" have an upper case name followed by parens "(...)" with the command parameters. One example is the $LST(...) command which is used to check a blocklist from anywhere within a match - for example...】
<img * src="$LST(ImageURLCheck)" * >
这将会核对所有图片的url是否符合“ImageURLCheck”块列表中的过滤条件,更多功能请点击 匹配命令查看详细资料
更多实例请看 提示和技巧
