Difference between revisions of "Regular expressions"

From Linuxintro
imported>ThorstenStaerk
(otwfkIZimlvhSNiZXt)
Line 1: Line 1:
Regular expressions allow you to formulate patterns to search for. Here's an example: It is easy to search for the string "Sep" in a file, you do it with
+
  bferoe  You also seem to have accidentally slipped and added an seo link with the anchor text “website development” to your reply. As this was clearly a mistake, I removed it for you, free of charge
  [[grep]] "Sep" file
 
This gives you all lines containing the string "Sep". But what do you do if you only want lines ''starting'' with "Sep", for example, to read all lines in your syslog regarding september? Then you need '''regular expressions'''. It works like this:
 
grep -E "^Sep" /var/log/messages
 
gives you all entries for september in your syslog. And there is much more you can do with regular expressions.
 
 
 
= Escaping =
 
The characters ^ and \ are seen as control-characters. ^ means "at the beginning of a line". With a backslash, you can ''escape'' these control-characters, meaning they act as body-characters again:
 
grep "^hallo" file
 
finds all occurrences of "hallo" at the beginning of a line in ''file''.
 
grep "\^hallo"
 
finds all occurrences of "^hallo" in a file
 
grep "\\^hallo"
 
finds all occurrences of "\^hallo" in a file
 
grep "\\\\^hallo"
 
finds all occurrences of "\\^hallo" in a file
 
And so on...
 
 
 
= Write regular expressions =
 
For "finding a pattern defined by a regular expression", we speak of "matching".
 
 
 
== Beginning of a line is ==
 
grep "^hallo" ''file''
 
prints all occurrences of "hallo" at the beginning of a line in ''file''.
 
 
 
== The end of a line ==
 
grep "hallo$" ''file''
 
prints all occurrences of "hallo" at the end of a line in ''file''.
 
 
 
== Find string1 OR string2 ==
 
grep -E "Sep|Aug" ''file''
 
prints all lines from ''file'' that contain "Sep" ''or'' "Aug".
 
 
 
== Match a group of characters ==
 
grep -E "L[I,1]NUX" ''file''
 
prints all lines from ''file'' that contain "LINUX" or "L1NUX"
 
 
 
== Match a range of characters ==
 
grep -E "foo[1-9]" ''file''
 
prints all lines from ''file'' that contain "foo1" or "foo2" till "foo9"
 
 
 
== NOT the following characters ==
 
To invert matching for a group of characters
 
grep -E "for[^ e]" ''file''
 
prints all lines from ''file'' that contain "for", but not followed by a space or an e, so not "for you" or "foresee"
 
 
 
With grep you have an additional possibility to invert matches:
 
grep -Ev "gettimeofday" ''file''
 
prints all lines from ''file'' that do NOT contain "gettimeofday". This is a grep feature.
 
 
 
== Any character ==
 
grep -E "L.nux" ''file''
 
matches any character that is not a newline, e.g. Linux, Lenux and Lnux in ''file''.
 
 
 
== Match one or more times ==
 
grep -E "L[i]+nux" ''file''
 
Match if i is there at least once in ''file''
 
The + here is a quantifier. It means, that i occurs 1 or more times. It is also possible to accept 0 or more times if you replace the + by a *.
 
 
 
== Backreferences ==
 
Backreferences allows you to reuse matches. For example consider the following line from /var/log/[[apache]]2/access_log:
 
84.163.99.149 - - [21/Jan/2012:15:23:40 +0100] "GET /wiki/Special:RecentChanges HTTP/1.1" 200 66493 "http://www.linuxintro.org/index.php?title=Configuring_and_securing_sshd&action=history" "Mozilla/5.0 (X11; Linux x86_64; rv:9.0.1) Gecko/20100101 Firefox/9.0.1"
 
If you want to "extract the string containing GET between the quotes" you best use backreferences like this:
 
[[cat]] /var/log/apache2/access_log | [[sed]] "s;.*\(GET [^\"]*\).*;\1;"
 
 
 
= Read regular expressions =
 
 
 
== * ==
 
An asterisk is a quantifier saying "whatever number of".
 
grep -E "Li*nux" file
 
Lnux
 
Linux
 
Liinux
 
Liiinux
 
An asterisk is placed next to an atom that can be repeated in whatever number. In the above example, the atom is the ''i'' character, but it can also be a group of characters:
 
grep -E "ba(na)*" file
 
ba
 
bana
 
banana
 
bananana
 
 
 
== ^ ==
 
The ^ character stands for  
 
* the beginning of a line if it stands at the beginning of a branch
 
# grep ^foo
 
barfoo
 
foo
 
foo
 
* "not" if it stands behind a bracket
 
# grep for[^e]
 
foresee
 
for each
 
for each
 
* the ^ character if it is escaped
 
# grep "\^"
 
adsf
 
as^df
 
as^df
 
 
 
= Understand regular expressions =
 
 
 
== Branches, Pieces and Atoms ==
 
A regular expression consists of one or more ''branches'', separated by "|", the "OR" sign. If one of the branches ''matches'', the expression matches:
 
grep -E "Tom|Harry"
 
Here, the expression is ''Tom''|''Harry'', and ''Tom'' and ''Harry'' are both branches.
 
 
 
A branch consists of one or more pieces, seen in its particular order. A piece is an atom optionally followed by a [[Regular_expressions#quantifiers|quantifier]]:
 
grep -E "To*m"
 
Here, T is a piece as well as o* and m.
 
 
 
An atom is a character, a bracket expression or a subexpression. Each line can be an atom:
 
a
 
b
 
[^e]
 
(this is a subexpression)
 
 
 
== quantifiers ==
 
A quantifier is used to define that an atom can exist several times. The * quantifier defines the atom in front of it can occur 0, 1 or several times:
 
grep -E "To*m"
 
Will find all lines containing Tom, Toom, Tooom and Tm.
 
 
 
= See also =
 
* [[scripting tutorial]]
 
 
 
<Rating comment=false>
 
How do you like this article?
 
1 (Hated it)
 
2
 
3
 
4
 
5 (Loved it)
 
</Rating>
 

Revision as of 12:16, 12 February 2012

bferoe   You also seem to have accidentally slipped and added an seo link with the anchor text “website development” to your reply. As this was clearly a mistake, I removed it for you, free of charge