Difference between revisions of "Regular expressions"
(ysXeTIrxqHpHBPBZv) |
|||
(42 intermediate revisions by 6 users not shown) | |||
Line 1: | Line 1: | ||
− | I | + | <metadesc>How to write (and read!) Regular Expressions by examples. How to filter for lines containing string1 OR string2, how to filter for lines NOT containing string1, backreferences and the good stuff.</metadesc> |
+ | With regular expressions, you can replace and search for string patterns. You could for example show all URLs in a file or lines that start with a certain date. | ||
+ | |||
+ | Let's take the easier one as an example: This is how you show all lines that begin with the string ''Sep 13'' in a file ''myfile.txt'' issue: | ||
+ | [[grep]] -E <abbr title='a prepended ^ says that the following expression must be at the beginning of the line'>"^''Sep 13''"</abbr> ''myfile.txt'' | ||
+ | In this case ''^Sep 13'' is your regular expression. The ^ sign says that a line must start with the following string. And there is much more you can do with regular expressions. | ||
+ | |||
+ | [[File:Regular_Expressions.png|thumb]] | ||
+ | |||
+ | = Escaping = | ||
+ | The characters ^ and \ are seen as control-characters. ^ means "at the beginning of a line". With a backslash, you can ''escape'' these control-characters, meaning they act as body-characters again: | ||
+ | grep "^hallo" file | ||
+ | finds all occurrences of "hallo" at the beginning of a line in ''file''. | ||
+ | grep "\^hallo" | ||
+ | finds all occurrences of "^hallo" in a file | ||
+ | grep "\\^hallo" | ||
+ | finds all occurrences of "\^hallo" in a file | ||
+ | grep "\\\\^hallo" | ||
+ | finds all occurrences of "\\^hallo" in a file | ||
+ | And so on... | ||
+ | |||
+ | = Write regular expressions = | ||
+ | For "finding a pattern defined by a regular expression", we speak of "matching". | ||
+ | |||
+ | == Beginning of a line is == | ||
+ | grep "^hallo" ''file'' | ||
+ | prints all occurrences of "hallo" at the beginning of a line in ''file''. | ||
+ | |||
+ | == The end of a line == | ||
+ | grep "hallo$" ''file'' | ||
+ | prints all occurrences of "hallo" at the end of a line in ''file''. | ||
+ | |||
+ | == Find string1 OR string2 == | ||
+ | grep -E "Sep|Aug" ''file'' | ||
+ | prints all lines from ''file'' that contain "Sep" ''or'' "Aug". | ||
+ | |||
+ | == Match a group of characters == | ||
+ | grep -E "L[I,1]NUX" ''file'' | ||
+ | prints all lines from ''file'' that contain "LINUX" or "L1NUX" | ||
+ | |||
+ | == Match a range of characters == | ||
+ | grep -E "foo[1-9]" ''file'' | ||
+ | prints all lines from ''file'' that contain "foo1" or "foo2" till "foo9" | ||
+ | |||
+ | == NOT the following characters == | ||
+ | To invert matching for a group of characters | ||
+ | grep -E "for[^ e]" ''file'' | ||
+ | prints all lines from ''file'' that contain "for", but not followed by a space or an e, so not "for you" or "foresee" | ||
+ | |||
+ | Also | ||
+ | [^\n]* | ||
+ | means "all characters till the next newline". This can be useful when writing parsers. | ||
+ | |||
+ | With grep you have an additional possibility to invert matches: | ||
+ | grep -Ev "gettimeofday" ''file'' | ||
+ | prints all lines from ''file'' that do NOT contain "gettimeofday". This is a grep feature. | ||
+ | |||
+ | == Any character == | ||
+ | grep -E "L.nux" ''file'' | ||
+ | matches any character that is not a newline, e.g. Linux, Lenux and L7nux in ''file''. | ||
+ | |||
+ | == Match one or more times == | ||
+ | grep -E "L[i]+nux" ''file'' | ||
+ | Match if i is there at least once in ''file'' | ||
+ | The + here is a quantifier. It means, that i occurs 1 or more times. It is also possible to accept 0 or more times if you replace the + by a *. | ||
+ | |||
+ | == Match ''n'' times == | ||
+ | /etc/services is a table for protocols (services) and their port numbers. The protocols are filled up with blanks to have 16 characters. If you want to replace all protocols for port 3200 with sapdp00 you do it like this: | ||
+ | [[sed]] -ri "s/.{16}3200/sapdp00 3200/" /etc/services | ||
+ | |||
+ | == Backreferences == | ||
+ | Backreferences allows you to reuse matches. For example consider /var/log/nginx/access_log. It is full of lines like this: | ||
+ | 127.0.0.1 - - [27/Dec/2020:12:07:27 -0800] "GET /wiki/load.php?lang=en&modules=jquery%2Csite%7Cjquery.client%7Cmediawiki.String%2CTitle%2Capi%2Cbase%2Ccldr%2CjqueryMsg%2Clanguage%2Cutil%7Cmediawiki.libs.pluralruleparser%7Cmediawiki.page.ready%2Cstartup%7Cskins.vector.legacy.js%7Cuser.defaults&skin=vector&version=o8vg2 HTTP/1.1" 200 277446 "http://localhost/wiki/index.php?title=Main_Page" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:83.0) Gecko/20100101 Firefox/83.0" | ||
+ | |||
+ | Every line tells you about a specific access request to your webserver. But maybe you are only interested in the part that starts with http, and you want to extract that part. Then you have to use backreferences: | ||
+ | cat access.log | sed "s;.*\(\http[^\"]*\)\(.*\);\1;" | ||
+ | will output for the example above | ||
+ | http://localhost/wiki/index.php?title=Main_Page | ||
+ | |||
+ | Now let's look at the line | ||
+ | cat access.log | sed "<abbr title='s: advises sed to substitute'>s</abbr><abbr title=';: separates sed command from regex. Right of this semicolon, the regex starts.'>;</abbr><abbr title='.*: the string starts with an arbitrary amount of characters (*, can be 0 characters) that are not a newline'>.*</abbr>\(\http[^\"]*\)\(.*\);\1;" | ||
+ | what does it do? | ||
+ | <pre> | ||
+ | cat access.log | : send access.log line by line to the next command (sed) | ||
+ | sed "s; - substitute the following pattern | ||
+ | .* - an arbitrary amount of characters that are not a newline. This will stop once the next pattern is found | ||
+ | \( - take the following match and remember it as first backreference | ||
+ | http[^\"]* - match a string started with http, followed by an arbitrary number of characters that is not a quote | ||
+ | \) - this is the end of the (first) backreference | ||
+ | \( - remember the following as the (second) backreference | ||
+ | .* - an arbitrary amount of characters that are not a newline will be matched | ||
+ | \) - this is the end of the (second) backreference | ||
+ | ; - this is the end of the string that has to be substituted | ||
+ | \1 - substitute the string by the first backreference | ||
+ | ; - this is the end of the substitution | ||
+ | " - this is the end of the sed parameter | ||
+ | </pre> | ||
+ | |||
+ | = Read regular expressions = | ||
+ | |||
+ | == * == | ||
+ | An asterisk is a quantifier saying "whatever number of". | ||
+ | grep -E "Li*nux" file | ||
+ | Lnux | ||
+ | Linux | ||
+ | Liinux | ||
+ | Liiinux | ||
+ | An asterisk is placed next to an atom that can be repeated in whatever number. In the above example, the atom is the ''i'' character, but it can also be a group of characters: | ||
+ | grep -E "ba(na)*" file | ||
+ | ba | ||
+ | bana | ||
+ | banana | ||
+ | bananana | ||
+ | |||
+ | == ^ == | ||
+ | The ^ character stands for | ||
+ | * the beginning of a line if it stands at the beginning of a branch | ||
+ | # grep ^foo | ||
+ | barfoo | ||
+ | foo | ||
+ | foo | ||
+ | * "not" if it stands behind a bracket | ||
+ | # grep for[^e] | ||
+ | foresee | ||
+ | for each | ||
+ | for each | ||
+ | * the ^ character if it is escaped | ||
+ | # grep "\^" | ||
+ | adsf | ||
+ | as^df | ||
+ | as^df | ||
+ | |||
+ | == ? == | ||
+ | The ? character stands for | ||
+ | * non-greedy matching: | ||
+ | http://.*?/ | ||
+ | |||
+ | = Understand regular expressions = | ||
+ | |||
+ | == Branches, Pieces and Atoms == | ||
+ | A regular expression consists of one or more ''branches'', separated by "|", the "OR" sign. If one of the branches ''matches'', the expression matches: | ||
+ | grep -E "Tom|Harry" | ||
+ | Here, the expression is ''Tom''|''Harry'', and ''Tom'' and ''Harry'' are both branches. | ||
+ | |||
+ | A branch consists of one or more pieces, seen in its particular order. A piece is an atom optionally followed by a [[Regular_expressions#quantifiers|quantifier]]: | ||
+ | grep -E "To*m" | ||
+ | Here, T is a piece as well as o* and m. | ||
+ | |||
+ | An atom is a character, a bracket expression or a subexpression. Each line can be an atom: | ||
+ | a | ||
+ | b | ||
+ | [^e] | ||
+ | (this is a subexpression) | ||
+ | |||
+ | == quantifiers == | ||
+ | A quantifier is used to define that an atom can exist several times. The * quantifier defines the atom in front of it can occur 0, 1 or several times: | ||
+ | grep -E "To*m" | ||
+ | Will find all lines containing Tom, Toom, Tooom and Tm. | ||
+ | |||
+ | = See also = | ||
+ | * [[scripting tutorial]] | ||
+ | * [http://www.linuxintro.org/regex RegEx ComPoser] | ||
+ | * [http://www.gskinner.com/RegExr/ RegEx training] | ||
+ | |||
+ | [[Category:Learning]] | ||
+ | [[Category:Concept]] | ||
+ | [[Category:Mindmap]] |
Latest revision as of 09:06, 3 January 2021
With regular expressions, you can replace and search for string patterns. You could for example show all URLs in a file or lines that start with a certain date.
Let's take the easier one as an example: This is how you show all lines that begin with the string Sep 13 in a file myfile.txt issue:
grep -E "^Sep 13" myfile.txt
In this case ^Sep 13 is your regular expression. The ^ sign says that a line must start with the following string. And there is much more you can do with regular expressions.
Contents
Escaping
The characters ^ and \ are seen as control-characters. ^ means "at the beginning of a line". With a backslash, you can escape these control-characters, meaning they act as body-characters again:
grep "^hallo" file
finds all occurrences of "hallo" at the beginning of a line in file.
grep "\^hallo"
finds all occurrences of "^hallo" in a file
grep "\\^hallo"
finds all occurrences of "\^hallo" in a file
grep "\\\\^hallo"
finds all occurrences of "\\^hallo" in a file And so on...
Write regular expressions
For "finding a pattern defined by a regular expression", we speak of "matching".
Beginning of a line is
grep "^hallo" file
prints all occurrences of "hallo" at the beginning of a line in file.
The end of a line
grep "hallo$" file
prints all occurrences of "hallo" at the end of a line in file.
Find string1 OR string2
grep -E "Sep|Aug" file
prints all lines from file that contain "Sep" or "Aug".
Match a group of characters
grep -E "L[I,1]NUX" file
prints all lines from file that contain "LINUX" or "L1NUX"
Match a range of characters
grep -E "foo[1-9]" file
prints all lines from file that contain "foo1" or "foo2" till "foo9"
NOT the following characters
To invert matching for a group of characters
grep -E "for[^ e]" file
prints all lines from file that contain "for", but not followed by a space or an e, so not "for you" or "foresee"
Also
[^\n]*
means "all characters till the next newline". This can be useful when writing parsers.
With grep you have an additional possibility to invert matches:
grep -Ev "gettimeofday" file
prints all lines from file that do NOT contain "gettimeofday". This is a grep feature.
Any character
grep -E "L.nux" file
matches any character that is not a newline, e.g. Linux, Lenux and L7nux in file.
Match one or more times
grep -E "L[i]+nux" file
Match if i is there at least once in file The + here is a quantifier. It means, that i occurs 1 or more times. It is also possible to accept 0 or more times if you replace the + by a *.
Match n times
/etc/services is a table for protocols (services) and their port numbers. The protocols are filled up with blanks to have 16 characters. If you want to replace all protocols for port 3200 with sapdp00 you do it like this:
sed -ri "s/.{16}3200/sapdp00 3200/" /etc/services
Backreferences
Backreferences allows you to reuse matches. For example consider /var/log/nginx/access_log. It is full of lines like this:
127.0.0.1 - - [27/Dec/2020:12:07:27 -0800] "GET /wiki/load.php?lang=en&modules=jquery%2Csite%7Cjquery.client%7Cmediawiki.String%2CTitle%2Capi%2Cbase%2Ccldr%2CjqueryMsg%2Clanguage%2Cutil%7Cmediawiki.libs.pluralruleparser%7Cmediawiki.page.ready%2Cstartup%7Cskins.vector.legacy.js%7Cuser.defaults&skin=vector&version=o8vg2 HTTP/1.1" 200 277446 "http://localhost/wiki/index.php?title=Main_Page" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:83.0) Gecko/20100101 Firefox/83.0"
Every line tells you about a specific access request to your webserver. But maybe you are only interested in the part that starts with http, and you want to extract that part. Then you have to use backreferences:
cat access.log | sed "s;.*\(\http[^\"]*\)\(.*\);\1;"
will output for the example above
http://localhost/wiki/index.php?title=Main_Page
Now let's look at the line
cat access.log | sed "s;.*\(\http[^\"]*\)\(.*\);\1;"
what does it do?
cat access.log | : send access.log line by line to the next command (sed) sed "s; - substitute the following pattern .* - an arbitrary amount of characters that are not a newline. This will stop once the next pattern is found \( - take the following match and remember it as first backreference http[^\"]* - match a string started with http, followed by an arbitrary number of characters that is not a quote \) - this is the end of the (first) backreference \( - remember the following as the (second) backreference .* - an arbitrary amount of characters that are not a newline will be matched \) - this is the end of the (second) backreference ; - this is the end of the string that has to be substituted \1 - substitute the string by the first backreference ; - this is the end of the substitution " - this is the end of the sed parameter
Read regular expressions
*
An asterisk is a quantifier saying "whatever number of".
grep -E "Li*nux" file Lnux Linux Liinux Liiinux
An asterisk is placed next to an atom that can be repeated in whatever number. In the above example, the atom is the i character, but it can also be a group of characters:
grep -E "ba(na)*" file ba bana banana bananana
^
The ^ character stands for
- the beginning of a line if it stands at the beginning of a branch
# grep ^foo barfoo foo foo
- "not" if it stands behind a bracket
# grep for[^e] foresee for each for each
- the ^ character if it is escaped
# grep "\^" adsf as^df as^df
?
The ? character stands for
- non-greedy matching:
http://.*?/
Understand regular expressions
Branches, Pieces and Atoms
A regular expression consists of one or more branches, separated by "|", the "OR" sign. If one of the branches matches, the expression matches:
grep -E "Tom|Harry"
Here, the expression is Tom|Harry, and Tom and Harry are both branches.
A branch consists of one or more pieces, seen in its particular order. A piece is an atom optionally followed by a quantifier:
grep -E "To*m"
Here, T is a piece as well as o* and m.
An atom is a character, a bracket expression or a subexpression. Each line can be an atom:
a b [^e] (this is a subexpression)
quantifiers
A quantifier is used to define that an atom can exist several times. The * quantifier defines the atom in front of it can occur 0, 1 or several times:
grep -E "To*m"
Will find all lines containing Tom, Toom, Tooom and Tm.