Difference between revisions of "Regular expressions"

From Linuxintro
 
(37 intermediate revisions by 3 users not shown)
Line 1: Line 1:
Regular expressions allow you to formulate patterns to search for. Here's an example: It is easy to search for the string "Sep" in a file, you do it with
+
<metadesc>How to write (and read!) Regular Expressions by examples. How to filter for lines containing string1 OR string2, how to filter for lines NOT containing string1, backreferences and the good stuff.</metadesc>
[[grep]] "Sep" file
+
With regular expressions, you can replace and search for string patterns. You could for example show all URLs in a file or lines that start with a certain date.
This gives you all lines containing the string "Sep". But what do you do if you only want lines ''starting'' with "Sep", for example, to read all lines in your syslog regarding september? Then you need '''regular expressions'''. It works like this:
+
 
  grep -E "^Sep" /var/log/messages
+
Let's take the easier one as an example: This is how you show all lines that begin with the string ''Sep 13'' in a file ''myfile.txt'' issue:
gives you all entries for september in your syslog. And there is much more you can do with regular expressions.
+
  [[grep]] -E <abbr title='a prepended ^ says that the following expression must be at the beginning of the line'>"^''Sep 13''"</abbr> ''myfile.txt''
 +
In this case ''^Sep 13'' is your regular expression. The ^ sign says that a line must start with the following string. And there is much more you can do with regular expressions.
 +
 
 +
[[File:Regular_Expressions.png|thumb]]
  
 
= Escaping =
 
= Escaping =
Line 44: Line 47:
 
  grep -E "for[^ e]" ''file''
 
  grep -E "for[^ e]" ''file''
 
prints all lines from ''file'' that contain "for", but not followed by a space or an e, so not "for you" or "foresee"
 
prints all lines from ''file'' that contain "for", but not followed by a space or an e, so not "for you" or "foresee"
 +
 +
Also
 +
[^\n]*
 +
means "all characters till the next newline". This can be useful when writing parsers.
  
 
With grep you have an additional possibility to invert matches:
 
With grep you have an additional possibility to invert matches:
Line 51: Line 58:
 
== Any character ==
 
== Any character ==
 
  grep -E "L.nux" ''file''
 
  grep -E "L.nux" ''file''
matches any character that is not a newline, e.g. Linux, Lenux and Lnux in ''file''.  
+
matches any character that is not a newline, e.g. Linux, Lenux and L7nux in ''file''.
  
 
== Match one or more times ==
 
== Match one or more times ==
Line 58: Line 65:
 
The + here is a quantifier. It means, that i occurs 1 or more times. It is also possible to accept 0 or more times if you replace the + by a *.
 
The + here is a quantifier. It means, that i occurs 1 or more times. It is also possible to accept 0 or more times if you replace the + by a *.
  
After reading  Flip the sripct for a moment , I have to admit that there are actually-existing invisible strings which have shaped our ways of actions and responses towards different social issues. I have to accept the conception that people whoever inside a certain society, should have been restricted in a special kind of behaviors and habits in daily life. There are lots of copies accompanied with wide-ranged progressing so that the whole society is moving forward all the time. In my opinion, looking inside the ways of humans  thoughts and actions, is supposed to be a complex and unexpected way which can not be obtained by general people who have no extraordinary capability of empathy. Referring to empathy, it is no wonder an uncommon ability which is full of skills to step out of own shoes and stand on others points with constant reconsidering the world. If a majority of people have possessed the way to manipulate  empathy  correctly, there is no wonder that less bias, wars, conflicts appearing in the world. What s more, the imbalanced definitions of different races and genders should have disappeared already. However, the reality isn t as perfect and well-behaved as we expect. The long-lasting shortage of equitable judgment and definition towards different social phenomena are really difficult and complicated to delete within a small amount of time even inside a tiny range of location. That s why human beings are struggling the unique ways which are acceptable within individuals themselves but not suitable for the entire society out because they have to compromise to the wholly social rules as long as they want to step further. Apparently, individuals are really weak so that they have no tremendous and effective influences to change anything about their own behaviors. According to this incapability, society is acting out in its own way which is though based on the summary of individuals behaviors but it turns out more like a group-thinking which should obey and perfect the upper-classes benefits. That s how the huge and diverse society has been working so well- just on the basis of the decisions and sole selfishness from a small class of politicians and riches who have totally unimaginable great powers to control almost the ways of social actions. Just considering the races without matters with gender differences, it seems ridiculous to punish any unequal former perspectives because different races built up with different cultures and wealth really have own styles and even have accepted their own social levels and divisions positively. That is to say, looking back for a really long-gone history, race bias has been defined as a natural solution to social classes divisions and people all over the world are potentially willing maybe actually they have to accept the way it is. It is the same way how gender difference works. Invisible strings establish the background of the society which has a large and wide influential impedes for the equal conception to last forever. Regarding people as fish, the society as the water we have to choose to live in, none of us can leave it for a long  time though we can flee and escape the environment and regulate own styles for a while; we should confirm that we have to find out a way to come back even if we have discovered what is wrong inside the society or what should be done in eager needs. Only own to the natural fundamental rule that fishes have to live in water, we are the last and un-easiest ones to realize the conditions or stay awaken for a periods. That s the mysterious sociology, we should try to think like a sociologist though it is kinds of confusions resulting from lots of hidden surprises whatever bad or good.  Never doubt the way, it is just what is should be acted.
+
== Match ''n'' times ==
 +
/etc/services is a table for protocols (services) and their port numbers. The protocols are filled up with blanks to have 16 characters. If you want to replace all protocols for port 3200 with sapdp00 you do it like this:
 +
[[sed]] -ri "s/.{16}3200/sapdp00 3200/" /etc/services
 +
 
 +
== Backreferences ==
 +
Backreferences allows you to reuse matches. For example consider /var/log/nginx/access_log. It is full of lines like this:
 +
  127.0.0.1 - - [27/Dec/2020:12:07:27 -0800] "GET /wiki/load.php?lang=en&modules=jquery%2Csite%7Cjquery.client%7Cmediawiki.String%2CTitle%2Capi%2Cbase%2Ccldr%2CjqueryMsg%2Clanguage%2Cutil%7Cmediawiki.libs.pluralruleparser%7Cmediawiki.page.ready%2Cstartup%7Cskins.vector.legacy.js%7Cuser.defaults&skin=vector&version=o8vg2 HTTP/1.1" 200 277446 "http://localhost/wiki/index.php?title=Main_Page" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:83.0) Gecko/20100101 Firefox/83.0"
 +
 
 +
Every line tells you about a specific access request to your webserver. But maybe you are only interested in the part that starts with http, and you want to extract that part. Then you have to use backreferences:
 +
cat access.log | sed "s;.*\(\http[^\"]*\)\(.*\);\1;"
 +
will output for the example above
 +
  http://localhost/wiki/index.php?title=Main_Page
 +
 
 +
Now let's look at the line
 +
  cat access.log | sed "<abbr title='s: advises sed to substitute'>s</abbr><abbr title=';: separates sed command from regex. Right of this semicolon, the regex starts.'>;</abbr><abbr title='.*: the string starts with an arbitrary amount of characters (*, can be 0 characters) that are not a newline'>.*</abbr>\(\http[^\"]*\)\(.*\);\1;"
 +
what does it do?
 +
<pre>
 +
cat access.log | : send access.log line by line to the next command (sed)
 +
sed "s; - substitute the following pattern
 +
.* - an arbitrary amount of characters that are not a newline. This will stop once the next pattern is found
 +
\( - take the following match and remember it as first backreference
 +
http[^\"]* - match a string started with http, followed by an arbitrary number of characters that is not a quote
 +
\) - this is the end of the (first) backreference
 +
\( - remember the following as the (second) backreference
 +
.* - an arbitrary amount of characters that are not a newline will be matched
 +
\) - this is the end of the (second) backreference
 +
; - this is the end of the string that has to be substituted
 +
\1 - substitute the string by the first backreference
 +
; - this is the end of the substitution
 +
" - this is the end of the sed parameter
 +
</pre>
  
 
= Read regular expressions =
 
= Read regular expressions =
Line 93: Line 130:
 
  as^df
 
  as^df
 
  as^df
 
  as^df
 +
 +
== ? ==
 +
The ? character stands for
 +
* non-greedy matching:
 +
http://.*?/
  
 
= Understand regular expressions =
 
= Understand regular expressions =
Line 118: Line 160:
 
= See also =
 
= See also =
 
* [[scripting tutorial]]
 
* [[scripting tutorial]]
 +
* [http://www.linuxintro.org/regex RegEx ComPoser]
 
* [http://www.gskinner.com/RegExr/ RegEx training]
 
* [http://www.gskinner.com/RegExr/ RegEx training]
  
<Rating comment=false>
+
[[Category:Learning]]
How do you like this article?
+
[[Category:Concept]]
1 (Hated it)
+
[[Category:Mindmap]]
2
 
3
 
4
 
5 (Loved it)
 
</Rating>
 

Latest revision as of 09:06, 3 January 2021

With regular expressions, you can replace and search for string patterns. You could for example show all URLs in a file or lines that start with a certain date.

Let's take the easier one as an example: This is how you show all lines that begin with the string Sep 13 in a file myfile.txt issue:

grep -E "^Sep 13" myfile.txt

In this case ^Sep 13 is your regular expression. The ^ sign says that a line must start with the following string. And there is much more you can do with regular expressions.

Regular Expressions.png

Escaping

The characters ^ and \ are seen as control-characters. ^ means "at the beginning of a line". With a backslash, you can escape these control-characters, meaning they act as body-characters again:

grep "^hallo" file

finds all occurrences of "hallo" at the beginning of a line in file.

grep "\^hallo" 

finds all occurrences of "^hallo" in a file

grep "\\^hallo"

finds all occurrences of "\^hallo" in a file

grep "\\\\^hallo"

finds all occurrences of "\\^hallo" in a file And so on...

Write regular expressions

For "finding a pattern defined by a regular expression", we speak of "matching".

Beginning of a line is

grep "^hallo" file

prints all occurrences of "hallo" at the beginning of a line in file.

The end of a line

grep "hallo$" file

prints all occurrences of "hallo" at the end of a line in file.

Find string1 OR string2

grep -E "Sep|Aug" file

prints all lines from file that contain "Sep" or "Aug".

Match a group of characters

grep -E "L[I,1]NUX" file

prints all lines from file that contain "LINUX" or "L1NUX"

Match a range of characters

grep -E "foo[1-9]" file

prints all lines from file that contain "foo1" or "foo2" till "foo9"

NOT the following characters

To invert matching for a group of characters

grep -E "for[^ e]" file

prints all lines from file that contain "for", but not followed by a space or an e, so not "for you" or "foresee"

Also

[^\n]*

means "all characters till the next newline". This can be useful when writing parsers.

With grep you have an additional possibility to invert matches:

grep -Ev "gettimeofday" file

prints all lines from file that do NOT contain "gettimeofday". This is a grep feature.

Any character

grep -E "L.nux" file

matches any character that is not a newline, e.g. Linux, Lenux and L7nux in file.

Match one or more times

grep -E "L[i]+nux" file

Match if i is there at least once in file The + here is a quantifier. It means, that i occurs 1 or more times. It is also possible to accept 0 or more times if you replace the + by a *.

Match n times

/etc/services is a table for protocols (services) and their port numbers. The protocols are filled up with blanks to have 16 characters. If you want to replace all protocols for port 3200 with sapdp00 you do it like this:

sed -ri "s/.{16}3200/sapdp00 3200/" /etc/services

Backreferences

Backreferences allows you to reuse matches. For example consider /var/log/nginx/access_log. It is full of lines like this:

127.0.0.1 - - [27/Dec/2020:12:07:27 -0800] "GET /wiki/load.php?lang=en&modules=jquery%2Csite%7Cjquery.client%7Cmediawiki.String%2CTitle%2Capi%2Cbase%2Ccldr%2CjqueryMsg%2Clanguage%2Cutil%7Cmediawiki.libs.pluralruleparser%7Cmediawiki.page.ready%2Cstartup%7Cskins.vector.legacy.js%7Cuser.defaults&skin=vector&version=o8vg2 HTTP/1.1" 200 277446 "http://localhost/wiki/index.php?title=Main_Page" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:83.0) Gecko/20100101 Firefox/83.0"

Every line tells you about a specific access request to your webserver. But maybe you are only interested in the part that starts with http, and you want to extract that part. Then you have to use backreferences:

cat access.log | sed "s;.*\(\http[^\"]*\)\(.*\);\1;"

will output for the example above

http://localhost/wiki/index.php?title=Main_Page

Now let's look at the line

cat access.log | sed "s;.*\(\http[^\"]*\)\(.*\);\1;"

what does it do?

cat access.log | : send access.log line by line to the next command (sed)
sed "s; - substitute the following pattern
.* - an arbitrary amount of characters that are not a newline. This will stop once the next pattern is found
\( - take the following match and remember it as first backreference
http[^\"]* - match a string started with http, followed by an arbitrary number of characters that is not a quote
\) - this is the end of the (first) backreference
\( - remember the following as the (second) backreference
.* - an arbitrary amount of characters that are not a newline will be matched
\) - this is the end of the (second) backreference
; - this is the end of the string that has to be substituted
\1 - substitute the string by the first backreference
; - this is the end of the substitution
" - this is the end of the sed parameter

Read regular expressions

*

An asterisk is a quantifier saying "whatever number of".

grep -E "Li*nux" file
Lnux
Linux
Liinux
Liiinux

An asterisk is placed next to an atom that can be repeated in whatever number. In the above example, the atom is the i character, but it can also be a group of characters:

grep -E "ba(na)*" file
ba
bana
banana
bananana

^

The ^ character stands for

  • the beginning of a line if it stands at the beginning of a branch
# grep ^foo
barfoo
foo
foo
  • "not" if it stands behind a bracket
# grep for[^e]
foresee
for each
for each
  • the ^ character if it is escaped
# grep "\^"
adsf
as^df
as^df

?

The ? character stands for

  • non-greedy matching:
http://.*?/

Understand regular expressions

Branches, Pieces and Atoms

A regular expression consists of one or more branches, separated by "|", the "OR" sign. If one of the branches matches, the expression matches:

grep -E "Tom|Harry"

Here, the expression is Tom|Harry, and Tom and Harry are both branches.

A branch consists of one or more pieces, seen in its particular order. A piece is an atom optionally followed by a quantifier:

grep -E "To*m"

Here, T is a piece as well as o* and m.

An atom is a character, a bracket expression or a subexpression. Each line can be an atom:

a
b
[^e]
(this is a subexpression)

quantifiers

A quantifier is used to define that an atom can exist several times. The * quantifier defines the atom in front of it can occur 0, 1 or several times:

grep -E "To*m"

Will find all lines containing Tom, Toom, Tooom and Tm.

See also