OPS102 - Regular Expressions

From CDOT Wiki
Revision as of 22:22, 4 December 2023 by Chris Tyler (talk | contribs) (Created page with "'''Regular Expressions''' are search patterns for "Regular Text". They are used by many different tools and languages, including the Linux grep command, the Windows findstr co...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Regular Expressions are search patterns for "Regular Text". They are used by many different tools and languages, including the Linux grep command, the Windows findstr command, less, vi/vim, sed, awk, perl, python, and many others.

Why Use Regular Expressions?

Regular Expressions can be a little daunting to learn: they often look like someone was just bashing their head against the keyboard (or, like a cat was lying on the keyboard). But they are very powerful - a well-written regular expression can replace many pages of code in a programming language such as C or C++ - and so it is worth investing some time to understand them.

The Seven Basic Elements of Regular Expressions

Characters

In a regular expression (regexp), any character that doesn't otherwise have a special meaning matches that character. So the digit "5", for example, matches the digit "5"; similarly "cat" matches the letters "c", "a", and "t" in sequence.

A backslash can be used to remove any special meaning which a character has. The period character "." is a type of wildcard (see below), so to search for a literal period, we place a backslash in front of it: "\."

Wildcards

A period "." will match any single character. Similarly, three periods "..." will match any three characters.

Bracket Expressions / Character Classes

Bracket Expressions or Character Classes are contained in square brackets "[ ]":

  • A list of characters in square brackets will match any one character from the list of characters: "[abc]" will match "a", "b", or "c"
  • A range of characters in square brackets, written as a starting character, a dash, and an ending character, will match any character in that range: "[0-9]" will match any one digit.
  • There are some pre-defined named character classes. These are selected by specifying the name of the character class surrounded by colons and square brackets, placed within outer square brackets, like "digits:". The available names are:
    • alnum - alphanumeric
    • alpha - alphabetic characters
    • blank - horizontal whitespace (space, tab)
    • cntrl - control characters
    • digit - digits
    • graph - letters, digits, and punctuation
    • print - letters, digits, punctuation, and space
    • punct - punctuation marks
    • space - horizontal and vertical whitespace (space, tab, vertical tab, form feed)
    • upper - UPPERCASE letters
    • lower - lowercase letters
    • xdigit - hexidecimal digits (digits plus a-f and A-F)
  • Ranges, lists, and named character classes may be combined - e.g., "[[:digit:]+-.,]" "[[:digit:][:punct:]]" "[0-9_*]"
  • To invert a character class, add a carat ^ character as the first character after the opening square bracket: "[^[:digit:]]" matches any non-digit character, and "[^:]" matches any character that is not a colon.
  • To include a literal carat, place it at the end of the character class. To include a literal dash or closing square bracket, place it at the start of the character class.

Repetition

  • A repeat count can be placed in curly brackets. It applies to the previous element: "x{3}" matches "xxx"
  • A repeat can be a range, written as min,max in curly brackets: "x{2,5}" will match "xx", "xxx", "xxxx", or "xxxxx"
  • The maximum value in a range can be omitted: "x{2,}" will two or more "x" characters in a row
  • There are short forms for some commonly-used ranges:
    • "*" is the same as "{0,}" (zero or more)
    • "+" is the same as "{1,}" (one or more)
    • "?" is the same as "{0,1}" (zero or one)

Alternation

  • The vertical bar indicates alternation - either the expression on the left or the right can be matched: "hot|cold" will match "hot" or "cold"

Grouping

  • Elements placed in parenthesis are treated as a group, and can be repeated: "(na)* batman" will match "nananana batman" and "nananananananana batman"
  • Grouping may also be used to limit alternation: "(fire|green)house" will match "firehouse" and "greenhouse"

Anchors

  • Anchors match locations, not characters.
  • A carat symbol will match the start of a line: "^upper:" wil match lines that start with an uppercase letter.
  • A dollar sign will match the end of a line: "punct:$" will match lines that end with a punctuation mark.
  • The two characters may be used together: "cat" will match the word "cat" anywhere on a line, but "^cat$" will only match lines that contain nothing besides the word "cat". Likewise, "^[0-9.]$" will match lines that are made up of only digits and dot characters.