Skip to content Skip to sidebar Skip to footer

Regex For Detection Company Names In Python

I want to detect company names with regex by using Python. This is my idea: Company name should have between 1 and 3 words First word in company name should be capitalized One of

Solution 1:

Use

\b[A-Z]\w+(?:\.com?)?(?:[ -]+(?:&[ -]+)?[A-Z]\w+(?:\.com?)?){0,2}[,\s]+(?i:ltd|llc|inc|plc|co(?:rp)?|group|holding|gmbh)\b

See regex proof.

EXPLANATION

--------------------------------------------------------------------------------
  \b                       the boundary between a word char (\w) and
                           something that isnot a word char
--------------------------------------------------------------------------------
  [A-Z]                    any character of: 'A' to 'Z'
--------------------------------------------------------------------------------
  \w+                      word characters (a-z, A-Z, 0-9, _) (1or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  (?:                      group, but do not capture (optional
                           (matching the most amount possible)):
--------------------------------------------------------------------------------
    \.                       '.'
--------------------------------------------------------------------------------
    co                       'co'
--------------------------------------------------------------------------------
    m?                       'm' (optional (matching the most amount
                             possible))
--------------------------------------------------------------------------------
  )?                       end of grouping
--------------------------------------------------------------------------------
  (?:                      group, but do not capture (between 0and2
                           times (matching the most amount
                           possible)):
--------------------------------------------------------------------------------
    [ -]+                    any character of: ' ', '-' (1or more
                             times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
    (?:                      group, but do not capture (optional
                             (matching the most amount possible)):
--------------------------------------------------------------------------------
      &                        '&'
--------------------------------------------------------------------------------
      [ -]+                    any character of: ' ', '-' (1or more
                               times (matching the most amount
                               possible))
--------------------------------------------------------------------------------
    )?                       end of grouping
--------------------------------------------------------------------------------
    [A-Z]                    any character of: 'A' to 'Z'
--------------------------------------------------------------------------------
    \w+                      word characters (a-z, A-Z, 0-9, _) (1or
                             more times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
    (?:                      group, but do not capture (optional
                             (matching the most amount possible)):
--------------------------------------------------------------------------------
      \.                       '.'
--------------------------------------------------------------------------------
      co                       'co'
--------------------------------------------------------------------------------
      m?                       'm' (optional (matching the most
                               amount possible))
--------------------------------------------------------------------------------
    )?                       end of grouping
--------------------------------------------------------------------------------
  ){0,2}                   end of grouping
--------------------------------------------------------------------------------
  [,\s]+                   any character of: ',', whitespace (\n, \r,
                           \t, \f, and" ") (1or more times
                           (matching the most amount possible))
--------------------------------------------------------------------------------
  (?i:                     group, but do not capture (case-
                           insensitive) (with ^ and $ matching
                           normally) (with . not matching \n)
                           (matching whitespace and# normally):
--------------------------------------------------------------------------------
    ltd                      'ltd'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    llc                      'llc'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    inc                      'inc'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    plc                      'plc'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    co                       'co'
--------------------------------------------------------------------------------
    (?:                      group, but do not capture (optional
                             (matching the most amount possible)):
--------------------------------------------------------------------------------
      rp                       'rp'
--------------------------------------------------------------------------------
    )?                       end of grouping
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    group                    'group'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    holding                  'holding'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    gmbh                     'gmbh'
--------------------------------------------------------------------------------
  )                        end of grouping
--------------------------------------------------------------------------------
  \b                       the boundary between a word char (\w) and
                           something that isnot a word char

Python code:

import re

regex = r"\b[A-Z]\w+(?:\.com?)?(?:[ -]+(?:&[ -]+)?[A-Z]\w+(?:\.com?)?){0,2}[,\s]+(?i:ltd|llc|inc|plc|co(?:rp)?|group|holding|gmbh)\b"

test_str = ("I work in Amazon.com Inc.\n""Company named Swiss Medic Holding invested in vaccine\n""what do you think about Abercrombie & Fitch Co. ?\n""do you work in Delta Group?\n""I have worked in CocaCola Gmbh")

print(re.findall(regex, test_str))

Results: ['Amazon.com Inc', 'Swiss Medic Holding', 'Abercrombie & Fitch Co', 'Delta Group', 'CocaCola Gmbh']

Solution 2:

Use https://regex101.com for testing out regex, it's great. For your specific example, here is regex that does what you want. I don't see the need to test for the optional .com in this example.

regex_company = '[A-Z]([^ ]*[ &]*){0,2}(Inc\.|Ltd|GmbH|AG|Gmbh|Group|Holding|Co\.)'

for address in [address_1, address_2, address_3, address_4, address_5]:
    found = re.search(regex_company, address)
    if found:
        print(found)

# prints:
# <regex.Matchobject; span=(10,25), match='Amazon.com Inc.'>
# <regex.Matchobject; span=(14,33), match='Swiss Medic Holding'>
# <regex.Matchobject; span=(24,47), match='Abercrombie & Fitch Co.'>
# <regex.Matchobject; span=(15,26), match='Delta Group'>
# <regex.Matchobject; span=(17,30), match='CocaCola Gmbh'>

Post a Comment for "Regex For Detection Company Names In Python"