Pytech Resources

Python Regex Example

Nov 10, 2024

Posted by:

Using Python re module to Parse a Malaysian Identity Card

Regular expressions can often be really handy. In this tutorial we will learn how to parse a Malaysian Identity card (aka MyKad) and extract some data from it. The first step is to understand the fstructure of the card which is currently :

YYMMDD-BP-nnnG

where

  • YYMMDD represent the birthdate following the ISO8601:2000 format.
  • BP is a 2 digit number denoting the place of birth
  • nnnG is a randomly generated serial number. The last digit G represents the gender :
    • odd for male
    • even for female

We could easily use the string object split method to parse the IC on the hyphen ("-") character. However very often the IC number is stored in a database as numbers only with the hyphen stripped off. A regular expression is flexible as we can parse both forms of the IC with it.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import re

pat = r"""
\b                      # word boundary
(?P<birthdate>\d{6})    # named group capture of birthdate, six digits
-?                      # optional -
(?P<birthplace>\d{2})   # named group, birthplace, 2 digits
-?                      # optional -
\d{3}                   # next 3 digits
(?P<gender>\d)          # capture last digit representing gender
\b                      # word boundary
"""

vpo = re.compile(pat, re.VERBOSE)

codes = [('01', '21', '22', '23', '24'), ('02', '25', '26', '27'), ('03', '28', '29'),
         ('04', '30'), ('05', '31', '59'), ('06', '32', '33'), ('07', '34', '35'),
         ('08', '36', '37', '38', '39'), ('09', '40'), ('10', '41', '42', '43', '44'),
         ('11', '45', '46'), ('12', '47', '48', '49'), ('13', '50', '51', '52', '53'),
         ('14', '54', '55', '56', '57'), ('15', '58'), ('16',), ('82',)]

# place of birth
place = ('Johor', 'Kedah', 'Kelantan', 'Malacca', 'Negri Sembilan', 
'Pahang', 'Penang',  'Perak',  'Perlis', 'Selangor', 'Trengganu', 'Sabah', 
'Sarawak', 'Kuala Lumpur', 'Labuan', 'Putrajaya', 'Unknown')

get_gender = lambda n : 'Male' if int(n) % 2 else 'Female'

def get_place(code):
     for i, item in enumerate(codes):
         if code in item:
             return place[i]
     return None

def parse_ic(ic):
    m = vpo.search(ic)
    if m:
        return(m.group('birthdate'), 
               get_place(m.group('birthplace')), 
               get_gender(m.group('gender')))

if __name__ == '__main__':
     ic = '850521-22-3454'
     print parse_ic(ic)
     ic = '850521223454'
     print parse_ic(ic)