Nov 10, 2024
Posted by:
Regular expressions can often be really handy. In this tutorial we will learn how to parse a Malaysian Identity card (aka MyKad) and extract some data from it. The first step is to understand the fstructure of the card which is currently :
YYMMDD-BP-nnnG
where
We could easily use the string object split method to parse the IC on the hyphen ("-") character. However very often the IC number is stored in a database as numbers only with the hyphen stripped off. A regular expression is flexible as we can parse both forms of the IC with it.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 | import re pat = r""" \b # word boundary (?P<birthdate>\d{6}) # named group capture of birthdate, six digits -? # optional - (?P<birthplace>\d{2}) # named group, birthplace, 2 digits -? # optional - \d{3} # next 3 digits (?P<gender>\d) # capture last digit representing gender \b # word boundary """ vpo = re.compile(pat, re.VERBOSE) codes = [('01', '21', '22', '23', '24'), ('02', '25', '26', '27'), ('03', '28', '29'), ('04', '30'), ('05', '31', '59'), ('06', '32', '33'), ('07', '34', '35'), ('08', '36', '37', '38', '39'), ('09', '40'), ('10', '41', '42', '43', '44'), ('11', '45', '46'), ('12', '47', '48', '49'), ('13', '50', '51', '52', '53'), ('14', '54', '55', '56', '57'), ('15', '58'), ('16',), ('82',)] # place of birth place = ('Johor', 'Kedah', 'Kelantan', 'Malacca', 'Negri Sembilan', 'Pahang', 'Penang', 'Perak', 'Perlis', 'Selangor', 'Trengganu', 'Sabah', 'Sarawak', 'Kuala Lumpur', 'Labuan', 'Putrajaya', 'Unknown') get_gender = lambda n : 'Male' if int(n) % 2 else 'Female' def get_place(code): for i, item in enumerate(codes): if code in item: return place[i] return None def parse_ic(ic): m = vpo.search(ic) if m: return(m.group('birthdate'), get_place(m.group('birthplace')), get_gender(m.group('gender'))) if __name__ == '__main__': ic = '850521-22-3454' print parse_ic(ic) ic = '850521223454' print parse_ic(ic) |