Maximizing KTP-OCR Performance using Regular Expression
Regular expression (or RE) specifies a set of strings that matches it; the functions in this module let you check if a particular string matches a given regular expression (or if a given regular expression matches a particular string, which comes down to the same thing).
we will maximize the performance of KTP-OCR using RE. there was lof of part in KTP like “nik”, “name”, “alamat”, “desa” etc. so every part of it have it own condition. Example, like NIK, NIk is fulled by number like this “nik : 130121231223” it was had a different with alamat like “alamat : jalan mayjen sutoyo…”, and some times pytesseract when convert image to text it had some mistake like adding some symbol or change the number into latter and otherwise, such as letter “g” -> “9”, “i” -> “1” etc.
So this the result if i not using RE:
- PROVINSI DAERAH ISTIMEWA YOGYAKARTA KABUPATEN SLEMAN
- NIK : 34711140209790001
- Nama :RIYANTO. SE
- Tempat/Tgl Lahir : GROBOGAN. 02-09-1979
- Jenis Kelamin : LAKI-LAKI Gol Darah : 0
- Alamat PRM PURI DOMAS D-3. SEMPU
- RTRW 1001 1024
- Kel/Desa : WEDOMARTANI!
- Kecamatan : NGEMPLAK
- Agama “ISLAM
- Status Bean KAWIN
- Pekerjaan : PEDAGANG
- Kewarganegaraan: WNI HI —
- Berlaku Hingga :02-09-2017 NIA
There was a lot of syntaks in RE in using term and condition in every syntaks that u use, or can mix it depend on the situation. But first i need to import the library first
import re
and now just using syntaks that u want. for example
word = word.split(":") jenis_kelamin = re.search("(LAKI-LAKI|LAKI|LELAKI|PEREMPUAN)", word[1])[0]
on that code im just searching that fit on my text that i want, so the result would be like this
print(jenis_kelamin) >>> LAKI-LAKI
or like this
alamat = re.sub(r'^\W*\w+\W*', '', word)
on that code im just delete the first word in the line, because alamat hava a lot of word and all we need is the alamat not the variable “alamat:”, so the result would be like this:
print(alamat) >>> PRM PURI DOMAS D-3. SEMPU
and rest of part ktp would be pretty same like before, and the final result would be like this:
"nik": "34711140209790001" "nama": "RIYANTO. SE" "tempat_lahir": " GROBOGAN. " "tanggal_lahir": "02-09-1979" "jenis_kelamin": "LAKI-LAKI" "golongan_darah": "O" "alamat": "PRM PURI DOMAS D-3. SEMPU" "rt": "001", "rw": "024" "kelurahan_atau_desa": " WEDOMARTANI" "kecamatan": "NGEMPLAK" "agama": "ISLAM" "status_perkawinan": "KAWIN" "pekerjaan": "PEDAGANG" "kewarganegaraan": "WNI" "berlaku_hingga": "SEMUR HIDUP"
here some source that u can use to understand more about RE: