记录一次文字点选验证码的破解(中篇)
接上回汉字定位,本篇来讲讲汉字的分类识别
数据集准备
首先来做汉字的截取
import cv2
with open('results/comp4_det_test_hanzi.txt', 'r') as f:
for line in f:
vals = line.strip().split()
fname, p, x1, y1, x2, y2 = vals
fname = 'jiyan/data/valid/' + fname + '.jpg'
x1 = int(float(x1))
x2 = int(float(x2))
y1 = int(float(y1))
y2 = int(float(y2))
p = float(p)
img = cv2.imread(fname)
if p > 0.5: # 只有置信度大于50%时截取汉字
crop_img = img[y1:y2, x1:x2]
resized_image = cv2.resize(crop_img, (60, 60)) # 统一大小为60x60
cv2.imwrite(fname+'.{}.jpg'.format(p), resized_image)
我总共截取了10w张图片,不过肯定有一些误识别的,这里花了半个小时肉眼过滤。10w张图片一个人标注非常慢,决定借助打码平台,吐槽一下云打码的Python client写的真是一言难尽。将图片的文件路径随机写到30个文件里,开启多线程,还是很快的,大概花了5个小时搞定。
import concurrent.futures
import string
import logging
import redis
from ydm import YDMHttp
redis_conn = redis.StrictRedis()
logger = logging.getLogger("label")
def login(username, password, appid, appkey):
client = YDMHttp(username, password, appid, appkey)
uid = client.login()
logger.info('uid: %s' % uid)
balance = client.balance()
logger.info('balance: %s' % balance)
return client
def worker(file_path, client):
with open(file_path, 'r') as f:
for line in f:
path = line.strip()
if redis_conn.sismember('path', path):
continue
else:
redis_conn.sadd('path', path)
try:
cid, result = client.decode(codetype=2001, filename=path, timeout=20)
logger.info('cid: %s, result: %s' % (cid, result))
except Exception as e:
logger.warning(e)
else:
if (result == '看不清' or len(result) != 1 or result in string.ascii_lowercase) and cid != -3003:
try:
yundama.report(cid)
except Exception as e:
print(e)
else:
redis_conn.hset("result", path, result)
def batch(files):
with concurrent.futures.ThreadPoolExecutor(max_workers=files) as executor:
future_to_file = {executor.submit(worker, file_path): file_path for file_path in files}
for future in concurrent.futures.as_completed(future_to_file):
file_path = future_to_file[future]
try:
future.result()
except Exception as exc:
logger.warning(exc, exc_info=True)
else:
logger.info("complete task: %s" % file_path)
完成后对汉字数量做一个统计,我这边分离出的汉字有1158个,需要修改配置文件中的分类大小为1158,图片大小为60x60。
另外需要把标注的图片名加上汉字的unicode,文件名例如verifyCode1580446215_u4e3b.jpg
,并且把所有unicode都放入labels.txt,这里给出转换代码。
'我'.encode('unicode-escape').decode().replace('\\', '')
'u6211'
训练
回到gsxt_captcha
根目录,将训练集放在chinese_classify/data/train/
,验证集放在chinese_classify/data/valid
。
接着执行,生成train.txt,valid.txt。
cd chinese_classify/data
sh gen_txt.sh
回到根目录,开始训练。
cd ../../
pwd # gsxt_captcha
sh classify_train.sh
20000轮训练后,average loss 0.03,跑一轮测试。
sh classify_test.sh
top1成功率95%,top100%成功率99%,能用了。
git仓库已同步更新代码~