
在我的其他几篇文章中介绍了Tesseract识别中文+数字+字母以及PDF去水印的一些技巧。当整个PDF都是由图片构成(如扫描件)时,如何提取PDF中的表格并按行列返回JSON数据呢?
一种方法就是将PDF中的图片转存为图片,然后通过对图片的识别来达到目的。Github上有一些诸如:CascadeTabNet、CDecNet的Deep Learning项目,百度和腾讯我也看了,有类似的Deep Learning项目。我试用了CascadeTabNet(目前Github上92颗星)以及百度的图片表格识别Deep Learning项目,其中CascadeTabNet11个G,百度的19个G。试验的结果感觉还可以,对小篇幅的图片识别准确率还可以,但是对大尺寸的图片(如A4纸)识别正确率很低。并且无法以JSON数据返回。
在这里我介绍另外一种通过OpenCV+Tesseract技术实现对图片中表格提取的方法,该方法可以提取更加复杂的表格(如嵌套表)。
思路:本文的思路是通过OpenCV对图片进行检测,检测完毕后返回关键数据,然后通过设计工具在图片上进行划定区域切割,并生成单页元数据,通过元数据对图片进行表格数据。
1. OpenCV表格检测 (可完成60%的表格识别)
2. 设计工具,Vue开发的一个小工具,可对OpenCV返回的格子数据进行再次加工。(用于实现100%的表格检测)
3. 通过国OpenCV进行表格识别。
4.通过Tesseract进行OCR识别
5. 转换为JSON返回
下面上源代码(本文并不会对源代码做过多介绍,请仔细深读),第一部分:OpenCV表格检测:
public Table parseImageTableStructure(String path) {
// 图像倾斜度调整
pictureTiltCorrection(path);
Mat src = Imgcodecs.imread(path);
// 1. 将图片灰度化
Mat gray = OpenCVUtils.gray(src);
// 2. 将图片二值化
Mat adaptiveThreshold = OpenCVUtils.adaptiveThreshold(gray);
// 3. 膨胀+腐蚀:补全表格线内的空洞
Mat element = Imgproc.getStructuringElement(Imgproc.MORPH_RECT, new Size(3, 3));
Imgproc.dilate(adaptiveThreshold, adaptiveThreshold, element);
Imgproc.erode(adaptiveThreshold, adaptiveThreshold, element);
// 4. 获得横线
Mat horizontalLine = getHorizontal(adaptiveThreshold.clone());
// 5. 获得竖线
Mat verticalLine = getVertical(adaptiveThreshold.clone());
// 6. 横竖线合并
Mat tableLine = OpenCVUtils.getOr(horizontalLine, verticalLine);
// 7. 通过 bitwise_and 定位横线、垂直线交汇的点
Mat points_image = new Mat();
Core.bitwise_and(horizontalLine, verticalLine, points_image);
// 8. 查找轮廓
List contours = new ArrayList();
Mat rootHierarchy = new Mat();
Imgproc.findContours(tableLine, contours, rootHierarchy, Imgproc.RETR_LIST, Imgproc.CHAIN_APPROX_TC89_KCOS, new Point(0, 0));
// 9. 分析轮廓
List contours_poly = contours;
Rect[] boundRect = new Rect[contours.size()];
linkedList tables = new linkedList();
// 循环所有找到的轮廓-点
for (int i = 0; i < contours.size(); i++) {
MatOfPoint point = contours.get(i);
MatOfPoint contours_poly_point = contours_poly.get(i);
double area = Imgproc.contourArea(contours.get(i));
// 如果小于某个值就忽略,代表是杂线不是表格
if (area < 100) {
continue;
}
Imgproc.approxPolyDP(new MatOfPoint2f(point.toArray()), new MatOfPoint2f(contours_poly_point.toArray()), 3, true);
// 为将这片区域转化为矩形,此矩形包含输入的形状
boundRect[i] = Imgproc.boundingRect(contours_poly.get(i));
// 找到交汇处的的表区域对象
Mat table_image = points_image.submat(boundRect[i]);
List table_contours = new ArrayList();
Mat joint_mat = new Mat();
Imgproc.findContours(table_image, table_contours, joint_mat, Imgproc.RETR_CCOMP, Imgproc.CHAIN_APPROX_TC89_L1);
// 从表格的特性看,如果这片区域的点数小于4,那就代表没有一个完整的表格,忽略掉
if (table_contours.size() < 4)
continue;
// 提取矩形数据
MatWithProperty mp = new MatWithProperty(null, boundRect[i]);
tables.addFirst(mp);
}
ImageTable table = new ImageTable();
table.setImageHeight(src.rows());
table.setImageWidth(src.cols());
// 10. 生成桶
List horBuckets = new ArrayList<>();
table.setRows(horBuckets);
// 生成横桶
createRowBuckets(tables, horBuckets);
// 遍历横桶,
for (Row row : horBuckets) {
RowBucket bucket = (RowBucket) row;
List rowMats = bucket.elements;
// 生成列桶
List verBuckets = new ArrayList<>();
createColBuckets(rowMats, verBuckets);
}
// 返回结构
return table;
}
返回的table就是表结构数据,你可以给它理解为表格蒙板数据。这个识别对于简单、清晰的表格可以100%识别,但是对于大表、嵌套表识别率60%左右,所以为了达到100%识别,我们需要对表格结构数据进行再次设计,这次设计就需要通过UI界面来进行了。下面是设计页面,使用Vue开发:
识别图片中的表格数据
0">
保存设计
显示数据
将图片文件拖到此处,或点击上传
只能上传jpg/png文件,且不超过10MB
{onDragStop(x,y,c)}" @resizestop="(x,y,w,h) => {onResizeStop(x,y,w,h,c)}" :parent="true" :key="c.id">
{{attributeData}}
点击保存
.ocr-design-wrapper {
/deep/.upload-demo {
width: 100%;
.el-upload {
width: 100%;
.el-upload-dragger {
width: 100%;
}
}
}
.design-port-container {
width: 100%;
position: relative;
}
.bounding-container {
width: 100%;
height: 100%;
position: absolute;
left: 0;
top: 0;
.sketch-container {
/deep/.bound-box {
background-color: rgba(100, 255, 187, 0.4);
.bound-box-close {
display: none;
cursor: pointer;
position: absolute;
right: -7px;
top: -7px;
background-color: #00c9ff;
color: #ffffff;
}
&.active {
.bound-box-close {
display: block;
}
}
.handle-tl {
top: -5px;
left: -5px;
}
.handle-tm {
top: -5px;
}
.handle-tr {
right: -5px;
top: -5px;
}
.handle-mr {
right: -5px;
}
.handle-ml {
left: -5px;
}
.handle-bl {
left: -5px;
bottom: -5px;
}
.handle-bm {
bottom: -5px;
}
.handle-br {
bottom: -5px;
right: -5px;
}
}
}
}
.top-header {
height: 50px;
line-height: 50px;
padding-left: 20px;
background-color: #dedede;
width: 100%;
left: 0px;
z-index: 99999999;
border-radius: 6px;
}
.design-port {
margin-top: 10px;
.left-view {
width: 100%;
height: 100%;
}
}
}
.btn-wrap {
position: fixed;
top: 50%;
right: 10px;
z-index: 14;
width: 80px;
/deep/ .el-button {
margin-bottom: 10px;
opacity: 0.6;
&:hover {
opacity: 1;
}
}
/deep/ .el-button+.el-button {
margin-left: 0;
}
}
.innerDom {
display: none !important;
}
.box {
padding: 20px;
}
.comp-wrap {
width: 313px;
float: left;
height: 736px;
}
.page-wrap {
width: 100%;
float: left;
padding: 0 !important;
}
.edit-wrap {
position: relative;
float: left;
width: 348px;
height: 736px;
}
.drag-sty {
border: 1px solid #e6e6e6;
width: 100px;
padding: 6px;
font-size: 12px;
height: 30px;
display: inline-block;
line-height: 18px;
}
.iconfont-back {
background: #ccc;
border-radius: 2px;
padding: 0 2px;
float: left;
height: 18px;
margin-right: 6px;
}
.drag-sty:hover .iconfont {
color: #2875e8;
}
.iconfont {
color: #a8a7a7;
font-size: 18px;
}
.bg-purple {
background: #d3dce6;
}
.bg-purple-light {
background: #fafafa;
}
.left-shadow {}
.grid-content {
border-radius: 4px;
overflow: auto;
padding: 20px;
}
.tab-content {
border: 1px solid #eee;
border-radius: 4px;
min-height: 736px;
height: 100%;
overflow: auto;
}
.item {
height: 60px;
border: 0px solid #333;
display: inline-block;
padding: 10px;
margin-bottom: 5px;
cursor: pointer;
}
.el-upload {
width: 100%;
}
.el-upload-dragger {
width: 100%;
}
.item2 {
height: 80px;
border: 0px solid #333;
padding: 10px;
margin-bottom: 5px;
cursor: pointer;
}
#removeBox {
height: 100px;
width: 100px;
border: 2px dashed #999;
background: rgba(0, 0, 0, 0.3);
position: absolute;
bottom: 10px;
right: 20px;
background: url(/static/image/deleteBox.png) no-repeat;
background-size: 90%;
background-position: center center;
}
.flxed {
position: relative;
top: 0;
left: 0;
}
.edit-content .el-form-item {
margin-bottom: 0;
}
.vali-el-input {
margin: 0 10px;
}
.el-checkbox {
margin: 4px 0;
}
.el-divider--horizontal {
margin: 4px 0;
}
h4,
h5 {
margin: 10px 0;
}
.submit-btn {
float: right;
}
/deep/ .page-item-group {
cursor: pointer;
position: relative;
.control-btn {
right: 0;
}
}
* {
box-sizing: border-box;
}
/deep/ .vali-el-input .el-input__inner {
height: 26px !important;
padding-right: 0;
padding-left: 4px;
}
/deep/ .sel-options .el-form-item__label {
width: 100%;
text-align: left;
}
/deep/ .el-icon-delete {
cursor: pointer;
}
/deep/ .edit-content .el-input {
width: 100%;
}
/deep/ .edit-content .el-date-editor.el-input,
/deep/ .edit-content .el-date-editor.el-input__inner {
width: 220px;
}
/deep/ .edit-content .el-input__inner {
height: 30px;
box-sizing: border-box;
}
/deep/ .edit-content .el-date-editor--date .el-icon-date,
/deep/ .edit-content .time-select .el-icon-circle-close {
line-height: 30px;
}
/deep/ .edit-content .el-form-item__label {
height: 30px;
}
/deep/ .long_input {
margin-left: 80px !important;
position: relative;
}
/deep/ .long_input_label {
width: 80px;
}
/deep/ .page-item:hover {
background: #e0f2ff;
}
/deep/ .page-item-select {
border: 1px dashed #4db8ff;
background: #e0f2ff;
}
/deep/ .page-item-select .control-btn {
display: block;
}
/deep/ .control-btn {
position: absolute;
top: 50%;
right: -20px;
transform: translate(0, -50%);
display: none;
}
/deep/ .control-btn .control-delete {
position: absolute;
right: 0;
bottom: -26px;
}
/deep/ .control-btn .control-arrow-wrap {
height: 20px;
cursor: pointer;
line-height: 20px;
background: #fff;
display: block;
}
/deep/ .control-btn .control-arrow-down {
bottom: -2px;
}
/deep/ .control-btn .control-arrow-up {
top: 28px;
margin-bottom: 6px;
}
/deep/ .tab-content .page-item {
margin-bottom: 0;
padding: 16px 20px;
min-height: 90px;
}
.sel-sty {
width: 200px;
margin-top: 14px;
display: block;
// -webkit-appearance: none;
background-color: #fff;
background-image: none;
border-radius: 4px;
border: 1px solid #dcdfe6;
box-sizing: border-box;
color: #606266;
font-size: inherit;
height: 30px;
line-height: 40px;
outline: none;
padding: 0 15px;
transition: border-color 0.2s cubic-bezier(0.645, 0.045, 0.355, 1);
}
注意,这里使用到了vue-draggable-resizable这个组件。具体如何引入到Vue中我就不做介绍了,运行效果如下图所示:
上传一个表格图片后,如下图所示:
红色文字不适合展示,我抹掉了。
在设计完毕后,可以查看到设计结果数据,该数据可用于OpenCV的完全识别,代码如下:
ListrecognizeFromSettings(File imagePath) { Mat src = Imgcodecs.imread(imagePath.getAbsolutePath()); String extention = Utils.getUriExtention(imagePath.getPath()); // 1. 生成横桶 List
cellProperty = JSON.parseArray(this.settings.getString("cells"), MatWithProperty.class); List horBuckets = new ArrayList<>(); createRowBuckets(cellProperty, horBuckets); // 遍历横桶, for (Row row : horBuckets) { RowBucket bucket = (RowBucket) row; List
rowMats = bucket.elements; // 生成列桶 List verBuckets = new ArrayList<>(); createColBuckets(rowMats, verBuckets); for (ColBucket verticalBucket : verBuckets) { List colMats = verticalBucket.elements; // 遍历列 for (MatWithProperty mat : colMats) { Mat subMat = src.submat(new Rect(mat.rect.x, mat.rect.y, mat.rect.width, mat.rect.height)).clone(); try { // 识别 BufferedImage image = OpenCVUtils.convertMat2BufferedImage(subMat, extention); String content = tesseract.doOCR(image); row.addCell(content); } catch (Exception e) { logger.error("[OCR Failed]", e); e.printStackTrace(); } finally { try { if (subMat != null) { subMat.release(); subMat = null; } } catch (Exception e) { e.printStackTrace(); } } } } } return horBuckets; }
欢迎分享,转载请注明来源:内存溢出
微信扫一扫
支付宝扫一扫
评论列表(0条)