python進階：PDF電子發票讀取與合併

大部分公司還是需要員工自行整理發票填寫報銷單，並且列印電子發票後提交給財務才能報銷。如果發票多了會很浪費時間，讓我們用Python寫個程式來管理電子發票吧。

個人發票管理功能點

發票自動識別，讀取發票資訊。（前期只處理PDF電子發票）

批次匯出識別的發票資訊。

合併多個發票檔案到一個檔案。

後續增加功能。會陸續寫文章一步一步介紹如何實現這些功能。

提供視覺化介面，支援C/S和B/S兩種模式。

新增到windows工作列。

發票分類。根據報銷型別對發票進行分類標註。

發票抬頭、稅號、發票號等校驗，排除錯誤抬頭、避免出現重複發票。

標註已經報銷的發票。

自動讀取郵箱中的發票附件，並提取發票資訊。

支援圖片格式增值稅發票。基於OCR識別發票。

支援更多其他型別的發票。

PDF格式電子發票資訊讀取

Python處理PDF的庫很多，有PyMuPDF、PyPDF2、pdfminer、pdfplumber等，每個庫都有不同的特點（使用Python操作PDF：常用PDF庫總結 - 知乎這篇文章總結得比較全面，有興趣的可以看看）。

我選擇用pdfplumber（https：//github。com/jsvine/pdfplumber）這個庫來讀取發票內容，因為它可以方便的提取PDF中的文字、圖片，重要的是它可以很好地提取表格。

pdfplumber與其他庫的比較

安裝pdfplumber

pip install pdfplumber

下面的示例程式碼將打印出pdfplumber讀取的文字資訊和表格資訊，並且將pdf表格和文字框儲存到圖片檢視讀取效果。另外，電子發票含有一個二維碼，裡面包含發票資訊，pdfplumber沒有辦法直接解析出圖片，這裡使用

crop()函式

擷取圖片區域，然後再用

to_image()函式

轉成圖片。二維碼圖片可以用pyzbar等識別二維碼的庫解析，用於核對資訊。

import pdfplumber with pdfplumber。open（r“1。pdf”） as pdf： # 讀取第一頁 first_page = pdf。pages［0］ # 轉成圖片，dpi設定為100 im = first_page。to_image（resolution=100） # 畫出表格邊框和線條交點 im。debug_tablefinder（） # 畫出文字邊框 im。draw_rects（first_page。extract_words（）） # 儲存圖片 im。save（r“1。png”） # 列印讀取的文字 print（first_page。extract_text（）） # 列印讀取的表格 print（first_page。extract_table（）） # 儲存pdf中的圖片 for image in first_page。images： im = first_page。crop（bbox=（image［‘x0’］，image［‘top’］，image［‘x1’］，image［‘bottom’］））。to_image（resolution=100） im。save（f“{image［‘name’］}。png”）

表格和文字識別區域

從上圖可以看出pdfplumber識別發票表格的效果還是很好的，紅色線是識別的表格，藍色圓圈點是表格線條的交點。

另外透過

extract_text()

打印出來的文字，我們分析可以發現，直接從文字中獲取發票全部資訊還是比較困難的（懶得打碼就不放圖了），很多資訊混在一起。這個換別的庫提取文字也一樣，都沒有很好的效果，而且不同省份的發票PDF排版還不一樣，如果不透過表格區域去獲取文字，是沒有辦法解析發票全部資訊的。

最後，我們結合

extract_text()

和

extract_table()

分別獲取發票頭和發票表格中的資訊。

解析發票資訊並匯出到表格

解析發票比較簡單，用正則表示式提取資訊就可以了。以發票表格上方的發票號等資訊為例，透過

extract_text()

獲取的文字來解析（

需要注意的是不同地區發票的“冒號”有的是半形有的是全形

）：

# 提取發票表格上方內容invoice。number = re。search（r‘發票號碼（：|：）（\d+）’， text）。group（2）invoice。date = re。search（r‘開票日期（：|：）（。*）’， text）。group（2）invoice。machine_number = re。search（r‘機器編號（：|：）（\d+）’， text）。group（2）invoice。code = re。search（r‘發票程式碼（：|：）（\d+）’， text）。group（2）invoice。check_code = re。search（r‘校驗碼（：|：）（\d+）’， text）。group（2）

發票表格中的內容，則透過

extract_table()

獲取的表格內容來解析。以購買方為例，需要從指定的cell中去獲取資料：

# 讀取表格table = first_page。extract_table（）# 表格為4行11列# ‘購買方’，內容，none，none，none，none，‘密碼區’，密碼，none，none，none# 貨物名，none，規格，單位，數量，單價，none，none，金額，稅率，稅額# ‘價稅合計’，none，金額，none，none，none，none，none，none，none，none# ‘銷售方’，內容，none，none，none，none，‘備註’，備註，none，none，none# 購買方 purchaser = table［0］［1］。replace（“ ”， “”）purchaser_name = re。search（r‘名稱（：|：）（。+）’， purchaser）invoice。purchaser_name = purchaser_name。group（2） if purchaser_name else ‘’purchaser_tax_number = re。search（r‘納稅人識別號（：|：）（。+）’， purchaser）invoice。purchaser_tax_number = purchaser_tax_number。group（2） if purchaser_tax_number else ‘’purchaser_address = re。search（r‘地址、電話（：|：）（。+）’， purchaser）invoice。purchaser_address = purchaser_address。group（2） if purchaser_address else ‘’purchaser_bank = re。search（r‘開戶行及賬號（：|：）（。+）’， purchaser）invoice。purchaser_bank = purchaser_bank。group（2） if purchaser_bank else ‘’

匯出資料到excel

使用openpyxl可以很簡單地將識別到的發票資料匯出到excel檔案。

workbook = openpyxl。Workbook（）sheet=workbook［workbook。sheetnames［0］］# 匯出開票日期、發票程式碼、發票號碼、校驗碼、購買方名稱和稅號、銷售方名稱和稅號、價格、稅率、稅額、價稅合計、備註sheet。append（［‘開票日期’，‘發票程式碼’，‘發票號碼’，‘校驗碼’，‘購買方名稱’，‘購買方稅號’，‘銷售方名稱’，‘銷售方稅號’，‘價格’，‘稅率’，‘稅額’，‘價稅合計’，‘備註’］）for i in range（row）： invoice = invoices［i］ sheet。append（［invoice。date，invoice。code，invoice。number，invoice。check_code，invoice。purchaser_name，invoice。purchaser_tax_number，invoice。seller_name，invoice。seller_tax_number，invoice。total_amount，invoice。tax_rate，invoice。total_tax，invoice。total，invoice。remark］）workbook。save（output_path）

匯出表格示例如下圖：

合併發票檔案

由於pdfplumber不能很好的編輯PDF檔案，這裡我們使用PyPDF4來處理PDF的合併，程式碼也很簡單：

from PyPDF4 import PdfFileMergermerger = PdfFileMerger（）for invoice_path in invoice_paths： merger。append（invoice_path）merger。write（output_path）merger。close（）

為了節省紙張，有些公司會在一張A4紙列印2張發票，這裡我們不去實現兩張發票合併到一個頁面的功能，這個可以在檔案列印的時候選擇一頁列印多張PDF。

原始碼

最後，乾貨必須帶原始碼~

python版本為3。8，需要安裝的依賴包如下requirements。txt：

pdfplumber~=0。6。1PyPDF4~=1。27。0openpyxl~=3。0。7

invoice。py

import pdfplumberfrom PyPDF4 import PdfFileMergerimport openpyxlimport re，time，osclass Invoice： def __init__（self）： # 發票型別、發票號碼、開票日期、機器編號、發票程式碼、校驗碼 self。type=self。number=self。date=self。machine_number=self。code=self。check_code = ‘’ # 收款人、複核、開票人 self。payee=self。reviewer=self。drawer = ‘’ # 購買方名稱、納稅人識別號、地址電話、開戶行及賬號 self。purchaser_name=self。purchaser_tax_number=self。purchaser_address=self。purchaser_bank = ‘’ # 密碼 self。password = ‘’ # 合計金額、合計稅額、稅率、價稅合計 self。total_amount=self。total_tax=self。tax_rate=self。total = ‘’ # 銷售方名稱、納稅人識別號、地址電話、開戶行及賬號 self。seller_name=self。seller_tax_number=self。seller_address=self。seller_bank = ‘’ # 備註 self。remark = ‘’ # 解析PDF格式電子發票 @staticmethod def read_invoice（pdf_file： str） -> “Invoice”： invoice = Invoice（） try： with pdfplumber。open（pdf_file） as pdf： # 讀取第一頁 first_page = pdf。pages［0］ # 讀取文字 text = first_page。extract_text（）。replace（“ ”， “”） # print（text） if ‘專用發票’ in text： invoice。type = ‘專票’ elif ‘普通發票’ in text： invoice。type = ‘普票’ else： raise Exception（‘未知發票型別或非發票檔案’） # 提取發票表格上方內容 invoice。number = re。search（r‘發票號碼（：|：）（\d+）’， text）。group（2） invoice。date = re。search（r‘開票日期（：|：）（。*）’， text）。group（2） invoice。machine_number = re。search（r‘機器編號（：|：）（\d+）’， text）。group（2） invoice。code = re。search（r‘發票程式碼（：|：）（\d+）’， text）。group（2） invoice。check_code = re。search（r‘校驗碼（：|：）（\d+）’， text）。group（2） # 提取發票表格下方內容，全在一行裡 match = re。search（r‘。*收款人（：|：）（。*）複核（：|：）（。*）開票人（：|：）（。*）銷售’， text） invoice。payee = match。group（2） invoice。reviewer = match。group（4） invoice。drawer = match。group（6） # 讀取表格 table = first_page。extract_table（） # 表格為4行11列 # ‘購買方’，內容，none，none，none，none，‘密碼區’，密碼，none，none，none # 貨物名，none，規格，單位，數量，單價，none，none，金額，稅率，稅額 # ‘價稅合計’，none，金額，none，none，none，none，none，none，none，none # ‘銷售方’，內容，none，none，none，none，‘備註’，備註，none，none，none if table and len（table）==4： # 購買方 purchaser = table［0］［1］。replace（“ ”， “”） purchaser_name = re。search（r‘名稱（：|：）（。+）’， purchaser） invoice。purchaser_name = purchaser_name。group（2） if purchaser_name else ‘’ purchaser_tax_number = re。search（r‘納稅人識別號（：|：）（。+）’， purchaser） invoice。purchaser_tax_number = purchaser_tax_number。group（2） if purchaser_tax_number else ‘’ purchaser_address = re。search（r‘地址、電話（：|：）（。+）’， purchaser） invoice。purchaser_address = purchaser_address。group（2） if purchaser_address else ‘’ purchaser_bank = re。search（r‘開戶行及賬號（：|：）（。+）’， purchaser） invoice。purchaser_bank = purchaser_bank。group（2） if purchaser_bank else ‘’ # 密碼區 invoice。password = table［0］［7］。replace（“\n”， “”） # 合計金額、稅額 invoice。total_amount = table［1］［8］。split（“\n”）［-1］。replace（“￥”， “”）。replace（“ ”， “”）。replace（“，”， “”） invoice。total_amount = float（invoice。total_amount） # 注意有的發票稅額為0是‘*’，無法直接轉成數值 invoice。total_tax = table［1］［10］。split（“\n”）［-1］。replace（“￥”， “”）。replace（“ ”， “”）。replace（“*”， “0”）。replace（“，”， “”） invoice。total_tax = float（invoice。total_tax） if invoice。total_tax else 0 # 稅率，一般一張發票中貨品稅率一致；也可以用（總稅額/總金額）計算稅率 invoice。tax_rate = table［1］［9］。split（“\n”）［-1］。replace（“ ”， “”）。replace（“%”， “”）。replace（“*”， “0”） invoice。tax_rate = float（invoice。tax_rate）/100 if invoice。tax_rate else 0 # invoice。tax_rate = invoice。total_tax / invoice。total_amount # 價稅合計，解析或計算 invoice。total = re。search（r‘。+￥（。+）’， table［2］［2］）。group（1）。replace（“ ”， “”）。replace（“，”， “”） invoice。total = float（invoice。total） # invoice。total = invoice。total_tax + invoice。total_amount # 銷售方 seller = table［3］［1］。replace（“ ”， “”） invoice。seller_name = re。search（r‘名稱（：|：）（。+）’， seller）。group（2） invoice。seller_tax_number = re。search（r‘納稅人識別號（：|：）（。+）’， seller）。group（2） seller_address = re。search（r‘地址、電話（：|：）（。+）’， seller） invoice。seller_address = seller_address。group（2） if seller_address else ‘’ seller_bank = re。search（r‘開戶行及賬號（：|：）（。+）’， seller） invoice。seller_bank = seller_bank。group（2） if seller_bank else ‘’ # 備註 invoice。remark = table［3］［7］。replace（“\n”， “”） except Exception as e： print（pdf_file， e） return invoice #匯出到excel @staticmethod def export_to_excel（invoices：“list［Invoice］”， output_path：str）： row = len（invoices） if invoices else 0 if row == 0： print（“沒有可匯出的資料”） return try： workbook = openpyxl。Workbook（） sheet=workbook［workbook。sheetnames［0］］ # 匯出開票日期、發票程式碼、發票號碼、校驗碼、購買方名稱和稅號、銷售方名稱和稅號、價格、稅率、稅額、價稅合計、備註 sheet。append（［‘開票日期’，‘發票程式碼’，‘發票號碼’，‘校驗碼’，‘購買方名稱’，‘購買方稅號’，‘銷售方名稱’，‘銷售方稅號’，‘價格’，‘稅率’，‘稅額’，‘價稅合計’，‘備註’］） for i in range（row）： invoice = invoices［i］ sheet。append（［invoice。date，invoice。code，invoice。number，invoice。check_code，invoice。purchaser_name，invoice。purchaser_tax_number，invoice。seller_name，invoice。seller_tax_number，invoice。total_amount，invoice。tax_rate，invoice。total_tax，invoice。total，invoice。remark］） workbook。save（output_path） except Exception as e： print（‘匯出到excel失敗’， e） # 合併pdf @staticmethod def merge_pdf（invoice_paths：“list［str］”， output_path：str）： if len（invoice_paths） == 0： print（“沒有可合併的pdf”） return try： merger = PdfFileMerger（） for invoice_path in invoice_paths： merger。append（invoice_path） merger。write（output_path） merger。close（） except Exception as e： print（‘合併pdf失敗’， e） if __name__ == ‘__main__’： # 發票所在資料夾，使用絕對路徑 path = r‘E：\playground\python_tools\pdf’ # 輸出檔案路徑 output_path = os。path。join（path， ‘output’） if not os。path。exists（output_path）： os。makedirs（output_path） # 發票內容列表 invoices = ［］ # 發票檔案路徑列表 invoice_paths = ［］ #列出目錄的下所有檔案 lists = os。listdir（path） for item in lists： if item。endswith（‘。pdf’）： item = os。path。join（path， item） invoice = Invoice。read_invoice（item） # 僅儲存解析成功的pdf if invoice and invoice。type： invoices。append（invoice） invoice_paths。append（item） # 匯出發票內容到excel Invoice。export_to_excel（invoices， f‘{output_path}/invoice_{time。strftime（“%Y%m%d_%H%M%S”， time。localtime（））}。xlsx’） # 合併發票到一個pdf Invoice。merge_pdf（invoice_paths， f‘{output_path}/merged_invoice_{time。strftime（“%Y%m%d_%H%M%S”， time。localtime（））}。pdf’）

python進階：PDF電子發票讀取與合併

相關文章