Difference Between NFD, NFC, NFKD, and NFKC Explained with Python Code

Difference Between NFD, NFC, NFKD, and NFKC Explained with Python Code,第1张

The difference between Unicode normalization forms

Photo by Joel Filipe on Unsplash
Recently I am working on an NLP task in Japanese, one problem is to convert special characters to a normalized form. So I have done a little research and write this post for anyone who has the same need.

Japanese contains different forms of the character, for example, Latin has two forms, full-width form, and half-width.

In the above example, we can see the full-width form is very ugly and is also hard to utilizing for the following processing. So we need to convert it to a normalized form.

TL;DR
Use NFKC method.

from unicodedata import normalize
s = “株式会社KADOKAWA Future Publishing”
normalize(‘NFKC’, s)
株式会社KADOKAWA Future Publishing
Unicode normalization forms

from Wikipedia
There are 4 kinds of Unicode normalization forms. This article give a very detailed explanation. But I will explain the difference with a simple and easy understanding way.

First, we could see the below result for an intuitive understanding.

アイウエオ (NFC)> アイウエオ
アイウエオ (NFD)> アイウエオ
アイウエオ (NFKC)> アイウエオ
アイウエオ (NFKD)> アイウエオ
パピプペポ (NFC)> パピプペポ
パピプペポ (NFD)> パピプペポ
パピプペポ (NFKC)> パピプペポ
パピプペポ (NFKD)> パピプペポ
パピプペポ (NFC)> パピプペポ
パピプペポ (NFD)> パピプペポ
パピプペポ (NFKC)> パピプペポ
パピプペポ (NFKD)> パピプペポ
abcABC (NFC)> abcABC
abcABC (NFD)> abcABC
abcABC (NFKC)> abcABC
abcABC (NFKD)> abcABC
123 (NFC)> 123
123 (NFD)> 123
123 (NFKC)> 123
123 (NFKD)> 123
+-.~)} (NFC)> +-.~)}
+-.~)} (NFD)> +-.~)}
+-.~)} (NFKC)> ±.~)}
+-.~)} (NFKD)> ±.~)}
There are two classification methods for these 4 forms.

1 original form changed or not
  • A(not changed): NFC & NFD
  • B(changed): NFKC & NFKD
2 the length of original length changed or not
  • A(not changed): NFC & NFKC
  • B(changed): NFD & NFKD
    1 Whether the original form is changed or not
    abcABC (NFC)> abcABC
    abcABC (NFD)> abcABC
    abcABC (NFKC)> abcABC
    abcABC (NFKD)> abcABC
1 original form changed or not
  • A(not changed): NFC & NFD
  • B(changed): NFKC & NFKD
    The first classification method is based on whether the original form is changed or not. More specifically, A group does not contain K but B group contains K. What does K means?

D = Decomposition
C = Composition
K = Compatibility
K means compatibility, which is used to distinguish with the original form. Because K changes the original form, so the length is also changed.

s= ‘…’
normalize(‘NFKC’, s)
‘…’
len(s)
1
len(normalize(‘NFC’, s))
1
len(normalize(‘NFKC’, s))
3
len(normalize(‘NFD’, s))
1
len(normalize(‘NFKD’, s))
3
2 Whether the length of original form is changed or not
パピプペポ (NFC)> パピプペポ
パピプペポ (NFD)> パピプペポ
パピプペポ (NFKC)> パピプペポ
パピプペポ (NFKD)> パピプペポ

2 the length of original length changed or not
  • A(not changed): NFC & NFKC
  • B(changed): NFD & NFKD
    This second classification method is based on whether the length of the original form is changed or not. A group contains C(Composition), which won’t change the length. B group contains D(Decomposition), which will change the length.

You might be wondering why the length is change? Please see the test below.

from unicodedata import normalize
s = “パピプペポ”
len(s)
5
len(normalize(‘NFC’, s))
5
len(normalize(‘NFKC’, s))
5
len(normalize(‘NFD’, s))
10
len(normalize(‘NFKD’, s))
10
We can find the “decomposition” method doubles the length.

from Unicode正規化とは
This is because the NFD & NFKD decompose each Unicode character into two Unicode characters. For example, ポ(U+30DD) = ホ(U+30DB) + Dot(U+309A) . So the length change from 5 to 10. NFC & NFKC compose separated Unicode characters together, so the length is not changed.

Python Implementation
You can use the unicodedata library to get different forms.

from unicodedata import normalize
s = “パピプペポ”
len(s)
5
len(normalize(‘NFC’, s))
5
len(normalize(‘NFKC’, s))
5
len(normalize(‘NFD’, s))
10
len(normalize(‘NFKD’, s))
10
Length

Take Away
Usually, we can use either of NFKC or NFKD to get the normalized form. The length won’t make trouble only if your NLP task is length sensitive. I usually use the NFKC method.

Check out my other posts on Medium with a categorized view!
GitHub: BrambleXu
LinkedIn: Xu Liang
Blog: BrambleXu

Reference
https://unicode.org/reports/tr15/#Norm_Forms
https://www.wikiwand.com/en/Unicode_equivalence#/Normal_forms
http://nomenclator.la.coocan.jp/unicode/normalization.htm
https://maku77.github.io/js/string/normalize.html
http://tech.albert2005.co.jp/501/

https://towardsdatascience.com/difference-between-nfd-nfc-nfkd-and-nfkc-explained-with-python-code-e2631f96ae6c

欢迎分享,转载请注明来源:内存溢出

原文地址:https://54852.com/langs/918920.html

(0)
打赏 微信扫一扫微信扫一扫 支付宝扫一扫支付宝扫一扫
上一篇 2022-05-16
下一篇2022-05-16

发表评论

登录后才能评论

评论列表(0条)

    保存