Difference Between NFD, NFC, NFKD, and NFKC Explained with Python Code_python

The difference between Unicode normalization forms

Photo by Joel Filipe on Unsplash
Recently I am working on an NLP task in Japanese, one problem is to convert special characters to a normalized form. So I have done a little research and write this post for anyone who has the same need.

Japanese contains different forms of the character, for example, Latin has two forms, full-width form, and half-width.

In the above example, we can see the full-width form is very ugly and is also hard to utilizing for the following processing. So we need to convert it to a normalized form.

TL;DR
Use NFKC method.

from unicodedata import normalize
s = “株式会社ＫＡＤＯＫＡＷＡＦｕｔｕｒｅＰｕｂｌｉｓｈｉｎｇ”
normalize(‘NFKC’, s)
株式会社KADOKAWA Future Publishing
Unicode normalization forms

from Wikipedia
There are 4 kinds of Unicode normalization forms. This article give a very detailed explanation. But I will explain the difference with a simple and easy understanding way.

First, we could see the below result for an intuitive understanding.

ｱｲｳｴｵ (NFC)> ｱｲｳｴｵ
ｱｲｳｴｵ (NFD)> ｱｲｳｴｵ
ｱｲｳｴｵ (NFKC)> アイウエオ
ｱｲｳｴｵ (NFKD)> アイウエオ
パピプペポ (NFC)> パピプペポ
パピプペポ (NFD)> パピプペポ
パピプペポ (NFKC)> パピプペポ
パピプペポ (NFKD)> パピプペポ
ﾊﾟﾋﾟﾌﾟﾍﾟﾎﾟ (NFC)> ﾊﾟﾋﾟﾌﾟﾍﾟﾎﾟ
ﾊﾟﾋﾟﾌﾟﾍﾟﾎﾟ (NFD)> ﾊﾟﾋﾟﾌﾟﾍﾟﾎﾟ
ﾊﾟﾋﾟﾌﾟﾍﾟﾎﾟ (NFKC)> パピプペポ
ﾊﾟﾋﾟﾌﾟﾍﾟﾎﾟ (NFKD)> パピプペポ
ａｂｃＡＢＣ (NFC)> ａｂｃＡＢＣ
ａｂｃＡＢＣ (NFD)> ａｂｃＡＢＣ
ａｂｃＡＢＣ (NFKC)> abcABC
ａｂｃＡＢＣ (NFKD)> abcABC
１２３ (NFC)> １２３
１２３ (NFD)> １２３
１２３ (NFKC)> 123
１２３ (NFKD)> 123
＋－．～）｝ (NFC)> ＋－．～）｝
＋－．～）｝ (NFD)> ＋－．～）｝
＋－．～）｝ (NFKC)> ±.~)}
＋－．～）｝ (NFKD)> ±.~)}
There are two classification methods for these 4 forms.

1 original form changed or not

A(not changed): NFC & NFD
B(changed): NFKC & NFKD

2 the length of original length changed or not

A(not changed): NFC & NFKC
B(changed): NFD & NFKD
1 Whether the original form is changed or not
ａｂｃＡＢＣ (NFC)> ａｂｃＡＢＣ
ａｂｃＡＢＣ (NFD)> ａｂｃＡＢＣ
ａｂｃＡＢＣ (NFKC)> abcABC
ａｂｃＡＢＣ (NFKD)> abcABC

1 original form changed or not

A(not changed): NFC & NFD
B(changed): NFKC & NFKD
The first classification method is based on whether the original form is changed or not. More specifically, A group does not contain K but B group contains K. What does K means?

D = Decomposition
C = Composition
K = Compatibility
K means compatibility, which is used to distinguish with the original form. Because K changes the original form, so the length is also changed.

s= ‘…’
normalize(‘NFKC’, s)
‘…’
len(s)
1
len(normalize(‘NFC’, s))
1
len(normalize(‘NFKC’, s))
3
len(normalize(‘NFD’, s))
1
len(normalize(‘NFKD’, s))
3
2 Whether the length of original form is changed or not
パピプペポ (NFC)> パピプペポ
パピプペポ (NFD)> パピプペポ
パピプペポ (NFKC)> パピプペポ
パピプペポ (NFKD)> パピプペポ

2 the length of original length changed or not

A(not changed): NFC & NFKC
B(changed): NFD & NFKD
This second classification method is based on whether the length of the original form is changed or not. A group contains C(Composition), which won’t change the length. B group contains D(Decomposition), which will change the length.

You might be wondering why the length is change? Please see the test below.

from unicodedata import normalize
s = “パピプペポ”
len(s)
5
len(normalize(‘NFC’, s))
5
len(normalize(‘NFKC’, s))
5
len(normalize(‘NFD’, s))
10
len(normalize(‘NFKD’, s))
10
We can find the “decomposition” method doubles the length.

from Unicode正規化とは
This is because the NFD & NFKD decompose each Unicode character into two Unicode characters. For example, ポ(U+30DD) = ホ(U+30DB) + Dot(U+309A) . So the length change from 5 to 10. NFC & NFKC compose separated Unicode characters together, so the length is not changed.

Python Implementation
You can use the unicodedata library to get different forms.

from unicodedata import normalize
s = “パピプペポ”
len(s)
5
len(normalize(‘NFC’, s))
5
len(normalize(‘NFKC’, s))
5
len(normalize(‘NFD’, s))
10
len(normalize(‘NFKD’, s))
10
Length

Take Away
Usually, we can use either of NFKC or NFKD to get the normalized form. The length won’t make trouble only if your NLP task is length sensitive. I usually use the NFKC method.

Check out my other posts on Medium with a categorized view!
GitHub: BrambleXu
LinkedIn: Xu Liang
Blog: BrambleXu

Reference
https://unicode.org/reports/tr15/#Norm_Forms
https://www.wikiwand.com/en/Unicode_equivalence#/Normal_forms
http://nomenclator.la.coocan.jp/unicode/normalization.htm
https://maku77.github.io/js/string/normalize.html
http://tech.albert2005.co.jp/501/

https://towardsdatascience.com/difference-between-nfd-nfc-nfkd-and-nfkc-explained-with-python-code-e2631f96ae6c

欢迎分享，转载请注明来源：内存溢出

原文地址:https://54852.com/langs/918920.html

Difference Between NFD, NFC, NFKD, and NFKC Explained with Python Code

发表评论

评论列表（0条）