non-breaking space 問題

Posted on 2022-01-24

在檢視海量資料時，發現偶爾會有幾筆資料會比對不過
印出來比對發現在可視範圍一模一樣

轉成 byte 才發現有一個長得不一樣的空白
在 utf-8 底下數值為 \xa0

而經過一番追尋後在文件中發現
https://en.wikipedia.org/wiki/Non-breaking_space

原來這是 non-breaking space！

我們先來看看移除 non-breaking space 的幾種方法

Replace

你可以簡單的 string.replace

1	s = s.replace('\xa0', ' ')

Beautiful Soup

我們可以透過 strip=True 來讓 Beautiful Soup 回傳乾淨的字串

1	element.get_text(strip=True)

但這個方法雖然是移除了，但是是直接移除！
並不會補回一個常見空白

因為我有比對需求，所以無法使用這個方法

Join

從 Python3 開始，non-breaking space 被算進 whitespace 字元

所以我們簡單用 string.join 就可以解決

1	s = ' '.join(s.split())

效能比較

因為 Replace 跟 Join 都符合我的需求
所以趁這次機會來做個效能比較

str_hard_space = '16\xa0kg on 24th\xa0June 2021'

start_time = time.time()
for _ in range(100000):
    str_hard_space.replace('\xa0', ' ')
end_time = time.time()

print(end_time - start_time)

start_time = time.time()
for _ in range(100000):
    ' '.join(str_hard_space.split())
end_time = time.time()

print(end_time - start_time)

執行結果

1 2	0.01523280143737793 0.03563690185546875

結果是 Replace 效能勝出！

完成！

Replace

Beautiful Soup

Join

效能比較

也許你也會想看看