Bloom filter

Published at 2024. 6. 17.

7 min read

#Data Structure

Table of Contents

Bloom filter ๋ž€?

์–ผ๋งˆ ์ „์—, ์ž๋ฃŒ๊ตฌ์กฐ ๊ด€๋ จ ๊ณต๋ถ€๋ฅผ ํ•˜๋‹ค๊ฐ€ bloom filter์— ๋Œ€ํ•ด์„œ ์•Œ๊ฒŒ ๋˜์—ˆ๋‹ค.
์กฐ๊ธˆ ๋” ๊ณต๋ถ€ํ•ด๋ณด๋‹ˆ ํ™•๋ฅ ์  ์ž๋ฃŒ๊ตฌ์กฐ๋ผ๊ณ  ํ•˜๋˜๋ฐ ์ž๋ฃŒ๊ตฌ์กฐ๋ฉด ์ž๋ฃŒ๊ตฌ์กฐ์ง€ โ€˜ํ™•๋ฅ ์ โ€™ ์ž๋ฃŒ๊ตฌ์กฐ๋Š” ๋˜ ๋ญ์ง€? ์‹ถ์–ด์„œ ๊ตฌ๊ธ€๋ง์„ ํ•ด๋ณด์•˜๋‹ค.

Wikipedia์—๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ ํ˜€์žˆ๋‹ค.

A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not โ€“ in other words, a query returns either โ€œpossibly in setโ€ or โ€œdefinitely not in setโ€. Elements can be added to the set, but not removed (though this can be addressed with the counting Bloom filter variant); the more items added, the larger the probability of false positives.

์š”์•ฝํ•ด๋ณด์ž๋ฉด Burton Howard Bloom ์ด๋ผ๋Š” ๋ถ„์ด ๊ณ ์•ˆํ–ˆ๊ณ , ์–ด๋–ค set์— ํŠน์ • ์š”์†Œ๊ฐ€ ํฌํ•จ๋˜๋Š”์ง€ ์—ฌ๋ถ€๋ฅผ ํŒ๋‹จํ•˜๋Š”(์ด๊ฑธ membership query๋ผ๊ณ  ๋ถ€๋ฆ„)๋ฐ์— ์‚ฌ์šฉ๋œ๋‹ค๊ณ  ํ•œ๋‹ค.

๋ช‡๊ฐ€์ง€ ์ฃผ์š”ํ•œ ํŠน์ง•์ด๋ผ๊ณ  ํ•œ๋‹ค๋ฉด,

  • ๋ชจ๋“  ์š”์†Œ๋ฅผ ๋ฉ”๋ชจ๋ฆฌ์— ์ ์žฌํ•˜์ง€ ์•Š๊ณ ๋„ membership test๋ฅผ ํ•  ์ˆ˜ ์žˆ๋‹ค.
  • ์š”์†Œ๋ฅผ bloom filter์— ๋”ํ•  ์ˆœ ์žˆ์–ด๋„, ๋บ„ ์ˆœ ์—†๋‹ค.
  • ์–ด๋–ค ์š”์†Œ๊ฐ€ ์ง‘ํ•ฉ์— ์†ํ•œ๋‹ค๋Š” ๊ฒƒ์€ ํ™•์‹คํ•˜๊ฒŒ ์•Œ ์ˆ˜ ์—†์ง€๋งŒ, ์†ํ•˜์ง€ ์•Š๋Š”๋‹ค๋Š” ๊ฒƒ์€ ํ™•์‹คํ•˜๊ฒŒ ์•Œ ์ˆ˜ ์žˆ๋‹ค. ์ฆ‰, false positive๋Š” ์กด์žฌํ•˜๋‚˜ false negative๋Š” ์กด์žฌํ•˜์ง€ ์•Š๋Š”๋‹ค.

์ด ์ •๋„๊ฐ€ ์•„๋‹๊นŒ ํ•œ๋‹ค.

โ€˜ํ™•๋ฅ ์ โ€™ ์ž๋ฃŒ๊ตฌ์กฐ๋ผ๊ณ  ์ด๋ฆ„ ๋ถ™์€ ๊ฒƒ์€ ์•„๋งˆ false positive๊ฐ€ ์กด์žฌํ•˜๊ธฐ ๋•Œ๋ฌธ์ด์ง€ ์•Š์„๊นŒ ์ƒ๊ฐํ•œ๋‹ค. ํŠน์ • ์ง‘ํ•ฉ์— ์†ํ•˜๋Š” ๊ฒƒ์„ ํ™•์‹คํžˆ ์•Œ ์ˆ˜ ์žˆ๋‹ค๋ฉด โ€˜ํ™•๋ฅ ์ โ€™ ์ด ์•„๋‹ˆ๋ผ โ€˜๊ฒฐ์ •์ โ€™ ์ผ ํ…Œ๋‹ˆ๊นŒ?

bloom filter์— ๊ด€ํ•œ ์ž์„ธํ•œ ์„ค๋ช…์€ Naver D2 ๋ธ”๋กœ๊ทธ์— ์ž˜ ์ •๋ฆฌ๋˜์–ด ์žˆ์–ด์„œ ๋งํฌ๋กœ ๊ฐˆ์Œํ•ด๋‘”๋‹ค.
Hash function์— ๊ด€ํ•œ ๊ธฐ๋ณธ์ ์ธ ์ดํ•ด์™€ ์ด๊ณผ ๊ณ ๋“ฑ์ˆ˜ํ•™ ์ •๋„ ์•Œ๊ณ  ์žˆ์œผ๋ฉด ์ดํ•ดํ•˜๋Š” ๋ฐ์— ํฐ ์–ด๋ ค์›€์€ ์—†๋Š” ์•„ํ‹ฐํด์ด๋‹ค.

์–ด๋””์— ์“ฐ์ด๊ณ  ์žˆ๋Š”์ง€?

ํ™•๋ฅ ์  ์ž๋ฃŒ๊ตฌ์กฐ๋ผ๋Š” ๊ฐœ๋…์ด ์žฌ๋ฏธ์žˆ๊ธฐ๋Š” ํ•ด๋„ ์‹ค์ œ ์—”์ง€๋‹ˆ์–ด๋ง ํ•„๋“œ์—์„œ ์“ฐ์ž„์ด ์—†๋‹ค๋ฉด ๊ณต๋ถ€ํ•˜๋Š” ์˜๋ฏธ๊ฐ€ ๋ณ„๋กœ ์—†์—ˆ์„ ํ…Œ์ง€๋งŒ, ์ด๋ฏธ ์—ฌ๋Ÿฌ ๋„๋ฉ”์ธ์—์„œ ํ™œ๋ฐœํžˆ ์‚ฌ์šฉ๋˜๊ณ  ์žˆ๋Š” ์ž๋ฃŒ๊ตฌ์กฐ์˜€๋‹ค.

์‹ค๋ฌด์—์„œ ํ”ํžˆ ๋ณผ ์ˆ˜ ์žˆ๋Š” Redis์—์„œ๋„ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ๊ณ , Kafka ๊ฐ™์€ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ ํˆด์—์„œ๋„ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ๋‹ค๊ณ  ํ•œ๋‹ค.
NHN Cloud๋ผ๋Š” ์กฐ์ง์—์„œ๋„ bloom filter ํ™œ์šฉ๊ธฐ๋ฅผ ํฌ์ŠคํŒ…์œผ๋กœ ๋งŒ๋“ค์–ด ๋‘์—ˆ๋‹ค.

๋‚˜๋Š” ์•„๋ฌด๋ž˜๋„ ํ”„๋ก ํŠธ์—”๋“œ ์—”์ง€๋‹ˆ์–ด๋‹ค ๋ณด๋‹ˆ ์—ฌ๋Ÿฌ use-case๋“ค ์ค‘์—์„œ Chrome Browser ์—์„œ์˜ use-case์— ๊ด€์‹ฌ์ด ๊ฐ”๋Š”๋ฐ, ์ด use-case์— ๊ด€ํ•ด ์กฐ๊ธˆ ๋” ๊นŠ๊ฒŒ ์‚ดํŽด๋ณด๊ณ  ๊ธ€์„ ๋งˆ์น ๊นŒ ํ•œ๋‹ค.

์šฐ์„ , ๋‚ด๊ฐ€ Chrome ๊ฐœ๋ฐœ์ž๋ผ๊ณ  ์ƒ๊ฐํ•ด๋ณด์ž. Chrome์€ Browser ์‹œ์žฅ์—์„œ 60%๊ฐ€ ๋„˜๋Š” ์ ์œ ์œจ์„ ๊ธฐ๋กํ•˜๊ณ  ์žˆ๋Š” ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์ด๋‹ˆ DAU๋Š” ๋ง๋„ ์•ˆ๋˜๊ฒŒ ๋†’์„ ๊ฒƒ์ด๋‹ค. ์ด๋Ÿฐ ์ƒํ™ฉ์—์„œ Chrome์— ์•…์„ฑ URL์„ filtering ํ•˜๋Š” ๊ธฐ๋Šฅ์„ ํƒ‘์žฌํ•˜๋ ค๋Š” ์‹œ๋„๋ฅผ ํ•œ๋‹ค๊ณ  ๊ฐ€์ •ํ•ด๋ณด์ž(์ด๋ฏธ ์žˆ๋Š” ๊ธฐ๋Šฅ์ด์ง€๋งŒ).

๊ฐ€์žฅ ์‰ฝ๊ฒŒ ๋– ์˜ฌ๋ฆด ์ˆ˜ ์žˆ๋Š” solution์€ ๋‹ค์Œ๊ณผ ๊ฐ™์„ ๊ฒƒ์ด๋‹ค. ์œ ์ €๊ฐ€ ์ฃผ์†Œ์ฐฝ์— URL์„ ์ž…๋ ฅํ•˜๋ฉด ํ•ด๋‹น URL์ด ์•…์„ฑ URL์ธ์ง€ ์•„๋‹Œ์ง€๋ฅผ ํŒ๋‹จํ•˜๋Š” HTTP request๋ฅผ Chrome ์„œ๋ฒ„์— ๋ณด๋‚ด์„œ ํ™•์ธํ•˜๋Š” ๋ฐฉ๋ฒ•.
ํ•˜์ง€๋งŒ ์ด ๋ฐฉ๋ฒ•์€ ์กฐ๊ธˆ๋งŒ ์ƒ๊ฐํ•ด๋ด๋„ ๊ต‰์žฅํ•œ ๋น„ํšจ์œจ์„ ๋ฐœ์ƒ์‹œํ‚ฌ ๊ฒƒ์ด๋ผ๋Š” ๊ฒƒ์„ client-server ์•„ํ‚คํ…์ณ๋ฅผ ํ™œ์šฉํ•œ ๊ฐœ๋ฐœ์„ ํ•ด๋ณธ ๊ฐœ๋ฐœ์ž๋ผ๋ฉด ์‰ฝ๊ฒŒ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ๋‹ค. ๋‹น์žฅ ์ƒ๊ฐ๋‚˜๋Š” ๋น„ํšจ์œจ์„ ๋ช‡ ๊ฐ€์ง€ ์ ์–ด๋ณด๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

  1. ๊ธฐ๋ณธ์ ์œผ๋กœ, ๋„คํŠธ์›Œํฌ ํ†ต์‹  ๋น„์šฉ(HTTP request/response round trip time)์€ ๋น„์‹ธ๋‹ค. ์—ฌ๊ธฐ์— Chrome์˜ DAU๊นŒ์ง€ ๊ณ ๋ คํ•˜๋ฉด ์ด ๋น„์šฉ์€ ๊ธฐํ•˜๊ธ‰์ˆ˜์ ์œผ๋กœ ์น˜์†Ÿ์„ ๊ฒƒ์ด๋‹ค.
  2. ์ด๋ ‡๊ฒŒ๋‚˜ ๋งŽ์€ ํŠธ๋ž˜ํ”ฝ์„ ๊ฐ๋‹นํ•˜๊ธฐ ์œ„ํ•œ ์„œ๋ฒ„๋ฅผ ์•ˆ์ •์ ์œผ๋กœ(high availability๋ฅผ ํ™•๋ณด) ์œ ์ง€ํ•˜๊ธฐ ์œ„ํ•œ ๋น„์šฉ๋„ ๋งŒ๋งŒ์น˜ ์•Š์„ ๊ฒƒ์ด๋‹ค.

์ด๋Ÿฐ ๊ฒฝ์šฐ์— ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒƒ์ด ๋ฐ”๋กœ bloom filter์ด๋‹ค.
์ •ํ™•ํžˆ Chrome์—์„œ bloom filter๋ฅผ ํ™œ์šฉํ•˜๋Š” ๋กœ์ง์€ ์•Œ ์ˆ˜ ์—†์ง€๋งŒ, ์•„๋งˆ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํ˜•ํƒœ์ด์ง€ ์•Š์„๊นŒ?

  1. Chrome ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์„ ์„ค์น˜ํ•  ๋•Œ, ์•…์„ฑ URL๊ณผ ๊ด€๋ จํ•œ blacklist data๋ฅผ ์œ ์ €์˜ local machine์— ๋‹ค์šด๋กœ๋“œ ๋ฐ›์•„๋‘”๋‹ค.
  2. ์œ ์ €๊ฐ€ Chrome์„ ํ†ตํ•ด ์ƒˆ๋กœ์šด URL์— ์ ‘๊ทผํ•  ๋•Œ, ์œ ์ €์˜ machine์— ์กด์žฌํ•˜๋Š” blacklist data๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๋งŒ๋“  bloom filter๋กœ ์•…์„ฑ URL์ธ์ง€ ์—ฌ๋ถ€๋ฅผ ํŒ๋‹จํ•œ๋‹ค.
  3. ๋งŒ์•ฝ bloom filter์— URL์„ ํ†ต๊ณผ์‹œํ‚จ ๊ฒฐ๊ณผ๊ฐ€ ์ฐธ์ด๋ผ๋ฉด, ์•…์„ฑ URL์ธ์ง€ ํ™•์‹ค์น˜ ์•Š๋‹ค๋Š” ๋œป์ด๋ฏ€๋กœ ์ด ๊ฒฝ์šฐ์—๋งŒ ์„œ๋ฒ„์— HTTP request๋ฅผ ๋ณด๋‚ด์„œ ์•…์„ฑ URL์ธ์ง€ ์—ฌ๋ถ€๋ฅผ ํŒ๋‹จํ•œ๋‹ค. ๊ฒฐ๊ณผ๊ฐ€ ๊ฑฐ์ง“์ด๋ผ๋ฉด, ์•…์„ฑ URL์ž„์ด ํ™•์‹คํ•˜๋ฏ€๋กœ(bloom filter ์ž๋ฃŒ๊ตฌ์กฐ์—๋Š” false negative์ธ ๊ฒฝ์šฐ๊ฐ€ ์—†์œผ๋ฏ€๋กœ) ํ•ด๋‹น URL๋กœ์˜ ์ ‘๊ทผ์„ ์ฐจ๋‹จํ•ด์ค€๋‹ค.

References

Slow But Steady ยฉ 2024