๊ด€๋ฆฌ ๋ฉ”๋‰ด

Doby's Lab

Label Encoding์˜ ๋ฌธ์ œ์  (with Chat GPT) ๋ณธ๋ฌธ

AI/Concepts

Label Encoding์˜ ๋ฌธ์ œ์  (with Chat GPT)

๋„๋น„(Doby) 2023. 7. 20. 02:08

๐Ÿ“„ Intro

Categorical Variables์— ๋Œ€ํ•˜์—ฌ ๋ชจ๋ธ์— ํ•™์Šต์‹œํ‚ฌ ๋•Œ, Label Encoding์˜ ๋ฌธ์ œ์ ์œผ๋กœ ์ธํ•ด One-Hot Enoding์„ ์ถ”์ฒœํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

ํ•˜์ง€๋งŒ, ์ดํ•ด๊ฐ€ ๋˜์ง€ ์•Š๋Š” ๋ถ€๋ถ„์ด ์žˆ์–ด์„œ ํฌ์ŠคํŒ…์„ ์ž‘์„ฑํ•˜๊ฒŒ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฒˆ ํฌ์ŠคํŒ…์€ Chat GPT๋ฅผ ์ด์šฉํ•˜์—ฌ ์ž‘์„ฑํ•˜์˜€์Šต๋‹ˆ๋‹ค.


๐Ÿ“„ Label Encoding์˜ ๋ฌธ์ œ์ 

Chat GPT๋Š” Label Encoding์˜ ๋ฌธ์ œ์ ์„ 3๊ฐ€์ง€๋กœ ์ •๋ฆฌํ–ˆ์Šต๋‹ˆ๋‹ค. (์ฐจ์› ์ฆ๊ฐ€ ๋ฌธ์ œ๋„ ์ œ๊ณตํ–ˆ์—ˆ๋Š”๋ฐ ์žฌ์ฐจ ๋ฌผ์—ˆ์„ ๋•Œ, ์˜ค๋ฅ˜์ธ ๊ฒƒ์œผ๋กœ ํ™•์ธํ–ˆ์Šต๋‹ˆ๋‹ค.)

  1. ์ˆœ์„œ ๋˜๋Š” ๋“ฑ๊ธ‰ ๋ถ€์—ฌ: ๋…๋ฆฝ์ ์ธ Categorical Variables์— ์ˆœ์„œ๋‚˜ ๋žญํฌ๋ฅผ ๋ถ€์—ฌํ•˜๋Š” ๊ฒƒ์€ ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ์•…์˜ํ–ฅ์„ ๋ฏธ์นœ๋‹ค.
  2. ๋ณ€์ˆ˜ ํ‰๊ฐ€ ์™œ๊ณก: Label Encoding์„ ํ†ตํ•ด ๋‚˜์˜จ ์ˆซ์ž ๊ฐ’์ด ๋ณ€์ˆ˜ ๊ฐ„์˜ ์ƒ๋Œ€์ ์ธ ์ค‘์š”์„ฑ๊ณผ ๊ฑฐ๋ฆฌ๋กœ ๋‚˜ํƒ€๋‚˜์ง€๋Š” ์•Š๋Š”๋‹ค.
  3. ์˜ˆ์ธก์˜ ํ•œ๊ณ„: ์ƒˆ๋กœ์šด ๋ฒ”์ฃผ๊ฐ€ ๋‚˜ํƒ€๋‚˜๋ฉด, ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒ, ๋ชจ๋ธ์€ ์ƒˆ๋กœ์šด ๋ฒ”์ฃผ๋ฅผ ์ธ์‹ํ•  ์ˆ˜ ์—†์œผ๋ฉฐ, ์ •ํ™•๋„๋ฅผ ๋‚ฎ์ถ˜๋‹ค.

์ด ๋ถ€๋ถ„์—์„œ ์˜์•„ํ–ˆ๋˜ ์ ์ด 1๋ฒˆ๊ณผ 2๋ฒˆ์ž…๋‹ˆ๋‹ค. ์ €์˜ ์งˆ๋ฌธ์€ '๋…๋ฆฝ์ ์ธ ๋ฐ์ดํ„ฐ๊ฐ€ ์ˆœ์„œ๋‚˜ ๋žญํฌ๋ฅผ ๊ฐ€์ง€๋Š” ์—ฐ๊ด€ ์—†๋Š” ์ •๋ณด๋ฅผ ๊ฐ€์ง€๋Š” ๊ฒƒ์€ ์˜ณ์ง€๋Š” ์•Š์•„ ๋ณด์ด๋‚˜, ํ•™์Šต์ด ๊ทธ๋ ‡๊ฒŒ ๋˜์—ˆ๋‹ค๋ฉด ๊ฒฐ๊ณผ๋„ ์ž˜ ๋‚˜์˜ฌ ํ…๋ฐ ๋ฌธ์ œ๋กœ ์‚ผ๋Š” ์›์ธ์ด ๋ฌด์—‡์ธ๊ฐ€?'์ž…๋‹ˆ๋‹ค.

 

์ด์— ๋Œ€ํ•ด Chat GPT๋Š” '๋ชจ๋ธ ์ž์ฒด์˜ ์˜ˆ์ธก ๊ฒฐ๊ณผ์—๋Š” ํฐ ๋ณ€ํ™”๊ฐ€ ์—†์„ ์ˆ˜ ์žˆ์œผ๋‚˜, ๋ชจ๋ธ์˜ ๊ฐ€์ถฉ์น˜๋‚˜ ํŠน์„ฑ ์ค‘์š”๋„๋ฅผ ํ•ด์„ํ•˜๋ ค๊ณ  ํ•  ๋•Œ, ๋ณ€์ˆ˜๊ฐ€ ์ธ์ฝ”๋”ฉ ๋œ ์ˆซ์ž ๊ฐ’์— ๋”ฐ๋ผ ๋ชจ๋ธ์˜ ์ค‘์š”๋„๊ฐ€ ์™œ๊ณก๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ๋ชจ๋ธ์˜ ํ•ด์„์„ ์–ด๋ ต๊ฒŒ ๋งŒ๋“ค๊ณ , ์ž˜๋ชป๋œ ๊ฒฐ๋ก ์„ ๋„์ถœํ•  ๊ฐ€๋Šฅ์„ฑ์„ ๋‚ดํฌํ•ฉ๋‹ˆ๋‹ค.'๋ผ๊ณ  ์ฝ”๋ฉ˜ํŠธ๋ฅผ ๋‚จ๊ฒผ์Šต๋‹ˆ๋‹ค.

 

์ด๋Š” ์„ ํ˜•๋ชจ๋ธ, ํŠธ๋ฆฌ๊ธฐ๋ฐ˜๋ชจ๋ธ์— ๋Œ€ํ•œ ์„ค๋ช…์ด์—ˆ๊ธฐ์— DNN์— ๋Œ€ํ•ด์„œ๋„ ๋ฌผ์–ด๋ณธ ๊ฒฐ๊ณผ, ์ „์ž๋ณด๋‹ค๋Š” ํ•ด์„์ด ์–ด๋ ค์šฐ๋‚˜ ์ผ๋ถ€ ํ•ด์„์ด ๊ฐ€๋Šฅํ•œ ๋ฐ”๊ฐ€ ์žˆ์œผ๋ฉฐ ๊ด€๋ จ ๋ฐฉ๋ฒ•๋“ค์„ ์„ค๋ช…ํ–ˆ์Šต๋‹ˆ๋‹ค.

 

๊ทธ๋ฆฌ๊ณ , ์„ค๋“๋ ฅ์ด ์žˆ๋Š” ๋ถ€๋ถ„์€ 3๋ฒˆ ์˜ˆ์ธก์˜ ํ•œ๊ณ„์˜€์Šต๋‹ˆ๋‹ค. ์ด๋ฏธ train set์—์„œ ํ•™์Šต์ด ์™„๋ฃŒ๋œ ๋ชจ๋ธ์ด test set์—์„œ ์ƒˆ๋กœ์šด Categorical Variable์— ๋Œ€ํ•ด ์ƒˆ๋กœ์šด Encoding ๊ฐ’์„ ๋ฐ›์•„๋“ค์ธ๋‹ค๋ฉด, ๋ชจ๋ธ์ด ์ž˜๋ชป๋œ ํŒ๋‹จ์„ ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ์—์„œ ์ผ๋ฆฌ๊ฐ€ ์žˆ๋‹ค๊ณ  ์ƒ๊ฐํ–ˆ์Šต๋‹ˆ๋‹ค.

 

๋ฌผ๋ก , Train Set์„ ํ†ตํ•ด ํ•™์Šต์ด ๋๋‚œ ์ƒํ™ฉ์— Test Set์—์„œ ์ƒˆ๋กœ์šด Categorical Variable์„ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐฉ๋ฒ•๋„ ์กด์žฌํ•ฉ๋‹ˆ๋‹ค. 

ํ•˜์ง€๋งŒ, ์ด๊ฑด One-Hot Encoding ๋˜ํ•œ ๋งˆ์ฐฌ๊ฐ€์ง€์ž…๋‹ˆ๋‹ค.

 

๐Ÿ’ก Conclusion

๊ฒฐ๋ก ์ ์œผ๋กœ, Categorical Variable์— ๋Œ€ํ•œ Label Encoding์€ ์ˆœ์„œ๋‚˜ ๋“ฑ๊ธ‰์„ ๋งค๊น€์œผ๋กœ์จ ์ž ์žฌ์  ๋ฌธ์ œ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์ง€๋งŒ, ๋ชจ๋ธ์˜ ํ•ด์„ ๊ฐ€๋Šฅ์„ฑ๊ณผ ์˜ˆ์ธก์˜ ํ•œ๊ณ„์— ๋Œ€ํ•œ ๋ฌธ์ œ์ ๋„ ์กด์žฌํ•˜๋Š” ๊ฒƒ์œผ๋กœ ํ™•์ธํ–ˆ์Šต๋‹ˆ๋‹ค.


๐Ÿ“„ Label Encoding vs One-Hot Encoding

๊ทธ๋ž˜์„œ Label Encoding๊ณผ One-Hot Encoding ์ค‘์—์„œ ์–ด๋–ค Encoding์ด ๋” ๋‚˜์€ ๊ฒƒ์ธ์ง€๋Š” ์ƒํ™ฉ์— ๋”ฐ๋ผ์„œ ๋‹ค๋ฅด๋‹ค๋Š” ๊ฒƒ์„ ์•„๋ž˜ Ref.2์—์„œ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

๋ชจ๋ธ์˜ ํ•ด์„ ๊ฐ€๋Šฅ์„ฑ ๋ฌธ์ œ๋ฅผ ๋ฐฐ์ œํ•˜๊ณ  ๋ณธ๋‹ค๋ฉด, ์šฐ์„ ์ ์œผ๋กœ Label Encoding์˜ ์ˆœ์„œ์„ฑ ๋ถ€์—ฌ ๋ฌธ์ œ๋กœ ์ธํ•ด One-Hot Encoding์„ ์ฑ„ํƒํ•  ์ˆ˜ ์žˆ์ง€๋งŒ, ์•„๋ž˜์™€ ๊ฐ™์€ ํ•ญ๋ชฉ๋“ค์„ ๊ณ ๋ คํ•˜๊ณ  ์ฑ„ํƒํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

  • Categorical Variable์ด ์ˆœ์„œ์„ฑ์„ ์ง€๋‹Œ์ง€์— ๋Œ€ํ•œ ์—ฌ๋ถ€ (๊ณ„๊ธ‰, ํ•™๊ต ๋“ฑ)
  • ๊ณ ์œณ๊ฐ’(Categorical Variable์˜ ์ข…๋ฅ˜)์˜ ๊ฐœ์ˆ˜์— ๋Œ€ํ•œ ๊ณ ๋ ค (๋งŽ์€ ๊ฒฝ์šฐ์— One-Hot Encoding์€ ๋ฉ”๋ชจ๋ฆฌ ์†Œ๋น„๊ฐ€ ๋งŽ์•„์„œ ๋น„ํšจ์œจ์ )

๋˜ํ•œ, One-Hot Encoding๋„ ๋…๋ฆฝ ๋ณ€์ˆ˜๋“ค ๊ฐ„์— ์ƒ๊ด€๊ด€๊ณ„๊ฐ€ ๋†’์•„์„œ Multicollinearity ๊ฐ™์€ ๋ฌธ์ œ๊ฐ€ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์šฐ์„ ์ ์œผ๋กœ One-Hot Encoding์ด ์ข‹๋‹ค๊ณ  ํ•  ์ˆ˜๋Š” ์—†์Šต๋‹ˆ๋‹ค. (Ref.3 ์ฐธ๊ณ )

๊ทธ๋ฆฌ๊ณ , Encoding ๋ฐฉ๋ฒ•์— ๋Œ€ํ•ด์„œ 2๊ฐ€์ง€๋งŒ ์žˆ๋Š” ๊ฒƒ๋„ ์•„๋‹ˆ๋ผ์„œ ์—ฌ๋Ÿฌ Encoding์„ ๊ณ ๋ คํ•ด ๋ณด๋Š” ๊ฒƒ์ด ์ข‹์„ ๊ฒƒ์ž…๋‹ˆ๋‹ค.


๐Ÿ“„ Outro

์ด๋ฒˆ ๋ฌธ์ œ๋Š” ํƒ์ƒ‰ํ•˜๋Š” ๋ฐ ์˜ค๋žœ ์‹œ๊ฐ„์ด ์†Œ์š”๋˜์–ด Chat GPT๋ฅผ ์‚ฌ์šฉํ•ด ๋ดค์Šต๋‹ˆ๋‹ค. ๋Œ€ํ™” ๋‚ด์šฉ์€ Ref.1๋กœ ๋งํฌ๋ฅผ ๊ฑธ์–ด๋‘์—ˆ์œผ๋‹ˆ ์ฐธ๊ณ ํ•˜์‹œ๊ธธ ๋ฐ”๋ž๋‹ˆ๋‹ค.


๐Ÿ“„ Reference

https://chat.openai.com/share/7aeddf7f-caea-48d2-920d-585e6680f660

 

ChatGPT

A conversational AI system that listens, learns, and challenges

chat.openai.com

https://azanewta.tistory.com/46

 

One Hot Encoding ๊ณผ Label Encoding ์„ ๋น„๊ตํ•ด๋ณด์ž

One-Hot Encoding์ด๋ž€ ๋ฌด์—‡์ธ๊ฐ€. ์–ธ์ œ ์šฐ๋ฆฌ๋Š” Label Encoding ๋Œ€์‹  One-Hot Encoding์„ ์จ์•ผํ•˜๋Š”๊ฐ€? ๋ฐ์ดํ„ฐ ๊ณผํ•™์— ์ผ๋ฐ˜์ ์ธ ์ธํ„ฐ๋ทฐ ์งˆ๋ฌธ์ด๋ฉฐ, ๋ฐ์ดํ„ฐ ๊ณผํ•™์ž๋Š” ๋ฐ˜๋“œ์‹œ ์•Œ์•„์•ผ ํ•˜๋Š” ๋‚ด์šฉ์ด๋‹ค. ๊ฒฐ๋ก ์ ์œผ๋กœ ๋‹น์‹ 

azanewta.tistory.com

https://towardsdatascience.com/how-and-why-performing-one-hot-encoding-in-your-data-science-project-a1500ec72d85#:~:text=In%20these%20cases%2C%20one%2Dhot,Learning%20algorithms%20without%20any%20problems.

 

How and Why Performing One-Hot Encoding in Your Data Science Project

An article on what one-hot encoding is, why to use it, and how to do it (in Python)

towardsdatascience.com

 

728x90