ํ•˜์ด๋žŒ 2021. 8. 27. 22:14

์ง€๋‚œ ์ฑ•ํ„ฐ์—์„œ ๋จธ์‹ ๋Ÿฌ๋‹ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ์‹ ๋ขฐํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ ์ฃผ๊ธฐ ์œ„ํ•ด ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๋ฅผ ํ›ˆ๋ จ ์„ธํŠธ์™€ ํ…Œ์ŠคํŠธ ์„ธํŠธ๋กœ ๋‚˜๋ˆ„์—ˆ๊ณ , ์ด๋Š” ์ •ํ™•๋„ 100%๋ฅผ ๋‹ฌ์„ฑํ–ˆ๋‹ค. ๊ทธ๋Ÿฐ๋ฐ ์ƒˆ๋กœ์šด ์ƒ์„ ์œผ๋กœ ํ…Œ์ŠคํŠธ๋ฅผ ํ•ด๋ดค๋”๋‹ˆ ๋„๋ฏธ๋ฅผ ๋น™์–ด๋กœ ์˜ˆ์ธกํ•˜๋Š” ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ–ˆ๋‹ค. ์ด ๋ฌธ์ œ๋Š” ์™œ ๋ฐœ์ƒํ–ˆ์„๊นŒ?

 

 

๋„˜ํŒŒ์ด๋กœ ๋ฐ์ดํ„ฐ ์ค€๋น„ํ•˜๊ธฐ

 

๋ฌธ์ œ์— ๋Œ€ํ•ด ์‚ดํŽด๋ณด๊ธฐ ์ „์— ๋ฐ์ดํ„ฐ๋ฅผ ๋จผ์ € ์ค€๋น„ํ•ด๋ณด์ž. ์ด๋ฒˆ์—๋Š” ๋„˜ํŒŒ์ด์˜ column_stack() ํ•จ์ˆ˜๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ํ•ฉ์น˜์ž. 

 

fish_length = [25.4, 26.3, 26.5, 29.0, 29.0, 29.7, 29.7, 30.0, 30.0, 30.7, 31.0, 31.0, 
                31.5, 32.0, 32.0, 32.0, 33.0, 33.0, 33.5, 33.5, 34.0, 34.0, 34.5, 35.0, 
                35.0, 35.0, 35.0, 36.0, 36.0, 37.0, 38.5, 38.5, 39.5, 41.0, 41.0, 9.8, 
                10.5, 10.6, 11.0, 11.2, 11.3, 11.8, 11.8, 12.0, 12.2, 12.4, 13.0, 14.3, 15.0]
fish_weight = [242.0, 290.0, 340.0, 363.0, 430.0, 450.0, 500.0, 390.0, 450.0, 500.0, 475.0, 500.0, 
                500.0, 340.0, 600.0, 600.0, 700.0, 700.0, 610.0, 650.0, 575.0, 685.0, 620.0, 680.0, 
                700.0, 725.0, 720.0, 714.0, 850.0, 1000.0, 920.0, 955.0, 925.0, 975.0, 950.0, 6.7, 
                7.5, 7.0, 9.7, 9.8, 8.7, 10.0, 9.9, 9.8, 12.2, 13.4, 12.2, 19.7, 19.9]
                
import numpy as np
fish_data = np.column_stack((fish_length, fish_weight))

 

 

ํƒ€๊นƒ ๋ฐ์ดํ„ฐ๋„ ๋งˆ์ฐฌ๊ฐ€์ง€ ๋ฐฉ๋ฒ•์œผ๋กœ ๋งŒ๋“ค์–ด๋ณด์ž. ๋„˜ํŒŒ์ด์˜ np.ones()์™€ np.zeros() ํ•จ์ˆ˜๋Š” ๊ฐ๊ฐ ์›ํ•˜๋Š” ๊ฐœ์ˆ˜์˜ 1๊ณผ 0์„ ์ฑ„์šด ๋ฐฐ์—ด์„ ๋งŒ๋“ค์–ด์ค€๋‹ค. ์—ฌ๊ธฐ์„œ np.concatenate() ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์—ฐ๊ฒฐ์‹œํ‚จ๋‹ค. 

 

fish_target = np.concatenate((np.ones(35),np.zeros(14)))



 

์‚ฌ์ดํ‚ท๋Ÿฐ์œผ๋กœ ํ›ˆ๋ จ ์„ธํŠธ์™€ ํ…Œ์ŠคํŠธ ์„ธํŠธ ๋‚˜๋ˆ„๊ธฐ

 

์ „๋‹ฌ๋˜๋Š” ๋ฆฌ์ŠคํŠธ๋‚˜ ๋ฐฐ์—ด์„ ๋น„์œจ์— ๋งž๊ฒŒ ํ›ˆ๋ จ ์„ธํŠธ์™€ ํ…Œ์ŠคํŠธ ์„ธํŠธ๋กœ ๋‚˜๋ˆ„์–ด ์ฃผ๋Š” train_test_split() ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ๋ฅผ ๋‚˜๋ˆ„์–ด๋ณด์ž. 

 

from sklearn.model_selection import train_test_split

train_input,test_input, train_target, test_target = train_test_split(fish_data, fish_target, random_state=42)

 

fish_data๋Š” train_input๊ณผ test_input์œผ๋กœ ๋‚˜๋‰˜๊ณ , fish_target์€ train_target๊ณผ test_target์œผ๋กœ ๋‚˜๋‰˜์–ด ์ด 4๊ฐœ์˜ ๋ฐฐ์—ด์ด ๋ฐ˜ํ™˜๋œ๋‹ค. ์ด ํ•จ์ˆ˜๋Š” ๊ธฐ๋ณธ์ ์œผ๋กœ 25%๋ฅผ ํ…Œ์ŠคํŠธ ์„ธํŠธ๋กœ ๋–ผ์–ด ๋‚ธ๋‹ค. ๊ทธ๋Ÿฐ๋ฐ ์ด๋•Œ ์ƒ˜ํ”Œ์ด ๊ณจ๊ณ ๋ฃจ ์„ž์ด์ง€ ์•Š์„ ์ˆ˜ ์žˆ๋‹ค. ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๊ฐ€ ์ž‘๊ฑฐ๋‚˜ ํŠน์ • ํด๋ž˜์Šค์˜ ์ƒ˜ํ”Œ ๊ฐœ์ˆ˜๊ฐ€ ์ ์„ ๋•Œ ํŠนํžˆ ์ด๋Ÿฐ ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค. stratify ๋งค๊ฐœ๋ณ€์ˆ˜์— ํƒ€๊นƒ ๋ฐ์ดํ„ฐ๋ฅผ ์ „๋‹ฌํ•˜๋ฉด ํด๋ž˜์Šค ๋น„์œจ์— ๋งž๊ฒŒ ๋ฐ์ดํ„ฐ๋ฅผ ๋‚˜๋ˆ„์–ด์ค€๋‹ค. 

 

train_input,test_input, train_target, test_target = train_test_split(fish_data, fish_target, stratify=fish_target, random_state=42)

 

์ƒ˜ํ”Œ๋ง ํŽธํ–ฅ ๋ฌธ์ œ๋„ ํ•ด๊ฒฐํ•˜์˜€์œผ๋‹ˆ ๋ฐ์ดํ„ฐ๋Š” ๋ชจ๋‘ ์ค€๋น„๋˜์—ˆ๋‹ค. ์ด์ œ ์ฃผ์–ด์ง„ ๋ฌธ์ œ๋ฅผ ํ™•์ธํ•ด๋ณด์ž.

 

 

 

์ˆ˜์ƒํ•œ ๋„๋ฏธ ํ•œ ๋งˆ๋ฆฌ

 

์ค€๋น„ํ•œ ๋ฐ์ดํ„ฐ๋กœ k-์ตœ๊ทผ์ ‘ ์ด์›ƒ์„ ํ›ˆ๋ จํ•ด๋ณด์ž. 

 

from sklearn.neighbors import KNeighborsClassifier
kn = KNeighborsClassifier()
kn.fit(train_input, train_target)
kn.score(test_input, test_target)

1.0

 

๊ทธ๋ฆฌ๊ณ  ๋„๋ฏธ ๋ฐ์ดํ„ฐ๋ฅผ ๋„ฃ๊ณ  ๊ฒฐ๊ณผ๋ฅผ ํ™•์ธํ•ด๋ณด์ž. 

print(kn.predict([[25,150]]))

[0.]

 

๋„๋ฏธ(1)๋กœ ์˜ˆ์ธกํ•˜์ง€ ์•Š์•˜๋‹ค. ์™œ ์ด๋Ÿฐ ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์™”๋Š”์ง€, ์ด์›ƒ ์ƒ˜ํ”Œ์„ ํ‘œ์‹œํ•œ ์‚ฐ์ ๋„๋ฅผ ๊ทธ๋ ค์„œ ํ™•์ธํ•ด๋ณด์ž. 

 

import matplotlib.pyplot as plt
distances, indexes = kn.kneighbors([[25,150]])

plt.scatter(train_input[:,0], train_input[:,1])
plt.scatter(25,150,marker='^')
plt.scatter(train_input[indexes,0], train_input[indexes,1],marker='D')
plt.xlabel('length')
plt.ylabel('weight')
plt.show()

 

๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ์ด์›ƒ์—๋Š” ๋น™์–ด๊ฐ€ ์••๋„์ ์œผ๋กœ ๋งŽ๋‹ค. ์‚ฐ์ ๋„๋ฅผ ๋ณด๋ฉด ์ง๊ด€์ ์œผ๋กœ ๋„๋ฏธ์™€ ๊ฐ€๊น๊ฒŒ ๋ณด์ž„์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  ๋ง์ด๋‹ค. ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” '๊ธฐ์ค€'์„ ๋งž์ถ”๋Š” ๊ฒƒ์ด ํ•„์š”ํ•˜๋‹ค.

 

 

 

๊ธฐ์ค€์„ ๋งž์ถฐ๋ผ

 

์œ„์˜ ์‚ฐ์ ๋„๋ฅผ ๋ณด๋ฉด x์ถ•์€ ๋ฒ”์œ„๊ฐ€ ์ข๊ณ (10~40), y์ถ•์€ ๋ฒ”์œ„๊ฐ€ ๋„“๋‹ค(0~1000). ๊ทธ๋ž˜์„œ y์ถ•์œผ๋กœ ์กฐ๊ธˆ๋งŒ ๋ฉ€์–ด์ ธ๋„ ๊ฑฐ๋ฆฌ๊ฐ€ ์•„์ฃผ ํฐ ๊ฐ’์œผ๋กœ ๊ณ„์‚ฐ๋˜์—ˆ๋˜ ๊ฒƒ์ด๋‹ค. ์ด๋ ‡๋“ฏ ํŠน์„ฑ ๊ฐ„์˜ ์Šค์ผ€์ผ์ด ๋‹ค๋ฅผ ๋•Œ๋Š” ํŠน์„ฑ๊ฐ’์„ ์ผ์ •ํ•œ ๊ธฐ์ค€์œผ๋กœ ๋งž์ถฐ์ฃผ๋Š” ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ ์ž‘์—…์ด ํ•„์š”ํ•˜๋‹ค. ๊ฐ€์žฅ ๋„๋ฆฌ ์‚ฌ์šฉํ•˜๋Š” ์ „์ฒ˜๋ฆฌ ๋ฐฉ๋ฒ•์€ ํ‘œ์ค€์ ์ˆ˜(z ์ ์ˆ˜)๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ํ‘œ์ค€์ ์ˆ˜๋Š” ๊ฐ ํŠน์„ฑ๊ฐ’์ด 0์—์„œ ํ‘œ์ค€ํŽธ์ฐจ์˜ ๋ช‡ ๋ฐฐ๋งŒํผ ๋–จ์–ด์ ธ ์žˆ๋Š”์ง€๋ฅผ ๋‚˜ํƒ€๋‚ด๊ณ , ์ด๋ฅผ ํ†ตํ•ด ์‹ค์ œ ํŠน์„ฑ๊ฐ’์˜ ํฌ๊ธฐ์™€ ์ƒ๊ด€์—†์ด ๋™์ผํ•œ ์กฐ๊ฑด์œผ๋กœ ๋น„๊ตํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋œ๋‹ค.

 

๊ณ„์‚ฐํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ๊ฐ„๋‹จํ•˜๋‹ค. ํ‰๊ท ์„ ๋นผ๊ณ  ํ‘œ์ค€ํŽธ์ฐจ๋ฅผ ๋‚˜๋ˆ„์–ด์ฃผ๋ฉด ๋œ๋‹ค. ์—ฌ๊ธฐ์„œ ํŠน์„ฑ๋งˆ๋‹ค ๊ฐ’์˜ ์Šค์ผ€์ผ์ด ๋‹ค๋ฅด๋ฏ€๋กœ ํ‰๊ท ๊ณผ ํ‘œ์ค€ํŽธ์ฐจ๋Š” ๊ฐ ํŠน์„ฑ๋ณ„๋กœ ๊ณ„์‚ฐํ•ด์•ผ ํ•œ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด axis=0์œผ๋กœ ์ง€์ •ํ•ด์ค€๋‹ค. 

 

mean = np.mean(train_input, axis=0)
std = np.std(train_input, axis=0)

train_scaled = (train_input - mean) / std

 

๋„˜ํŒŒ์ด์—์„œ๋Š” ๋ธŒ๋กœ๋“œ์บ์ŠคํŒ…์ด๋ผ๋Š” ๊ธฐ๋Šฅ์ด ์žˆ์–ด์„œ ์œ„์™€ ๊ฐ™์ด ๋ช…๋ นํ•˜๋ฉด, train_input์˜ ๋ชจ๋“  ํ–‰์—์„œ mean์— ์žˆ๋Š” ๋‘ ํ‰๊ท ๊ฐ’์„ ๋นผ๊ณ , std์— ์žˆ๋Š” ๋‘ ํ‘œ์ค€ํŽธ์ฐจ๋ฅผ ๋‹ค์‹œ ๋ชจ๋“  ํ–‰์— ์ ์šฉํ•œ๋‹ค.

 

 

 

์ „์ฒ˜๋ฆฌ ๋ฐ์ดํ„ฐ๋กœ ๋ชจ๋ธ ํ›ˆ๋ จํ•˜๊ธฐ

 

์œ„์—์„œ ๊ธฐ์ค€์„ ๋งž์ถฐ ๋ฐ์ดํ„ฐ๋ฅผ ๋งŒ๋“ค์—ˆ์ง€๋งŒ, ๊ฐ’์˜ ๋ฒ”์œ„๊ฐ€ ๋‹ฌ๋ผ์กŒ๊ธฐ ๋•Œ๋ฌธ์— ์ƒ˜ํ”Œ๋„ ๋™์ผํ•œ ๋น„์œจ๋กœ ๋ณ€ํ™˜ํ•ด์•ผํ•œ๋‹ค. ์—ฌ๊ธฐ์„œ ์ค‘์š”ํ•œ ์ ์€ 'ํ›ˆ๋ จ ์„ธํŠธ์˜ mean๊ณผ std'๋ฅผ ์‚ฌ์šฉํ•ด์„œ ๋ณ€ํ™˜ํ•ด์•ผ ํ•œ๋‹ค๋Š” ์ ์ด๋‹ค. 

 

new=([25,150] - mean) / std
plt.scatter(train_scaled[:,0], train_scaled[:,1])
plt.scatter(new[0],new[1],marker='^')
plt.xlabel('length')
plt.ylabel('weight')
plt.show()

 

์ด์ œ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์˜ ๋‘ ํŠน์„ฑ์ด ๋น„์Šทํ•œ ๋ฒ”์œ„๋ฅผ ์ฐจ์ง€ํ•˜๊ฒŒ ๋˜์—ˆ๋‹ค. ๋‹ค์‹œ ํ•œ๋ฒˆ k-์ตœ๊ทผ์ ‘ ์ด์›ƒ ๋ชจ๋ธ์„ ํ›ˆ๋ จํ•˜๊ณ , ํ‰๊ฐ€ํ•ด๋ณด์ž.

๊ทธ๋ฆฌ๊ณ  ์ด๋•Œ ์ƒ˜ํ”Œ์„ ํ›ˆ๋ จ ์„ธํŠธ์˜ ํ‰๊ท ๊ณผ ํ‘œ์ค€ํŽธ์ฐจ๋กœ ๋ณ€ํ™˜ํ•œ ๊ฒƒ์ฒ˜๋Ÿผ, ํ…Œ์ŠคํŠธ ์„ธํŠธ๋„ ํ›ˆ๋ จ ์„ธํŠธ์˜ ํ‰๊ท ๊ณผ ํ‘œ์ค€ํŽธ์ฐจ๋กœ ๋ณ€ํ™˜ํ•ด์•ผ ํ•œ๋‹ค.

 

kn.fit(train_scaled, train_target)

test_scaled = (test_input - mean) / std

kn.score(test_scaled, test_target)

1.0

 

ํ…Œ์ŠคํŠธ ์„ธํŠธ์˜ ์ƒ˜ํ”Œ์„ ์™„๋ฒฝํ•˜๊ฒŒ ๋ถ„๋ฅ˜ํ–ˆ๋‹ค. ์ด์ œ ์•„๊นŒ ๊ทธ ๋ฌธ์ œ์˜ ์ƒ˜ํ”Œ์„ ์—์ธกํ•ด๋ณด๋ฉด

 

print(kn.predict([new]))

[1.]

 

๋„๋ฏธ๋กœ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ์—์ธกํ•œ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. 

 

๊ทธ๋Ÿผ ๋‹ค์‹œ ์‚ฐ์ ๋„๋ฅผ ๊ทธ๋ ค ๊ฐ€๊นŒ์šด ์ด์›ƒ์ด ์–ด๋–ป๊ฒŒ ๋ณ€ํ™”ํ–ˆ๋Š”์ง€ ์‚ดํŽด๋ณด๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ์ƒ˜ํ”Œ์ด ๋ชจ๋‘ ๋„๋ฏธ์ธ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. 

 

distances, indexes = kn.kneighbors([new])
plt.scatter(train_scaled[:,0], train_scaled[:,1])
plt.scatter(new[0],new[1],marker='^')
plt.scatter(train_scaled[indexes,0], train_scaled[indexes,1],marker='D')
plt.xlabel('length')
plt.ylabel('weight')
plt.show()

 

 

 

 

(์ •๋ฆฌ) ์Šค์ผ€์ผ์ด ๋‹ค๋ฅธ ํŠน์„ฑ ์ฒ˜๋ฆฌ

 

๊ธฐ์กด์— ๋งŒ๋“ค์—ˆ๋˜ ๋ชจ๋ธ์€ ์™„๋ฒฝํ•˜๊ฒŒ ํ…Œ์ŠคํŠธ ์„ธํŠธ๋ฅผ ๋ถ„๋ฅ˜ํ–ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ƒˆ๋กœ์šด ์ƒ˜ํ”Œ์— ๋Œ€ํ•ด์„œ ์ž˜๋ชป ์˜ˆ์ธกํ–ˆ๋‹ค. ์ด๋Š” ์ƒ˜ํ”Œ์˜ ๋‘ ํŠน์„ฑ์ธ ๊ธธ์ด์™€ ๋ฌด๊ฒŒ์˜ ์Šค์ผ€์ผ์ด ๋‹ค๋ฅด๊ธฐ ๋•Œ๋ฌธ์— ์ƒ๊ธด ๋ฌธ์ œ์ด๋‹ค. ๊ทธ๋ž˜์„œ ํŠน์„ฑ์„ ํ‘œ์ค€์ ์ˆ˜๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ ์Šค์ผ€์ผ์„ ์กฐ์ •ํ•˜๋Š” ์ž‘์—…์„ ํ–ˆ๋‹ค. ์ด๋•Œ, ํ›ˆ๋ จ ์„ธํŠธ๋ฅผ ๋ณ€ํ™˜ํ•œ ๋ฐฉ์‹ ๊ทธ๋Œ€๋กœ ํ…Œ์ŠคํŠธ ์„ธํŠธ๋ฅผ ๋ณ€ํ™˜ํ•ด์•ผ ํ•œ๋‹ค๋Š” ๊ฒƒ์„ ๊ผญ ๊ธฐ์–ตํ•ด์•ผ ํ•œ๋‹ค. 

 

 

์ถœ์ฒ˜ : ๋ฐ•ํ•ด์„ , ํ˜ผ์ž ๊ณต๋ถ€ํ•˜๋Š” ๋จธ์‹ ๋Ÿฌ๋‹+๋”ฅ๋Ÿฌ๋‹, ํ•œ๋น›๋ฏธ๋””์–ด, 2021