classsklearn.neighbors.KNeighborsClassifier(n_neighbors=5,*,weights='uniform',algorithm='auto',leaf_size=30,p=2,metric='minkowski',metric_params=None,n_jobs=None,**kwargs)加载数据集importnumpyasnpimportmatplotlib.pyplotaspltfromsklearnimportdatasetsiris=datasets.load_iris()iris.keys()X=iris.datay=iris.targetX.shapey.shape拆分数据集fromsklearn.model_selectionimporttrain_test_splitX_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=666)print(X_train.shape)print(y_train.shape)print(X_test.shape)print(y_test.shape)分类fromsklearn.neighborsimportKNeighborsClassifierknn_clf=KNeighborsClassifier(n_neighbors=6)knn_clf.fit(X_train,y_train)y_predict=knn_clf.predict(X_test)sum(y_predict==y_test)/len(y_test)#准确度knn_clf.score(X_test,y_test)#准确度分类结果的精确度fromsklearn.metricsimportaccuracy_scoreaccuracy_score(y_test,y_predict)
#加载数据集fromsklearnimportdatasetsdigits=datasets.load_digits()X=digits.dataprint(X.shape)#(1797,64)y=digits.targetprint(y.shape)#(1797,)#拆分数据集fromsklearn.model_selectionimporttrain_test_splitX_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=666)#分类fromsklearn.neighborsimportKNeighborsClassifierknn_clf=KNeighborsClassifier(n_neighbors=3)knn_clf.fit(X_train,y_train)y_predict=knn_clf.predict(X_test)#分类精确度fromsklearn.metricsimportaccuracy_scoreprint(accuracy_score(y_test,y_predict))#0.9888888888888889knn_clf.score(X_test,y_test)#0.9888888888888889demo02超参数weights考虑距离权重p明可夫斯基距离相应的p如果K=3,投票结果相同,如下三个类别,那怎么办呢?引入距离权重
欧拉距离
weights='distance'
p=2
best_score=0.0best_k=-1best_method=""formethodin["uniform","distance"]:forkinrange(1,11):knn_clf=KNeighborsClassifier(n_neighbors=k,weights=method)knn_clf.fit(X_train,y_train)score=knn_clf.score(X_test,y_test)ifscore>best_score:best_k=kbest_score=scorebest_method=methodprint("best_method=",best_method)#best_method=uniformprint("best_k=",best_k)#best_k=4print("best_score=",best_score)#best_score=0.9916666666666667sk_knn_clf=KNeighborsClassifier(n_neighbors=4,weights="distance",p=1)sk_knn_clf.fit(X_train,y_train)sk_knn_clf.score(X_test,y_test)#0.9833333333333333明可夫斯基距离
考虑距离权重,最佳的参数p为2
best_score=0.0best_k=-1best_p=-1forkinrange(1,11):forpinrange(1,6):knn_clf=KNeighborsClassifier(n_neighbors=k,weights="distance",p=p)knn_clf.fit(X_train,y_train)score=knn_clf.score(X_test,y_test)ifscore>best_score:best_k=kbest_p=pbest_score=scoreprint("best_k=",best_k)#best_k=3print("best_p=",best_p)#best_p=2print("best_score=",best_score)#best_score=0.9888888888888889网格搜索kp结合的搜索
verbose输出日志
param_grid=[{'weights':['uniform'],'n_neighbors':[iforiinrange(1,11)]},{'weights':['distance'],'n_neighbors':[iforiinrange(1,11)],'p':[iforiinrange(1,6)]}]knn_clf=KNeighborsClassifier()fromsklearn.model_selectionimportGridSearchCVgrid_search=GridSearchCV(knn_clf,param_grid)grid_search.fit(X_train,y_train)grid_search.best_estimator_#KNeighborsClassifier(algorithm='auto',leaf_size=30,metric='minkowski',metric_params=None,n_jobs=1,n_neighbors=3,p=3,weights='distance')grid_search.best_score_#0.9853862212943633grid_search=GridSearchCV(knn_clf,param_grid,n_jobs=-1,verbose=2)#n_jobs=-1为所有CPU核grid_search.fit(X_train,y_train)
更多超参数
作用:避免某部分特征数值过大,其它特征影响较小
#均值方差归一化原理importnumpyasnpimportmatplotlib.pyplotaspltX2=np.random.randint(0,100,(50,2))X2=np.array(X2,dtype=float)X2[:,0]=(X2[:,0]-np.mean(X2[:,0]))/np.std(X2[:,0])X2[:,1]=(X2[:,1]-np.mean(X2[:,1]))/np.std(X2[:,1])np.mean(X2[:,0])#-1.1990408665951691e-16np.std(X2[:,0])#1.0np.mean(X2[:,1])#-1.1546319456101628e-16np.std(X2[:,1])#0.99999999999999989
importnumpyasnpfromsklearnimportdatasets#加载数据集iris=datasets.load_iris()X=iris.datay=iris.target#拆分数据集fromsklearn.model_selectionimporttrain_test_splitX_train,X_test,y_train,y_test=train_test_split(iris.data,iris.target,test_size=0.2,random_state=666)#归一化fromsklearn.preprocessingimportStandardScalerstandardScalar=StandardScaler()standardScalar.fit(X_train)#获得归一化后的均值和方差standardScalar.mean_#归一化的均值standardScalar.scale_#归一化的方差X_train_standard=standardScalar.transform(X_train)#均值方差归一化处理X_test_standard=standardScalar.transform(X_test)#归一化后的数据进行处理fromsklearn.neighborsimportKNeighborsClassifierknn_clf=KNeighborsClassifier(n_neighbors=3)knn_clf.fit(X_train_standard,y_train)knn_clf.score(X_test_standard,y_test)#1.0总结优点:
1.解决分类问题,天然可以解决多分类问题
2.思想简单,效果强大
3.还可以解决回归问题(取最近几个点的均值或者加权平均)
缺点:
1.最大缺点:效率低下(训练集有m个样本,n个特征,则预测每一个新的数据,需要O(m*n))
3.预测结果不具可解释性
4.维度灾难
1.解决回归问题
2.思想简单,实现容易
3.许多强大的非线性模型的基础
4.结果具有很好的可解释性
5.蕴含机器学习中的很多重要思想
衡量线性回归法的指标
作用:得到一个函数后,如何求解?有时候并不可解,线性回归可以当作一个特例
定义:一步步逼近J最小值对应的theta。具体步骤:theta=theta-eta*gradient(导数大于0则theta变小,反之变大)。
解释:dJ/d(theta)为导数的方向,-dJ/d(theta)为梯度的方向
importnumpyasnpfromsklearnimportdatasets#加载数据集boston=datasets.load_boston()X=boston.datay=boston.targetX=X[y<50.0]y=y[y<50.0]#拆分数据集fromsklearn.model_selectionimporttrain_test_splitX_train,X_test,y_train,y_test=train_test_split(X,y,random_state=666)#数据集归一化fromsklearn.preprocessingimportStandardScalerstandardScaler=StandardScaler()standardScaler.fit(X_train)X_train_standard=standardScaler.transform(X_train)#预测fromsklearn.linear_modelimportSGDRegressorsgd_reg=SGDRegressor()%timesgd_reg.fit(X_train_standard,y_train)#Walltime:74.2mssgd_reg.score(X_test_standard,y_test)#0.8032705751514356sgd_reg=SGDRegressor(n_iter=50)%timesgd_reg.fit(X_train_standard,y_train)#Walltime:4.88mssgd_reg.score(X_test_standard,y_test)#0.8129892326446553