决策树

Posted on 2019-02-21

1.引入

从根节点开始一步步走到叶子节点(决策)，所有的数据最终都会落到叶子节点，既可以做分类也可以做回归

2.数的组成

根节点:第一个选择点
非叶子节点与分支:中间过程
叶子节点:最终的决策结果

3.衡量标准-熵

熵是表示随机变量不确定性的度量

公式:H(X)=- ∑ pi * logpi, i=1,2, … , n

A集合[1,1,1,1,1,1,1,1,2,2]

B集合[1,2,3,4,5,6,7,8,9,1]

显然A集合的熵值要低，因为A里面只有两种类别，相对稳定一些

而B中类别太多了，熵值就会大很多

不确定性越大，得到的熵值也就越大当p=0或p=1时，H(p)=0,随机变量完全没有不确定性当p=0.5时，H(p)=1,此时随机变量的不确定性最大

逻辑回归模型—信用卡交易数据异常检测

Posted on 2019-01-03

1.原始数据

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

data = pd.read_csv("creditcard.csv")
print("前5行数据:\n",data.head())

# 统计列Class不同值的个数并且升序排序
count_classes = pd.value_counts(data['Class'], sort=True).sort_index()
# 条形图
count_classes.plot(kind='bar')
plt.title("Fraud class histogram")
plt.xlabel("Class")
plt.ylabel("Frequency")
plt.show()
# 结果发现样本Class=0数据量>250000,而Class=1数据量<1000

2.样本不均衡采用的方法

2.1 过采样–让两个样本同样多

# Amount列数值分布不均衡
# 数据进行变换  例子：[2,3] reshape [-1,2]  分成2列,自动计算为3行
# 标准化数据，保证Amount列的特征数据方差为1，均值为0，使得预测结果不会被某些维度过大的特征值而主导
data['normAmount'] = StandardScaler().fit_transform(data['Amount'].values.reshape(-1, 1))
# 去掉无用的两列,得到特征数据
data = data.drop(['Time','Amount'],axis=1)
print("去掉两列之后:\n",data.head())

2.2 下采用–让两个样本同样少

3.交叉验证

from sklearn.cross_validation import train_test_split
# 交叉验证让模型的评估效果可信
# 原始数据集切分，30%数据当做测试集
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=0)

print("Number transactions train dataset: ", len(X_train))
print("Number transactions test dataset: ", len(X_test))
print("Total number of transactions: ", len(X_train)+len(X_test))

# 结果输出
Number transactions train dataset:  199364
Number transactions test dataset:  85443
Total number of transactions:  284807


# 下采样数据集切分 
X_train_undersample, X_test_undersample, y_train_undersample, y_test_undersample = train_test_split(X_undersample

print("Number transactions train dataset: ", len(X_train_undersample))
print("Number transactions test dataset: ", len(X_test_undersample))
print("Total number of transactions: ", len(X_train_undersample)+len(X_test_undersample))

# 结果输出
Number transactions train dataset:  688
Number transactions test dataset:  296
Total number of transactions:  984

4.模型评估

例子：某个班级有男生80人，女生20人，共计100人，目标是找出所有女生，现在某人挑选出50个人，其中20人是女生，另外还错误的把30个男生也当作女生挑选出来

	相关（Relevant）,正类	无关（NonRelevant）,负类
被检索到（Retrieved）	true positives(TP 正类判定为正类,例子中就是正确的判定”这位是女生”)	false positives(FP 负类判定为正类，”存伪”,例子中就是分明是男生却判断为女生，当下伪娘横行，这个错常有人犯）
未被检索到（Not Retrieved）	false negatives(FN 正类判定为负类，”去真”,例子中就是分明是女生，却判断为男生–梁山伯同学犯的错就是这个)	true negatives(TN 负类判定为负类，男生判断为男生，像我这样的纯爷们一准儿就会在此处)

上述例子，可以得到这几个值:

TP = 20

FP = 30

FN = 0

TN = 50

Recall（召回率） = TP/(TP+FN)

from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import KFold, cross_val_score
from sklearn.metrics import confusion_matrix,recall_score,classification_report 

def printing_Kfold_scores(x_train_data,y_train_data):
    # 交叉验证，原始数据切分5部分
    fold = KFold(len(y_train_data),5,shuffle=False) 

    # Different C parameters
    c_param_range = [0.01,0.1,1,10,100]

    results_table = pd.DataFrame(index = range(len(c_param_range),2), columns = ['C_parameter','Mean recall score'])
    results_table['C_parameter'] = c_param_range

    # the k-fold will give 2 lists: train_indices = indices[0], test_indices = indices[1]
    j = 0
    for c_param in c_param_range:
        print('-------------------------------------------')
        print('C parameter: ', c_param)
        print('-------------------------------------------')
        print('')

        recall_accs = []
        for iteration, indices in enumerate(fold,start=1):

            # Call the logistic regression model with a certain C parameter
            lr = LogisticRegression(C = c_param, penalty = 'l1')

            # Use the training data to fit the model. In this case, we use the portion of the fold to train the model
            # with indices[0]. We then predict on the portion assigned as the 'test cross validation' with indices[1]
            lr.fit(x_train_data.iloc[indices[0],:],y_train_data.iloc[indices[0],:].values.ravel())

            # Predict values using the test indices in the training data
            y_pred_undersample = lr.predict(x_train_data.iloc[indices[1],:].values)

            # Calculate the recall score and append it to a list for recall scores representing the current c_parameter
            recall_acc = recall_score(y_train_data.iloc[indices[1],:].values,y_pred_undersample)
            recall_accs.append(recall_acc)
            print('Iteration ', iteration,': recall score = ', recall_acc)

        # The mean value of those recall scores is the metric we want to save and get hold of.
        results_table.ix[j,'Mean recall score'] = np.mean(recall_accs)
        j += 1
        print('')
        print('Mean recall score ', np.mean(recall_accs))
        print('')

    best_c = results_table.loc[results_table['Mean recall score'].idxmax()]['C_parameter']
    
    # Finally, we can check which C parameter is the best amongst the chosen.
    print('*********************************************************************************')
    print('Best model to choose from cross validation is with C parameter = ', best_c)
    print('*********************************************************************************')
    
    return best_c

5.混淆矩阵

def plot_confusion_matrix(cm, classes,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    """
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=0)
    plt.yticks(tick_marks, classes)

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

import itertools
lr = LogisticRegression(C = best_c, penalty = 'l1')
lr.fit(X_train_undersample,y_train_undersample.values.ravel())
y_pred_undersample = lr.predict(X_test_undersample.values)

# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test_undersample,y_pred_undersample)
np.set_printoptions(precision=2)

print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
                      , classes=class_names
                      , title='Confusion matrix')
plt.show()

#结果 
Recall metric in the testing dataset:  0.931972789116

6.阀值对结果的影响

lr = LogisticRegression(C = 0.01, penalty = 'l1')
lr.fit(X_train_undersample,y_train_undersample.values.ravel())
y_pred_undersample_proba = lr.predict_proba(X_test_undersample.values)

thresholds = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]

plt.figure(figsize=(10,10))

j = 1
for i in thresholds:
    y_test_predictions_high_recall = y_pred_undersample_proba[:,1] > i
    
    plt.subplot(3,3,j)
    j += 1
    
    # Compute confusion matrix
    cnf_matrix = confusion_matrix(y_test_undersample,y_test_predictions_high_recall)
    np.set_printoptions(precision=2)

    print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

    # Plot non-normalized confusion matrix
    class_names = [0,1]
    plot_confusion_matrix(cnf_matrix
                          , classes=class_names
                          , title='Threshold >= %s'%i) 
                          
#结果
Recall metric in the testing dataset:  1.0
Recall metric in the testing dataset:  1.0
Recall metric in the testing dataset:  1.0
Recall metric in the testing dataset:  0.986394557823
Recall metric in the testing dataset:  0.931972789116
Recall metric in the testing dataset:  0.884353741497
Recall metric in the testing dataset:  0.836734693878
Recall metric in the testing dataset:  0.748299319728
Recall metric in the testing dataset:  0.571428571429

7.SMOTE样本生成策略-过采样

import pandas as pd
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split

credit_cards=pd.read_csv('creditcard.csv')

columns=credit_cards.columns
# The labels are in the last column ('Class'). Simply remove it to obtain features columns
features_columns=columns.delete(len(columns)-1)

features=credit_cards[features_columns]
labels=credit_cards['Class']

features_train, features_test, labels_train, labels_test = train_test_split(features, 
                                                                            labels, 
                                                                            test_size=0.2, 
                                                                            random_state=0)
                                                                            
                                                                            oversampler=SMOTE(random_state=0)
os_features,os_labels=oversampler.fit_sample(features_train,labels_train)

len(os_labels[os_labels==1])
# 结果
227454

os_features = pd.DataFrame(os_features)
os_labels = pd.DataFrame(os_labels)
best_c = printing_Kfold_scores(os_features,os_labels)

lr = LogisticRegression(C = best_c, penalty = 'l1')
lr.fit(os_features,os_labels.values.ravel())
y_pred = lr.predict(features_test.values)

# Compute confusion matrix
cnf_matrix = confusion_matrix(labels_test,y_pred)
np.set_printoptions(precision=2)

print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
                      , classes=class_names
                      , title='Confusion matrix')
plt.show()
# 结果，误杀率降低了
Recall metric in the testing dataset:  0.90099009901

算法数学理论—梯度下降实现逻辑回归

Posted on 2018-12-14

1.原始数据

#三大件
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os

path = 'data' + os.sep + 'LogiReg_data.txt'
# txt 没有header，手动指定
pdData = pd.read_csv(path, header=None, names=['Exam 1', 'Exam 2', 'Admitted'])
print("输出前五行数据：\n",pdData.head())
print("\n 数据维度：",pdData.shape)

positive = pdData[pdData['Admitted'] == 1]
negative = pdData[pdData['Admitted'] == 0]
# 指定画图域-长和宽
fig,ax = plt.subplots(figsize=(10,5))
# 散点图
ax.scatter(positive['Exam 1'], positive['Exam 2'], s=30, c='b', marker='o', label='Admitted')
ax.scatter(negative['Exam 1'], negative['Exam 2'], s=30, c='r', marker='x', label='Not Admitted')
# 设置图例
ax.legend()
ax.set_xlabel('Exam 1 Score')
ax.set_ylabel('Exam 2 Score')
plt.show()

2.实现方案

目标：建立分类器（求解出三个参数 𝜃0𝜃1𝜃2）
设定阈值，根据阈值判断录取结果

要完成的模块
- sigmoid : 映射到概率的函数
- model : 返回预测结果值
- cost : 根据参数计算损失
- gradient : 计算每个参数的梯度方向
- descent : 进行参数更新
- accuracy: 计算精度

3.具体实现

3.1 sigmoid

1 2	def sigmoid(z): return 1 / (1 + np.exp(-z))

3.2 预测模型-数值运算转换成矩阵运算

1 2	def model(X, theta): return sigmoid(np.dot(X, theta.T))

3.3 数据准备

# 添加一列𝜃0，指定是1
pdData.insert(0, 'Ones', 1) 
# 转成二维
orig_data = pdData.as_matrix()
# 获取列数
cols = orig_data.shape[1]
# 1 x1 x2
X = orig_data[:,0:cols-1]
# 标签
y = orig_data[:,cols-1:cols]

# 构造1行3列的theta
theta = np.zeros([1, 3])
print(X.shape,y.shape,theta.shape)
结果输出：(100, 3) (100, 1) (1, 3)

3.4 损失函数

将对数似然函数去负号

𝐷(ℎ𝜃(𝑥),𝑦)=−𝑦log(ℎ𝜃(𝑥))−(1−𝑦)log(1−ℎ𝜃(𝑥))

def cost(X, y, theta):
    left = np.multiply(-y, np.log(model(X, theta)))
    right = np.multiply(1 - y, np.log(1 - model(X, theta)))
    return np.sum(left - right) / (len(X))
    
print(cost(X, y, theta))
结果输出：0.6931471805599453

3.5 计算梯度

def gradient(X, y, theta):
    grad = np.zeros(theta.shape)
    # 外面符号提到括号里
    error = (model(X, theta)- y).ravel()
    #for each parmeter
    for j in range(len(theta.ravel())): 
        term = np.multiply(error, X[:,j])
        grad[0, j] = np.sum(term) / len(X)   
    return grad

3中不同梯度下降方法

STOP_ITER = 0
STOP_COST = 1
STOP_GRAD = 2

def stopCriterion(type, value, threshold):
    #设定三种不同的停止策略
    if type == STOP_ITER:        return value > threshold
    elif type == STOP_COST:      return abs(value[-1]-value[-2]) < threshold
    elif type == STOP_GRAD:      return np.linalg.norm(value) < threshold

import numpy.random
# 洗牌打乱顺序
def shuffleData(data):
    np.random.shuffle(data)
    cols = data.shape[1]
    X = data[:, 0:cols-1]
    y = data[:, cols-1:]
    return X, y

import time

def descent(data, theta, batchSize, stopType, thresh, alpha):
    #梯度下降求解
    
    init_time = time.time()
    i = 0 # 迭代次数
    k = 0 # batch
    X, y = shuffleData(data)
    grad = np.zeros(theta.shape) # 计算的梯度
    costs = [cost(X, y, theta)] # 损失值

    
    while True:
        grad = gradient(X[k:k+batchSize], y[k:k+batchSize], theta)
        k += batchSize #取batch数量个数据
        if k >= n: 
            k = 0 
            X, y = shuffleData(data) #重新洗牌
        theta = theta - alpha*grad # 参数更新
        costs.append(cost(X, y, theta)) # 计算新的损失
        i += 1 

        if stopType == STOP_ITER:       value = i
        elif stopType == STOP_COST:     value = costs
        elif stopType == STOP_GRAD:     value = grad
        if stopCriterion(stopType, value, thresh): break
    
    return theta, i-1, costs, grad, time.time() - init_time

3.6 计算精度

#设定阈值
def predict(X, theta):
    return [1 if x >= 0.5 else 0 for x in model(X, theta)]
    
scaled_X = scaled_data[:, :3]
y = scaled_data[:, 3]
predictions = predict(scaled_X, theta)
correct = [1 if ((a == 1 and b == 1) or (a == 0 and b == 0)) else 0 for (a, b) in zip(predictions, y)]
accuracy = (sum(map(int, correct)) % len(correct))
print ('accuracy = {0}%'.format(accuracy))
结果输出：accuracy = 89%

算法数学理论—逻辑回归

Posted on 2018-12-12

1.引入

机器学习首选的算法是逻辑回归，先逻辑回归再用复杂的，能简单还是用简单的，决策边界可以是非线性

2.Sigmoid函数

解决二分类问题，比如YES-NO类型的问题

3.多分类的softmax

解决具有多个选项的问题，比如出生地是波士顿、伦敦还是悉尼

算法数学理论—梯度下降

Posted on 2018-12-12

1.引入

交给机器一堆数据，然后告诉它什么样的学习方式是对的(目标函数)，然后让它朝着这个方向去做，每次优化一点点，累积起来就是个大成绩，最后什么样的参数能使得目标函数达到极值点

2.梯度下降方法

3.总结

学习率(步长):对结果会产生巨大的影响，一般小一些
如何选择:从小的时候，不行再小
批处理数量:32，64，128都可以，很多时候还得考虑内存和效率

算法数学理论—线性回归

Posted on 2018-12-11

1.例子

工资（X1)	年龄（X2）	额度（Y）
4000	25	20000
8000	30	45000
5000	27	25000

2.通俗解释

X1,X2是两个特征,找到最合适的一条线来拟合数据点

3.误差

误差ε是独立并且具有相同的分布,并且服从均值为0,方差为Θ2的高斯分布

4.推导过程

4.1 似然函数:根据样本估计参数值,观测数据,由数据推导,什么样的参数数据组合后恰好是真实值

4.2 对数函数:对数里面乘法可以转换成加法

5.评估方法

mysql字符串类型比较问题—app版本号

Posted on 2018-09-27

背景

mysql版本号字段类型是varchar,app版本号规则如下：

7.1.1

7.1

7.0.10

7.0.10

7.0.1

7.0

6.9.5

产生的问题是:筛选出<7.9.10版本号,7.9.9,7.9.8等等就不在范围之内

问题解决

原理：取每组版本号并向前补0至N位（例子中补10位）

X.XX.XXX > 7.9.12
‘000000000X00000000XX0000000XXX’ > ‘000000000700000000090000000012’

SELECT
  * 
FROM
  table
WHERE
   CONCAT(
	LPAD( SUBSTRING_INDEX( SUBSTRING_INDEX( version, '.', 1 ), '.', - 1 ), 10, '0' ),
	LPAD( SUBSTRING_INDEX( SUBSTRING_INDEX( version, '.', 2 ), '.', - 1 ), 10, '0' ),
	LPAD( SUBSTRING_INDEX( SUBSTRING_INDEX( version, '.', 3 ), '.', - 1 ), 10, '0' ) 
   ) > CONCAT( LPAD( 7, 10, '0' ), LPAD( 9, 10, '0' ), LPAD( 12, 10, '0' ) );

1.引入

2.数的组成

3.衡量标准-熵

1.原始数据

2.样本不均衡采用的方法

2.1 过采样–让两个样本同样多

2.2 下采用–让两个样本同样少

3.交叉验证

4.模型评估

5.混淆矩阵

6.阀值对结果的影响

7.SMOTE样本生成策略-过采样

1.原始数据

2.实现方案

要完成的模块

3.具体实现

3.1 sigmoid

3.2 预测模型-数值运算转换成矩阵运算

3.3 数据准备

3.4 损失函数

3.5 计算梯度

3.6 计算精度

1.引入

2.Sigmoid函数

3.多分类的softmax

1.引入

2.梯度下降方法

3.总结

1.例子

2.通俗解释

3.误差

4.推导过程

4.1 似然函数:根据样本估计参数值,观测数据,由数据推导,什么样的参数数据组合后恰好是真实值

4.2 对数函数:对数里面乘法可以转换成加法

5.评估方法

背景

问题解决