垃圾郵件分類 python_在python中創建SMS垃圾郵件分類器

垃圾郵件分類 python

介紹 (Introduction)

I have always been fascinated with Google’s gmail spam detection system, where it is able to seemingly effortlessly judge whether incoming emails are spam and therefore not worthy of our limited attention.

我一直對Google的gmail垃圾郵件檢測系統著迷,該系統似乎可以毫不費力地判斷收到的電子郵件是否是垃圾郵件,因此不值得我們的關注。

In this article, I seek to recreate such a spam detection system, but on sms messages. I will use a few different models and compare their performance.

在本文中,我試圖重新創建這樣的垃圾郵件檢測系統,但要針對短信。 我將使用幾種不同的模型并比較它們的性能。

The models are as below:

型號如下:

  1. Multinomial Naive Bayes Model (Count tokenizer)

    多項樸素貝葉斯模型(Count tokenizer)
  2. Multinomial Naive Bayes Model (tfidf tokenizer)

    多項式樸素貝葉斯模型(tfidf tokenizer)
  3. Support Vector Classifier Model

    支持向量分類器模型
  4. Logistic Regression Model with ngrams parameters

    具有ngrams參數的Logistic回歸模型

Using a train-test split, the 4 models were put through the stages of X_train vectorization, model fitting on X_train and Y_train, make some predictions and generate the respective confusion matrices and area under the receiver operating characteristics curve for evaluation. (AUC-ROC)

使用火車測試拆分,對這四個模型進行了X_train向量化,對X_train和Y_train進行模型擬合的階段,進行了一些預測,并在接收器工作特性曲線下生成了相應的混淆矩陣和面積以進行評估。 (AUC-ROC)

The resultant best performing model was the Logistic Regression Model, although it should be noted that all 4 models performed reasonably well at detecting spam messages (all AUC > 0.9).

最終表現最好的模型是Logistic回歸模型 ,盡管應該注意的是,這4個模型在檢測垃圾郵件方面都表現得相當不錯(所有AUC> 0.9)。

Image for post
Photo by Hannes Johnson on Unsplash
Hannes Johnson在Unsplash上拍攝的照片

數據 (The Data)

The data was obtained from UCI’s Machine Learning Repository, alternatively I have also uploaded the used dataset onto my github repo. In total, the data set has 5571 rows, and 2 columns: spamorham indicating it’s spam status and the message’s text. I found it quite funny how the text is quite relatable.

數據是從UCI的機器學習存儲庫中獲得的 ,或者我也將使用過的數據集上傳到了我的github存儲庫中 。 數據集總共有5571行和2列:spamorham(表明其為垃圾郵件狀態)和郵件的文本。 我發現文本之間的相關性很好笑。

Definitions: Spam refers to spam messages as they are commonly known, ham refers to non-spam messages.

定義:垃圾郵件是指眾所周知的垃圾郵件,火腿是指非垃圾郵件。

Image for post
spam.head(10)
垃圾郵件標題(10)

數據預處理 (Data Preprocessing)

As the dataset is relatively simple, not much preprocessing was needed. Spam messages were marked with a 1, while ham was marked with a 0.

由于數據集相對簡單,因此不需要太多預處理。 垃圾郵件標記為1,火腿標記為0。

Image for post
df.head(10)
df.head(10)

探索性數據分析 (Exploratory Data Analysis)

Now, let’s look at the dataset in detail. Taking an average of the ‘target’ column, we find that that 13.409% of the messages were marked as spam.

現在,讓我們詳細看一下數據集。 取“目標”列的平均值,我們發現有13.409%的郵件被標記為垃圾郵件。

Further, maybe the message length has some correlation with the target outcome? Splitting the spam and ham messages into their individual dataframes, we further add on the number of characters of a message as a third column ‘len’.

此外,消息長度可能與目標結果有一些相關性嗎? 將垃圾郵件和火腿郵件拆分為各自的數據幀,然后在第三列“ len”中進一步添加郵件的字符數。

Image for post
The split data frames
分割數據幀
#creating two seperate dfs: 1 for spam and 1 for non spam messages only
df_s = df.loc[ df['target']==1]
df_ns = df.loc[ df['target']==0]df_s['len'] = [len(x) for x in df_s["text"]]
spamavg = df_s.len.mean()
print('df_s.head(5)')
print(df_s.head(5))print('\n\ndf_ns.head(5)')
df_ns['len'] = [len(x) for x in df_ns["text"]]
nonspamavg = df_ns.len.mean()
print(df_ns.head(5))

Further, taking the averages of messages lengths, we can find that spam and ham messages have average lengths of 139.12 and 71.55 characters respectively.

此外,以郵件長度的平均值為基礎,我們可以發現垃圾郵件和火腿郵件的平均長度分別為139.12和71.55個字符。

資料建模 (Data Modelling)

Now it’s time for the interesting stuff.

現在該是有趣的東西了。

火車測試拆分 (Train-test split)

We begin with creating a train-test split using the default sklearn split of a 75% train-test split.

我們首先使用默認的sklearn拆分(75%的火車測試拆分)創建火車測試拆分。

計數向量化器 (Count Vectorizer)

A count vectorizer will convert a collection of text documents to a sparse matrix of token counts. This will be necessary for model fitting to be done.

計數矢量化器會將文本文檔的集合轉換為令牌計數的稀疏矩陣 。 這對于完成模型擬合是必要的。

We fit the CountVectorizer onto X_train, and then further transform it using the transform method.

我們將CountVectorizer擬合到X_train上,然后使用transform方法對其進行進一步轉換。

#train test split
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['target'], random_state=0)#fitting and transforming X_train using a Count Vectorizer with default parameters
vect = CountVectorizer().fit(X_train)
X_train_vectorized = vect.transform(X_train)#to look at the object types
print(vect)
print(X_train_vectorized)

MNNB模型擬合 (MNNB Model Fitting)

Let’s first try fitting a classic Multinomial Naive Bayes Classifier Model (MNNB), on X_train and Y_train.

首先讓我們嘗試經典 多項式樸素貝葉斯分類器模型 (MNNB),位于X_train和Y_train上。

A Naive Bayes model assumes that each of the features it uses are conditionally independent of one another given some class. In practice Naive Bayes models have performed surprisingly well, even on complex tasks where it is clear that the strong independence assumptions are false.

一個樸素的貝葉斯模型假設它使用的每個功能在給定類的條件下彼此獨立 。 實際上,樸素貝葉斯模型的表現令人驚訝地出色,即使在很顯然獨立性強的假設是錯誤的復雜任務上也是如此。

MNNB模型評估 (MNNB Model Evaluation)

In evaluating the model’s performance, we can generate some predictions then look at the confusion matrix and AUC-ROC score to evaluate performance on the test dataset.

在評估模型的性能時,我們可以生成一些預測,然后查看混淆矩陣和AUC-ROC分數以評估測試數據集的性能。

The confusion matrix is generated as below:

混淆矩陣的生成如下:

Image for post
MNNB confusion matrix
MNNB混淆矩陣

The results seem promising, with a True Positive Rate (TPR) of 92.6% , specificity of 99.7% and a False Positive Rate (FPR) of 0.3%. These results show that the model performs quite well in predicting whether messages are spam, based solely on the text in the messages.

結果似乎很有希望, 真陽性率(TPR)為92.6%特異性為99.7%假陽性率(FPR)為0.3% 。 這些結果表明,僅基于郵件中的文本,該模型在預測郵件是否為垃圾郵件方面表現非常出色。

The Receiver Operator Characteristic (ROC) curve is an evaluation metric for binary classification problems. It is a probability curve that plots the TPR against FPR at various threshold values and essentially separates the ‘signal’ from the ‘noise’. The Area Under the Curve (AUC) is the measure of the ability of a classifier to distinguish between classes and is used as a summary of the ROC curve.

接收者操作員特征(ROC)曲線是二進制分類問題的評估指標。 這是一條概率曲線,在不同的閾值下繪制了TPRFPR的關系 ,并從本質上將“信號”與“噪聲”分開曲線下面積(AUC)是分類器區分類的能力的度量,并用作ROC曲線的摘要。

Image for post
this article本文圖片

The model produced an AUC score of 0.962, which is significantly better than if the model made random guesses of the outcome.

該模型產生的AUC得分為0.962,這比該模型對結果進行隨機猜測的結果要好得多。

Although the Multinomial Naive Bayes Classifier seems to have worked quite well, I felt that maybe the result could possibly be improved further through a different model

盡管多項式樸素貝葉斯分類器的效果似乎很好,但我認為通過不同的模型可以進一步改善結果

#fitting a multinomial Naive Bayes Classifier Model with smoothing alpha=0.1
model = sklearn.naive_bayes.MultinomialNB(alpha=0.1)
model_fit = model.fit(X_train_vectorized, y_train)#making predictions & looking at AUC score
predictions = model.predict(vect.transform(X_test))
aucscore = roc_auc_score(y_test, predictions) #good!
print(aucscore)#confusion matrix
from sklearn.metrics import confusion_matrix
tn, fp, fn, tp = confusion_matrix(y_test, predictions).ravel()
print(pd.DataFrame(confusion_matrix(y_test, predictions),columns=['Predicted Spam', "Predicted Ham"], index=['Actual Spam', 'Actual Ham']))
print(f'\nTrue Positives: {tp}')
print(f'False Positives: {fp}')
print(f'True Negatives: {tn}')
print(f'False Negatives: {fn}')print(f'True Positive Rate: { (tp / (tp + fn))}')
print(f'Specificity: { (tn / (tn + fp))}')
print(f'False Positive Rate: { (fp / (fp + tn))}')

MNNB(Tfid矢量化器)模型擬合 (MNNB(Tfid-vectorizer) Model Fitting)

I then attempt to use a tfidf vectorizer instead of a count-vectorizer to see if it improves the results.

然后,我嘗試使用tfidf矢量化器而不是count-vectorizer來查看它是否可以改善結果。

The goal of using tfidf is to scale down the impact of tokens that occur very frequently in a given corpus and that are hence empirically less informative than features that occur in a small fraction of the training corpus.

使用tfidf的目的是減少在給定語料庫中非常頻繁出現的令牌的影響,因此,從經驗上講,這些令牌的信息少于在訓練語料庫的一小部分中出現的特征。

MNNB(Tfid矢量化器)模型評估 (MNNB(Tfid-vectorizer) Model Evaluation)

In evaluating the model’s performance we look at the AUC-ROC numbers and the confusionn matrix again. It generates an AUC score of 91.67%

在評估模型的性能時,我們再次查看AUC-ROC數和混淆矩陣。 它的AUC得分為91.67%

The results seem promising, with a True Positive Rate (TPR) of 83.3% , specificity of 100% and a False Positive Rate (FPR) of 0.0%%.

結果似乎很有希望, 真陽性率(TPR)為83.3%特異性為100%假陽性率(FPR)為0.0 %%

Image for post
tfid confusion matrix
tfid混淆矩陣

When comparing the two models based on AUC scores, it seems like the tfid vectorizer did not improve upon model accuracy, but even introduced more noise into the predictions! However, the tfid seems to have greatly improved the model’s ability to detect ham messages to the point of 100% accuracy.

當根據AUC分數比較兩個模型時,tfid矢量化器似乎并沒有提高模型的準確性,但甚至在預測中引入了更多噪聲! 但是,tfid似乎已大大提高了該模型檢測火腿消息的能力,準確性達到100%。

# fitting and transforming X_train using a tfid vectorizer, ignoring terms with a document frequency lower than 3.
vect = TfidfVectorizer(min_df=3).fit(X_train)
X_train_vectorized = vect.transform(X_train)# fitting training data to a multinomial NB model
model = sklearn.naive_bayes.MultinomialNB()
model_fit = model.fit(X_train_vectorized, y_train) #looking at model features
feature_names = np.array(vect.get_feature_names())
sorted_tfidf_index = X_train_vectorized.max(0).toarray()[0].argsort()
((pd.Series(feature_names[sorted_tfidf_index[:20]]),pd.Series(feature_names[sorted_tfidf_index[-21:-1]])))#making predictions
predictions = model_fit.predict(vect.transform(X_test))
aucscore = roc_auc_score(y_test, predictions)
print(aucscore)#confusion matrix
from sklearn.metrics import confusion_matrix
tn, fp, fn, tp = confusion_matrix(y_test, predictions).ravel()
print(pd.DataFrame(confusion_matrix(y_test, predictions),columns=['Predicted Spam', "Predicted Ham"], index=['Actual Spam', 'Actual Ham']))
print(f'\nTrue Positives: {tp}')
print(f'False Positives: {fp}')
print(f'True Negatives: {tn}')
print(f'False Negatives: {fn}')print(f'True Positive Rate: { (tp / (tp + fn))}')
print(f'Specificity: { (tn / (tn + fp))}')
print(f'False Positive Rate: { (fp / (fp + tn))}')

Being a stubborn person, I still believe that better performance can be obtained, with a few tweaks.

作為一個固執的人,我仍然相信通過一些調整就能獲得更好的性能。

SVC模型擬合 (SVC Model Fitting)

I now attempt to fit and transform the training data X_train using a Tfidf Vectorizer, while ignoring terms that have a document frequency strictly lower than 5. Further adding an additional feature, the length of document (number of characters), I then fit a Support Vector Classification (SVC) model with regularization C=10000.

我現在嘗試使用Tfidf Vectorizer擬合和變換訓練數據X_train,同時忽略文檔頻率嚴格低于5的術語。進一步添加附加功能,即文檔長度(字符數),然后適合支持具有正則化C = 10000的向量分類(SVC)模型。

SVC模型評估 (SVC Model Evaluation)

This results in the following:

結果如下:

  • AUC score of 97.4%

    AUC分數為97.4%
  • TPR of 95.1%

    TPR為95.1%
  • Specificity of 99.7%

    特異性為99.7%
  • FPR of 0.3%

    FPR為0.3%
Image for post
SVC confusion matrix
SVC混淆矩陣
#defining an additional function
def add_feature(X, feature_to_add):"""Returns sparse feature matrix with added feature.feature_to_add can also be a list of features."""from scipy.sparse import csr_matrix, hstackreturn hstack([X, csr_matrix(feature_to_add).T], 'csr')#fit and transfor x_train and X_test
vectorizer = TfidfVectorizer(min_df=5)X_train_transformed = vectorizer.fit_transform(X_train)
X_train_transformed_with_length = add_feature(X_train_transformed, X_train.str.len())X_test_transformed = vectorizer.transform(X_test)
X_test_transformed_with_length = add_feature(X_test_transformed, X_test.str.len())# SVM creation and model fitting
clf = SVC(C=10000)
clf.fit(X_train_transformed_with_length, y_train)
y_predicted = clf.predict(X_test_transformed_with_length)#auc score
roc_auc_score(y_test, y_predicted)#confusion matrix
from sklearn.metrics import confusion_matrix
tn, fp, fn, tp = confusion_matrix(y_test, y_predicted).ravel()
print(pd.DataFrame(confusion_matrix(y_test, y_predicted),columns=['Predicted Spam', "Predicted Ham"], index=['Actual Spam', 'Actual Ham']))print(f'\nTrue Positives: {tp}')
print(f'False Positives: {fp}')
print(f'True Negatives: {tn}')
print(f'False Negatives: {fn}')print(f'True Positive Rate: { (tp / (tp + fn))}')
print(f'Specificity: { (tn / (tn + fp))}')
print(f'False Positive Rate: { (fp / (fp + tn))}')

Logistic回歸模型(n克)擬合 (Logistic Regression Model (n-grams) Fitting)

Using a logistic regression I further include the use of ngrams which allow the model to take into account groups of words, of max size 3, when considering whether a message is spam.

使用邏輯回歸,我進一步包括使用ngram,當考慮消息是否為垃圾郵件時,該模型允許模型考慮最大大小為3的單詞組。

Logistic回歸模型(n-gram)評估 (Logistic Regression Model (n-grams) Evaluation)

This results in the following:

結果如下:

  • AUC score of 97.7%

    AUC分數為97.7%
  • TPR of 95.6%

    TPR為95.6%
  • Specificity of 99.7%

    特異性為99.7%
  • FPR of 0.3%

    FPR為0.3%
from sklearn.linear_model import LogisticRegressionvectorizer = TfidfVectorizer(min_df=5, ngram_range=[1,3])X_train_transformed = vectorizer.fit_transform(X_train)
X_train_transformed_with_length = add_feature(X_train_transformed, [X_train.str.len(),X_train.apply(lambda x: len(''.join([a for a in x if a.isdigit()])))])X_test_transformed = vectorizer.transform(X_test)
X_test_transformed_with_length = add_feature(X_test_transformed, [X_test.str.len(),X_test.apply(lambda x: len(''.join([a for a in x if a.isdigit()])))])clf = LogisticRegression(C=100)clf.fit(X_train_transformed_with_length, y_train)y_predicted = clf.predict(X_test_transformed_with_length)roc_auc_score(y_test, y_predicted)#confusion matrix
from sklearn.metrics import confusion_matrix
tn, fp, fn, tp = confusion_matrix(y_test, y_predicted).ravel()
print(pd.DataFrame(confusion_matrix(y_test, y_predicted),columns=['Predicted Spam', "Predicted Ham"], index=['Actual Spam', 'Actual Ham']))
print(f'\nTrue Positives: {tp}')
print(f'False Positives: {fp}')
print(f'True Negatives: {tn}')
print(f'False Negatives: {fn}')print(f'True Positive Rate: { (tp / (tp + fn))}')
print(f'Specificity: { (tn / (tn + fp))}')
print(f'False Positive Rate: { (fp / (fp + tn))}')

型號比較 (Model Comparison)

After training and testing these 4 models, it’s time to compare them. I primarily look at comparing them based on AUC scores as the TPR and TNR rates are all somewhat similar.

在訓練和測試了這四個模型之后,是時候進行比較了。 我主要考慮根據AUC分數比較它們,因為TPR和TNR率都有些相似。

The logistic regression had the highest AUC score, with the SVC model and MNNB 1 model being marginally behind. Relatively, the MNNB 2 model seemed to have underperformed the rest. However, I would still opine that all 4 models produce AUC scores which are much higher than 0.5, showing that all 4 perform good enough to beat a model that only randomly guesses the target.

Logistic回歸的AUC得分最高,SVC模型和MNNB 1模型僅次于。 相對而言,MNNB 2模型的表現似乎不如其他模型。 但是,我仍然認為所有4個模型產生的AUC得分都遠高于0.5,這表明所有4個模型的表現都足以擊敗僅隨機猜測目標的模型。

Image for post
AUC score comparison between the 4 models
4個模型之間的AUC得分比較
import seaborn as sb
label = ['MNNB 1', 'MNNB 2', 'SVC', 'Logistic']
auclist = [0.9615532083312719, 0.9166666666666667, 0.97422863173865, 0.976679612130807]#generates an array of length label and use it on the X-axis
def plot_bar_x():# this is for plotting purposeindex = np.arange(len(label))clrs = ['grey' if (x < max(auclist)) else 'red' for x in auclist ]g=sb.barplot(x=index, y=auclist, palette=clrs) # color=clrs)   plt.xlabel('Model', fontsize=10)plt.ylabel('AUC score', fontsize=10)plt.xticks(index, label, fontsize=10, rotation=30)plt.title('AUC score for each fitted model')ax=gfor p in ax.patches:ax.annotate("%.2f" % p.get_height(), (p.get_x() + p.get_width() / 2., p.get_height()),ha='center', va='center', fontsize=11, color='gray', xytext=(0, 20),textcoords='offset points')g.set_ylim(0,1.25) #To make space for the annotationsplot_bar_x()

感謝您的閱讀! (Thanks for the read!)

Do find the code here.

在這里找到代碼 。

Do feel free to reach out to me on LinkedIn if you have questions or would like to discuss ideas on applying data science techniques in a post-Covid-19 world!

如果您有任何疑問或想討論在Covid-19后世界中應用數據科學技術的想法,請隨時通過LinkedIn與我聯系。

這是給您的另一篇文章! (Here’s another article for you!)

翻譯自: https://towardsdatascience.com/create-a-sms-spam-classifier-in-python-b4b015f7404b

垃圾郵件分類 python

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/392452.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/392452.shtml
英文地址,請注明出處:http://en.pswp.cn/news/392452.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

leetcode 103. 二叉樹的鋸齒形層序遍歷(層序遍歷)

給定一個二叉樹&#xff0c;返回其節點值的鋸齒形層序遍歷。&#xff08;即先從左往右&#xff0c;再從右往左進行下一層遍歷&#xff0c;以此類推&#xff0c;層與層之間交替進行&#xff09;。例如&#xff1a; 給定二叉樹 [3,9,20,null,null,15,7],3/ \9 20/ \15 7 返回…

簡單易用的MongoDB

從我第一次聽到Nosql這個概念到如今已經走過4個年頭了&#xff0c;但仍然沒有具體的去做過相應的實踐。最近獲得一段學習休息時間&#xff0c;購買了Nosql技術實踐一書&#xff0c;正在慢慢的學習。在主流觀點中&#xff0c;Nosql大體分為4類&#xff0c;鍵值存儲數據庫&#x…

html畫布圖片不顯示_如何在HTML5畫布上顯示圖像

html畫布圖片不顯示by Nash Vail由Nash Vail Ok, so here’s a question: “Why do we need an article for this, Nash?”好的&#xff0c;這是一個問題&#xff1a;“為什么我們需要為此寫一篇文章&#xff0c;納什&#xff1f;” Well, grab a seat.好吧&#xff0c;坐下…

java斷點續傳插件_視頻斷點續傳+java視頻

之前仿造uploadify寫了一個HTML5版的文件上傳插件&#xff0c;沒看過的朋友可以點此先看一下~得到了不少朋友的好評&#xff0c;我自己也用在了項目中&#xff0c;不論是用戶頭像上傳&#xff0c;還是各種媒體文件的上傳&#xff0c;以及各種個性的業務需求&#xff0c;都能得到…

全棧入門_啟動數據棧入門包(2020)

全棧入門I advise a lot of people on how to build out their data stack, from tiny startups to enterprise companies that are moving to the cloud or from legacy solutions. There are many choices out there, and navigating them all can be tricky. Here’s a brea…

Go-json解碼到接口及根據鍵獲取值

Go-json解碼到接口及根據鍵獲取值 package mainimport ("encoding/json""fmt""github.com/bitly/go-simplejson" )type JsonServer struct {ServerName stringServerIP string }type JsonServers struct {Servers []JsonServer }func main() {…

C#接口的顯隱實現

顯示接口實現與隱式接口實現 何為顯式接口實現、隱式接口實現&#xff1f;簡單概括&#xff0c;使用接口名作為方法名的前綴&#xff0c;這稱為“顯式接口實現”&#xff1b;傳統的實現方式&#xff0c;稱為“隱式接口實現”。下面給個例子。 IChineseGreeting接口&#xff0c;…

亞馬遜 各國站點 鏈接_使用Amazon S3和HTTPS的簡單站點托管

亞馬遜 各國站點 鏈接by Georgia Nola喬治亞諾拉(Georgia Nola) 使用Amazon S3和HTTPS的簡單站點托管 (Simple site hosting with Amazon S3 and HTTPS) Hiya folks!大家好&#xff01; In this tutorial I’ll show you how to host a static website with HTTPS on AWS wit…

leetcode 387. 字符串中的第一個唯一字符(hash)

給定一個字符串&#xff0c;找到它的第一個不重復的字符&#xff0c;并返回它的索引。如果不存在&#xff0c;則返回 -1。 示例&#xff1a; s “leetcode” 返回 0 s “loveleetcode” 返回 2 class Solution { public int firstUniqChar(String s) { int[][] tempnew i…

marlin 三角洲_三角洲湖泊和數據湖泊-入門

marlin 三角洲Data lakes are becoming adopted in more and more companies seeking for efficient storage of their assets. The theory behind it is quite simple, in contrast to the industry standard data warehouse. To conclude this this post explains the logica…

tomcat中設置Java 客戶端程序的http(https)訪問代理

1、假定http/https代理服務器為 127.0.0.1 端口為8118 2、在tomcat/bin/catalina.sh腳本文件中設置JAVA_OPTS&#xff0c;如下圖&#xff1a; 保存后重啟tomcat就能生效。轉載于:https://www.cnblogs.com/zhangmingcheng/p/11211776.html

java界面中顯示圖片_java中怎樣在界面中顯示圖片?

方法一&#xff1a;JLabel helloLabel new JLabel("New label");helloLabel.setIcon(new ImageIcon("E:\\javaSE\u4EE3\u7801\\TimeManager\\asset\\hello.gif"));helloLabel.setBackground(Color.BLACK);helloLabel.setBounds(0, 0, 105, 50);contentPan…

one-of-k 編碼算法_我們如何教K-12學生如何編碼

one-of-k 編碼算法by Christopher George克里斯托弗喬治(Christopher George) 我們如何教K-12學生如何編碼 (How we’re teaching K-12 students how to code) Hello World! (Sorry, I couldn’t resist.) My name is Christopher George and I am currently a Junior at Carn…

knime簡介_KNIME簡介

knime簡介Data Science is abounding. It considers different realms of the data world including its preparation, cleaning, modeling, and whatnot. To be precise, it is massive in terms of the span it covers and the opportunities it offers. Needless to say, th…

hadoop2.x HDFS快照介紹

說明&#xff1a;由于近期正好在研究hadoop的快照機制。看官網上的文檔講的非常仔細。就順手翻譯了。也沒有去深究一些名詞的標準譯法&#xff0c;所以可能有些翻譯和使用方法不是非常正確&#xff0c;莫要介意~~ 原文地址&#xff1a;&#xff08;Apache hadoop的官方文檔&…

MQTT服務器搭建--Mosquitto用戶名密碼配置

前言&#xff1a; 基于Mosquitto服務器已經搭建成功&#xff0c;大部分都是采用默認的是允許匿名用戶登錄模式&#xff0c;正式上線的系統需要進行用戶認證。 1.用戶參數說明 Mosquitto服務器的配置文件為/etc/mosquitto/mosquitto.conf&#xff0c;關于用戶認證的方式和讀取的…

java number string_java基礎系列(一):Number,Character和String類及操作

這篇文章總結了Java中最基礎的類以及常用的方法&#xff0c;主要有&#xff1a;Number&#xff0c;Character&#xff0c;String。1、Number類在實際開發的過程中&#xff0c;常常會用到需要使用對象而不是內置的數據類型的情形。所以&#xff0c;java語言為每個內置數據類型都…

誰參加了JavaScript 2018狀況調查?

by Sacha Greif由Sacha Greif 誰參加了JavaScript 2018狀況調查&#xff1f; (Who Took the State of JavaScript 2018 Survey?) 我們如何努力使調查更具代表性 (How we’re working to make the survey more representative) I was recently listening to a podcast episode…

機器學習 建立模型_建立生產的機器學習系統

機器學習 建立模型When businesses plan to start incorporating machine learning to enhance their solutions, they more often than not think that it is mostly about algorithms and analytics. Most of the blogs/training on the matter also only talk about taking …

CDH使用秘籍(一):Cloudera Manager和Managed Service的數據庫

背景從業務發展需求&#xff0c;大數據平臺須要使用spark作為機器學習、數據挖掘、實時計算等工作&#xff0c;所以決定使用Cloudera Manager5.2.0版本號和CDH5。曾經搭建過Cloudera Manager4.8.2和CDH4&#xff0c;在搭建Cloudera Manager5.2.0版本號的時候&#xff0c;發現對…