新闻标题的情感分析：经典监督学习与深度学习

介绍

本文将使你能够构建一个二进制分类器，该分类器使用两种不同的方法对未标记的数据执行情感分析：

1-通过scikit-learn库和nlp包监督学习

2-使用TensorFlow和Keras框架进行深度学习

然而，挑战在于我们正在处理未标记的数据，因此我们将利用一种技术Snorkel，为我们的训练数据点创建0（负值）或1（正值）标签。

项目的计划如下图所示：

本项目中使用的数据集称为“百万新闻标题”数据集，可在Kaggle上找到。

https://www.kaggle.com/therohk/million-headlines

我还将使用Google Colab，可以自由设置你的笔记本，自己复制和运行代码。

#install needed packages
!pip install snorkel
!pip install textblob

#import libraries and modules
from google.colab import files
import io
import pandas as pd

#Snorkel
from snorkel.labeling import LabelingFunction
import re
from snorkel.preprocess import preprocessor
from textblob import TextBlob
from snorkel.labeling import PandasLFApplier
from snorkel.labeling.model import LabelModel
from snorkel.labeling import LFAnalysis
from snorkel.labeling import filter_unlabeled_dataframe
from snorkel.labeling import labeling_function

#NLP packages
import spacy
from nltk.corpus import stopwords
import string
import nltk
import nltk.tokenize
punc = string.punctuation
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

#Supervised learning
from tqdm import tqdm_notebook as tqdm
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

##Deep learning libraries and APIs
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

1.加载数据

#uplaod the data from your local directory
uploaded = files.upload()

# store the dataset as a Pandas Dataframe
df = pd.read_csv(io.BytesIO(uploaded['data.csv']))

#conduct some data cleaning
df = df.drop(['publish_date', 'Unnamed: 2'], axis=1)
df = df.rename(columns = {'headline_text': 'text'})
df['text'] = df['text'].astype(str)

#check the data info
df.info()

在这里，我们可以检查数据集是否有63821个实例。

2.创建标签

由于数据集是未标记的，我们将使用Snorkel，使用函数来提出启发式和编程规则，这些函数分配两个类的标签，以区分标题是正（1）还是负（0）。

在下面，你可以找到生成的标签函数。

第一个函数查看标题中的输入单词，第二个函数根据正负列表中预定义的单词分配适当的标签。例如，如果标题中出现“有希望”一词，那么它将被指定为积极标签。

#define constants to represent the class labels :positive, negative, and abstain
POSITIVE = 1
NEGATIVE = 0
ABSTAIN = -1

#define function which looks into the input words to represent a proper label
def keyword_lookup(x, keywords, label):  
    if any(word in x.text.lower() for word in keywords):
        return label
    return ABSTAIN

#define function which assigns a correct label
def make_keyword_lf(keywords, label=POSITIVE):
    return LabelingFunction(
        name=f"keyword_{keywords[0]}",
        f=keyword_lookup,
        resources=dict(keywords=keywords, label=label))

#resource: https://www.snorkel.org/use-cases/01-spam-tutorial#3-writing-more-labeling-functions

#these two lists can be further extended 
"""positive news might contain the following words' """
keyword_positive = make_keyword_lf(keywords=['boosts', 'great', 'develops', 'promising', 'ambitious', 'delighted', 'record', 'win', 'breakthrough', 'recover', 'achievement', 'peace', 'party', 'hope', 'flourish', 'respect', 'partnership', 'champion', 'positive', 'happy', 'bright', 'confident', 'encouraged', 'perfect', 'complete', 'assured' ])

"""negative news might contain the following words"""
keyword_negative = make_keyword_lf(keywords=['war','solidiers', 'turmoil', 'injur','trouble', 'aggressive', 'killed', 'coup', 'evasion', 'strike', 'troops', 'dismisses', 'attacks', 'defeat', 'damage', 'dishonest', 'dead', 'fear', 'foul', 'fails', 'hostile', 'cuts', 'accusations', 'victims',  'death', 'unrest', 'fraud', 'dispute', 'destruction', 'battle', 'unhappy', 'bad', 'alarming', 'angry', 'anxious', 'dirty', 'pain', 'poison', 'unfair', 'unhealthy'
                                              ], label=NEGATIVE)

另一组标记函数是通过TextBlob工具实现的，TextBlob是一个预处理的情感分析器。我们将创建一个预处理器，在标题上运行TextBlob，然后提取极性和主观性分数。

#set up a preprocessor function to determine polarity & subjectivity using textlob pretrained classifier 
@preprocessor(memoize=True)
def textblob_sentiment(x):
    scores = TextBlob(x.text)
    x.polarity = scores.sentiment.polarity
    x.subjectivity = scores.sentiment.subjectivity
    return x

#find polarity
@labeling_function(pre=[textblob_sentiment])
def textblob_polarity(x):
    return POSITIVE if x.polarity > 0.6 else ABSTAIN

#find subjectivity 
@labeling_function(pre=[textblob_sentiment])
def textblob_subjectivity(x):
    return POSITIVE if x.subjectivity >= 0.5 else ABSTAIN

下一步是组合所有标签函数并将其应用于我们的数据集。然后，我们拟合label_model来预测并生成正负类。

#combine all the labeling functions 
lfs = [keyword_positive, keyword_negative, textblob_polarity, textblob_subjectivity ]

#apply the lfs on the dataframe
applier = PandasLFApplier(lfs=lfs)
L_snorkel = applier.apply(df=df)

#apply the label model
label_model = LabelModel(cardinality=2, verbose=True)

#fit on the data
label_model.fit(L_snorkel)

#predict and create the labels
df["label"] = label_model.predict(L=L_snorkel)

我们可以注意到，在删除未标记的数据点（如下所示）后，我们有大约12300个阳性标签和6900个阴性标签，这足以构建我们的情感分类器。

3.应用监督学习法：Logistic回归

我们将用于构建情感分类器的第一种方法是经典的监督方法，即Logistic Regression，它被认为是一种强大的二元分类器，可以估计属于某个类的实例的概率，并据此进行预测。

然而，我们应该首先对数据进行预处理，并在训练模型之前创建向量表示。

3.1文本预处理

预处理是自然语言处理（NLP）中为训练准备文本数据的一项基本任务。它将原始文本的输入转换为单个单词或字符的清理标记。主要预处理方法概述如下：

1-分词：将句子拆分为单词

2-词形还原：将单词还原为词根格式

3-删除停用词：删除不必要的词，如“the”、“he”、“she”等。

4-删除标点符号：删除不重要的单词元素，如逗号、句点、括号、括号等。

#make a copy of the dataframe
data = df.copy()

#define a function which handles the text preprocessing 
def preparation_text_data(data):
    """
    This pipeline prepares the text data, conducting the following steps:
    1) Tokenization
    2) Lemmatization
    4) Removal of stopwords
    5) Removal of punctuation
    """

    # initialize spacy object
    nlp = spacy.load('en_core_web_sm')

    # select raw text
    raw_text = data.text.values.tolist()

    # tokenize
    tokenized_text = [[nlp(i.lower().strip())] for i in tqdm(raw_text)]

    #define the punctuations and stop words
    punc = string.punctuation 
    stop_words = set(stopwords.words('english'))

    #lemmatize, remove stopwords and punctuationd
    corpus = []
    for doc in tqdm(tokenized_text):
        corpus.append([word.lemma_ for word in doc[0] if (word.lemma_ not in stop_words and word.lemma_ not in punc)])

    # add prepared data to df
    data["text"] = corpus
    return data

#apply the data preprocessing function
data =  preparation_text_data(data)

我们可以在下图中注意到，数据已被正确清理，显示为单独的单词，每个单词是根形式，没有停用词和标点符号。

3.2文本表示

第二步涉及将文本数据转换为ML模型可以理解的有意义的向量。我应用了TF-IDF（词频（TF）-逆文档概率（IDF）），它根据整个语料库中出现的单词为输入数据创建计数权重。

def text_representation(data):
  tfidf_vect = TfidfVectorizer()
  data['text'] = data['text'].apply(lambda text: " ".join(set(text)))
  X_tfidf = tfidf_vect.fit_transform(data['text'])

  print(X_tfidf.shape)
  print(tfidf_vect.get_feature_names())

  X_tfidf = pd.DataFrame(X_tfidf.toarray())
  return X_tfidf

#apply the TFIDV function
X_tfidf = text_representation(data)

下面，我们可以找到text_representation函数的结果，从中我们可以看到单词已经转换为有意义的向量。

3.3模型训练

在这个阶段，我们已经准备好构建和训练ML Logistic回归模型。我们将数据集分成训练和测试，我们拟合模型并对数据点进行预测。我们可以发现，该模型的准确度得分为92%。

X= X_tfidf
y = data['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

#fit Log Regression Model
clf= LogisticRegression()
clf.fit(X_train,y_train)

clf.score(X_test,y_test)
y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))

3.4预测新实例

我们可以对如下所示的新实例进行预测，我们为模型添加一个新标题，并且在我们的示例中预测负面标签，因为我们正在传达战争和制裁。

new_data = ["The US imposes sanctions on Rassia because of the Ukranian war"]
tf = TfidfVectorizer()
tfdf = tf.fit_transform(data['text'])
vect = pd.DataFrame(tf.transform(new_data).toarray())
new_data = pd.DataFrame(vect)
logistic_prediction = clf.predict(new_data)

print(logistic_prediction)

4.深度学习方法：TensorFlow和Keras

神经网络是一种深度学习算法，由激活函数驱动的多层互连神经元组成。它考虑每个输入的加权和，然后对该和应用一个步长函数，并输出显示实例类的结果。

事实上，Keras和TensorFlow是深度学习中最流行的框架。定义而言，Keras是一个运行在TensorFlow之上的高级神经网络库，而TensorFlow是一个端到端的机器学习开源平台，它由工具、库和其他资源组成，提供高级API。

在我们的项目中，我们将使用TensorFlow来预处理和填充使用分词器类的数据，并且我们将使用Keras来加载和训练序列模型（神经网络）。

要了解有关序列模型的更多信息，请访问此链接：https://www.tensorflow.org/guide/keras/sequential_model

4.1训练和测试拆分

##store headlines and labels in respective lists
text = list(data['text'])
labels = list(data['label'])

##sentences
training_text = text[0:15000]
testing_text = text[15000:]

##labels
training_labels = labels[0:15000]
testing_labels = labels[15000:]

4.2从Tensor设置分词器以预处理数据。

在这一步中，我们使用来自Tensorflow的分词器。keras使用texstosequences实例创建单词编码（带键值对的字典）和序列，然后使用padsequences示例填充这些序列以使其长度相等。

#preprocess 
tokenizer = Tokenizer(num_words=10000, oov_token= "<OOV>")
tokenizer.fit_on_texts(training_text)

word_index = tokenizer.word_index

training_sequences = tokenizer.texts_to_sequences(training_text)
training_padded = pad_sequences(training_sequences, maxlen=120, padding='post', truncating='post')

testing_sequences = tokenizer.texts_to_sequences(testing_text)
testing_padded = pad_sequences(testing_sequences, maxlen=120, padding='post', truncating='post')

# convert lists into numpy arrays to make it work with TensorFlow 
training_padded = np.array(training_padded)
training_labels = np.array(training_labels)

testing_padded = np.array(testing_padded)
testing_labels = np.array(testing_labels)

4.3定义和训练顺序模型

我们使用一个词汇大小、嵌入维度和输入长度的嵌入层来构建模型。我们还添加了一个RelU，它要求模型将实例分为两类，即正的或负的，以及输出概率在0或1之间的sigmoid层。你可以简单地使用每个层中的超参数来提高模型性能。

然后，我们使用优化器和度量性能编译模型，并在数据集上对其进行训练。

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(10000, 16, input_length=120),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(24, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

##compile the model
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])

model.summary()

我们可以在下图中检查，我们有4个层，最大长度为120，Dense层节点为16和24，可训练参数为160433。

num_epochs = 10

history = model.fit(training_padded, 
                    training_labels, 
                    epochs=num_epochs, 
                    validation_data=(testing_padded, testing_labels), 
                    verbose=2)

我们可以进一步检查，我们构建的具有10个epoch的神经网络模型具有99%的良好准确率，减少了验证损失，提高了验证准确率，从而确保了强大的预测性能和较低的泛化（过拟合）错误风险。

4.4预测新实例

现在，我们将使用这个特定的模型来预测同一个标题。同样，输出接近于零，这也表明这个标题是消极的。

new_headline = ["The US imposes sanctions on Rassia because of the Ukranian war"]

##prepare the sequences of the sentences in question
sequences = tokenizer.texts_to_sequences(new_headline)
padded_seqs = pad_sequences(sequences, maxlen=120, padding='post', truncating='post')

print(model.predict(padded_seqs))

5.结论

在本文中，我们构建了一个二元分类器来检测新闻标题的情感。然而，我们首先使用了一些启发式规则，使用Snorkel方法创建标签，对负面和正面标题进行分类。我们使用有监督的ML和深度学习方法创建了情绪预测。这两种方法都成功地预测了一个新的给定实例的正确标题，对于Logistic回归和深度神经网络，它们都有合理的高准确度得分，分别为92%和96%。

你可能会问自己，在我的下一个数据科学预测任务中，哪种方法更好或更容易使用；然而，答案完全取决于项目的范围和复杂性以及数据的可用性。

有时我们可能会选择使用scikit-learn中著名算法的简单解决方案，预测系统利用数学直觉为给定输入分配期望的输出值。

另一方面，深度学习试图通过从给定输入中执行规则（作为输出）的功能来模拟人脑的功能。我们应该记住，神经网络通常需要大量数据和高计算能力来完成任务。

参考引用

[1] A Million News Headlines, News headlines published over a period of 18 Years, License CCO: Public Domain, Kaggle

感谢阅读！