無需訓練 RNN 或生成模型，如何編寫一個快速且通用的 AI“講故事”專案？

作者 | Andre Ye

譯者 | 彎月，責編 | 郭芮

頭圖 | CSDN 下載自視覺中國

出品 | CSDN（ID：CSDNnews）

以下為譯文：

這段日子裡，我們都被隔離了，就特別想聽故事。然而，我們並非對所有故事都感興趣，有些人喜歡浪漫的故事，他們肯定不喜歡懸疑小說，而喜歡推理小說的人肯定對浪漫的愛情故事沒興趣。看看周圍，還有誰比AI更擅長講我們喜歡的故事呢？

在本文中，我將向你演示如何編寫一個AI，根據我們的個人喜好來給我們講故事，為沉悶的隔離生活增添一份樂趣。

本文可以分為以下幾個部分：

1.藍圖：

概述整個專案及其構成部分。

2.程式演示：

在完成編寫程式碼的工作後，作為預覽演示系統的功能。

3.資料載入和清理：

載入資料並準備好進行處理。

4.尋找最具有代表性的情節：

該專案的第一部分，使用K-Means選擇使用者最感興趣的情節。

5.總結圖：

使用基於圖表的總結來獲取每個情節的摘要，這是UI的組成部分。

6.推薦引擎：

使用簡單的預測式機器學習模型推薦新故事。

7.綜合所有元件：

編寫能夠將所有元件結合在一起的生態系統結構。

藍圖

我想讓AI給我講個故事。在理想情況下，我希望以真正的技術-文藝復興時期的方式來訓練遞迴神經網路或其他的生成式方法。然而，以我從事文字生成工作的經驗來看，這些訓練要麼需要花費很長很長的時間，要麼就會出現過度擬合數據，導致無法完成“原始文字生成”的目標。另外，還需注意，訓練一個性能良好的模型所需的時間超過8個小時，然而據我所知，訓練深度學習模型最有效的免費平臺Kaggle最多隻能免費執行8小時。

我想建立一個快速、通用且每個人都可以實現的專案。這個AI無需訓練RNN或生成模型，只需從“故事資料庫”中搜索人為建立的故事，然後找到我最喜歡的故事。這不僅可以保證故事的基本質量（由人類創造，為人類服務），而且速度更快。

至於“故事資料庫”，我們來使用Kaggle上的Wikipedia電影情節資料集。其中包含了各種型別、國家和時代的3。5萬個電影故事，可謂是眼前我所能找到的最佳故事資料庫。

該資料集包括髮行年份、標題、電影的國家、型別和劇情的文字說明。

現在資料已就緒，接下來我們來設計一個粗略的大綱/藍圖。

1。這個程式會輸出五個特性鮮明的故事的概要（這些故事的評論可以更好地區分使用者的口味。例如，像《教父》這樣的故事，幾乎無法分辨每個人的口味，因為每個人都喜歡這部電影。）

2。使用者的評分，他們是喜歡、不喜歡還是保持中立。

3。這個程式接收使用者對這五個故事的喜好程度，並輸出完整故事的摘要。如果使用者感興趣，則程式會輸出完整的故事。每個完整的故事結束後，程式都會要求使用者提供反饋。該程式將從實時反饋中學習，並嘗試提出更好的推薦（強化學習系統）。

注意，我們選擇了五個左右最有代表性的故事，目的是為了讓模型在有限的資料量下獲得儘可能多的資訊。

系統演示

剛開始的時候，這個程式會要求你針對三個故事提供反饋。對於程式來說，這三個故事是資料的每個簇中最具代表性的故事。

在回答完前三個入門問題，對你的喜好進行大致評估後，模型就會開始生成你喜歡的故事。

如果你對某個故事的節選感興趣，那麼程式就會輸出整個故事供你閱讀。

模型會將你的反饋（你是否喜歡故事）新增到訓練資料，以改善模型的推薦。當你閱讀故事時，模型會不斷學習。如果你不喜歡某個故事的摘要，那麼程式就不會輸出完整的故事，它會繼續生成新的故事。

如果你喜歡某個謀殺和警察的故事節選，並給出了“1”作為響應，那麼程式就會開始學習，並朝著這個方向推薦越來越多的故事。

這個程式就像“蒙特卡洛樹搜尋”一樣，朝著最佳化獎勵的方向發展，並在偏離太遠（與你喜歡的故事型別相距太遠）時後退，從而最佳化你的體驗。

資料載入和清理

我們透過pandas 的 load_csv載入資料。

import pandas as pd

data = pd。read_csv（‘/kaggle/input/wikipedia-movie-plots/wiki_movie_plots_deduped。csv’）

data。head

資料集中的欄位包括髮行年份、電影名稱、國家、導演、演員、型別、該電影在Wikipedia頁面上的URL以及劇情的文字說明。我們可以去掉導演和演員陣容，對於我們的推薦演算法或聚類方法來說，這兩個欄位的分類太多了（準確地說，共有12593個導演和32182演員），所以收益不大。然而，電影型別的數量相對較少——100多部電影的分類只有30多個，而且這代表了超過80%的電影（其他電影可以簡單地歸類為“其他“即可）。因此，我們可以刪除導演和演員。

data。drop（［‘Director’，‘Cast’］，axis=1，inplace=True）

我們遇到的另一個問題是括號的引用。眾所周知，Wikipedia會針對引用來源編號（例如［3］）。

“Grace Roberts （played by Lea Leland）， marries rancher Edward Smith， who is revealed to be a neglectful， vice-ridden spouse。 They have a daughter， Vivian。 Dr。 Franklin （Leonid Samoloff） whisks Grace away from this unhappy life， and they move to New York under aliases， pretending to be married （since surely Smith would not agree to a divorce）。 Grace and Franklin have a son， Walter （Milton S。 Gould）。 Vivian gets sick， however， and Grace and Franklin return to save her。 Somehow this reunion， as Smith had assumed Grace to be dead， causes the death of Franklin。 This plot device frees Grace to return to her father‘s farm with both children。［1］”

例如，對於上述字串，我們需要刪除［1］。最簡單的解決方案是建立一個帶有每個括號值（［1］，［2］，［3］，…，［98］，［99］）的列表，然後從字串中刪除列表中存在的每個值。這種方法的前提是我們可以確保每篇文章的引用都不會超過99條。儘管效率不是最高，但我們可以透過混亂的字串索引或拆分來解決這個問題。

blacklist =

for i in range（100）：

blacklist。append（’［‘+str（i）+’］‘）

這段程式碼建立了blacklist，這個列表包含了我們不想要的引用標記。

def remove_brackets（string）：

for item in blacklist：

string = string。replace（item，’‘）

return string

接下來，我們可以使用這個blacklist建立一個函式remove_brackets，然後應用到每一列。

data［’Plot‘］ = data［’Plot‘］。apply（remove_brackets）

至此，我們的基本資料清理工作結束了。

總結故事情節

這個系統的關鍵要素是總結故事情節。由於通常故事讀起來都太長，因此總結故事很重要，方便使用者選擇是否繼續閱讀。

我們將使用基於圖的摘要演算法，這是最流行的文字摘要方法。首先建立文件單元圖（而其他大多數方法都使用句子作為基本單位），然後選擇具有適用於此場景的PageRank版本的節點。Google原始的PageRank版本採用類似的基於圖的方法來查詢網頁節點。

PageRank演算法計算圖中的節點“中心”，這對於衡量句子中相關資訊的內容很有用。該圖的構造使用了詞袋特徵序列和基於餘弦相似度的邊緣權重。

我們將使用gensim庫來總結長文字。與前面的示例一樣，實現方法很簡單：

import gensim

string = ’‘’

The PageRank algorithm outputs a probability distribution used to represent the likelihood that a person randomly clicking on links will arrive at any particular page。 PageRank can be calculated for collections of documents of any size。 It is assumed in several research papers that the distribution is evenly divided among all documents in the collection at the beginning of the computational process。 The PageRank computations require several passes， called “iterations”， through the collection to adjust approximate PageRank values to more closely reflect the theoretical true value。

Assume a small universe of four web pages： A， B， C and D。 Links from a page to itself， or multiple outbound links from one single page to another single page， are ignored。 PageRank is initialized to the same value for all pages。 In the original form of PageRank， the sum of PageRank over all pages was the total number of pages on the web at that time， so each page in this example would have an initial value of 1。 However， later versions of PageRank， and the remainder of this section， assume a probability distribution between 0 and 1。 Hence the initial value for each page in this example is 0。25。

The PageRank transferred from a given page to the targets of its outbound links upon the next iteration is divided equally among all outbound links。

If the only links in the system were from pages B， C， and D to A， each link would transfer 0。25 PageRank to A upon the next iteration， for a total of 0。75。

Suppose instead that page B had a link to pages C and A， page C had a link to page A， and page D had links to all three pages。 Thus， upon the first iteration， page B would transfer half of its existing value， or 0。125， to page A and the other half， or 0。125， to page C。 Page C would transfer all of its existing value， 0。25， to the only page it links to， A。 Since D had three outbound links， it would transfer one third of its existing value， or approximately 0。083， to A。 At the completion of this iteration， page A will have a PageRank of approximately 0。458。

In other words， the PageRank conferred by an outbound link is equal to the document’s own PageRank score divided by the number of outbound links L。

In the general case， the PageRank value for any page u can be expressed as： i。e。 the PageRank value for a page u is dependent on the PageRank values for each page v contained in the set Bu （the set containing all pages linking to page u）， divided by the number L（v） of links from page v。 The algorithm involves a damping factor for the calculation of the pagerank。 It is like the income tax which the govt extracts from one despite paying him itself。

‘’‘

print（gensim。summarization。summarize（string））

輸出：

In the original form of PageRank， the sum of PageRank over all pages was the total number of pages on the web at that time， so each page in this example would have an initial value of 1。

The PageRank transferred from a given page to the targets of its outbound links upon the next iteration is divided equally among all outbound links。 If the only links in the system were from pages B， C， and D to A， each link would transfer 0。25 PageRank to A upon the next iteration， for a total of 0。75。 Since D had three outbound links， it would transfer one third of its existing value， or approximately 0。083， to A。

這段總結得很不錯（如果你不願閱讀全文的話）。圖摘要演算法是最有效的總結方法之一，我們將使用該演算法總結摘要。下面我們來建立一個函式summary，接收文字並輸出總結。但是，我們需要設定兩個條件：

如果文字長度小於500個字元，則直接返回原始文字。總結會讓文字的內容過於簡短。

如果文字只有一個句子，則genism 無法處理，因為它只能選擇文字中的重要句子。我們將使用TextBlob物件，該物件具有。sentences屬性，可將文字分成多個句子。如果文字的第一個句子就等於文字本身，則可以判斷該文字只有一個句子。

import gensim

from textblob import TextBlob

def summary（x）：

if len（x） < 500 or str（TextBlob（x）。sentences［0］） == x：

return x

else：

return gensim。summarization。summarize（x）

data［’Summary‘］ = data［’Plot‘］。apply（summary）

如果不滿足這兩個條件中的任何一個，則返回文字的摘要。接下來，我們建立一列summary。

執行需要花費幾個小時。但是，只需執行一次，而且總結完成後還可以節省以後的時間。

讓我們來看看資料集中一些示例文字的處理：

“The earliest known adaptation of the classic fairytale， this films shows Jack trading his cow for the beans， his mother forcing him to drop them in the front yard， and beig forced upstairs。 As he sleeps， Jack is visited by a fairy who shows him glimpses of what will await him when he ascends the bean stalk。 In this version， Jack is the son of a deposed king。 When Jack wakes up， he finds the beanstalk has grown and he climbs to the top where he enters the giant’s home。 The giant finds Jack， who narrowly escapes。 The giant chases Jack down the bean stalk， but Jack is able to cut it down before the giant can get to safety。 He falls and is killed as Jack celebrates。 The fairy then reveals that Jack may return home as a prince。”

結果：

‘As he sleeps， Jack is visited by a fairy who shows him glimpses of what will await him when he ascends the bean stalk。’

這篇摘要是一個非常精彩的預告！不僅易於閱讀，而且可以讓你對電影情節中的重要句子有一個很好的瞭解。

尋找最具有代表性的情節

為了尋找最具有代表性的情節，我們使用K Means將情節文字分割成一定數量的簇。我們按照文字的簇標籤以及電影的國家、型別和年份將電影分成簇以方便查詢。越是接近簇中心的電影，越能代表這個簇，因此最具有代表性。這個想法背後的主要思想是：

詢問使用者他們是否喜歡最具有代表性的電影，為模型提供最多的資訊，以彌補以前沒有的關於使用者喜好的資訊。

電影的國家、型別和年份都代表電影中可透過文字中傳達的各個方面，這有助於我們快速找到恰當的推薦。從理論上說，最“準確”的推薦應當是在轉換成非常非常長的圖向量之後，推薦的圖向量與原始文字的圖向量之間存在某種相似性，但這需要花費很長時間。因此，我們利用摘要的屬性來表示。

將文字劃分成簇的工作只需進行一次，不僅可以為我們提供電影簇的其他功能，而且還可以為我們在實際提出推薦時提供電影的屬性。

下面我們開始。首先，我們需要刪除所有標點符號，並將所有文字改為小寫。我們可以使用正則表示式建立函式clean來執行該操作。

import string

import re

def clean（text）：

return re。sub（‘［%s］’ % string。punctuation，‘’，text）。lower

我們使用 pandas 的 apply，這個函式可應用於所有的圖。

data［‘Cleaned’］ = data［‘Plot’］。apply（clean）

接下來，我們將資料變成向量。我們使用TF-IDF（term frequency–inverse document frequency）。該方法可以幫助我們區分重要的詞和不重要的詞，方便將文字劃分成簇。該方法可以強調在一個文件中出現多次，但在整個語料庫中出現次數很少的單詞，並弱化那些出現在所有文件中的單詞。

from sklearn。feature_extraction。text import TfidfVectorizer

vectorizer = TfidfVectorizer（stop_words=‘english’，max_features=500）

X = vectorizer。fit_transform（data［‘Plot’］）

我們將這個非常稀疏的矩陣儲存到變數X中。由於K-Means是基於距離的，這意味著它會受到維數詛咒的影響，因此我們應盡最大努力來降低向量化文字的維數，這裡我們將向量中的最大元素數為500。（如果我沒有設定max_features限制，那麼K-means就會將除了一個文字之外的所有文字歸到一個簇，將那一個文字歸到另一個簇。這就是K-Means的維數詛咒的結果，距離都失去了作用，TF-IDF詞彙表中會出現數十萬個維度，導致除了異常值之外的所有值都被歸到同一個簇。

出於同樣的原因，在將資料輸入到K-Means模型之前，最好先縮放資料。我們使用StandardScaler將資料縮放到-1到1之間。

from sklearn。preprocessing import StandardScaler

scaler = StandardScaler

X = scaler。fit_transform（X）

下面，我們來訓練K-Means模型。在理想情況下，簇的數量（我們需要提出的問題數量）應介於3-6之間（含3和6）。

因此，我們使用列表［3， 4， 5， 6］中的每個簇來執行K-Means模型。我們將評估每個簇的得分，並找出最適合我們資料的簇數量。

首先，我們來初始化儲存簇的數量以及分數的兩個列表（圖中的x和y）：

n_clusters =

scores =

接下來，我們匯入sklearn 的 KMeans 和 silhouette_score。

from sklearn。cluster import KMeans

from sklearn。metrics import silhouette_score

然後，我們針對預先選擇的四個簇數量中的每一個，擬合一個具有n個簇數量的KMeans模型，然後將該數量的簇的得分新增到列表中。

for n in ［3，4，5，6］：

kmeans = KMeans（n_clusters=n）

kmeans。fit（X）

scores。append（silhouette_score（X，kmeans。predict（X）））

n_clusters。append（n）

接下來，我只需點選Kaggle上的“提交”，然後讓程式自己執行，這需要幾個小時才能完成。

最後的結果是：表現最佳的簇數量為三個，而且得分最高。

現在我們有了文字標籤，可以開始將電影作為一個整體進行分簇了。但是，我們必須採取一些步驟來清理資料。

例如，Release Year從1900年開始。如果採用文字整數值，那麼模型就會很迷惑。我們建立一個Age列來返回電影的年齡，簡單地用2017年（資料庫中最新的電影）減去電影發行的年份。

data［‘Age’］ = data［‘Release Year’］。apply（lambda x：2017-x）

Age從0開始是有實際意義的。

Origin/Ethnicity列很重要，故事的風格可以追溯到故事的來源。但是，該列有分類，例如可以是［‘American’，‘Telegu’，‘Chinese’］。如果想轉換為機器可讀的內容，我們需要對其進行One-Hot編碼，我們透過 sklearn 的 OneHotEncoder 來實現。

from sklearn。preprocessing import OneHotEncoder

enc = OneHotEncoder（handle_unknown=’ignore’）

nation = enc。fit_transform（np。array（data［‘Origin/Ethnicity’］）。reshape（-1， 1））。toarray

現在，nation中儲存了每一行的One-Hot編碼編碼值。行的每個索引代表一個唯一的值，例如，第一列（每行的第一個索引）代表“美國”。

但是，目前，它只是一個數組，我們將需要建立資料中的列，將資訊實際轉換為我們的資料。因此，我們將每一列命名為該向量的列對應的國家（enc。categories_ ［0］返回原始列的陣列，nation［：，i］索引指向陣列中每一行的第i個值）。

for i in range（len（nation［0］））：

data［enc。categories_［0］［i］］ = nation［：，i］

我們已成功地將每個故事的國家新增到了我們的資料中了。接下來，我們對故事的型別做相同的處理。型別比國家更重要，因為它傳達了關係到故事內容的資訊，而這在機器學習模型識別的水平上是無法輕易實現的。

但是，有一個問題：

data［‘Genre’］。value_counts

似乎很多型別都是未知的。不過不用擔心，我們稍後再解決。目前，我們的目標是對型別進行One-Hot編碼。我們按照上述方式，但會稍作改動，因為有太多型別由於其名稱不同而被認為是不同的型別（例如“戲劇喜劇”和“浪漫喜劇”），但實際上都是同一種類型，我們只選擇最流行的20種類型，其餘的都歸類到這20種類型中的一種。

top_genres = pd。DataFrame（data［‘Genre’］。value_counts）。reset_index。head（21）［‘index’］。tolist

top_genres。remove（‘unknown’）

請注意，最終我們會刪除列表中的“unknown”，這就是為什麼最初出現了21個型別的原因。接下來，讓我們根據top_genres來處理型別，如果有的型別不在最流行的20種類型中，則將其替換為字串“unknown”。

def process（genre）：

if genre in top_genres：

return genre

else：

return ‘unknown’

data［‘Genre’］ = data［‘Genre’］。apply（process）

然後，像上面一樣，我們建立一個One-Hot編碼器的例項，並將轉換後的結果儲存到變數genres中。

enc1 = OneHotEncoder（handle_unknown=‘ignore’）

genres = enc1。fit_transform（np。array（data［‘Genre’］）。reshape（-1， 1））。toarray

為了將這個陣列整合到資料中，我們再來建立幾列，每一列都用陣列中的一列填充。

for i in range（len（genres［0］））：

data［enc1。categories_［0］［i］］ = genres［：，i］

我們的資料是One-Hot編碼，但仍然存在unknown值的問題。現在，所有資料均已完成One-Hot編碼，我們知道，unknown列的值為1的行需要設定型別。因此，我們針對需要設定型別的每個索引，將其型別替換為nan值，以便我們稍後使用的KNN插值器時，可以識別出它是一個缺失值。

for i in data［data［‘unknown’］==1］。index：

for column in ［‘action’，

‘adventure’， ‘animation’， ‘comedy’， ‘comedy， drama’， ‘crime’，

‘crime drama’， ‘drama’， ‘film noir’， ‘horror’， ‘musical’， ‘mystery’， ‘romance’， ‘romantic comedy’， ‘sci-fi’， ‘science fiction’， ‘thriller’， ‘unknown’， ‘war’， ‘western’］：

data。loc［i，column］ = np。nan

現在，所有缺失值都標記成了缺失，我們可以使用KNN分類器了。但是，除了上映的年份和國家以外，我們沒有太多資料可用於分類。下面，我們使用TF-IDF，從故事中選擇前30個單詞，作為KNN正確分配型別的附加資訊。

我們必須事先清理文字，因此我們使用正則表示式來刪除所有標點符號，並將所有本文都轉換為小寫。

import re

data［‘Cleaned’］ = data［‘Plot’］。apply（lambda x：re。sub（‘［^A-Za-z0-9］+’，‘ ’，str（x））。lower）

我們將設定英語標準的停用詞，並將特徵的最大數量設定為30。經過清理後向量化的文字以陣列的形式儲存到變數X。

from sklearn。feature_extraction。text import TfidfVectorizer

vectorizer = TfidfVectorizer（stop_words=’english’，max_features=30）

X = vectorizer。fit_transform（data［‘Cleaned’］）。toarray

像上面一樣，我們將陣列X中的每一列資訊都轉移成我們資料的一列，並命名每一列為x中相應列的單詞。

keys = list（vectorizer。vocabulary_。keys）

for i in range（len（keys））：

data［keys［i］］ = X［：，i］

這些單詞將提供更多背景資訊，幫助設定型別。最後，我們來設定型別！

from sklearn。impute import KNNImputer

imputer = KNNImputer（n_neighbors=5）

column_list = ［‘Age’， ‘American’， ‘Assamese’，‘Australian’， ‘Bangladeshi’， ‘Bengali’， ‘Bollywood’， ‘British’，‘Canadian’， ‘Chinese’， ‘Egyptian’， ‘Filipino’， ‘Hong Kong’， ‘Japanese’，‘Kannada’， ‘Malayalam’， ‘Malaysian’， ‘Maldivian’， ‘Marathi’， ‘Punjabi’，‘Russian’， ‘South_Korean’， ‘Tamil’， ‘Telugu’， ‘Turkish’，‘man’， ‘night’， ‘gets’， ‘film’， ‘house’， ‘takes’， ‘mother’， ‘son’，‘finds’， ‘home’， ‘killed’， ‘tries’， ‘later’， ‘daughter’， ‘family’，‘life’， ‘wife’， ‘new’， ‘away’， ‘time’， ‘police’， ‘father’， ‘friend’，‘day’， ‘help’， ‘goes’， ‘love’， ‘tells’， ‘death’， ‘money’， ‘action’， ‘adventure’， ‘animation’， ‘comedy’， ‘comedy， drama’， ‘crime’，‘crime drama’， ‘drama’， ‘film noir’， ‘horror’， ‘musical’， ‘mystery’，‘romance’， ‘romantic comedy’， ‘sci-fi’， ‘science fiction’， ‘thriller’，‘war’， ‘western’］

imputed = imputer。fit_transform（data［column_list］）

設定型別的時候能夠識別出缺失值np。nan，並自動使用周圍的國家資料和資料中的單詞以及電影的年齡來估計型別。結果儲存到陣列形式的變數中。與往常一樣，我們將資料轉換為：

for i in range（len（column_list））：

data［column_list［i］］ = imputed［：，i］

刪除One-Hot編碼或不再需要的列之後，例如 Genre 的 Unknown 或類別 Genre 變數……

data。drop（［‘Title’，‘Release Year’，‘Director’，‘Cast’，‘Wiki Page’，‘Origin/Ethnicity’，‘Unknown’，‘Genre’］，axis=1，inplace=True）

……資料已準備就緒，沒有缺失值。KNN分類的另一個有趣的方面是，它可以給出十進位制的值，也就是說，一部電影20%是西方，其餘部分是另一種或幾種型別。

這些特徵都可以很好地用於簇。這些特徵與之前獲得的簇標籤相結合，應該可以很好地表明使用者對某個故事的喜愛程度。最後，我們開始分簇，像以前一樣，我們將故事分為3、4、5或6個簇，然後看看哪種表現最佳。

from sklearn。cluster import KMeans

from sklearn。metrics import silhouette_score

Xcluster = data。drop（［‘Plot’，‘Summary’，‘Cleaned’］，axis=1）

score =

for i in ［3，4，5，6］：

kmeans = KMeans（n_clusters=i）

prediction = kmeans。fit_predict（Xcluster）

score = silhouette_score（Xcluster，prediction）

score。append（score）

繪製得分情況……

像前面一樣，三個簇的表現最好，得分最高。所以我們僅在三個簇上訓練KMeans：

from sklearn。cluster import KMeans

Xcluster = data。drop（［‘Plot’，‘Summary’，‘Cleaned’］，axis=1）

kmeans = KMeans（n_clusters=3）

kmeans。fit（Xcluster）

pd。Series（kmeans。predict（Xcluster））。value_counts

最好讓每個簇都擁有數量差不多的電影。我們可以透過。cluster_centers_方法來獲得簇的中心：

centers = kmeans。cluster_centers_

centers

首先，我們為每一項分配標籤。

Xcluster［‘Label’］ = kmeans。labels_

對於每個簇，我們希望找到距離簇中心歐幾里得距離最近的資料點。該點最能代表整個簇。p和q兩點之間的距離由p和q對應維度之差的平方和，再取平方根。你可以參考歐幾里得距離公式：

由於歐幾里得距離是l2範數，因此可以使用numpy的線性代數函式np。linalg。norm（a-b）來計算。

下面我們來看看完整的計算程式碼，並找到與簇之間的歐幾里得距離最小的故事。

for cluster in ［0，1，2］：

subset = Xcluster［Xcluster［‘Label’］==cluster］

subset。drop（［‘Label’］，axis=1，inplace=True）

indexes = subset。index

subset = subset。reset_index。drop（‘index’，axis=1）

center = centers［cluster］

scores = {‘Index’：，‘Distance’：}

上述程式碼可以初始化搜尋。首先，將標籤與我們當前正在搜尋的簇相符的故事儲存起來。然後，我們從子集中刪除Label。為了儲存原始的索引以供以後參考，我們將索引儲存到變數indexes中。接下來，我們將重置子集上的索引，以確保索引正常工作。然後，我們選擇當前簇的中心點，並初始化一個包含兩列的字典：一個儲存主資料集中的故事索引的列表，

另一個儲存得分/距離的列表。

for index in range（len（subset））：

scores［‘Index’］。append（indexes［index］）

scores［‘Distance’］。append（np。linalg。norm（center-np。array（ subset。loc［index］）））

這段程式碼會遍歷子集中的每一行，記錄當前索引，並計算和記錄它與中心之間的距離。

scores = pd。DataFrame（scores）

print（‘Cluster’，cluster，‘：’，scores［scores［‘Distance’］==scores［‘Distance’］。min］［‘Index’］。tolist）

這段程式碼將分數轉換為pandas DataFrame以進行分析，並輸出距中心最近的故事的索引。

似乎第一個簇中具有最小歐幾里德距離的故事有四個，而簇1和2只有一個故事。

簇0：

data。loc［4114］［‘Summary’］

輸出：

‘On a neutral island in the Pacific called Shadow Island （above the island of Formosa）， run by American gangster Lucky Kamber， both sides in World War II attempt to control the secret of element 722， which can be used to create synthetic aviation fuel。’

簇1：

data。loc［15176］［‘Summary’］

輸出：

‘Jake Rodgers （Cedric the Entertainer） wakes up near a dead body。 Freaked out， he is picked up by Diane。’

簇2：

data。loc［9761］［‘Summary’］

輸出：

‘Jewel thief Jack Rhodes， a。k。a。 “Jack of Diamonds”， is masterminding a heist of $30 million worth of uncut gems。 He also has his eye on lovely Gillian Bromley， who becomes a part of the gang he is forming to pull off the daring robbery。 However， Chief Inspector Cyril Willis from Scotland Yard is blackmailing Gillian， threatening her with prosecution on another theft if she doesn\’t cooperate in helping him bag the elusive Rhodes， the last jewel in his crown before the Chief Inspector formally retires from duty。‘

很好！現在我們獲得了三個最有代表性的故事情節。雖然人類看不出其中的區別，但在機器學習模型的心中，這些資料為它提供了大量資訊，可供隨時使用。

推薦引擎

這裡的推薦引擎只是一個機器學習模型，可以預測哪些電影情節更有可能獲得使用者的高度評價。該引擎接收電影的特徵，例如年齡或國家，以及TF-IDF向量化的摘要，最大可接收100個特徵。

每個電影情節的目標是1或0。模型經過在資料（使用者已評價的故事）上的訓練後，可預測使用者對故事評價良好的機率。接下來，模型會向用戶推薦最有可能受到喜愛的故事，並記錄使用者對該故事的評分，最後還會將該故事新增到訓練資料列表中。

至於訓練資料，我們僅使用每部電影中資料的屬性。

我們可能需要決策樹分類器，因為它可以做出有效的預測，快速訓練並開發高方差解決方案，這正是推薦系統所追求的。

綜合所有元件

首先，我們針對三個最有代表性的電影，編寫使用者的評分。這個程式會確保針對每個輸入，輸出為0或1。

import time

starting =

print（“Indicate if like （1） or dislike （0） the following three story snapshots。”）

print（“\n> > > 1 < < <”）

print（’On a neutral island in the Pacific called Shadow Island （above the island of Formosa）， run by American gangster Lucky Kamber， both sides in World War II attempt to control the secret of element 722， which can be used to create synthetic aviation fuel。‘）

time。sleep（0。5） #Kaggle sometimes has a glitch with inputs

while True：

response = input（’：： ‘）

try：

if int（response） == 0 or int（response） == 1：

starting。append（int（response））

break

else：

print（’Invalid input。 Try again‘）

except：

print（’Invalid input。 Try again‘）

print（’\n> > > 2 < < <‘）

print（’Jake Rodgers （Cedric the Entertainer） wakes up near a dead body。 Freaked out， he is picked up by Diane。‘）

time。sleep（0。5） #Kaggle sometimes has a glitch with inputs

while True：

response = input（’：： ‘）

try：

if int（response） == 0 or int（response） == 1：

starting。append（int（response））

break

else：

print（’Invalid input。 Try again‘）

except：

print（’Invalid input。 Try again‘）

print（’\n> > > 3 < < <‘）

print（“Jewel thief Jack Rhodes， a。k。a。 ’Jack of Diamonds‘， is masterminding a heist of $30 million worth of uncut gems。 He also has his eye on lovely Gillian Bromley， who becomes a part of the gang he is forming to pull off the daring robbery。 However， Chief Inspector Cyril Willis from Scotland Yard is blackmailing Gillian， threatening her with prosecution on another theft if she doesn’t cooperate in helping him bag the elusive Rhodes， the last jewel in his crown before the Chief Inspector formally retires from duty。”）

time。sleep（0。5） #Kaggle sometimes has a glitch with inputs

while True：

response = input（‘：： ’）

try：

if int（response） == 0 or int（response） == 1：

starting。append（int（response））

break

else：

print（‘Invalid input。 Try again’）

except：

print（‘Invalid input。 Try again’）

上述程式碼執行良好。接下來，我們將資料儲存到訓練資料集DataFrame中，然後刪除資料中的索引。

X = data。loc［［9761，15176，4114］］。drop（［‘Plot’，‘Summary’，‘Cleaned’］，axis=1）

y = starting

data。drop（［［9761，15176，4114］］，inplace=True）

下面，我們來建立一個迴圈。我們在當前訓練集上訓練決策樹分類器。

from sklearn。tree import DecisionTreeClassifier

subset = data。drop（［‘Plot’，‘Summary’，‘Cleaned’］，axis=1）

while True：

dec = DecisionTreeClassifier。fit（X，y）

然後，針對資料中的每個索引，進行機率預測。

dic = {‘Index’：，‘Probability’：}

subdf = shuffle（subset）。head（10_000） #select about 1/3 of data

for index in tqdm（subdf。index。values）：

dic［‘Index’］。append（index）

dic［‘Probability’］。append（dec。predict_proba（ np。array（subdf。loc［index］）。reshape（1， -1））［0］［1］）

dic = pd。DataFrame（dic）

為了確保快速選擇，我們在打亂的資料中隨機選擇大約1/3的資料，並選擇前10，000行。這段程式碼將索引儲存到DataFrame。

最初，許多電影的機率都為1，但隨著我們的進步和模型的學習，它將開始做出更高階的選擇。

index = dic［dic［‘Probability’］==dic［‘Probability’］。max］。loc［0，‘Index’］

我們將使用者最喜愛的電影的索引儲存到變數index。

下面，我們需要從資料中獲取有關索引的資訊並顯示它。

print（‘> > > Would you be interested in this snippet from a story？（1/0/-1 to quit） < < <’）

print（data。loc［index］［‘Summary’］）

time。sleep（0。5）

然後驗證使用者的輸入是0、1還是-1（退出）：

while True：

response = input（‘：： ’）

try：

if int（response） == 0 or int（response） == 1：

response = int（response）

break

else：

print（‘Invalid input。 Try again’）

except：

print（‘Invalid input。 Try again’）

……我們可以開始新增訓練資料。但是，首先，我們必須允許使用者在需要退出的時候結束迴圈。

if response == -1：

break

另外，無論使用者喜歡還是不喜歡這部電影，我們都將其新增到訓練資料中（目標將有所不同）：

X = pd。concat（［X，pd。DataFrame（data。loc［index］。drop（［‘Plot’，‘Summary’，‘Cleaned’］））。T］）

最後，如果響應為0，我們將0新增到y中。表示使用者不想聽這個故事。

if response == 0：

y。append（0）

如果使用者喜歡這個故事，則程式輸出完整的故事。

else：

print（‘\n> > > Printing full story。 < < <’）

print（data。loc［index］［‘Plot’］）

time。sleep（2）

print（“\n> > > Did you enjoy this story？（1/0） < < <”）

我們再次收集使用者的輸入，並確保輸入為0或1。

while True：

response = input（‘：： ’）

try：

if int（response） == 0 or int（response） == 1：

response = int（response）

break

else：

print（‘Invalid input。 Try again’）

except：

print（‘Invalid input。 Try again’）

……並相應地將0或1新增到y。

if response == 1：

y。append（1）

else：

y。append（0）

最後，我們從資料中刪除故事，因為使用者不想重複看到同一個故事。

data。drop（index，inplace=True）

大功告成！每次迭代都會更新訓練資料，模型的準確率也會越來與高。

感謝您的閱讀！

希望您喜歡本文！你可以透過這個程式來閱讀一些有趣的故事，或檢視這些情節出自哪部電影。在隔離期間，處理資料方面的問題和難題非常有意思，可以為我們帶來一絲樂趣。

如果你想試試看這個程式，那麼請點選這裡獲取：https：//www。kaggle。com/washingtongold/tell-me-a-story-1-2？scriptVersionId=31773396

原文：https：//towardsdatascience。com/tell-me-a-story-ai-one-that-i-like-4c0bc60f46ae

本文為 CSDN 翻譯，轉載請註明來源出處。

☞AI 世界的硬核之戰，Tengine 憑什麼成為最受開發者歡迎的主流框架？

☞說了這麼多 5G，最關鍵的技術在這裡

☞360金融新任首席科學家：別指望AI Lab做成中臺

☞AI影象智慧修復老照片，效果驚豔到我了

☞程式設計師內功修煉系列：10 張圖解談 Linux 物理記憶體和虛擬記憶體

☞當 DeFi 遇上 Rollup，將擦出怎樣的火花？

無需訓練 RNN 或生成模型，如何編寫一個快速且通用的 AI“講故事”專案？

相關文章