Rated aspect summarization of short comments(WWW 2009) 読んだ

Rated aspect summarization of short comments

概要

またもeBay Research Labs．
商品及びそのrate(評価値)，コメントからAspect(側面)別にコメントを要約してrateまで出す．具体的に言うとこんな感じ．

記法

まず，ある商品についたコメントを $t \in T$ とし， $T$ をコメント集合とする．これには特定のrate(評価値) $r(t)$ が付与されている．
コメント $t$ はphrase(語句) $f$ の集合で表現されるものとする．phrase $f = (w_m, w_h)$ はhead term $w_h$ とmodifier(修飾語句) $w_m$ のペアで構成されているものとする．例えば $f = "good\,price"$ ならば $w_h = "price"$ ， $w_m = "good"$ となる．
また，Aspect Clusterとして $A_i = \{w_h|A(w_h) = i\}$ を考える．つまりはhead termをクラスタリングする．ついでにAspectに対するRatingとして $R(A_i)$ を考えておく．

アルゴリズム

三段階で構成される．

k個のAspectをクラスタリングによって決定
Aspect別のrateを計算
Aspectのrateを説明するようなphraseを抽出

こう書くと意外とナイーブ．以下順を追って説明する．

Aspect Discovery and Clustering

まずは $w_h$ をクラスタリングしたい．色々提案している．

K-means

$w_h$ と $w_m^i$ が共起した回数を $c(w_h, w_m^i)$ として， $v(w_h) = (c(w_h, w_m^1), c(w_h, w_m^2), \cdots)$ と特徴ベクトルを作ってからK-means．

Unstructured pLSA

k個のunigramの言語モデル $\Theta = \{\theta_1, \cdots, \theta_k\}$ を作ってから
$p_t(w_h) = \sum_{j=1}^k (\pi_{t,j}p(w_h|\theta_j))$ とかする．これの対数尤度が
$log p(T|\Lambda) = \sum_{t \in T} \sum_{w_h \in V_h} \{ c(w_h,t) \times \log \sum_{j=1}^k ( \pi_{t,j} p(w_h|\theta_j) ) \}$
とかなるらしいのでこれをEMで更新していくと次の式になるらしい．
（うつすのだるくなった）
で，あとは $A(w_h) = arg\max_{j} p(w_h|\theta_j)$ でクラスタリング．

Structured pLSA

Unstructured pLSAと似た感じだけど，今度はhead termとmodifierのセットで考える．
$d(w_m) = \{w_h|(w_h, w_m)\in T\}$ として $p_{d(w_m)}(w_h) \sum_{j=1}^k ( \pi_{d(w_m),j}p(w_h|\theta_j))$ とすると対数尤度が
$log p(V_m|\Lambda) = \sum_{w_m \in V_m} \sum_{w_h \in V_h} \{ c(w_h,d(w_m)) \times \log \sum_{j=1}^k ( \pi_{d(w_m),j} p(w_h|\theta_j) ) \}$
とかなるのでこれもEMで推定．

Incorporating Aspect Priors

これもトピックモデル．
$p(\Lambda) \propto \prod_{j=1}^k \prod_{w_h \in V_h} p(w_h|\theta_j)^{\gamma_j p(w_h|a_j)}$
とかやって色々やるけど謎．

Aspect Rating Problem

で，クラスタリング結果からaspectにratingする．
まずはphrase $f$ について，2つの方法でrating $r(f)$ を予測する．その後，aspect clusterごとに平均をとる．

Local Prediction

$r(f \in t) = r(t)$
つまりはコメント $t$ につけられたrateをtにおける全てのphraseにつける．

Glocab Prediction

もっとちゃんとやる．Aspectごとにrateの出やすさを見て
$p(w_m|A_i,r) = \frac{c(w_m, S(A_i, r))}{\sum_{w'_m \in V_m} c(w'_m, S(A_i,r))}$
$S(A_i, r) = \{f | f \in t,\,A(f)=i,\,r(t)=r\}$
とする（これもunigramの言語モデルであると言っている）．つまりは一番でやすいrを推定してる．
その後， $r(f) = arg\max_{r} \{p(w_m|A_i,r)|A(f) = i\}$
とする．