Classification in Graphs using Discriminative Random Walks 読んだ & Rubyで実装した

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.139.6489
id:smlyさんに教わった論文．

概要

グラフにおけるクラス判別に関する半教師あり学習をランダムウォークの変形であるD-walksで解く．

notation

入力： $G=(N,E)$ なるグラフ． $N=\{q_1,\cdots,q_n\}$ はノード集合であり，エッジ集合 $m=|E|$ である．
ラベルつきノード集合 $L \subset N$ ，ラベルなしノード集合 $U = L \setminus N$ ，ラベル集合 $Y$ とする．
ノード $q \in L$ のラベルを $y_q$ とし， $L_y = \{q | y_q = y\}$ ， $n_y = |L_y|$ とする．
出力： $q \in U$ におけるラベル $y_q$ ．
ランダムウォーク（値を伝播させるやつじゃなくてマルコフ連鎖のような，本来の意味でのランダムウォーク）の遷移確率として，tステップ目にq，t+1ステップ目にq'にいる確率を
$P(X_{t+1} = q' | X_t = q) = P_{qq'} = \frac{a_{qq'}}{\sum_{q'' \in N}a_{qq''}}$
とする． $a_{qq''}$ は $q$ から $q''$ へのエッジの重み．

D-walks，D-betweenness

次に，D-walksを定義する．これは，長さ $l$ ステップのランダムウォークにおいて，開始ノードのラベルと終了ノードのラベルが等しく，かつ，終了時まで開始ノードのラベルを持つノードに到達しないものである．
数式にすると， $l \geq 1$ に対して， $y_{q_0} = y_{q_l}$ であり， $y_{q_0} \neq y_{q_t},\,0<t<l$ となるような $q_0,\cdots,q_l$ と表現できる．
$D_l^y$ をラベル $y$ から始まる長さ $l$ のD-walksとし， $D^y_{\leq L}$ をラベル $y$ から始まる長さ $L$ 以下のD-walksの集合とする．
ついでに， $B_L(q,y)$ を $D^y_{\leq L}$ においてノード $q$ を通過する回数の期待値とし，これをD-betweennessと呼ぶ．
このbetweennessの定義は元論文では[cond-mat/0309045] A measure of betweenness centrality based on random walksに言及している．単語と著者に見覚えがあったので先週の自主ゼミで発表したスライドは晒しませんが読んだ論文だけは貼る - 糞ネット弁慶で読んだ[cond-mat/0308217] Finding and evaluating community structure in networksと比較したら後者はエッジを通過する回数をもってしてbetweennessとしていたので近いと言えば近い．

forward-backward

あとはこれをforward-backwardで解く．
前向き変数 $\alpha ^y(q,t)$ をラベルyからスタートしてtステップ目にqにやってくる確率とすると
$\left\{\begin{array}{l}(t=1)\,\,\alpha ^y(q,1) = \sum_{q' \in L_y}\frac{1}{n_y} P_{q'q}\\(t \geq 2)\,\,\alpha ^y(q,t) = \sum_{q' \in N \setminus L_y}\alpha ^y(q',t-1) P_{q'q}\\\end{array}\right.$
となる．
同時に，後ろ向き変数 $\beta ^y(q,t)$ をラベルyからスタートしてtステップ目にqから出て行く確率とすると
$\left\{\begin{array}{l}(t=1)\,\,\beta ^y(q,1) = \sum_{q' \in L_y}\frac{1}{n_y} P_{qq'}\\(t \geq 2)\,\,\beta ^y(q,t) = \sum_{q' \in N \setminus L_y}\beta ^y(q',t-1) P_{q'q}\\\end{array}\right.$
と書ける．
で，ここまでは何を言ってるか判るが，そもそもforward-backwardを全く知らないので次の式が判らない．
前向き変数と後ろ向き変数を使うと $B_L(q_y)$ は次のように書ける．
$B_L(q,y) = \frac{\sum_{l=1}^L \sum_{t=1}^{l-1} \alpha(q,t)\beta(q,l-t)}{\sum_{l=1}^L \sum_{q' \in L_y}\alpha ^y(q',l)}$
であとは，最も高いbetweennessを持つラベルに割り当てる，つまりは $\hat{y_q} = arg\max_{y \in Y}B_L(q,y)$ とすると未知ノードのラベルが求められる．

Rubyで実装

簡単そうなのでRubyで実装．

添字間違ってたので書き直した．前のままだと $B_L$ が全てNaNになる．

# -*- coding: utf-8 -*-
class DWalks
  def initialize
    @edges = Hash.new{|h, k|h[k] = Hash.new{ }}
    @labels = { }
    @nodes = [ ]
    @labels_index = Hash.new{|h, k|h[k] = Array.new}
    @prob = Hash.new{0.0}
    @alpha = { }
    @beta = { }
  end

  # フォーマット
  # node1, node2, weight
  # ex. 1,2,100
  def read_graph(file_name)
    open(file_name){|f|
      f.each{|l|
        from, to, weight = l.chomp.split(",").map{|e|e.to_i}
        @edges[from][to] = weight
        @nodes.push from
        @nodes.push to
      }
    }
    calc_prob
    @nodes.uniq!
  end

  # フォーマット
  # node, label
  # ex. 1,3
  def read_label(file_name)
    open(file_name){|f|
      f.each{|l|
        node, label = l.chomp.split(",").map{|e|e.to_i}
        @labels[node] = label
        @labels_index[label].push node
      }
    }
  end

  # 遷移確率のテーブルを一括計算
  def calc_prob
    @edges.each_pair do |from, adj|
      sum = adj.values.inject(0.0){|s, e|s += e}
      adj.each_pair do |to, weight|
        @prob[from => to] = weight / sum
      end
    end
  end

  # アルファとベータは再帰しつつ保存しておく
  def alpha(y, q, t)
    if @alpha[[y ,q, t]].nil?
      if t == 1
        @alpha[[y, q, 1]] = @labels_index[y].inject(0.0){|s, e| s += @prob[e => q]} / @labels_index[y].size
      else
        @alpha[[y, q, t]] = (@nodes - @labels_index[y]).inject(0.0){|s, e| s+= alpha(y, e, t - 1) * @prob[e => q]}
      end
    end
    return @alpha[[y ,q, t]]
  end

  def beta(y, q, t)
    if @beta[[y, q, t]].nil?
      if t == 1
        @beta[[y, q, 1]] = @labels_index[y].inject(0.0){|s, e| s += @prob[q=> e]}
      else
        @beta[[y, q, t]] = (@nodes - @labels_index[y]).inject(0.0){|s, e| s+= beta(y, e, t - 1) * @prob[q => e]}
      end
    end
    return @beta[[y, q, t]]
  end

  # ラベル推定
  def estimation(q, length)
    b = Hash.new

    # ラベルが既にあったら終了
    if !@labels[q].nil?
      puts "query: #{q} is #{@labels[q]}!"
      exit(1)
    end

    # ノードがそもそもグラフになかったら終了
    if !@nodes.include?(q)
      puts "query: #{q} not found in graph!"
      exit(1)
    end

    @labels_index.each_pair do |label, nodes|
      demo = 0.0; nume = 0.0
      1.upto(length) do |l|
        demo += nodes.inject(0.0){|s, e| s += alpha(label, e, l)}
        1.upto(l - 1) do |t|
          nume += alpha(label, q, t) * beta(label, q, l - t)
        end
      end
      b[label] = nume / demo
    end
    return b.to_a.sort{|x, y| y[1] <=> x[1]}[0]
  end
end

if __FILE__ == $0
  d = DWalks.new
  d.read_graph("./sample_graph.txt")
  d.read_label("./sample_label.txt")
  p d.estimation(5, 3)
end

とする．ついでに次のようなテキストを読み込ませる．長いので続きを読むで．
sample_graph.txt

2,1,1
1,2,1
1,3,1
3,1,1
1,4,1
4,1,1
2,3,1
3,2,1
4,2,1
2,4,1
3,4,1
4,3,1
6,7,1
7,6,1
5,6,1
6,5,1
1,5,1
5,1,1
2,5,1
5,2,1
3,5,1
5,3,1

sample_label.txt

1,1
2,1
3,1
4,1
6,2
7,2

実行．

y_benjo@BENZA:~/memo% ruby sample_D_walks.rb
[1, 0.147540983606557]

見事ノード5はクラス1と推定されました．

糞糞糞ネット弁慶

読んだ論文についてメモを書きます．趣味の話は http://repose.hatenablog.com

Classification in Graphs using Discriminative Random Walks 読んだ & Rubyで実装した

概要

notation

D-walks，D-betweenness

forward-backward

Rubyで実装