2019-01-08

IMMEDIATE_P

MRI

IMMEDIATE_P の実装を確認する。

https://github.com/ruby/ruby/blob/ruby_2_6/include/ruby/ruby.h#L400

#define RB_IMMEDIATE_P(x) ((VALUE)(x) & RUBY_IMMEDIATE_MASK)
#define IMMEDIATE_P(x) RB_IMMEDIATE_P(x)

RUBY_IMMEDIATE_MASK も確認する（64bit の場合）。

https://github.com/ruby/ruby/blob/ruby_2_6/include/ruby/ruby.h#L442

    RUBY_IMMEDIATE_MASK = 0x07,

と定義してある。つまり、0b0111 、x の下位3ビットをとってきて & を取った結果、0 でなければ IMMEDIATE である、という意味になる。これは、ポインタ値かどうかのチェックということもできる。下位 3 bit が 0 でなければ、ポインタの恐れがあると判断できる。

2019-01-08

SPECIAL_CONST_P 実装

MRI

さて、SPECIAL_CONST_P の実装の話。

#define RB_SPECIAL_CONST_P(x) (RB_IMMEDIATE_P(x) || !RB_TEST(x))
#define SPECIAL_CONST_P(x) RB_SPECIAL_CONST_P(x)

この条件で、メモリへのポインタか、そうでないかを分けることができる、といっている。どうやるのか。

ポインタの場合、MRI では必ず word aliment するようにしている。つまり、64bit CPU では 8 byte alignment。つまり、8 の倍数。そうすると、下位 3 bit は 0 である。RB_IMMEDIATE_P(x) の実装は、まさに下位 3 bit が 0 かどうかのチェックをしているので、IMMEDIATE_P だった場合、ポインタではない。

次、RB_TEST(x) では何をしているかというと、x が 0 か 8 という値か（64bit CPU の場合）、というチェックになっている。下位 3 bit は 0 だが、この2値だけは、絶対に Ruby オブジェクトじゃない、という前提で動いている。

ということで、この二つの条件にあてはまれば、オブジェクトが割り当てられたポインタではなく、あてはまらなければポインタだ、というように判断できる。

ちなみに、このチェックは大量に行うことになるので、効率的だとありがたい。

int
sp_const_p(VALUE x)
{
    return RB_SPECIAL_CONST_P(x);
}

こういう関数にしてみると、

# gcc
   0:   40 f6 c7 07             test   $0x7,%dil
   4:   b8 01 00 00 00          mov    $0x1,%eax
   9:   75 0c                   jne    17 <sp_const_p+0x17>
   b:   31 c0                   xor    %eax,%eax
   d:   48 f7 c7 f7 ff ff ff    test   $0xfffffffffffffff7,%rdi
  14:   0f 94 c0                sete   %al
  17:   f3 c3                   repz retq

# clang
  90:   40 f6 c7 07             test   $0x7,%dil
  94:   0f 95 c0                setne  %al
  97:   48 f7 c7 f7 ff ff ff    test   $0xfffffffffffffff7,%rdi
  9e:   0f 94 c1                sete   %cl
  a1:   08 c1                   or     %al,%cl
  a3:   0f b6 c1                movzbl %cl,%eax
  a6:   c3                      retq

このような結果になる。もうちょっと効率よくなるようにできるといいですねえ。

2019-01-07

SPECIAL_CONST_P

MRI

SPECIAL_CONST_P の実装を見てみる。

https://github.com/ruby/ruby/blob/ruby_2_6/include/ruby/ruby.h#L1297

#define RB_SPECIAL_CONST_P(x) (RB_IMMEDIATE_P(x) || !RB_TEST(x))
#define SPECIAL_CONST_P(x) RB_SPECIAL_CONST_P(x)

どうやら、x について、

RB_IMMEDIATE_P(x) である、もしくは
RB_TEST(x) ではない

ということらしい。ちなみに、RB_ prefix アリなしがあるのは、昔は RB_ 無ししかなかったけど、やっぱりあったほうがいいだろうって後から追加されたため（たしか）。MRI にはこういうのが結構ある。今は、RB_ 付きを使っておくのが良いと思う。

もうひとつちなみに、_P （もしくは、関数の場合は _p） suffix は、predicate、述語の意味で、Ruby だと special_const? みたいなメソッドの意味。C だと ? を識別子に使えないからね。

それぞれの定義も引用する。

https://github.com/ruby/ruby/blob/ruby_2_6/include/ruby/ruby.h#L400

#define RB_IMMEDIATE_P(x) ((VALUE)(x) & RUBY_IMMEDIATE_MASK)
#define IMMEDIATE_P(x) RB_IMMEDIATE_P(x)

https://github.com/ruby/ruby/blob/ruby_2_6/include/ruby/ruby.h#L478

#define RB_TEST(v) !(((VALUE)(v) & (VALUE)~RUBY_Qnil) == 0)

ビット計算が入ってきていて、微妙によくわからなくなっている。

実装はともかく、それらが何をしているかというと、

RB_IMMEDIATE_P(x) は、x が Fixnum、static Symbol、Flonum、true の時に真
RB_TEST(x) は、x が nil、false の時に真

ということで、前項で述べた special const なオブジェクトに対応する。

2019-01-07

VALUE: special const

MRI

Ruby から見えるすべてはオブジェクトである、という言説がある。実際、数値など 1 + 2 は 1.+(2) といったように、オブジェクトとメソッドの組み合わせで表現される。

オブジェクトをどのように表現するかは、処理系のデザインとしては難しい問題だ。一番ナイーブに作るには、オブジェクトを作成すると、すべてメモリを割り当てる、という手法が考えられる。実際、Array オブジェクトを生成すると、（GC 管理の）メモリを割り当てる。このとき、割り当てた配列オブジェクトへのポインタが VALUE の値である。すなわち、Array オブジェクトの場合、VALUE はポインタということになる。

ただ、数値、とくに整数値などはよく使うので、毎回割り当てるのはオーバヘッド的に大変だ。例えば、n.times{|i| ...} とすると、0～n までの整数が登場するが、毎回それを割り当てるのは大変そうだ。また、よく使う nil なんかも、毎回割り当てるのは変な気がする。そもそも、nil オブジェクトは1個だから、なんか特別なモノでも良い気がする。

VALUE は 32bit ないし 64bit の非負整数値である。ここに、いろいろな形でエンコーディングすることで、いくつかの種類のオブジェクトを表現している。

Fixnum (ある上限・下限をもつ整数値, 2.5 以降は Integer として独立しているが、処理系内部では異なる扱いをしている）
Flonum（ある上限・下限、精度の制限をもつ浮動小数点数。Ruby からは Float として、Flonum もそうじゃないものも、透過的に見えるが、処理系内部では異なる扱いをしている）
Symbol（static symbol と呼ばれるもの）
特殊な即値
- true
- false
- nil
- undef （Ruby レベルからは観測できない、処理系内で使う特殊な値）
メモリへのポインタ（上記以外のオブジェクト）

ポインタ以外を、ソースコード中では special const（特殊定数）と呼んでおり（多分）、SPECIAL_CONST_P(v) で special const かどうかを判断している。適当な名前だなあ。

2019-01-06

VALUE という型

MRI

（早速一日さぼってしまった）

MRI のソースコードでは至る所に出てくる VALUE という型が、Ruby のオブジェクトを表している。

https://github.com/ruby/ruby/blob/ruby_2_6/include/ruby/ruby.h#L94 この辺で定義しているが、やりたいことは、ポインタ型のサイズと同じ大きさの符合無し整数値。

この整数値は、次の２つの値が入る

特殊な値（即値 Immediate value ともいう）→ special const という用語だった
ポインタ

2019-01-04

ar_table の話（もっと良い提案を募集します）

MRI

Hash なのに Hash 値を使わない O(n) な Hash table

いきなり色々すっ飛ばして、hash.c に定義してある ar_table の話をする。

話の前提は Ruby 2.6 の改善を自慢したい - クックパッド開発者ブログにまとめた transient heap。この話は今後、ここでもちゃんとまとめたい。theap の話は、本記事には直接関係しない。

簡単に言うと、ar_table は、Hash を実装するためのデータ構造であり、8 要素に限定した連想配列。8要素以上であれば、st_table を使う（これも、今後紹介する）ようにするので、とりあえず 8 要素以下の Hash オブジェクトのための特殊なデータ構造、と考えて欲しい。線形探索を行うので、探索コストは O(n)。

Hash の要素は key, value の2要素、に加えて key のハッシュ値の3要素。

それを格納するのが ar_table_entry (https://github.com/ruby/ruby/blob/ruby_2_6/internal.h#L816)。

typedef struct ar_table_entry {
    VALUE hash;
    VALUE key;
    VALUE record;
} ar_table_entry;

VALUE はまだ紹介していないけど、オブジェクトへのポインタ。ちなみに、ハッシュ値を格納する hash が VALUE なのはバグで（実害はない）、本当は単なる数値であるべき。trunk では直っている。value が record なのは、単なる趣味だと思う（元にした実装があるんです、多分）。

ar_table は、見てわかるように ar_table_entry の配列 (https://github.com/ruby/ruby/blob/ruby_2_6/internal.h#L822)。

typedef struct ar_table_struct {
    ar_table_entry entries[RHASH_AR_TABLE_MAX_SIZE];
} ar_table;

凄い古典的な連想リストとして実装。探索 (lookup) を疑似 ruby コードで実装すると、こんな感じ：

def lookup(table, key)
  hash_value = key.hash

  table.size.times do |i|
    if table.entries[i].hash == hash_value &&
       table.entries[i].key == key
      return table.entries[i].record
  　end
  end
  raise not_found
end

要するに、最初から辿っていく、というもの。挿入 (insert) は、空いてるエントリに key, value, ハッシュ値をセットするだけの、そのまんまなので省略。

削除 (delete) はちょっとだけ考えなければならなくて、

compaction する（後ろにあるエントリを詰める。削除時、ちょっと時間がかかるかもしれないが、8要素フルに使える）
skip エントリを入れておく（消されたのでスキップとする。ちょっとエントリが無駄になる）

の2つの選択肢がある。MRI 2.6 では、後者 skip エントリを入れるという実装になっている。今考えてみると、前者でも良かったかもしれない（delete よりも lookup のほうが圧倒的に多そうだし。実装が圧倒的に簡単になるし...。今から変えようかな。

なお、skip エントリかどうかは、ar_table_entry::hash に RESERVED_HASH_VAL が入っているかどうかでチェックしている(https://github.com/ruby/ruby/blob/ruby_2_6/hash.c#L349)。

insert は翻って、8要素全部使っているが skip が含まれているなら compaction してから insert を行う (https://github.com/ruby/ruby/blob/ruby_2_6/hash.c#L677)。

これでアルゴリズムの詳細の説明が終わった。Hash テーブルといいながら、8要素以下であれば、単なる線形探索を行う O(n)。ただし、n <= 8 なので、探索時間なんてたいしてかからないだろう、というのが、実装当初の意図。ただ、本当に O(n) でいいの？

Hash なんだから、Hash 値を使って O(1) にしよう

たかだか8要素、されど8要素、ということで、8要素のハッシュ表をオープンアドレッシングで教科書通り作ってみる。

用語があやふやなのでハッシュテーブル - Wikipedia の用語を利用してみる。

ルート配列を用意しなければならない。そこで、8 エントリを 7 エントリにすることで、その空いた領域をルート配列にしよう。64bit CPU の場合は 8 x 3 = 24B だった。さて、ルート配列をどれくらい用意しよう。

7 要素なので、index は 3 bit あれば足りる。キリが良く 4 bit にしておこう。どれくらいのルート配列の長さがあれば良いか。とりあえずエントリ数の2倍 14、ちょっとキリが悪いので、16 エントリにしてみよう。すると、4bit * 16 = 64bit = 8B。まぁ、こんなもんだろう。8B にルート配列を突っ込めば良さそう。

と思って作ったのがこちら https://github.com/ko1/ruby/compare/hash_ar

さて、どれくらい速くなったかな、わくわく、ということで、ベンチマークを用意してみた。

require_relative 'benchmark/lib/load'

Benchmark.driver(repeat_count: 5){|x|
  x.executable name: 'clean', command: %w'../clean-trunk/miniruby'
  x.executable name: 'modify', command: %w'./miniruby'

  10.times{|i|
    x.prelude %Q{h#{i} = { #{i.times.map{|j| "#{j+1} => #{j+1}, "}.join } }}
  }

  [100, :symbol, 's', 'short_str', 'long ' * 100 + 'str'].each{|key|
    10.times{|i|
      x.report "x = h#{i}[#{key.inspect.to_s[0..10]}]", %Q{x = h#{i}[#{key.inspect}];}
    }
  }
  [100, :symbol, 's', 'short_str', 'long ' * 100 + 'str'].each{|key|
    10.times{|i|
      x.report "h#{i}[#{key.inspect.to_s[0..10]}] = true", %Q{h#{i}[#{key.inspect}] = true;}
    }
  }
}

何をやっているかというと、要素数が 0～9 のハッシュ h0～h9 までを用意し、それぞれに対して、いろいろなキーで (1) 表引き（x = h0[100] のように。ただし、エントリは存在しない）、(2) セット（h0[100] = true のように。2度目以降は存在するキーに対してセット）を行う。

clean がパッチ適用前の MRI trunk で、modify が適用後。

実行結果（全データ、5回計測し、最良値）：

# 表から取得（存在しないキーについて）
                            clean      modify
           x = h0[100]    71.382M     68.264M i/s -    111.733M times in 1.565281s 1.636780s
           x = h1[100]    64.569M     65.207M i/s -    103.362M times in 1.600807s 1.585127s
           x = h2[100]    58.040M     70.716M i/s -    100.978M times in 1.739816s 1.427940s
           x = h3[100]    57.785M     64.796M i/s -     99.286M times in 1.718206s 1.532288s
           x = h4[100]    53.154M     65.509M i/s -     95.571M times in 1.798011s 1.458906s
           x = h5[100]    52.951M     64.176M i/s -     94.424M times in 1.783222s 1.471311s
           x = h6[100]    50.729M     63.956M i/s -     94.091M times in 1.854785s 1.471170s
           x = h7[100]    52.343M     64.688M i/s -     94.679M times in 1.808842s 1.463625s
           x = h8[100]    48.283M     59.181M i/s -     80.856M times in 1.674611s 1.366249s
           x = h9[100]    59.493M     65.898M i/s -    103.146M times in 1.733747s 1.565252s
       x = h0[:symbol]    71.649M     76.317M i/s -    111.117M times in 1.550855s 1.455993s
       x = h1[:symbol]    66.340M     72.472M i/s -    106.463M times in 1.604811s 1.469017s
       x = h2[:symbol]    61.781M     67.467M i/s -    104.569M times in 1.692584s 1.549918s
       x = h3[:symbol]    59.895M     67.812M i/s -    101.895M times in 1.701233s 1.502618s
       x = h4[:symbol]    58.011M     68.718M i/s -     79.001M times in 1.361837s 1.149642s
       x = h5[:symbol]    54.835M     67.058M i/s -     83.361M times in 1.520197s 1.243109s
       x = h6[:symbol]    51.676M     69.619M i/s -     95.163M times in 1.841535s 1.366927s
       x = h7[:symbol]    51.936M     66.584M i/s -     91.941M times in 1.770270s 1.380817s
       x = h8[:symbol]    52.347M     63.100M i/s -     94.175M times in 1.799060s 1.492460s
       x = h9[:symbol]    61.492M     61.434M i/s -     97.341M times in 1.582983s 1.584480s
           x = h0["s"]    52.246M     47.940M i/s -     90.479M times in 1.731800s 1.887331s
           x = h1["s"]    45.926M     47.301M i/s -     66.259M times in 1.442724s 1.400785s
           x = h2["s"]    44.499M     47.372M i/s -     85.578M times in 1.923133s 1.806515s
           x = h3["s"]    45.103M     47.777M i/s -     85.701M times in 1.900123s 1.793763s
           x = h4["s"]    43.234M     47.875M i/s -     74.151M times in 1.715084s 1.548830s
           x = h5["s"]    42.249M     47.268M i/s -     79.976M times in 1.892973s 1.691973s
           x = h6["s"]    41.281M     47.383M i/s -     81.115M times in 1.964935s 1.711905s
           x = h7["s"]    40.827M     43.986M i/s -     58.734M times in 1.438612s 1.335291s
           x = h8["s"]    39.907M     43.387M i/s -     79.366M times in 1.988796s 1.829248s
           x = h9["s"]    41.744M     42.676M i/s -     83.145M times in 1.991773s 1.948295s
   x = h0["short_str"]    46.725M     45.238M i/s -     88.081M times in 1.885112s 1.947063s
   x = h1["short_str"]    44.019M     43.855M i/s -     84.102M times in 1.910584s 1.917716s
   x = h2["short_str"]    42.807M     44.158M i/s -     80.763M times in 1.886661s 1.828943s
   x = h3["short_str"]    42.550M     44.119M i/s -     82.935M times in 1.949125s 1.879820s
   x = h4["short_str"]    40.764M     44.179M i/s -     78.123M times in 1.916487s 1.768325s
   x = h5["short_str"]    40.123M     44.326M i/s -     78.880M times in 1.965977s 1.779536s
   x = h6["short_str"]    39.368M     44.127M i/s -     77.492M times in 1.968403s 1.756120s
   x = h7["short_str"]    39.351M     43.959M i/s -     78.870M times in 2.004282s 1.794183s
   x = h8["short_str"]    37.973M     40.953M i/s -     76.933M times in 2.025977s 1.878552s
   x = h9["short_str"]    40.919M     40.928M i/s -     80.092M times in 1.957326s 1.956906s
   x = h0["long long ]     7.609M      7.583M i/s -     20.516M times in 2.696223s 2.705468s
   x = h1["long long ]     7.549M      7.532M i/s -     20.622M times in 2.731816s 2.737974s
   x = h2["long long ]     7.522M      7.531M i/s -     20.596M times in 2.737924s 2.734750s
   x = h3["long long ]     7.492M      7.531M i/s -     20.516M times in 2.738329s 2.724281s
   x = h4["long long ]     7.487M      7.534M i/s -     20.238M times in 2.703273s 2.686341s
   x = h5["long long ]     7.444M      7.500M i/s -     19.177M times in 2.576132s 2.557040s
   x = h6["long long ]     7.425M      7.533M i/s -     20.351M times in 2.740651s 2.701387s
   x = h7["long long ]     7.397M      7.530M i/s -     20.290M times in 2.743029s 2.694600s
   x = h8["long long ]     7.387M      7.440M i/s -     20.270M times in 2.743931s 2.724420s
   x = h9["long long ]     7.503M      7.444M i/s -     20.052M times in 2.672382s 2.693812s

# セット
        h0[100] = true    46.083M     42.413M i/s -     83.860M times in 1.819765s 1.977203s
        h1[100] = true    43.511M     42.373M i/s -     84.211M times in 1.935364s 1.987362s
        h2[100] = true    41.062M     42.208M i/s -     81.507M times in 1.984945s 1.931053s
        h3[100] = true    39.654M     42.213M i/s -     77.406M times in 1.952063s 1.833701s
        h4[100] = true    39.428M     42.478M i/s -     74.194M times in 1.881773s 1.746667s
        h5[100] = true    42.973M     42.265M i/s -     80.608M times in 1.875779s 1.907174s
        h6[100] = true    40.662M     42.281M i/s -     76.306M times in 1.876611s 1.804727s
        h7[100] = true    38.067M     40.084M i/s -     70.484M times in 1.851591s 1.758405s
        h8[100] = true    48.033M     47.918M i/s -     89.954M times in 1.872740s 1.877238s
        h9[100] = true    48.293M     47.340M i/s -     89.425M times in 1.851723s 1.888990s
    h0[:symbol] = true    43.157M     42.848M i/s -     78.601M times in 1.821269s 1.834401s
    h1[:symbol] = true    43.926M     45.265M i/s -     88.841M times in 2.022519s 1.962692s
    h2[:symbol] = true    41.430M     42.967M i/s -     82.082M times in 1.981232s 1.910365s
    h3[:symbol] = true    40.009M     42.854M i/s -     80.231M times in 2.005313s 1.872203s
    h4[:symbol] = true    39.538M     43.023M i/s -     77.901M times in 1.970253s 1.810694s
    h5[:symbol] = true    41.638M     42.950M i/s -     75.416M times in 1.811218s 1.755912s
    h6[:symbol] = true    38.124M     42.889M i/s -     67.325M times in 1.765961s 1.569744s
    h7[:symbol] = true    37.903M     42.646M i/s -     76.329M times in 2.013815s 1.789841s
    h8[:symbol] = true    48.710M     44.633M i/s -     80.689M times in 1.656528s 1.807852s
    h9[:symbol] = true    51.595M     45.664M i/s -     90.569M times in 1.755380s 1.983365s
        h0["s"] = true    35.082M     35.235M i/s -     68.348M times in 1.948236s 1.939793s
        h1["s"] = true    34.785M     34.682M i/s -     69.035M times in 1.984618s 1.990507s
        h2["s"] = true    34.107M     34.471M i/s -     65.289M times in 1.914228s 1.894018s
        h3["s"] = true    33.682M     34.424M i/s -     68.677M times in 2.039002s 1.995034s
        h4["s"] = true    32.591M     34.496M i/s -     69.715M times in 2.139110s 2.020991s
        h5["s"] = true    32.173M     33.911M i/s -     63.276M times in 1.966719s 1.865913s
        h6["s"] = true    32.307M     34.608M i/s -     66.085M times in 2.045527s 1.909525s
        h7["s"] = true    31.360M     31.992M i/s -     67.392M times in 2.149006s 2.106543s
        h8["s"] = true    33.849M     33.663M i/s -     71.534M times in 2.113358s 2.124989s
        h9["s"] = true    33.690M     33.655M i/s -     68.199M times in 2.024321s 2.026436s
h0["short_str"] = true    33.163M     32.519M i/s -     70.550M times in 2.127370s 2.169500s
h1["short_str"] = true    32.358M     32.684M i/s -     67.495M times in 2.085887s 2.065060s
h2["short_str"] = true    32.791M     32.503M i/s -     68.566M times in 2.091024s 2.109524s
h3["short_str"] = true    31.632M     32.729M i/s -     66.508M times in 2.102565s 2.032062s
h4["short_str"] = true    30.384M     32.541M i/s -     66.774M times in 2.197653s 2.052010s
h5["short_str"] = true    30.526M     32.531M i/s -     66.923M times in 2.192355s 2.057222s
h6["short_str"] = true    31.069M     32.508M i/s -     64.041M times in 2.061248s 1.969987s
h7["short_str"] = true    29.933M     30.252M i/s -     64.463M times in 2.153544s 2.130880s
h8["short_str"] = true    31.347M     31.222M i/s -     67.977M times in 2.168574s 2.177243s
h9["short_str"] = true    31.765M     31.263M i/s -     61.937M times in 1.949834s 1.981173s
h0["long long ] = true     7.173M      7.168M i/s -     19.712M times in 2.747900s 2.749936s
h1["long long ] = true     7.155M      7.166M i/s -     19.670M times in 2.749118s 2.744699s
h2["long long ] = true     7.123M      7.159M i/s -     18.257M times in 2.563142s 2.550164s
h3["long long ] = true     7.115M      7.161M i/s -     19.520M times in 2.743328s 2.725762s
h4["long long ] = true     7.081M      7.162M i/s -     19.297M times in 2.725344s 2.694446s
h5["long long ] = true     7.059M      7.151M i/s -     19.410M times in 2.749527s 2.714388s
h6["long long ] = true     7.048M      7.163M i/s -     19.428M times in 2.756493s 2.712289s
h7["long long ] = true     7.040M      7.043M i/s -     19.244M times in 2.733740s 2.732542s
h8["long long ] = true     7.190M      7.181M i/s -     19.782M times in 2.751245s 2.754663s
h9["long long ] = true     7.186M      7.179M i/s -     18.897M times in 2.629533s 2.632356s

なんというか、正直ぱっとしない。

           x = h0[100]    71.382M     68.264M i/s -    111.733M times in 1.565281s 1.636780s
           x = h1[100]    64.569M     65.207M i/s -    103.362M times in 1.600807s 1.585127s
           x = h2[100]    58.040M     70.716M i/s -    100.978M times in 1.739816s 1.427940s
           x = h3[100]    57.785M     64.796M i/s -     99.286M times in 1.718206s 1.532288s
           x = h4[100]    53.154M     65.509M i/s -     95.571M times in 1.798011s 1.458906s
           x = h5[100]    52.951M     64.176M i/s -     94.424M times in 1.783222s 1.471311s
           x = h6[100]    50.729M     63.956M i/s -     94.091M times in 1.854785s 1.471170s
           x = h7[100]    52.343M     64.688M i/s -     94.679M times in 1.808842s 1.463625s
           x = h8[100]    48.283M     59.181M i/s -     80.856M times in 1.674611s 1.366249s
           x = h9[100]    59.493M     65.898M i/s -    103.146M times in 1.733747s 1.565252s

これを見ると、clean では、エントリ数に比例してちょっと遅く案っている（O(n)）んだけど... modify のほう、O(1) といわれればそんな気もするんだけど、もうちょっと速いままにならないかな...。ルート配列を一度アクセスする分、どーしょーもないかなー。

ご意見募集

というわけで、スゲー単純なデータ構造の話なんだけど、もうちょっと「ぱっとした」値になりそうな気がするので、もしよかったら教えてください。アセンブラで AVX とかの命令使ってもいいけど、ほとんどのバイナリで使えなくなる可能性があるので、ちょっと...。

8要素とか、そういう単位なので、何かカッコいい方法が知られてるんじゃないかと期待しています。

ちなみに、8要素だと 8B * 3 * 8 = 192B かかるんだけど、これをもっと圧縮する何か素敵なアイディアがあれば...。これはさすがに無いかなー。hash値を 32bit しか格納しない（eql? されるかもしれない回数が増える）、とかくらいか。それでも 4B * 8 = 32B しか浮かない。いや、しかってだけでもないか。

なんというか、とてもショーモナイ話だし、実際ここが2倍に高速化しても、あまり実アプリでは意味ないんだけど...。弄りやすくはあります。

既存研究

hash 値だけを先頭に集めれば線形探索速いよね？
- trunk に比べて速くなる場合があることは確認しているが、O(1) に比べてどうか？
ブルームフィルタはどう？
- 結果ほとんど見えなかった

2019-01-03

対象とする CPU アーキテクチャ

MRI

MRI が対象とする CPU アーキテクチャは、32bit および 64bit CPU。命令セットは、一応だいたい動くような気がするけど、レジスタウィンドウとか、IA64 のなんとか、というのが特殊な対応が入る奴がいる。今は Intel x64 が多いんじゃないかな。AWS で動かすならこれですよね。直接 CPU 命令を弄るところはほとんど無い、というか、(1) CPU ごとにどうしてもしょうがないところをアセンブラで書く、(2) 性能のために、有名どころは弄る必要がある

弄っているのは、

GC のため（保守的GC の root set のため）
Fiber のため（2.6 から）

の2カ所、だと思う。それ以外のところは、多分 CPU をあまり選ばないのではないだろうか。

コンパイラは、32bit or 64bit CPU で、C91 を対象としている。C99 は、今議論中。C99 は 20年前なのか。

マシンスタックを保守的GCのルートにするため、マシンスタックのすべての値を取る必要がある。そのため、コンパイルされたコードはマシンスタックが1つの配列構造になっている必要がある。スパゲッティスタックではいけない。callstack - How to make spaghetti stack with C? - Stack Overflow によると、clang も gcc も、そういうのをやるオプションがあるようだが、MRI はそのようにコンパイルされると動かない。