原创

JDK1.8 HashMap源码解析(四) 为什么HashMap桶中链表长度个数超过8才转为红黑树

首先强调一点,HashMap桶中, 并不是链表长度个数超过8一定会转为红黑树。 见文章

因为树化的条件是:桶中链表的长度达到了8,并且数组的长度大于等于64。

但是, 这并不妨碍我们分析其中一个条件:桶中链表长度个数超过8。为什么是8?

为什么要转换?

因为Map中桶的元素初始化是链表保存的,其查找性能是O(n),而树结构能将查找性能提升到O(log(n))。

当链表长度很小的时候,即使遍历,速度也非常快,但是当链表长度不断变长,肯定会对查询性能有一定的影响,所以才需要转成树。

在HashMap源码中,

/**
   * The bin count threshold for using a tree rather than list for a
   * bin.  Bins are converted to trees when adding an element to a
   * bin with at least this many nodes. The value must be greater
   * than 2 and should be at least 8 to mesh with assumptions in
   * tree removal about conversion back to plain bins upon
   * shrinkage.
   */
  /**
   * JDK1.8 新加,阈值,Entry链表最大长度。
   * 当桶中节点上的链表长度大于8,将链表转成红黑树存储;
   */
  static final int TREEIFY_THRESHOLD = 8;

这段源代码的注释只说明了8是bin(桶,即bucket)从链表转成树的阈值。

抽丝剥茧

为什么阈值是8,去源码中找寻答案,才是最可靠的途径。

在HashMap源代码的开始,有一段Implementation notes描述,里面就告诉了我们答案:

/*
   * Implementation notes.
   *
   * This map usually acts as a binned (bucketed) hash table, but
   * when bins get too large, they are transformed into bins of
   * TreeNodes, each structured similarly to those in
   * java.util.TreeMap. Most methods try to use normal bins, but
   * relay to TreeNode methods when applicable (simply by checking
   * instanceof a node).  Bins of TreeNodes may be traversed and
   * used like any others, but additionally support faster lookup
   * when overpopulated. However, since the vast majority of bins in
   * normal use are not overpopulated, checking for existence of
   * tree bins may be delayed in the course of table methods.
   *
   * Tree bins (i.e., bins whose elements are all TreeNodes) are
   * ordered primarily by hashCode, but in the case of ties, if two
   * elements are of the same "class C implements Comparable<C>",
   * type then their compareTo method is used for ordering. (We
   * conservatively check generic types via reflection to validate
   * this -- see method comparableClassFor).  The added complexity
   * of tree bins is worthwhile in providing worst-case O(log n)
   * operations when keys either have distinct hashes or are
   * orderable, Thus, performance degrades gracefully under
   * accidental or malicious usages in which hashCode() methods
   * return values that are poorly distributed, as well as those in
   * which many keys share a hashCode, so long as they are also
   * Comparable. (If neither of these apply, we may waste about a
   * factor of two in time and space compared to taking no
   * precautions. But the only known cases stem from poor user
   * programming practices that are already so slow that this makes
   * little difference.)
   *
   * Because TreeNodes are about twice the size of regular nodes, we
   * use them only when bins contain enough nodes to warrant use
   * (see TREEIFY_THRESHOLD). And when they become too small (due to
   * removal or resizing) they are converted back to plain bins.  In
   * usages with well-distributed user hashCodes, tree bins are
   * rarely used.  Ideally, under random hashCodes, the frequency of
   * nodes in bins follows a Poisson distribution
   * (http://en.wikipedia.org/wiki/Poisson_distribution) with a
   * parameter of about 0.5 on average for the default resizing
   * threshold of 0.75, although with a large variance because of
   * resizing granularity. Ignoring variance, the expected
   * occurrences of list size k are (exp(-0.5) * pow(0.5, k) /
   * factorial(k)). The first values are:
   *
   * 0:    0.60653066
   * 1:    0.30326533
   * 2:    0.07581633
   * 3:    0.01263606
   * 4:    0.00157952
   * 5:    0.00015795
   * 6:    0.00001316
   * 7:    0.00000094
   * 8:    0.00000006
   * more: less than 1 in ten million
   *
   * The root of a tree bin is normally its first node.  However,
   * sometimes (currently only upon Iterator.remove), the root might
   * be elsewhere, but can be recovered following parent links
   * (method TreeNode.root()).
   *
   * All applicable internal methods accept a hash code as an
   * argument (as normally supplied from a public method), allowing
   * them to call each other without recomputing user hashCodes.
   * Most internal methods also accept a "tab" argument, that is
   * normally the current table, but may be a new or old one when
   * resizing or converting.
   *
   * When bin lists are treeified, split, or untreeified, we keep
   * them in the same relative access/traversal order (i.e., field
   * Node.next) to better preserve locality, and to slightly
   * simplify handling of splits and traversals that invoke
   * iterator.remove. When using comparators on insertion, to keep a
   * total ordering (or as close as is required here) across
   * rebalancings, we compare classes and identityHashCodes as
   * tie-breakers.
   *
   * The use and transitions among plain vs tree modes is
   * complicated by the existence of subclass LinkedHashMap. See
   * below for hook methods defined to be invoked upon insertion,
   * removal and access that allow LinkedHashMap internals to
   * otherwise remain independent of these mechanics. (This also
   * requires that a map instance be passed to some utility methods
   * that may create new nodes.)
   *
   * The concurrent-programming-like SSA-based coding style helps
   * avoid aliasing errors amid all of the twisty pointer operations.
   */

要点:

当bin变得很大的时候,会被转换成TreeNodes中的bin,其结构和TreeMap相似,也就是红黑树。

This map usually acts as a binned (bucketed) hash table, but when bins get too large, they are transformed into bins of TreeNodes, each structured similarly to those in java.util.TreeMap.

转换后存储的数据结构TreeNodes占用空间是普通Nodes的两倍,只有当bin包含足够多的节点时才会转成TreeNodes,而是否足够多就是由 TREEIFY_THRESHOLD 的值决定的。

当bin中节点数变少时,又会转成普通的bin(链表长度达到8就转成红黑树,当长度降到6就转成普通bin)。

Because TreeNodes are about twice the size of regular nodes, we use them only when bins contain enough nodes to warrant use (see TREEIFY_THRESHOLD). And when they become too small (due to removal or resizing) they are converted back to plain bins.

这就是为什么不是一开始就将其转换为TreeNodes,而是需要一定节点数才转为TreeNodes。

继续往下看,给出了一个不同节点数达到的概率值。

在hashCode离散性很好的情况下,树型bin用到的概率非常小,因为数据均匀分布在每个bin中,几乎不会有bin中链表长度会达到阈值(8)。但是在随机hashCode下,离散性可能会变差,然而JDK又不能阻止用户实现这种不好的hash算法,因此就可能导致不均匀的数据分布。

事实上,随机hashCode算法下所有bin中节点的分布频率遵循如下的泊松分布

在扩容阈值为0.75的情况下,(即使因为扩容而方差很大)遵循着参数平均为0.5的泊松分布。忽略方差,按公式
泊松分布公式.png

我们可以看到如下的计算概率,一个bin中链表长度达到8个元素的概率为0.00000006,几乎是不可能事件

所以,之所以选择8,是时间和空间的权衡(trade-off),是根据概率统计决定的, 是非常严谨和科学的。

In usages with well-distributed user hashCodes, tree bins are rarely used. Ideally, under random hashCodes, the frequency of nodes in bins follows a Poisson distribution (http://en.wikipedia.org/wiki/Poisson_distribution) with a parameter of about 0.5 on average for the default resizing threshold of 0.75, although with a large variance because of resizing granularity. Ignoring variance, the expected occurrences of list size k are (exp(-0.5) * pow(0.5, k) / factorial(k)).

The first values are:
    0:    0.60653066
    1:    0.30326533
    2:    0.07581633
    3:    0.01263606
    4:    0.00157952
    5:    0.00015795
    6:    0.00001316
    7:    0.00000094
    8:    0.00000006
    more: less than 1 in ten million

大部分情况下,链表存储能节约存储空间同时有着良好的查找性能;极个别情况下,节点数达到8个,转为红黑树,能获得更好的查找性能,同时因为是个别情况,不需要大量的存储空间。

总结

1、TreeNodes(红黑树)占用空间是普通Nodes(链表)的两倍,为了时间和空间的权衡。

2、节点的分布频率会遵循泊松分布,链表长度达到8个元素的概率为0.00000006,几乎是不可能事件.

题外话:
为什么转化为红黑树的阈值8和转化为链表的阈值6不一样,是为了避免频繁来回转化。

正文到此结束
广告是为了更好的提供数据服务
本文目录