歡迎光臨
每天分享高質量文章

【追光者系列】HikariCP原始碼分析之故障檢測那些思考 fail fast & allowPoolSuspension

【追光者系列】HikariCP原始碼分析之故障檢測那些思考 fail fast & allowPoolSuspension

摘自【工匠小豬豬的技術世界】 1.這是一個系列,有興趣的朋友可以持續關註 2.如果你有HikariCP使用上的問題,可以給我留言,我們一起溝通討論 3.希望大家可以提供我一些案例,我也希望可以支援你們做一些調優

由於時間原因,本文主要內容參考了 https://segmentfault.com/a/1190000013136251,並結合一些思考做了增註。

模擬資料庫掛掉

首先解釋一下connectionTimeout的意思,這並不是獲取連線的超時時間,而是從連線池傳回連線的超時時間。 SQL執行的超時時間,JDBC 可以直接使用 Statement.setQueryTimeout,Spring 可以使用 @Transactional(timeout=10)。

connectionTimeout  This property controls the maximum number of milliseconds that a client (that’s you) will wait for a connection from the pool. If this time is exceeded without a connection becoming available, a SQLException will be thrown. Lowest acceptable connection timeout is 250 ms. Default: 30000 (30 seconds)

如果是沒有空閑連線且連線池滿不能新建連線的情況下,hikari則是阻塞connectionTimeout的時間,沒有得到連線丟擲SQLTransientConnectionException。

如果是有空閑連線的情況,hikari是在connectionTimeout時間內不斷迴圈獲取下一個空閑連線進行校驗,校驗失敗繼續獲取下一個空閑連線,直到超時丟擲SQLTransientConnectionException。(hikari在獲取一個連線的時候,會在connectionTimeout時間內迴圈把空閑連線挨個validate一次,最後timeout丟擲異常;之後的獲取連線操作,則一直阻塞connectionTimeout時間再丟擲異常)

如果微服務使用了連線的健康監測,如果你catch了此異常,就會不斷的打出健康監測的錯誤

hikari如果connectionTimeout設定太大的話,在資料庫掛的時候,很容易阻塞業務執行緒

根據以上結論我們擼一遍原始碼,首先看一下getConnection的原始碼,大致流程是如果borrow的poolEntry為空,就會跳出迴圈,拋異常,包括超時時間也會打出來如下:

  1. java.sql.SQLTransientConnectionException: communications-link-failure-db - Connection is not available, request timed out after 447794ms.

  2.    at com.zaxxer.hikari.pool.HikariPool.createTimeoutException(HikariPool.java:666)

  3.    at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:182)

  4.    at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:147)</code> /**

  5.    * Get a connection from the pool, or timeout after the specified number of milliseconds.

  6.    *

  7.    * @param hardTimeout the maximum time to wait for a connection from the pool

  8.    * @return a java.sql.Connection instance

  9.    * @throws SQLException thrown if a timeout occurs trying to obtain a connection

  10.    */

  11.   public Connection getConnection(final long hardTimeout) throws SQLException {

  12.      suspendResumeLock.acquire();

  13.      final long startTime = currentTime();

  14.      try {

  15.         long timeout = hardTimeout;

  16.         do {

  17.            PoolEntry poolEntry = connectionBag.borrow(timeout, MILLISECONDS);

  18.            if (poolEntry == null) {

  19.               break; // We timed out... break and throw exception

  20.            }

  21.            final long now = currentTime();

  22.            if (poolEntry.isMarkedEvicted() || (elapsedMillis(poolEntry.lastAccessed, now) > ALIVE_BYPASS_WINDOW_MS && !isConnectionAlive(poolEntry.connection))) {

  23.               closeConnection(poolEntry, poolEntry.isMarkedEvicted() ? EVICTED_CONNECTION_MESSAGE : DEAD_CONNECTION_MESSAGE);

  24.               timeout = hardTimeout - elapsedMillis(startTime);

  25.            }

  26.            else {

  27.               metricsTracker.recordBorrowStats(poolEntry, startTime);

  28.               return poolEntry.createProxyConnection(leakTaskFactory.schedule(poolEntry), now);

  29.            }

  30.         } while (timeout > 0L);

  31.         metricsTracker.recordBorrowTimeoutStats(startTime);

  32.         throw createTimeoutException(startTime);

  33.      }

  34.      catch (InterruptedException e) {

  35.         Thread.currentThread().interrupt();

  36.         throw new SQLException(poolName + " - Interrupted during connection acquisition", e);

  37.      }

  38.      finally {

  39.         suspendResumeLock.release();

  40.      }

  41.   }

我們聚焦一下borrow原始碼,該方法的意思和其註釋所說的一樣,The method will borrow a BagEntry from the bag, blocking for the specified timeout if none are available. 那麼final T bagEntry = handoffQueue.poll(timeout, NANOSECONDS); 這段程式碼就是在資料庫掛掉的情況下,會產生一段耗時的地方

  1. /**

  2.    * The method will borrow a BagEntry from the bag, blocking for the

  3.    * specified timeout if none are available.

  4.    *

  5.    * @param timeout how long to wait before giving up, in units of unit

  6.    * @param timeUnit a TimeUnit determining how to interpret the timeout parameter

  7.    * @return a borrowed instance from the bag or null if a timeout occurs

  8.    * @throws InterruptedException if interrupted while waiting

  9.    */

  10.   public T borrow(long timeout, final TimeUnit timeUnit) throws InterruptedException

  11.   {

  12.      // Try the thread-local list first

  13.      final List<Object> list = threadList.get();

  14.      for (int i = list.size() - 1; i >= 0; i--) {

  15.         final Object entry = list.remove(i);

  16.         @SuppressWarnings("unchecked")

  17.         final T bagEntry = weakThreadLocals ? ((WeakReference<T>) entry).get() : (T) entry;

  18.         if (bagEntry != null && bagEntry.compareAndSet(STATE_NOT_IN_USE, STATE_IN_USE)) {

  19.            return bagEntry;

  20.         }

  21.      }

  22.      // Otherwise, scan the shared list ... then poll the handoff queue

  23.      final int waiting = waiters.incrementAndGet();

  24.      try {

  25.         for (T bagEntry : sharedList) {

  26.            if (bagEntry.compareAndSet(STATE_NOT_IN_USE, STATE_IN_USE)) {

  27.               // If we may have stolen another waiter's connection, request another bag add.

  28.               if (waiting > 1) {

  29.                  listener.addBagItem(waiting - 1);

  30.               }

  31.               return bagEntry;

  32.            }

  33.         }

  34.         listener.addBagItem(waiting);

  35.         timeout = timeUnit.toNanos(timeout);

  36.         do {

  37.            final long start = currentTime();

  38.            final T bagEntry = handoffQueue.poll(timeout, NANOSECONDS);

  39.            if (bagEntry == null || bagEntry.compareAndSet(STATE_NOT_IN_USE, STATE_IN_USE)) {

  40.               return bagEntry;

  41.            }

  42.            timeout -= elapsedNanos(start);

  43.         } while (timeout > 10_000);

  44.         return null;

  45.      }

  46.      finally {

  47.         waiters.decrementAndGet();

  48.      }

  49.   }

這裡使用了JUC的SynchronousQueue

  1. /**

  2.     * Retrieves and removes the head of this queue, waiting

  3.     * if necessary up to the specified wait time, for another thread

  4.     * to insert it.

  5.     *

  6.     * @return the head of this queue, or {@code null} if the

  7.     *         specified waiting time elapses before an element is present

  8.     * @throws InterruptedException {@inheritDoc}

  9.     */

  10.    public E poll(long timeout, TimeUnit unit) throws InterruptedException {

  11.        E e = transferer.transfer(null, true, unit.toNanos(timeout));

  12.        if (e != null || !Thread.interrupted())

  13.            return e;

  14.        throw new InterruptedException();

  15.    }

此時拿到空的poolEntry在getConnection中跳出迴圈,拋異常

HikariPool還有一個內部類叫PoolEntryCreator

  1. /**

  2.    * Creating and adding poolEntries (connections) to the pool.

  3.    */

  4.   private final class PoolEntryCreator implements Callable<Boolean> {

  5.      private final String loggingPrefix;

  6.      PoolEntryCreator(String loggingPrefix)

  7.      {

  8.         this.loggingPrefix = loggingPrefix;

  9.      }

  10.      @Override

  11.      public Boolean call() throws Exception

  12.      {

  13.         long sleepBackoff = 250L;

  14.         while (poolState == POOL_NORMAL && shouldCreateAnotherConnection()) {

  15.            final PoolEntry poolEntry = createPoolEntry();

  16.            if (poolEntry != null) {

  17.               connectionBag.add(poolEntry);

  18.               LOGGER.debug("{} - Added connection {}", poolName, poolEntry.connection);

  19.               if (loggingPrefix != null) {

  20.                  logPoolState(loggingPrefix);

  21.               }

  22.               return Boolean.TRUE;

  23.            }

  24.            // failed to get connection from db, sleep and retry

  25.            quietlySleep(sleepBackoff);

  26.            sleepBackoff = Math.min(SECONDS.toMillis(10), Math.min(connectionTimeout, (long) (sleepBackoff * 1.5)));

  27.         }

  28.         // Pool is suspended or shutdown or at max size

  29.         return Boolean.FALSE;

  30.      }

  31.      /**

  32.       * We only create connections if we need another idle connection or have threads still waiting

  33.       * for a new connection.  Otherwise we bail out of the request to create.

  34.       *

  35.       * @return true if we should create a connection, false if the need has disappeared

  36.       */

  37.      private boolean shouldCreateAnotherConnection() {

  38.         return getTotalConnections() < config.getMaximumPoolSize() &&

  39.            (connectionBag.getWaitingThreadCount() > 0 || getIdleConnections() < config.getMinimumIdle());

  40.      }

  41.   }

shouldCreateAnotherConnection方法決定了是否需要新增新的連線

HikariPool初始化的時候會初始化兩個PoolEntryCreator,分別是POOLENTRYCREATOR和POSTFILLPOOLENTRYCREATOR,是兩個非同步執行緒

  1. private final PoolEntryCreator POOL_ENTRY_CREATOR = new PoolEntryCreator(null /*logging prefix*/);

  2.   private final PoolEntryCreator POST_FILL_POOL_ENTRY_CREATOR = new PoolEntryCreator("After adding ");

POOLENTRYCREATOR主要是會被private final ThreadPoolExecutor addConnectionExecutor;呼叫到,一處是fillPool,從當前的空閑連線(在執行時被感知到的)填充到minimumIdle(HikariCP嘗試在池中維護的最小空閑連線數,如果空閑連線低於此值並且池中的總連線數少於maximumPoolSize,HikariCP將盡最大努力快速高效地新增其他連線)。 補充新連線也會遭遇Connection refused相關的異常。

  1. /**

  2.    * Fill pool up from current idle connections (as they are perceived at the point of execution) to minimumIdle connections.

  3.    */

  4.   private synchronized void fillPool() {

  5.      final int connectionsToAdd = Math.min(config.getMaximumPoolSize() - getTotalConnections(), config.getMinimumIdle() - getIdleConnections())

  6.                                   - addConnectionQueue.size();

  7.      for (int i = 0; i < connectionsToAdd; i++) {

  8.         addConnectionExecutor.submit((i < connectionsToAdd - 1) ? POOL_ENTRY_CREATOR : POST_FILL_POOL_ENTRY_CREATOR);

  9.      }

  10.   }

還有一處是addBagItem

  1. /** {@inheritDoc} */

  2.   @Override

  3.   public void addBagItem(final int waiting) {

  4.      final boolean shouldAdd = waiting - addConnectionQueue.size() >= 0; // Yes, >= is intentional.

  5.      if (shouldAdd) {

  6.         addConnectionExecutor.submit(POOL_ENTRY_CREATOR);

  7.      }

  8.   }

最後再補充兩個屬性idleTimeout和minimumIdle

idleTimeout  This property controls the maximum amount of time that a connection is allowed to sit idle in the pool. This setting only applies when minimumIdle is defined to be less than maximumPoolSize. Idle connections will not be retired once the pool reaches minimumIdle connections. Whether a connection is retired as idle or not is subject to a maximum variation of +30 seconds, and average variation of +15 seconds. A connection will never be retired as idle before this timeout. A value of 0 means that idle connections are never removed from the pool. The minimum allowed value is 10000ms (10 seconds). Default: 600000 (10 minutes)

預設是600000毫秒,即10分鐘。如果idleTimeout+1秒>maxLifetime 且 maxLifetime>0,則會被重置為0;如果idleTimeout!=0且小於10秒,則會被重置為10秒。如果idleTimeout=0則表示空閑的連線在連線池中永遠不被移除。

只有當minimumIdle小於maximumPoolSize時,這個引數才生效,當空閑連線數超過minimumIdle,而且空閑時間超過idleTimeout,則會被移除。

minimumIdle  This property controls the minimum number of idle connections that HikariCP tries to maintain in the pool. If the idle connections dip below this value and total connections in the pool are less than maximumPoolSize, HikariCP will make a best effort to add additional connections quickly and efficiently. However, for maximum performance and responsiveness to spike demands, we recommend not setting this value and instead allowing HikariCP to act as a fixed size connection pool. Default: same as maximumPoolSize

控制連線池空閑連線的最小數量,當連線池空閑連線少於minimumIdle,而且總共連線數不大於maximumPoolSize時,HikariCP會儘力補充新的連線。為了效能考慮,不建議設定此值,而是讓HikariCP把連線池當做固定大小的處理,預設minimumIdle與maximumPoolSize一樣。

當minIdle<0或者minIdle>maxPoolSize,則被重置為maxPoolSize,該值預設為10。

Hikari會啟動一個HouseKeeper定時任務,在HikariPool建構式裡頭初始化,預設的是初始化後100毫秒執行,之後每執行完一次之後隔HOUSEKEEPINGPERIODMS(30秒)時間執行。

這個定時任務的作用就是根據idleTimeout的值,移除掉空閑超時的連線。 首先檢測時鐘是否倒退,如果倒退了則立即對過期的連線進行標記evict;之後當idleTimeout>0且配置的minimumIdle

取出狀態是STATENOTINUSE的連線數,如果大於minimumIdle,則遍歷STATENOTINUSE的連線的連線,將空閑超時達到idleTimeout的連線從connectionBag移除掉,若移除成功則關閉該連線,然後toRemove–。

在空閑連線移除之後,再呼叫fillPool,嘗試補充空間連線數到minimumIdle值

hikari的連線洩露是每次getConnection的時候單獨觸發一個延時任務來處理,而空閑連線的清除則是使用HouseKeeper定時任務來處理,其執行間隔由com.zaxxer.hikari.housekeeping.periodMs環境變數控制,預設為30秒。

allowPoolSuspension

關於這個引數,用來標記釋放允許暫停連線池,一旦被暫停,所有的getConnection方法都會被阻塞。

作者是這麼說的: https://github.com/brettwooldridge/HikariCP/issues/1060

All of the suspend use cases I have heard have centered around a pattern of:

  • Suspend the pool.

  • Alter the pool configuration, or alter DNS configuration (to point to a new master).

  • Soft-evict existing connections.

  • Resume the pool.

我做過試驗,Suspend期間getConnection確實不會超時,SQL執行都會被保留下來,軟碟機除現有連線之後,一直保持到池恢復Resume時,這些SQL依然會繼續執行,也就是說使用者並不會丟資料。 但是在實際生產中,不影響業務很難,即使繼續執行,業務也可能超時了。 故障註入是中介軟體開發應該要做的,這個點的功能在實現chaosmonkey以模擬資料庫連線故障,但是監控過程中我發現hikaricppendingthreads指標並沒有提升、MBean的threadAwaitingConnections也沒有改變,所以包括故障演練以後也可以不用搞得那麼複雜,收攏在中介軟體內部做可能更好,前提是對於這個引數,中介軟體還需要自研以增加模擬拋異常或是一些監控指標進行加強。 另外,長期阻塞該引數存在讓微服務卡死的風險

詳細推薦看一下  【追光者系列】HikariCP原始碼分析之allowPoolSuspension

參考資料

  • https://segmentfault.com/u/codecraft/articles?page=4

贊(0)

分享創造快樂