ファイルディスクリプタリーク追跡：sys.stdout再割り当てが招く静かな枯渇

データ処理パイプラインで、数時間動かしているとOSError: [Errno 24] Too many open filesで落ちるようになりました。

原因は、サブプロセスのstdout/stderrを意図せず変数に保持し続けていたことによるファイルディスクリプタ（FD）リークでした。Pythonのsubprocess.Popenの挙動と、FDリーク検出のアプローチをまとめます。

問題：Too many open filesが出る

症状

1
Traceback (most recent call last):
2
  File "pipeline.py", line 142, in run
3
    proc = subprocess.Popen([...], stdout=subprocess.PIPE)
4
OSError: [Errno 24] Too many open files

再起動すると直る。しかし数時間後にまた出る。典型的なFDリークの症状です。

調査の最初のステップ

まず、現在のプロセスが開いているFD数を確認します。

1
# プロセスIDを特定
2
ps aux | grep pipeline
3

4
# 開いているFD数
5
ls /proc/<PID>/fd | wc -l     # Linux
6
lsof -p <PID> | wc -l         # macOS

macOSのデフォルトFD上限はulimit -nで256〜1024程度。数時間で上限に達するということは、1呼び出しあたり最低数個のFDがリークしている計算になります。

原因：proc.stdoutを変数に保持していた

リークしていたコード

1
def run_task(cmd: list[str]) -> str:
2
    proc = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
3
    stdout = proc.stdout  # ← これが罠
4
    stderr = proc.stderr
5
    output = stdout.read().decode()
6
    error = stderr.read().decode()
7
    proc.wait()
8
    return output

一見問題なさそうに見えますが、proc.stdoutを別変数に代入した時点で、そのファイルオブジェクトへの参照が複数存在する状態になります。

この関数が呼ばれるループの中で、stdoutという変数がどこかに保持されていると、Pythonのガベージコレクタがprocオブジェクトを回収できません。結果、procが持つstdout/stderrのFDが閉じられません。

実際のリーク源

1
# キャッシュ的な用途で結果を保存
2
recent_outputs = []
3

4
def run_and_cache(cmd: list[str]):
5
    proc = subprocess.Popen(cmd, stdout=subprocess.PIPE)
6
    output_stream = proc.stdout  # PIPE のファイルオブジェクト
7
    recent_outputs.append({
8
        "cmd": cmd,
9
        "stream": output_stream,  # ← streamを保持してしまっている
10
    })
11
    return output_stream.read()

recent_outputsにstream（ファイルオブジェクト）を入れ続けると、FDが永遠に解放されません。データを保存したいなら、stream.read()の結果（bytes）を保存すべきです。

検出：FD数の推移を記録する

簡易モニタ

1
import os
2
import subprocess
3
import logging
4

5
def count_open_fds() -> int:
6
    """現在のプロセスが開いているFD数を返す"""
7
    pid = os.getpid()
8
    if os.path.exists(f"/proc/{pid}/fd"):  # Linux
9
        return len(os.listdir(f"/proc/{pid}/fd"))
10
    # macOS: lsofを使う
11
    result = subprocess.run(
12
        ["lsof", "-p", str(pid)],
13
        capture_output=True, text=True
14
    )
15
    return len(result.stdout.splitlines()) - 1  # ヘッダ除く

処理ループの中で定期的にログ出力します。

1
for i, task in enumerate(tasks):
2
    run_task(task)
3
    if i % 100 == 0:
4
        fd_count = count_open_fds()
5
        logging.info(f"iter={i} fd_count={fd_count}")

FD数が単調増加していれば、リークしています。

FDの中身を見る

数だけではどこからリークしているかわかりません。lsofの出力を見ます。

1
lsof -p <PID> | awk '{print $5, $9}' | sort | uniq -c | sort -rn | head -20

出力例：

1
 127 PIPE       pipe
2
  54 REG        /tmp/tmpXXXXXX.json
3
  12 REG        /app/data/cache.db

PIPEが大量にある場合、subprocess.Popenのstdout=PIPE系のリーク確定です。

修正パターン

パターン1: with文でコンテキストマネージャ化

1
def run_task(cmd: list[str]) -> str:
2
    with subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE) as proc:
3
        output, error = proc.communicate()
4
        return output.decode()

withを使うと、ブロック終了時にproc.stdoutとproc.stderrが確実にcloseされます。

パターン2: communicate()を使う

1
def run_task(cmd: list[str]) -> str:
2
    proc = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
3
    try:
4
        output, error = proc.communicate(timeout=60)
5
    finally:
6
        if proc.stdout: proc.stdout.close()
7
        if proc.stderr: proc.stderr.close()
8
    return output.decode()

communicate()はタイムアウト時にstream操作の例外を投げるため、finallyで確実に閉じます。

パターン3: 結果のみを保存する

1
# Bad: ファイルオブジェクトを保存
2
recent_outputs.append({"stream": proc.stdout})
3

4
# Good: 読み込み結果を保存
5
output_bytes = proc.stdout.read()
6
recent_outputs.append({"output": output_bytes})

sys.stdoutの再割り当てという別の罠

もう一つ、地味に踏みやすいのがsys.stdout/sys.stderrの再割り当てです。

リークするパターン

1
def capture_output():
2
    import io
3
    old_stdout = sys.stdout
4
    sys.stdout = io.StringIO()
5
    try:
6
        noisy_function()
7
        captured = sys.stdout.getvalue()
8
    finally:
9
        sys.stdout = old_stdout  # ← これだけでは old_stdout が閉じない
10
    return captured

これ自体は動きます。しかし、この関数内でsubprocessを起動して、そのstdoutを現在のsys.stdoutに継承させていると、StringIOオブジェクトが子プロセスから参照され続けます。子プロセスが生きている間、FDが解放されません。

安全な書き方

1
from contextlib import redirect_stdout
2
import io
3

4
def capture_output():
5
    buf = io.StringIO()
6
    with redirect_stdout(buf):
7
        noisy_function()
8
    return buf.getvalue()

contextlib.redirect_stdoutは元のstdoutを確実に復元します。自前でsys.stdoutを書き換えるのは避けます。

CI/CD でFDリークを早期検出する

pytestでFD数をチェック

1
import pytest
2
import os
3

4
@pytest.fixture(autouse=True)
5
def check_fd_leak():
6
    """各テスト前後でFD数を比較し、リークを検出"""
7
    before = count_open_fds()
8
    yield
9
    after = count_open_fds()
10
    if after - before > 5:  # 誤差を許容
11
        pytest.fail(f"FDリーク検出: {before} → {after}")

ループを含む結合テストで回すと、リークしているコードパスが特定できます。

resourceモジュールで上限を意図的に下げる

1
import resource
2

3
# テスト実行時だけFD上限を低く設定
4
resource.setrlimit(resource.RLIMIT_NOFILE, (128, 128))

本番環境でToo many open filesが出る前に、テスト環境で再現させられます。

実践で学んだこと

1. subprocessはwithで使う

公式ドキュメントも推奨しています。古いコードでも見つけたら書き換えます。

2. ファイルオブジェクトは保存しない、内容を保存する

「あとで使うかも」でファイルオブジェクトを変数に入れると、ほぼ確実にリークします。読み込み結果（bytes/str）を保存します。

3. FDリークは「時間の経過」でしか現れない

単体テストでは検出できないことが多いです。長時間ループを回すシナリオテストを1つ用意しておきます。

4. macOSとLinuxで挙動が違う

ulimit -nのデフォルト値、/proc/<PID>/fdの有無など、OSで差があります。本番Linuxで出るバグがローカルmacOSでは出ない、という事故は典型例です。

まとめ

症状	調査	対処
`Too many open files`	`lsof -p <PID>`でFD種別を確認	`with subprocess.Popen(...)`に書き換え
PIPE大量	`stdout`/`stderr`を変数に保持していないか	`communicate()`または`with`で自動close
REG大量	`open()`の閉じ忘れ	`with open(...)`に統一

FDリークは「静かに進行する」タイプのバグで、ある日突然本番で爆発します。FD数の推移を定期的にログに出す、テストでFD差分を検証する、この2つを仕込んでおくと早期発見できます。