perf采样频率动态调整

perf CPU 开销进程

发布日期: 2025-07-23

文章字数: 1.5k

阅读时长: 6 分

一、ksamplingd 线程概述

在内核中新建 ksamplingd 用于采样硬件性能事件的线程：

从 Perf Events 的 ring buffer 里读取样本
处理样本数据
根据自身实际 CPU 时间占用率，动态调整采样频率
这篇文章的介绍将略过采样和处理样本的部分，因为采样在之前博客里详细分析过（TODO）。这里主要调整的参数是struct perf_event_attr sample_period

二、线程初始化

1. CPU 线程给定

const struct cpumask *cpumask = cpumask_of_node(0);
if (!cpumask_empty(cpumask))
	do_set_cpus_allowed(access_sampling, cpumask);

cpumask_of_node(0) 获取节点 0 所有 CPU 的 mask。
通过 do_set_cpus_allowed() 把采样线程给定运行在某些 CPU 上。看需求，因为在笔者的场景中，只关注 socket0 的冷数据和热数据，而socket0 虽然 NUMA 有两个，实际物理核心只是 NUMA0 的。

2. 时间变量初始化

trace_runtime = total_runtime = exec_runtime = t->se.sum_exec_runtime;
trace_cputime = total_cputime = elapsed_cputime = jiffies;
sleep_timeout = usecs_to_jiffies(2000);

t->se.sum_exec_runtime ：是内核调度实体 task_struct的 struct sched_entity 中记录该进程历史上总共运行的 CPU 耗时（单位 ns）。原本用于计算 CFS 调度器中优先级和时间分配。
jiffies ：内核时钟计数器
sleep_timeout ：指定最短采样间隔

三、CPU 占用率计算

1. 时间差值

exec_runtime = t->se.sum_exec_runtime - exec_runtime; // ns
elapsed_cputime = jiffies_to_usecs(cur - elapsed_cputime); // us

exec_runtime ：本次周期内采样线程耗时
elapsed_cputime ：本次周期内当前增量时间

2. 占用率计算

需要单位符合：把 ns -> us

u64 cur_cputime = div64_u64(exec_runtime / 1000, elapsed_cputime); // permil

最终得到 cputime ：

单位是 permil (％的 10倍)
如果 cputime = 300 ，即 CPU 占用 30%

四、动态调整采样频率

1. 调整规则

关键参数：
ksampled_soft_cpu_quota 目标 CPU 占用率上限（permil，千分比）
cputime 实际测得的 CPU 时间占用率
sample_period 当前 LLC 采样周期（控制采样间隔）
sample_inst_period 当前指令采样周期
pcount 宏定义的上限

增加采样周期（降低采样频率）:
当 cputime 超过目标占用率 ksampled_soft_cpu_quota + 5(防止频繁调整)，且 sample_period 没有达到上限 pcount，调用 increase_sample_period 使采样周期增加，从而降低采样频率。如果周期有变化，应用新的采样周期。

减少采样周期（提高采样频率）：
当 cputime 低于目标占用率 ksampled_soft_cpu_quota - 5，并且 sample_period 大于 pinstcount，可以调用 decrease_sample_period 减小采样周期，增加采样频率。同样如果采样周期发生变化，需要更新 pebs 采样器的配置。

2. 调整函数实现细节

线性调整：每次增加或减少采样周期，只调整一个单位。确保不会超过预定义的上下限 pcount 和 pinstcount。响应慢但足够稳定。
通过查表（预先定义一个已经排序的数组，数组中每个值表示周期）进行非线性调整，适合对不同采样周期有特殊设计的应用场景。

五、线程休眠机制

schedule_timeout_interruptible(sleep_timeout);


- 休眠控制线程激活频率
- 配合采样周期，避免忙等和超时间

unsigned long sleep_timeout = usecs_to_jiffies(2000); /* 每个采样循环休眠2ms */
	unsigned long cpucap_period = msecs_to_jiffies(15000); /* 15s查一次采样开销便于动态控制 */

    unsigned long sample_period = 0; /* 一般事件的采样频率 */
    unsigned long sample_inst_period = 0; /* 指令事件的采样频率 */

	/*
		total_runtime	采样线程启动后的总执行时间
		exec_runtime	周期内的实际运行时间
		cputime	        采样线程的当前CPU开销（归一化后的 0-100+ 数值）
	*/
    u64 total_runtime, exec_runtime, cputime = 0;
	struct task_struct *t = current;
    total_runtime = exec_runtime = t->se.sum_exec_runtime;

	/*
		jiffies 差值记录的cputime是从启动到现在过了多少毫秒；而runtime是进程在 CPU 上跑了多少纳秒；两者比值得出真实 CPU 使用率
		total_cputime	线程启动时的 jiffies，用于计算总 wall-clock 时间
		elapsed_cputime	上次评估 CPU 开销额度时的 jiffies，用于计算周期差
		cur	当前 jiffies
	*/
	unsigned long total_cputime, elapsed_cputime, cur;
    total_cputime = elapsed_cputime = jiffies;

	const struct cpumask *cpumask;
    cpumask = cpumask_of_node(0);
    if (!cpumask_empty(cpumask))
		do_set_cpus_allowed(access_sampling, cpumask);

    while (!kthread_should_stop()) {
		int cpu, event, cond = false;
    
		for (cpu = 0; cpu < CPUS_PER_SOCKET; cpu++) {
			for (event = 0; event < N_ARTMEMEVENTS; event++) {
				do {
					
          做正事处理采样数据的逻辑……

					/*
						ksampled_max_sample_ratio	ring buffer 满度 上限阈值（百分比，如 80%）
						ksampled_min_sample_ratio	ring buffer 满度 下限阈值（百分比，如 20%）
						cond 是否继续批量处理采样的标志，这样写：如果大于80%继续处理这个cpu的这个事件，一直处理到小于20%就去下一个事件。初始化时cond是false，随着程序运行buffer的容量介于80-20其中，就取决于当前cond在上一个事件中是false还是true，如果false就只被执行一次，true就一直处理到小于20%才去下一个事件。存在随机性、不必要的跨事件状态耦合，可以改为更内部的参数。
					*/
					if (head > (BUFFER_SIZE * ksampled_max_sample_ratio / 100)) {
						cond = true;
					} else if (head < (BUFFER_SIZE * ksampled_min_sample_ratio / 100)) {
						cond = false;
					}

          做正事处理采样数据的逻辑……
					
				} while (cond);	
	    	}
		}	
		
		if (!ksampled_soft_cpu_quota) /* 如果没有设置cpu使用上限，那就不用更新采样频率、不用每轮采样完休息2ms */
	    	continue;

		schedule_timeout_interruptible(sleep_timeout);

		cur = jiffies;
		if ((cur - elapsed_cputime) >= cpucap_period) {
	    	u64 cur_runtime = t->se.sum_exec_runtime;
	    	exec_runtime = cur_runtime - exec_runtime;
	    	elapsed_cputime = jiffies_to_usecs(cur - elapsed_cputime);
			/* 判断是否第一次给这个cpu开销赋值，之后的值都是指数加权移动平均，新数据占 80%，旧数据占 20% */
	    	if (!cputime) {
				u64 cur_cputime = div64_u64(exec_runtime, elapsed_cputime); 
				cputime = ((cur_cputime << 3) + (cputime << 1)) / 10;
	    	} else {
				cputime = div64_u64(exec_runtime, elapsed_cputime);
			}

			/* 更新采样频率 */
			/* pcount是采样频率数组的数量，当采样开销比预定开销高5%，且当前采样频率并不是最小的时，采取的措施*/
	    	if (cputime > (ksampled_soft_cpu_quota + 5) && sample_period != pcount) {
				unsigned long tmp1 = sample_period, tmp2 = sample_inst_period;
				/* 数值增加后，采样一次的时间变长了；（不同PEBS事件本身发生次数有数量级差距，调整频率时用到两个数组）；拿到频率数组中对应的值，用perf子系统功能函数perf_event_period更新 */
				increase_sample_period(&sample_period, &sample_inst_period);
				if (tmp1 != sample_period || tmp2 != sample_inst_period)
					pebs_update_period(get_sample_period(sample_period),get_sample_inst_period(sample_inst_period));
			} else if (cputime < (ksampled_soft_cpu_quota - 5) && sample_period) {
				unsigned long tmp1 = sample_period, tmp2 = sample_inst_period;
				decrease_sample_period(&sample_period, &sample_inst_period);
				if (tmp1 != sample_period || tmp2 != sample_inst_period)
						pebs_update_period(get_sample_period(sample_period),get_sample_inst_period(sample_inst_period));
			}
			elapsed_cputime = cur;
			exec_runtime = cur_runtime;
		}
    }

    total_runtime = (t->se.sum_exec_runtime) - total_runtime; // ns
    total_cputime = jiffies_to_usecs(jiffies - total_cputime); // us

    printk("total runtime: %llu ns, total cputime: %lu us, cpu usage: %llu\n", total_runtime, total_cputime, (total_runtime) / total_cputime);

    return 0;
}