bfq-iosched.c source code [linux/block/bfq-iosched.c]

1	// SPDX-License-Identifier: GPL-2.0-or-later
2	/*
3	* Budget Fair Queueing (BFQ) I/O scheduler.
4	*
5	* Based on ideas and code from CFQ:
6	* Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
7	*
8	* Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it>
9	* Paolo Valente <paolo.valente@unimore.it>
10	*
11	* Copyright (C) 2010 Paolo Valente <paolo.valente@unimore.it>
12	* Arianna Avanzini <avanzini@google.com>
13	*
14	* Copyright (C) 2017 Paolo Valente <paolo.valente@linaro.org>
15	*
16	* BFQ is a proportional-share I/O scheduler, with some extra
17	* low-latency capabilities. BFQ also supports full hierarchical
18	* scheduling through cgroups. Next paragraphs provide an introduction
19	* on BFQ inner workings. Details on BFQ benefits, usage and
20	* limitations can be found in Documentation/block/bfq-iosched.rst.
21	*
22	* BFQ is a proportional-share storage-I/O scheduling algorithm based
23	* on the slice-by-slice service scheme of CFQ. But BFQ assigns
24	* budgets, measured in number of sectors, to processes instead of
25	* time slices. The device is not granted to the in-service process
26	* for a given time slice, but until it has exhausted its assigned
27	* budget. This change from the time to the service domain enables BFQ
28	* to distribute the device throughput among processes as desired,
29	* without any distortion due to throughput fluctuations, or to device
30	* internal queueing. BFQ uses an ad hoc internal scheduler, called
31	* B-WF2Q+, to schedule processes according to their budgets. More
32	* precisely, BFQ schedules queues associated with processes. Each
33	* process/queue is assigned a user-configurable weight, and B-WF2Q+
34	* guarantees that each queue receives a fraction of the throughput
35	* proportional to its weight. Thanks to the accurate policy of
36	* B-WF2Q+, BFQ can afford to assign high budgets to I/O-bound
37	* processes issuing sequential requests (to boost the throughput),
38	* and yet guarantee a low latency to interactive and soft real-time
39	* applications.
40	*
41	* In particular, to provide these low-latency guarantees, BFQ
42	* explicitly privileges the I/O of two classes of time-sensitive
43	* applications: interactive and soft real-time. In more detail, BFQ
44	* behaves this way if the low_latency parameter is set (default
45	* configuration). This feature enables BFQ to provide applications in
46	* these classes with a very low latency.
47	*
48	* To implement this feature, BFQ constantly tries to detect whether
49	* the I/O requests in a bfq_queue come from an interactive or a soft
50	* real-time application. For brevity, in these cases, the queue is
51	* said to be interactive or soft real-time. In both cases, BFQ
52	* privileges the service of the queue, over that of non-interactive
53	* and non-soft-real-time queues. This privileging is performed,
54	* mainly, by raising the weight of the queue. So, for brevity, we
55	* call just weight-raising periods the time periods during which a
56	* queue is privileged, because deemed interactive or soft real-time.
57	*
58	* The detection of soft real-time queues/applications is described in
59	* detail in the comments on the function
60	* bfq_bfqq_softrt_next_start. On the other hand, the detection of an
61	* interactive queue works as follows: a queue is deemed interactive
62	* if it is constantly non empty only for a limited time interval,
63	* after which it does become empty. The queue may be deemed
64	* interactive again (for a limited time), if it restarts being
65	* constantly non empty, provided that this happens only after the
66	* queue has remained empty for a given minimum idle time.
67	*
68	* By default, BFQ computes automatically the above maximum time
69	* interval, i.e., the time interval after which a constantly
70	* non-empty queue stops being deemed interactive. Since a queue is
71	* weight-raised while it is deemed interactive, this maximum time
72	* interval happens to coincide with the (maximum) duration of the
73	* weight-raising for interactive queues.
74	*
75	* Finally, BFQ also features additional heuristics for
76	* preserving both a low latency and a high throughput on NCQ-capable,
77	* rotational or flash-based devices, and to get the job done quickly
78	* for applications consisting in many I/O-bound processes.
79	*
80	* NOTE: if the main or only goal, with a given device, is to achieve
81	* the maximum-possible throughput at all times, then do switch off
82	* all low-latency heuristics for that device, by setting low_latency
83	* to 0.
84	*
85	* BFQ is described in [1], where also a reference to the initial,
86	* more theoretical paper on BFQ can be found. The interested reader
87	* can find in the latter paper full details on the main algorithm, as
88	* well as formulas of the guarantees and formal proofs of all the
89	* properties. With respect to the version of BFQ presented in these
90	* papers, this implementation adds a few more heuristics, such as the
91	* ones that guarantee a low latency to interactive and soft real-time
92	* applications, and a hierarchical extension based on H-WF2Q+.
93	*
94	* B-WF2Q+ is based on WF2Q+, which is described in [2], together with
95	* H-WF2Q+, while the augmented tree used here to implement B-WF2Q+
96	* with O(log N) complexity derives from the one introduced with EEVDF
97	* in [3].
98	*
99	* [1] P. Valente, A. Avanzini, "Evolution of the BFQ Storage I/O
100	* Scheduler", Proceedings of the First Workshop on Mobile System
101	* Technologies (MST-2015), May 2015.
102	* http://algogroup.unimore.it/people/paolo/disk_sched/mst-2015.pdf
103	*
104	* [2] Jon C.R. Bennett and H. Zhang, "Hierarchical Packet Fair Queueing
105	* Algorithms", IEEE/ACM Transactions on Networking, 5(5):675-689,
106	* Oct 1997.
107	*
108	* http://www.cs.cmu.edu/~hzhang/papers/TON-97-Oct.ps.gz
109	*
110	* [3] I. Stoica and H. Abdel-Wahab, "Earliest Eligible Virtual Deadline
111	* First: A Flexible and Accurate Mechanism for Proportional Share
112	* Resource Allocation", technical report.
113	*
114	* http://www.cs.berkeley.edu/~istoica/papers/eevdf-tr-95.pdf
115	*/
116	#include <linux/module.h>
117	#include <linux/slab.h>
118	#include <linux/blkdev.h>
119	#include <linux/cgroup.h>
120	#include <linux/ktime.h>
121	#include <linux/rbtree.h>
122	#include <linux/ioprio.h>
123	#include <linux/sbitmap.h>
124	#include <linux/delay.h>
125	#include <linux/backing-dev.h>
126
127	#include <trace/events/block.h>
128
129	#include "elevator.h"
130	#include "blk.h"
131	#include "blk-mq.h"
132	#include "blk-mq-sched.h"
133	#include "bfq-iosched.h"
134	#include "blk-wbt.h"
135
136	#define BFQ_BFQQ_FNS(name) \
137	void bfq_mark_bfqq_##name(struct bfq_queue *bfqq) \
138	{ \
139	__set_bit(BFQQF_##name, &(bfqq)->flags); \
140	} \
141	void bfq_clear_bfqq_##name(struct bfq_queue *bfqq) \
142	{ \
143	__clear_bit(BFQQF_##name, &(bfqq)->flags); \
144	} \
145	int bfq_bfqq_##name(const struct bfq_queue *bfqq) \
146	{ \
147	return test_bit(BFQQF_##name, &(bfqq)->flags); \
148	}
149
150	BFQ_BFQQ_FNS(just_created);
151	BFQ_BFQQ_FNS(busy);
152	BFQ_BFQQ_FNS(wait_request);
153	BFQ_BFQQ_FNS(non_blocking_wait_rq);
154	BFQ_BFQQ_FNS(fifo_expire);
155	BFQ_BFQQ_FNS(has_short_ttime);
156	BFQ_BFQQ_FNS(sync);
157	BFQ_BFQQ_FNS(IO_bound);
158	BFQ_BFQQ_FNS(in_large_burst);
159	BFQ_BFQQ_FNS(coop);
160	BFQ_BFQQ_FNS(split_coop);
161	BFQ_BFQQ_FNS(softrt_update);
162	#undef BFQ_BFQQ_FNS \
163
164	/ Expiration time of async (0) and sync (1) requests, in ns. /
165	static const u64 bfq_fifo_expire[`2`] = { NSEC_PER_SEC / `4`, NSEC_PER_SEC / `8` };
166
167	/ Maximum backwards seek (magic number lifted from CFQ), in KiB. /
168	static const int bfq_back_max = `16` * `1024`;
169
170	/ Penalty of a backwards seek, in number of sectors. /
171	static const int bfq_back_penalty = `2`;
172
173	/ Idling period duration, in ns. /
174	static u64 bfq_slice_idle = NSEC_PER_SEC / `125`;
175
176	/ Minimum number of assigned budgets for which stats are safe to compute. /
177	static const int bfq_stats_min_budgets = `194`;
178
179	/ Default maximum budget values, in sectors and number of requests. /
180	static const int bfq_default_max_budget = `16` * `1024`;
181
182	/*
183	* When a sync request is dispatched, the queue that contains that
184	* request, and all the ancestor entities of that queue, are charged
185	* with the number of sectors of the request. In contrast, if the
186	* request is async, then the queue and its ancestor entities are
187	* charged with the number of sectors of the request, multiplied by
188	* the factor below. This throttles the bandwidth for async I/O,
189	* w.r.t. to sync I/O, and it is done to counter the tendency of async
190	* writes to steal I/O throughput to reads.
191	*
192	* The current value of this parameter is the result of a tuning with
193	* several hardware and software configurations. We tried to find the
194	* lowest value for which writes do not cause noticeable problems to
195	* reads. In fact, the lower this parameter, the stabler I/O control,
196	* in the following respect. The lower this parameter is, the less
197	* the bandwidth enjoyed by a group decreases
198	* - when the group does writes, w.r.t. to when it does reads;
199	* - when other groups do reads, w.r.t. to when they do writes.
200	*/
201	static const int bfq_async_charge_factor = `3`;
202
203	/ Default timeout values, in jiffies, approximating CFQ defaults. /
204	const int bfq_timeout = HZ / `8`;
205
206	/*
207	* Time limit for merging (see comments in bfq_setup_cooperator). Set
208	* to the slowest value that, in our tests, proved to be effective in
209	* removing false positives, while not causing true positives to miss
210	* queue merging.
211	*
212	* As can be deduced from the low time limit below, queue merging, if
213	* successful, happens at the very beginning of the I/O of the involved
214	* cooperating processes, as a consequence of the arrival of the very
215	* first requests from each cooperator. After that, there is very
216	* little chance to find cooperators.
217	*/
218	static const unsigned long bfq_merge_time_limit = HZ/`10`;
219
220	static struct kmem_cache *bfq_pool;
221
222	/ Below this threshold (in ns), we consider thinktime immediate. /
223	#define BFQ_MIN_TT (2 * NSEC_PER_MSEC)
224
225	/ hw_tag detection: parallel requests threshold and min samples needed. /
226	#define BFQ_HW_QUEUE_THRESHOLD 3
227	#define BFQ_HW_QUEUE_SAMPLES 32
228
229	#define BFQQ_SEEK_THR (sector_t)(8 * 100)
230	#define BFQQ_SECT_THR_NONROT (sector_t)(2 * 32)
231	#define BFQ_RQ_SEEKY(bfqd, last_pos, rq) \
232	(get_sdist(last_pos, rq) > \
233	BFQQ_SEEK_THR && \
234	(!blk_queue_nonrot(bfqd->queue) \|\| \
235	blk_rq_sectors(rq) < BFQQ_SECT_THR_NONROT))
236	#define BFQQ_CLOSE_THR (sector_t)(8 * 1024)
237	#define BFQQ_SEEKY(bfqq) (hweight32(bfqq->seek_history) > 19)
238	/*
239	* Sync random I/O is likely to be confused with soft real-time I/O,
240	* because it is characterized by limited throughput and apparently
241	* isochronous arrival pattern. To avoid false positives, queues
242	* containing only random (seeky) I/O are prevented from being tagged
243	* as soft real-time.
244	*/
245	#define BFQQ_TOTALLY_SEEKY(bfqq) (bfqq->seek_history == -1)
246
247	/ Min number of samples required to perform peak-rate update /
248	#define BFQ_RATE_MIN_SAMPLES 32
249	/ Min observation time interval required to perform a peak-rate update (ns) /
250	#define BFQ_RATE_MIN_INTERVAL (300*NSEC_PER_MSEC)
251	/ Target observation time interval for a peak-rate update (ns) /
252	#define BFQ_RATE_REF_INTERVAL NSEC_PER_SEC
253
254	/*
255	* Shift used for peak-rate fixed precision calculations.
256	* With
257	* - the current shift: 16 positions
258	* - the current type used to store rate: u32
259	* - the current unit of measure for rate: [sectors/usec], or, more precisely,
260	* [(sectors/usec) / 2^BFQ_RATE_SHIFT] to take into account the shift,
261	* the range of rates that can be stored is
262	* [1 / 2^BFQ_RATE_SHIFT, 2^(32 - BFQ_RATE_SHIFT)] sectors/usec =
263	* [1 / 2^16, 2^16] sectors/usec = [15e-6, 65536] sectors/usec =
264	* [15, 65G] sectors/sec
265	* Which, assuming a sector size of 512B, corresponds to a range of
266	* [7.5K, 33T] B/sec
267	*/
268	#define BFQ_RATE_SHIFT 16
269
270	/*
271	* When configured for computing the duration of the weight-raising
272	* for interactive queues automatically (see the comments at the
273	* beginning of this file), BFQ does it using the following formula:
274	* duration = (ref_rate / r) * ref_wr_duration,
275	* where r is the peak rate of the device, and ref_rate and
276	* ref_wr_duration are two reference parameters. In particular,
277	* ref_rate is the peak rate of the reference storage device (see
278	* below), and ref_wr_duration is about the maximum time needed, with
279	* BFQ and while reading two files in parallel, to load typical large
280	* applications on the reference device (see the comments on
281	* max_service_from_wr below, for more details on how ref_wr_duration
282	* is obtained). In practice, the slower/faster the device at hand
283	* is, the more/less it takes to load applications with respect to the
284	* reference device. Accordingly, the longer/shorter BFQ grants
285	* weight raising to interactive applications.
286	*
287	* BFQ uses two different reference pairs (ref_rate, ref_wr_duration),
288	* depending on whether the device is rotational or non-rotational.
289	*
290	* In the following definitions, ref_rate[0] and ref_wr_duration[0]
291	* are the reference values for a rotational device, whereas
292	* ref_rate[1] and ref_wr_duration[1] are the reference values for a
293	* non-rotational device. The reference rates are not the actual peak
294	* rates of the devices used as a reference, but slightly lower
295	* values. The reason for using slightly lower values is that the
296	* peak-rate estimator tends to yield slightly lower values than the
297	* actual peak rate (it can yield the actual peak rate only if there
298	* is only one process doing I/O, and the process does sequential
299	* I/O).
300	*
301	* The reference peak rates are measured in sectors/usec, left-shifted
302	* by BFQ_RATE_SHIFT.
303	*/
304	static int ref_rate[`2`] = {`14000`, `33000`};
305	/*
306	* To improve readability, a conversion function is used to initialize
307	* the following array, which entails that the array can be
308	* initialized only in a function.
309	*/
310	static int ref_wr_duration[`2`];
311
312	/*
313	* BFQ uses the above-detailed, time-based weight-raising mechanism to
314	* privilege interactive tasks. This mechanism is vulnerable to the
315	* following false positives: I/O-bound applications that will go on
316	* doing I/O for much longer than the duration of weight
317	* raising. These applications have basically no benefit from being
318	* weight-raised at the beginning of their I/O. On the opposite end,
319	* while being weight-raised, these applications
320	* a) unjustly steal throughput to applications that may actually need
321	* low latency;
322	* b) make BFQ uselessly perform device idling; device idling results
323	* in loss of device throughput with most flash-based storage, and may
324	* increase latencies when used purposelessly.
325	*
326	* BFQ tries to reduce these problems, by adopting the following
327	* countermeasure. To introduce this countermeasure, we need first to
328	* finish explaining how the duration of weight-raising for
329	* interactive tasks is computed.
330	*
331	* For a bfq_queue deemed as interactive, the duration of weight
332	* raising is dynamically adjusted, as a function of the estimated
333	* peak rate of the device, so as to be equal to the time needed to
334	* execute the 'largest' interactive task we benchmarked so far. By
335	* largest task, we mean the task for which each involved process has
336	* to do more I/O than for any of the other tasks we benchmarked. This
337	* reference interactive task is the start-up of LibreOffice Writer,
338	* and in this task each process/bfq_queue needs to have at most ~110K
339	* sectors transferred.
340	*
341	* This last piece of information enables BFQ to reduce the actual
342	* duration of weight-raising for at least one class of I/O-bound
343	* applications: those doing sequential or quasi-sequential I/O. An
344	* example is file copy. In fact, once started, the main I/O-bound
345	* processes of these applications usually consume the above 110K
346	* sectors in much less time than the processes of an application that
347	* is starting, because these I/O-bound processes will greedily devote
348	* almost all their CPU cycles only to their target,
349	* throughput-friendly I/O operations. This is even more true if BFQ
350	* happens to be underestimating the device peak rate, and thus
351	* overestimating the duration of weight raising. But, according to
352	* our measurements, once transferred 110K sectors, these processes
353	* have no right to be weight-raised any longer.
354	*
355	* Basing on the last consideration, BFQ ends weight-raising for a
356	* bfq_queue if the latter happens to have received an amount of
357	* service at least equal to the following constant. The constant is
358	* set to slightly more than 110K, to have a minimum safety margin.
359	*
360	* This early ending of weight-raising reduces the amount of time
361	* during which interactive false positives cause the two problems
362	* described at the beginning of these comments.
363	*/
364	static const unsigned long max_service_from_wr = `120000`;
365
366	/*
367	* Maximum time between the creation of two queues, for stable merge
368	* to be activated (in ms)
369	*/
370	static const unsigned long bfq_activation_stable_merging = `600`;
371	/*
372	* Minimum time to be waited before evaluating delayed stable merge (in ms)
373	*/
374	static const unsigned long bfq_late_stable_merging = `600`;
375
376	#define RQ_BIC(rq) ((struct bfq_io_cq *)((rq)->elv.priv[0]))
377	#define RQ_BFQQ(rq) ((rq)->elv.priv[1])
378
379	struct bfq_queue bic_to_bfqq(struct* bfq_io_cq *bic, bool is_sync,
380	unsigned int actuator_idx)
381	{
382	if (is_sync)
383	return bic->bfqq[`1`][actuator_idx];
384
385	return bic->bfqq[`0`][actuator_idx];
386	}
387
388	static void bfq_put_stable_ref(struct bfq_queue *bfqq);
389
390	void bic_set_bfqq(struct bfq_io_cq *bic,
391	struct bfq_queue *bfqq,
392	bool is_sync,
393	unsigned int actuator_idx)
394	{
395	struct bfq_queue *old_bfqq = bic->bfqq[is_sync][actuator_idx];
396
397	/*
398	* If bfqq != NULL, then a non-stable queue merge between
399	* bic->bfqq and bfqq is happening here. This causes troubles
400	* in the following case: bic->bfqq has also been scheduled
401	* for a possible stable merge with bic->stable_merge_bfqq,
402	* and bic->stable_merge_bfqq == bfqq happens to
403	* hold. Troubles occur because bfqq may then undergo a split,
404	* thereby becoming eligible for a stable merge. Yet, if
405	* bic->stable_merge_bfqq points exactly to bfqq, then bfqq
406	* would be stably merged with itself. To avoid this anomaly,
407	* we cancel the stable merge if
408	* bic->stable_merge_bfqq == bfqq.
409	*/
410	struct bfq_iocq_bfqq_data *bfqq_data = &bic->bfqq_data[actuator_idx];
411
412	/ Clear bic pointer if bfqq is detached from this bic /
413	if (old_bfqq && old_bfqq->bic == bic)
414	old_bfqq->bic = NULL;
415
416	if (is_sync)
417	bic->bfqq[`1`][actuator_idx] = bfqq;
418	else
419	bic->bfqq[`0`][actuator_idx] = bfqq;
420
421	if (bfqq && bfqq_data->stable_merge_bfqq == bfqq) {
422	/*
423	* Actually, these same instructions are executed also
424	* in bfq_setup_cooperator, in case of abort or actual
425	* execution of a stable merge. We could avoid
426	* repeating these instructions there too, but if we
427	* did so, we would nest even more complexity in this
428	* function.
429	*/
430	bfq_put_stable_ref(bfqq: bfqq_data->stable_merge_bfqq);
431
432	bfqq_data->stable_merge_bfqq = NULL;
433	}
434	}
435
436	struct bfq_data bic_to_bfqd(struct* bfq_io_cq *bic)
437	{
438	return bic->icq.q->elevator->elevator_data;
439	}
440
441	/**
442	* icq_to_bic - convert iocontext queue structure to bfq_io_cq.
443	* @icq: the iocontext queue.
444	*/
445	static struct bfq_io_cq icq_to_bic(struct* io_cq *icq)
446	{
447	/ bic->icq is the first member, %NULL will convert to %NULL /
448	return container_of(icq, struct bfq_io_cq, icq);
449	}
450
451	/**
452	* bfq_bic_lookup - search into @ioc a bic associated to @bfqd.
453	* @q: the request queue.
454	*/
455	static struct bfq_io_cq bfq_bic_lookup(struct* request_queue *q)
456	{
457	struct bfq_io_cq *icq;
458	unsigned long flags;
459
460	if (!current->io_context)
461	return NULL;
462
463	spin_lock_irqsave(&q->queue_lock, flags);
464	icq = icq_to_bic(icq: ioc_lookup_icq(q));
465	spin_unlock_irqrestore(lock: &q->queue_lock, flags);
466
467	return icq;
468	}
469
470	/*
471	* Scheduler run of queue, if there are requests pending and no one in the
472	* driver that will restart queueing.
473	*/
474	void bfq_schedule_dispatch(struct bfq_data *bfqd)
475	{
476	lockdep_assert_held(&bfqd->lock);
477
478	if (bfqd->queued != `0`) {
479	bfq_log(bfqd, "schedule dispatch");
480	blk_mq_run_hw_queues(q: bfqd->queue, async: true);
481	}
482	}
483
484	#define bfq_class_idle(bfqq) ((bfqq)->ioprio_class == IOPRIO_CLASS_IDLE)
485
486	#define bfq_sample_valid(samples) ((samples) > 80)
487
488	/*
489	* Lifted from AS - choose which of rq1 and rq2 that is best served now.
490	* We choose the request that is closer to the head right now. Distance
491	* behind the head is penalized and only allowed to a certain extent.
492	*/
493	static struct request bfq_choose_req(struct* bfq_data *bfqd,
494	struct request *rq1,
495	struct request *rq2,
496	sector_t last)
497	{
498	sector_t s1, s2, d1 = `0`, d2 = `0`;
499	unsigned long back_max;
500	#define BFQ_RQ1_WRAP 0x01 /* request 1 wraps */
501	#define BFQ_RQ2_WRAP 0x02 /* request 2 wraps */
502	unsigned int wrap = `0`; / bit mask: requests behind the disk head? /
503
504	if (!rq1 \|\| rq1 == rq2)
505	return rq2;
506	if (!rq2)
507	return rq1;
508
509	if (rq_is_sync(rq: rq1) && !rq_is_sync(rq: rq2))
510	return rq1;
511	else if (rq_is_sync(rq: rq2) && !rq_is_sync(rq: rq1))
512	return rq2;
513	if ((rq1->cmd_flags & REQ_META) && !(rq2->cmd_flags & REQ_META))
514	return rq1;
515	else if ((rq2->cmd_flags & REQ_META) && !(rq1->cmd_flags & REQ_META))
516	return rq2;
517
518	s1 = blk_rq_pos(rq: rq1);
519	s2 = blk_rq_pos(rq: rq2);
520
521	/*
522	* By definition, 1KiB is 2 sectors.
523	*/
524	back_max = bfqd->bfq_back_max * `2`;
525
526	/*
527	* Strict one way elevator _except_ in the case where we allow
528	* short backward seeks which are biased as twice the cost of a
529	* similar forward seek.
530	*/
531	if (s1 >= last)
532	d1 = s1 - last;
533	else if (s1 + back_max >= last)
534	d1 = (last - s1) * bfqd->bfq_back_penalty;
535	else
536	wrap \|= BFQ_RQ1_WRAP;
537
538	if (s2 >= last)
539	d2 = s2 - last;
540	else if (s2 + back_max >= last)
541	d2 = (last - s2) * bfqd->bfq_back_penalty;
542	else
543	wrap \|= BFQ_RQ2_WRAP;
544
545	/ Found required data /
546
547	/*
548	* By doing switch() on the bit mask "wrap" we avoid having to
549	* check two variables for all permutations: --> faster!
550	*/
551	switch (wrap) {
552	case `0`: / common case for CFQ: rq1 and rq2 not wrapped /
553	if (d1 < d2)
554	return rq1;
555	else if (d2 < d1)
556	return rq2;
557
558	if (s1 >= s2)
559	return rq1;
560	else
561	return rq2;
562
563	case BFQ_RQ2_WRAP:
564	return rq1;
565	case BFQ_RQ1_WRAP:
566	return rq2;
567	case BFQ_RQ1_WRAP\|BFQ_RQ2_WRAP: / both rqs wrapped /
568	default:
569	/*
570	* Since both rqs are wrapped,
571	* start with the one that's further behind head
572	* (--> only one back seek required),
573	* since back seek takes more time than forward.
574	*/
575	if (s1 <= s2)
576	return rq1;
577	else
578	return rq2;
579	}
580	}
581
582	#define BFQ_LIMIT_INLINE_DEPTH 16
583
584	#ifdef CONFIG_BFQ_GROUP_IOSCHED
585	static bool bfqq_request_over_limit(struct bfq_data *bfqd,
586	struct bfq_io_cq *bic, blk_opf_t opf,
587	unsigned int act_idx, int limit)
588	{
589	struct bfq_entity *inline_entities[BFQ_LIMIT_INLINE_DEPTH];
590	struct bfq_entity **entities = inline_entities;
591	int alloc_depth = BFQ_LIMIT_INLINE_DEPTH;
592	struct bfq_sched_data *sched_data;
593	struct bfq_entity *entity;
594	struct bfq_queue *bfqq;
595	unsigned long wsum;
596	bool ret = false;
597	int depth;
598	int level;
599
600	retry:
601	spin_lock_irq(lock: &bfqd->lock);
602	bfqq = bic_to_bfqq(bic, is_sync: op_is_sync(op: opf), actuator_idx: act_idx);
603	if (!bfqq)
604	goto out;
605
606	entity = &bfqq->entity;
607	if (!entity->on_st_or_in_serv)
608	goto out;
609
610	/ +1 for bfqq entity, root cgroup not included /
611	depth = bfqg_to_blkg(bfqg: bfqq_group(bfqq))->blkcg->css.cgroup->level + `1`;
612	if (depth > alloc_depth) {
613	spin_unlock_irq(lock: &bfqd->lock);
614	if (entities != inline_entities)
615	kfree(objp: entities);
616	entities = kmalloc_array(depth, sizeof(*entities), GFP_NOIO);
617	if (!entities)
618	return false;
619	alloc_depth = depth;
620	goto retry;
621	}
622
623	sched_data = entity->sched_data;
624	/ Gather our ancestors as we need to traverse them in reverse order /
625	level = `0`;
626	for_each_entity(entity) {
627	/*
628	* If at some level entity is not even active, allow request
629	* queueing so that BFQ knows there's work to do and activate
630	* entities.
631	*/
632	if (!entity->on_st_or_in_serv)
633	goto out;
634	/ Uh, more parents than cgroup subsystem thinks? /
635	if (WARN_ON_ONCE(level >= depth))
636	break;
637	entities[level++] = entity;
638	}
639	WARN_ON_ONCE(level != depth);
640	for (level--; level >= `0`; level--) {
641	entity = entities[level];
642	if (level > `0`) {
643	wsum = bfq_entity_service_tree(entity)->wsum;
644	} else {
645	int i;
646	/*
647	* For bfqq itself we take into account service trees
648	* of all higher priority classes and multiply their
649	* weights so that low prio queue from higher class
650	* gets more requests than high prio queue from lower
651	* class.
652	*/
653	wsum = `0`;
654	for (i = `0`; i <= bfqq->ioprio_class - `1`; i++) {
655	wsum = wsum * IOPRIO_BE_NR +
656	sched_data->service_tree[i].wsum;
657	}
658	}
659	if (!wsum)
660	continue;
661	limit = DIV_ROUND_CLOSEST(limit * entity->weight, wsum);
662	if (entity->allocated >= limit) {
663	bfq_log_bfqq(bfqq->bfqd, bfqq,
664	"too many requests: allocated %d limit %d level %d",
665	entity->allocated, limit, level);
666	ret = true;
667	break;
668	}
669	}
670	out:
671	spin_unlock_irq(lock: &bfqd->lock);
672	if (entities != inline_entities)
673	kfree(objp: entities);
674	return ret;
675	}
676	#else
677	static bool bfqq_request_over_limit(struct bfq_data *bfqd,
678	struct bfq_io_cq *bic, blk_opf_t opf,
679	unsigned int act_idx, int limit)
680	{
681	return false;
682	}
683	#endif
684
685	/*
686	* Async I/O can easily starve sync I/O (both sync reads and sync
687	* writes), by consuming all tags. Similarly, storms of sync writes,
688	* such as those that sync(2) may trigger, can starve sync reads.
689	* Limit depths of async I/O and sync writes so as to counter both
690	* problems.
691	*
692	* Also if a bfq queue or its parent cgroup consume more tags than would be
693	* appropriate for their weight, we trim the available tag depth to 1. This
694	* avoids a situation where one cgroup can starve another cgroup from tags and
695	* thus block service differentiation among cgroups. Note that because the
696	* queue / cgroup already has many requests allocated and queued, this does not
697	* significantly affect service guarantees coming from the BFQ scheduling
698	* algorithm.
699	*/
700	static void bfq_limit_depth(blk_opf_t opf, struct blk_mq_alloc_data *data)
701	{
702	struct bfq_data *bfqd = data->q->elevator->elevator_data;
703	struct bfq_io_cq *bic = bfq_bic_lookup(q: data->q);
704	int depth;
705	unsigned limit = data->q->nr_requests;
706	unsigned int act_idx;
707
708	/ Sync reads have full depth available /
709	if (op_is_sync(op: opf) && !op_is_write(op: opf)) {
710	depth = `0`;
711	} else {
712	depth = bfqd->word_depths[!!bfqd->wr_busy_queues][op_is_sync(op: opf)];
713	limit = (limit * depth) >> bfqd->full_depth_shift;
714	}
715
716	for (act_idx = `0`; bic && act_idx < bfqd->num_actuators; act_idx++) {
717	/ Fast path to check if bfqq is already allocated. /
718	if (!bic_to_bfqq(bic, is_sync: op_is_sync(op: opf), actuator_idx: act_idx))
719	continue;
720
721	/*
722	* Does queue (or any parent entity) exceed number of
723	* requests that should be available to it? Heavily
724	* limit depth so that it cannot consume more
725	* available requests and thus starve other entities.
726	*/
727	if (bfqq_request_over_limit(bfqd, bic, opf, act_idx, limit)) {
728	depth = `1`;
729	break;
730	}
731	}
732	bfq_log(bfqd, "[%s] wr_busy %d sync %d depth %u",
733	__func__, bfqd->wr_busy_queues, op_is_sync(opf), depth);
734	if (depth)
735	data->shallow_depth = depth;
736	}
737
738	static struct bfq_queue *
739	bfq_rq_pos_tree_lookup(struct bfq_data bfqd, struct* rb_root *root,
740	sector_t sector, struct rb_node **ret_parent,
741	struct rb_node ***rb_link)
742	{
743	struct rb_node *p, parent;
744	struct bfq_queue *bfqq = NULL;
745
746	parent = NULL;
747	p = &root->rb_node;
748	while (*p) {
749	struct rb_node **n;
750
751	parent = *p;
752	bfqq = rb_entry(parent, struct bfq_queue, pos_node);
753
754	/*
755	* Sort strictly based on sector. Smallest to the left,
756	* largest to the right.
757	*/
758	if (sector > blk_rq_pos(rq: bfqq->next_rq))
759	n = &(*p)->rb_right;
760	else if (sector < blk_rq_pos(rq: bfqq->next_rq))
761	n = &(*p)->rb_left;
762	else
763	break;
764	p = n;
765	bfqq = NULL;
766	}
767
768	*ret_parent = parent;
769	if (rb_link)
770	*rb_link = p;
771
772	bfq_log(bfqd, "rq_pos_tree_lookup %llu: returning %d",
773	(unsigned long long)sector,
774	bfqq ? bfqq->pid : `0`);
775
776	return bfqq;
777	}
778
779	static bool bfq_too_late_for_merging(struct bfq_queue *bfqq)
780	{
781	return bfqq->service_from_backlogged > `0` &&
782	time_is_before_jiffies(bfqq->first_IO_time +
783	bfq_merge_time_limit);
784	}
785
786	/*
787	* The following function is not marked as __cold because it is
788	* actually cold, but for the same performance goal described in the
789	* comments on the likely() at the beginning of
790	* bfq_setup_cooperator(). Unexpectedly, to reach an even lower
791	* execution time for the case where this function is not invoked, we
792	* had to add an unlikely() in each involved if().
793	*/
794	void __cold
795	bfq_pos_tree_add_move(struct bfq_data bfqd, struct* bfq_queue *bfqq)
796	{
797	struct rb_node *p, parent;
798	struct bfq_queue *__bfqq;
799
800	if (bfqq->pos_root) {
801	rb_erase(&bfqq->pos_node, bfqq->pos_root);
802	bfqq->pos_root = NULL;
803	}
804
805	/ oom_bfqq does not participate in queue merging /
806	if (bfqq == &bfqd->oom_bfqq)
807	return;
808
809	/*
810	* bfqq cannot be merged any longer (see comments in
811	* bfq_setup_cooperator): no point in adding bfqq into the
812	* position tree.
813	*/
814	if (bfq_too_late_for_merging(bfqq))
815	return;
816
817	if (bfq_class_idle(bfqq))
818	return;
819	if (!bfqq->next_rq)
820	return;
821
822	bfqq->pos_root = &bfqq_group(bfqq)->rq_pos_tree;
823	__bfqq = bfq_rq_pos_tree_lookup(bfqd, root: bfqq->pos_root,
824	sector: blk_rq_pos(rq: bfqq->next_rq), ret_parent: &parent, rb_link: &p);
825	if (!__bfqq) {
826	rb_link_node(node: &bfqq->pos_node, parent, rb_link: p);
827	rb_insert_color(&bfqq->pos_node, bfqq->pos_root);
828	} else
829	bfqq->pos_root = NULL;
830	}
831
832	/*
833	* The following function returns false either if every active queue
834	* must receive the same share of the throughput (symmetric scenario),
835	* or, as a special case, if bfqq must receive a share of the
836	* throughput lower than or equal to the share that every other active
837	* queue must receive. If bfqq does sync I/O, then these are the only
838	* two cases where bfqq happens to be guaranteed its share of the
839	* throughput even if I/O dispatching is not plugged when bfqq remains
840	* temporarily empty (for more details, see the comments in the
841	* function bfq_better_to_idle()). For this reason, the return value
842	* of this function is used to check whether I/O-dispatch plugging can
843	* be avoided.
844	*
845	* The above first case (symmetric scenario) occurs when:
846	* 1) all active queues have the same weight,
847	* 2) all active queues belong to the same I/O-priority class,
848	* 3) all active groups at the same level in the groups tree have the same
849	* weight,
850	* 4) all active groups at the same level in the groups tree have the same
851	* number of children.
852	*
853	* Unfortunately, keeping the necessary state for evaluating exactly
854	* the last two symmetry sub-conditions above would be quite complex
855	* and time consuming. Therefore this function evaluates, instead,
856	* only the following stronger three sub-conditions, for which it is
857	* much easier to maintain the needed state:
858	* 1) all active queues have the same weight,
859	* 2) all active queues belong to the same I/O-priority class,
860	* 3) there is at most one active group.
861	* In particular, the last condition is always true if hierarchical
862	* support or the cgroups interface are not enabled, thus no state
863	* needs to be maintained in this case.
864	*/
865	static bool bfq_asymmetric_scenario(struct bfq_data *bfqd,
866	struct bfq_queue *bfqq)
867	{
868	bool smallest_weight = bfqq &&
869	bfqq->weight_counter &&
870	bfqq->weight_counter ==
871	container_of(
872	rb_first_cached(&bfqd->queue_weights_tree),
873	struct bfq_weight_counter,
874	weights_node);
875
876	/*
877	* For queue weights to differ, queue_weights_tree must contain
878	* at least two nodes.
879	*/
880	bool varied_queue_weights = !smallest_weight &&
881	!RB_EMPTY_ROOT(&bfqd->queue_weights_tree.rb_root) &&
882	(bfqd->queue_weights_tree.rb_root.rb_node->rb_left \|\|
883	bfqd->queue_weights_tree.rb_root.rb_node->rb_right);
884
885	bool multiple_classes_busy =
886	(bfqd->busy_queues[`0`] && bfqd->busy_queues[`1`]) \|\|
887	(bfqd->busy_queues[`0`] && bfqd->busy_queues[`2`]) \|\|
888	(bfqd->busy_queues[`1`] && bfqd->busy_queues[`2`]);
889
890	return varied_queue_weights \|\| multiple_classes_busy
891	#ifdef CONFIG_BFQ_GROUP_IOSCHED
892	\|\| bfqd->num_groups_with_pending_reqs > `1`
893	#endif
894	;
895	}
896
897	/*
898	* If the weight-counter tree passed as input contains no counter for
899	* the weight of the input queue, then add that counter; otherwise just
900	* increment the existing counter.
901	*
902	* Note that weight-counter trees contain few nodes in mostly symmetric
903	* scenarios. For example, if all queues have the same weight, then the
904	* weight-counter tree for the queues may contain at most one node.
905	* This holds even if low_latency is on, because weight-raised queues
906	* are not inserted in the tree.
907	* In most scenarios, the rate at which nodes are created/destroyed
908	* should be low too.
909	*/
910	void bfq_weights_tree_add(struct bfq_queue *bfqq)
911	{
912	struct rb_root_cached *root = &bfqq->bfqd->queue_weights_tree;
913	struct bfq_entity *entity = &bfqq->entity;
914	struct rb_node *new = &(root->rb_root.rb_node), parent = NULL;
915	bool leftmost = true;
916
917	/*
918	* Do not insert if the queue is already associated with a
919	* counter, which happens if:
920	* 1) a request arrival has caused the queue to become both
921	* non-weight-raised, and hence change its weight, and
922	* backlogged; in this respect, each of the two events
923	* causes an invocation of this function,
924	* 2) this is the invocation of this function caused by the
925	* second event. This second invocation is actually useless,
926	* and we handle this fact by exiting immediately. More
927	* efficient or clearer solutions might possibly be adopted.
928	*/
929	if (bfqq->weight_counter)
930	return;
931
932	while (*new) {
933	struct bfq_weight_counter __counter = container_of(new,
934	struct bfq_weight_counter,
935	weights_node);
936	parent = *new;
937
938	if (entity->weight == __counter->weight) {
939	bfqq->weight_counter = __counter;
940	goto inc_counter;
941	}
942	if (entity->weight < __counter->weight)
943	new = &((*new)->rb_left);
944	else {
945	new = &((*new)->rb_right);
946	leftmost = false;
947	}
948	}
949
950	bfqq->weight_counter = kzalloc(sizeof(struct bfq_weight_counter),
951	GFP_ATOMIC);
952
953	/*
954	* In the unlucky event of an allocation failure, we just
955	* exit. This will cause the weight of queue to not be
956	* considered in bfq_asymmetric_scenario, which, in its turn,
957	* causes the scenario to be deemed wrongly symmetric in case
958	* bfqq's weight would have been the only weight making the
959	* scenario asymmetric. On the bright side, no unbalance will
960	* however occur when bfqq becomes inactive again (the
961	* invocation of this function is triggered by an activation
962	* of queue). In fact, bfq_weights_tree_remove does nothing
963	* if !bfqq->weight_counter.
964	*/
965	if (unlikely(!bfqq->weight_counter))
966	return;
967
968	bfqq->weight_counter->weight = entity->weight;
969	rb_link_node(node: &bfqq->weight_counter->weights_node, parent, rb_link: new);
970	rb_insert_color_cached(node: &bfqq->weight_counter->weights_node, root,
971	leftmost);
972
973	inc_counter:
974	bfqq->weight_counter->num_active++;
975	bfqq->ref++;
976	}
977
978	/*
979	* Decrement the weight counter associated with the queue, and, if the
980	* counter reaches 0, remove the counter from the tree.
981	* See the comments to the function bfq_weights_tree_add() for considerations
982	* about overhead.
983	*/
984	void bfq_weights_tree_remove(struct bfq_queue *bfqq)
985	{
986	struct rb_root_cached *root;
987
988	if (!bfqq->weight_counter)
989	return;
990
991	root = &bfqq->bfqd->queue_weights_tree;
992	bfqq->weight_counter->num_active--;
993	if (bfqq->weight_counter->num_active > `0`)
994	goto reset_entity_pointer;
995
996	rb_erase_cached(node: &bfqq->weight_counter->weights_node, root);
997	kfree(objp: bfqq->weight_counter);
998
999	reset_entity_pointer:
1000	bfqq->weight_counter = NULL;
1001	bfq_put_queue(bfqq);
1002	}
1003
1004	/*
1005	* Return expired entry, or NULL to just start from scratch in rbtree.
1006	*/
1007	static struct request bfq_check_fifo(struct* bfq_queue *bfqq,
1008	struct request *last)
1009	{
1010	struct request *rq;
1011
1012	if (bfq_bfqq_fifo_expire(bfqq))
1013	return NULL;
1014
1015	bfq_mark_bfqq_fifo_expire(bfqq);
1016
1017	rq = rq_entry_fifo(bfqq->fifo.next);
1018
1019	if (rq == last \|\| blk_time_get_ns() < rq->fifo_time)
1020	return NULL;
1021
1022	bfq_log_bfqq(bfqq->bfqd, bfqq, "check_fifo: returned %p", rq);
1023	return rq;
1024	}
1025
1026	static struct request bfq_find_next_rq(struct* bfq_data *bfqd,
1027	struct bfq_queue *bfqq,
1028	struct request *last)
1029	{
1030	struct rb_node *rbnext = rb_next(&last->rb_node);
1031	struct rb_node *rbprev = rb_prev(&last->rb_node);
1032	struct request next, prev = NULL;
1033
1034	/ Follow expired path, else get first next available. /
1035	next = bfq_check_fifo(bfqq, last);
1036	if (next)
1037	return next;
1038
1039	if (rbprev)
1040	prev = rb_entry_rq(rbprev);
1041
1042	if (rbnext)
1043	next = rb_entry_rq(rbnext);
1044	else {
1045	rbnext = rb_first(&bfqq->sort_list);
1046	if (rbnext && rbnext != &last->rb_node)
1047	next = rb_entry_rq(rbnext);
1048	}
1049
1050	return bfq_choose_req(bfqd, rq1: next, rq2: prev, last: blk_rq_pos(rq: last));
1051	}
1052
1053	/ see the definition of bfq_async_charge_factor for details /
1054	static unsigned long bfq_serv_to_charge(struct request *rq,
1055	struct bfq_queue *bfqq)
1056	{
1057	if (bfq_bfqq_sync(bfqq) \|\| bfqq->wr_coeff > `1` \|\|
1058	bfq_asymmetric_scenario(bfqd: bfqq->bfqd, bfqq))
1059	return blk_rq_sectors(rq);
1060
1061	return blk_rq_sectors(rq) * bfq_async_charge_factor;
1062	}
1063
1064	/**
1065	* bfq_updated_next_req - update the queue after a new next_rq selection.
1066	* @bfqd: the device data the queue belongs to.
1067	* @bfqq: the queue to update.
1068	*
1069	* If the first request of a queue changes we make sure that the queue
1070	* has enough budget to serve at least its first request (if the
1071	* request has grown). We do this because if the queue has not enough
1072	* budget for its first request, it has to go through two dispatch
1073	* rounds to actually get it dispatched.
1074	*/
1075	static void bfq_updated_next_req(struct bfq_data *bfqd,
1076	struct bfq_queue *bfqq)
1077	{
1078	struct bfq_entity *entity = &bfqq->entity;
1079	struct request *next_rq = bfqq->next_rq;
1080	unsigned long new_budget;
1081
1082	if (!next_rq)
1083	return;
1084
1085	if (bfqq == bfqd->in_service_queue)
1086	/*
1087	* In order not to break guarantees, budgets cannot be
1088	* changed after an entity has been selected.
1089	*/
1090	return;
1091
1092	new_budget = max_t(unsigned long,
1093	max_t(unsigned long, bfqq->max_budget,
1094	bfq_serv_to_charge(next_rq, bfqq)),
1095	entity->service);
1096	if (entity->budget != new_budget) {
1097	entity->budget = new_budget;
1098	bfq_log_bfqq(bfqd, bfqq, "updated next rq: new budget %lu",
1099	new_budget);
1100	bfq_requeue_bfqq(bfqd, bfqq, expiration: false);
1101	}
1102	}
1103
1104	static unsigned int bfq_wr_duration(struct bfq_data *bfqd)
1105	{
1106	u64 dur;
1107
1108	dur = bfqd->rate_dur_prod;
1109	do_div(dur, bfqd->peak_rate);
1110
1111	/*
1112	* Limit duration between 3 and 25 seconds. The upper limit
1113	* has been conservatively set after the following worst case:
1114	* on a QEMU/KVM virtual machine
1115	* - running in a slow PC
1116	* - with a virtual disk stacked on a slow low-end 5400rpm HDD
1117	* - serving a heavy I/O workload, such as the sequential reading
1118	* of several files
1119	* mplayer took 23 seconds to start, if constantly weight-raised.
1120	*
1121	* As for higher values than that accommodating the above bad
1122	* scenario, tests show that higher values would often yield
1123	* the opposite of the desired result, i.e., would worsen
1124	* responsiveness by allowing non-interactive applications to
1125	* preserve weight raising for too long.
1126	*
1127	* On the other end, lower values than 3 seconds make it
1128	* difficult for most interactive tasks to complete their jobs
1129	* before weight-raising finishes.
1130	*/
1131	return clamp_val(dur, msecs_to_jiffies(`3000`), msecs_to_jiffies(`25000`));
1132	}
1133
1134	/ switch back from soft real-time to interactive weight raising /
1135	static void switch_back_to_interactive_wr(struct bfq_queue *bfqq,
1136	struct bfq_data *bfqd)
1137	{
1138	bfqq->wr_coeff = bfqd->bfq_wr_coeff;
1139	bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
1140	bfqq->last_wr_start_finish = bfqq->wr_start_at_switch_to_srt;
1141	}
1142
1143	static void
1144	bfq_bfqq_resume_state(struct bfq_queue bfqq, struct* bfq_data *bfqd,
1145	struct bfq_io_cq *bic, bool bfq_already_existing)
1146	{
1147	unsigned int old_wr_coeff = `1`;
1148	bool busy = bfq_already_existing && bfq_bfqq_busy(bfqq);
1149	unsigned int a_idx = bfqq->actuator_idx;
1150	struct bfq_iocq_bfqq_data *bfqq_data = &bic->bfqq_data[a_idx];
1151
1152	if (bfqq_data->saved_has_short_ttime)
1153	bfq_mark_bfqq_has_short_ttime(bfqq);
1154	else
1155	bfq_clear_bfqq_has_short_ttime(bfqq);
1156
1157	if (bfqq_data->saved_IO_bound)
1158	bfq_mark_bfqq_IO_bound(bfqq);
1159	else
1160	bfq_clear_bfqq_IO_bound(bfqq);
1161
1162	bfqq->last_serv_time_ns = bfqq_data->saved_last_serv_time_ns;
1163	bfqq->inject_limit = bfqq_data->saved_inject_limit;
1164	bfqq->decrease_time_jif = bfqq_data->saved_decrease_time_jif;
1165
1166	bfqq->entity.new_weight = bfqq_data->saved_weight;
1167	bfqq->ttime = bfqq_data->saved_ttime;
1168	bfqq->io_start_time = bfqq_data->saved_io_start_time;
1169	bfqq->tot_idle_time = bfqq_data->saved_tot_idle_time;
1170	/*
1171	* Restore weight coefficient only if low_latency is on
1172	*/
1173	if (bfqd->low_latency) {
1174	old_wr_coeff = bfqq->wr_coeff;
1175	bfqq->wr_coeff = bfqq_data->saved_wr_coeff;
1176	}
1177	bfqq->service_from_wr = bfqq_data->saved_service_from_wr;
1178	bfqq->wr_start_at_switch_to_srt =
1179	bfqq_data->saved_wr_start_at_switch_to_srt;
1180	bfqq->last_wr_start_finish = bfqq_data->saved_last_wr_start_finish;
1181	bfqq->wr_cur_max_time = bfqq_data->saved_wr_cur_max_time;
1182
1183	if (bfqq->wr_coeff > `1` && (bfq_bfqq_in_large_burst(bfqq) \|\|
1184	time_is_before_jiffies(bfqq->last_wr_start_finish +
1185	bfqq->wr_cur_max_time))) {
1186	if (bfqq->wr_cur_max_time == bfqd->bfq_wr_rt_max_time &&
1187	!bfq_bfqq_in_large_burst(bfqq) &&
1188	time_is_after_eq_jiffies(bfqq->wr_start_at_switch_to_srt +
1189	bfq_wr_duration(bfqd))) {
1190	switch_back_to_interactive_wr(bfqq, bfqd);
1191	} else {
1192	bfqq->wr_coeff = `1`;
1193	bfq_log_bfqq(bfqq->bfqd, bfqq,
1194	"resume state: switching off wr");
1195	}
1196	}
1197
1198	/ make sure weight will be updated, however we got here /
1199	bfqq->entity.prio_changed = `1`;
1200
1201	if (likely(!busy))
1202	return;
1203
1204	if (old_wr_coeff == `1` && bfqq->wr_coeff > `1`)
1205	bfqd->wr_busy_queues++;
1206	else if (old_wr_coeff > `1` && bfqq->wr_coeff == `1`)
1207	bfqd->wr_busy_queues--;
1208	}
1209
1210	static int bfqq_process_refs(struct bfq_queue *bfqq)
1211	{
1212	return bfqq->ref - bfqq->entity.allocated -
1213	bfqq->entity.on_st_or_in_serv -
1214	(bfqq->weight_counter != NULL) - bfqq->stable_ref;
1215	}
1216
1217	/ Empty burst list and add just bfqq (see comments on bfq_handle_burst) /
1218	static void bfq_reset_burst_list(struct bfq_data bfqd, struct* bfq_queue *bfqq)
1219	{
1220	struct bfq_queue *item;
1221	struct hlist_node *n;
1222
1223	hlist_for_each_entry_safe(item, n, &bfqd->burst_list, burst_list_node)
1224	hlist_del_init(n: &item->burst_list_node);
1225
1226	/*
1227	* Start the creation of a new burst list only if there is no
1228	* active queue. See comments on the conditional invocation of
1229	* bfq_handle_burst().
1230	*/
1231	if (bfq_tot_busy_queues(bfqd) == `0`) {
1232	hlist_add_head(n: &bfqq->burst_list_node, h: &bfqd->burst_list);
1233	bfqd->burst_size = `1`;
1234	} else
1235	bfqd->burst_size = `0`;
1236
1237	bfqd->burst_parent_entity = bfqq->entity.parent;
1238	}
1239
1240	/ Add bfqq to the list of queues in current burst (see bfq_handle_burst) /
1241	static void bfq_add_to_burst(struct bfq_data bfqd, struct* bfq_queue *bfqq)
1242	{
1243	/ Increment burst size to take into account also bfqq /
1244	bfqd->burst_size++;
1245
1246	if (bfqd->burst_size == bfqd->bfq_large_burst_thresh) {
1247	struct bfq_queue pos, bfqq_item;
1248	struct hlist_node *n;
1249
1250	/*
1251	* Enough queues have been activated shortly after each
1252	* other to consider this burst as large.
1253	*/
1254	bfqd->large_burst = true;
1255
1256	/*
1257	* We can now mark all queues in the burst list as
1258	* belonging to a large burst.
1259	*/
1260	hlist_for_each_entry(bfqq_item, &bfqd->burst_list,
1261	burst_list_node)
1262	bfq_mark_bfqq_in_large_burst(bfqq: bfqq_item);
1263	bfq_mark_bfqq_in_large_burst(bfqq);
1264
1265	/*
1266	* From now on, and until the current burst finishes, any
1267	* new queue being activated shortly after the last queue
1268	* was inserted in the burst can be immediately marked as
1269	* belonging to a large burst. So the burst list is not
1270	* needed any more. Remove it.
1271	*/
1272	hlist_for_each_entry_safe(pos, n, &bfqd->burst_list,
1273	burst_list_node)
1274	hlist_del_init(n: &pos->burst_list_node);
1275	} else /*
1276	* Burst not yet large: add bfqq to the burst list. Do
1277	* not increment the ref counter for bfqq, because bfqq
1278	* is removed from the burst list before freeing bfqq
1279	* in put_queue.
1280	*/
1281	hlist_add_head(n: &bfqq->burst_list_node, h: &bfqd->burst_list);
1282	}
1283
1284	/*
1285	* If many queues belonging to the same group happen to be created
1286	* shortly after each other, then the processes associated with these
1287	* queues have typically a common goal. In particular, bursts of queue
1288	* creations are usually caused by services or applications that spawn
1289	* many parallel threads/processes. Examples are systemd during boot,
1290	* or git grep. To help these processes get their job done as soon as
1291	* possible, it is usually better to not grant either weight-raising
1292	* or device idling to their queues, unless these queues must be
1293	* protected from the I/O flowing through other active queues.
1294	*
1295	* In this comment we describe, firstly, the reasons why this fact
1296	* holds, and, secondly, the next function, which implements the main
1297	* steps needed to properly mark these queues so that they can then be
1298	* treated in a different way.
1299	*
1300	* The above services or applications benefit mostly from a high
1301	* throughput: the quicker the requests of the activated queues are
1302	* cumulatively served, the sooner the target job of these queues gets
1303	* completed. As a consequence, weight-raising any of these queues,
1304	* which also implies idling the device for it, is almost always
1305	* counterproductive, unless there are other active queues to isolate
1306	* these new queues from. If there no other active queues, then
1307	* weight-raising these new queues just lowers throughput in most
1308	* cases.
1309	*
1310	* On the other hand, a burst of queue creations may be caused also by
1311	* the start of an application that does not consist of a lot of
1312	* parallel I/O-bound threads. In fact, with a complex application,
1313	* several short processes may need to be executed to start-up the
1314	* application. In this respect, to start an application as quickly as
1315	* possible, the best thing to do is in any case to privilege the I/O
1316	* related to the application with respect to all other
1317	* I/O. Therefore, the best strategy to start as quickly as possible
1318	* an application that causes a burst of queue creations is to
1319	* weight-raise all the queues created during the burst. This is the
1320	* exact opposite of the best strategy for the other type of bursts.
1321	*
1322	* In the end, to take the best action for each of the two cases, the
1323	* two types of bursts need to be distinguished. Fortunately, this
1324	* seems relatively easy, by looking at the sizes of the bursts. In
1325	* particular, we found a threshold such that only bursts with a
1326	* larger size than that threshold are apparently caused by
1327	* services or commands such as systemd or git grep. For brevity,
1328	* hereafter we call just 'large' these bursts. BFQ does not
1329	* weight-raise queues whose creation occurs in a large burst. In
1330	* addition, for each of these queues BFQ performs or does not perform
1331	* idling depending on which choice boosts the throughput more. The
1332	* exact choice depends on the device and request pattern at
1333	* hand.
1334	*
1335	* Unfortunately, false positives may occur while an interactive task
1336	* is starting (e.g., an application is being started). The
1337	* consequence is that the queues associated with the task do not
1338	* enjoy weight raising as expected. Fortunately these false positives
1339	* are very rare. They typically occur if some service happens to
1340	* start doing I/O exactly when the interactive task starts.
1341	*
1342	* Turning back to the next function, it is invoked only if there are
1343	* no active queues (apart from active queues that would belong to the
1344	* same, possible burst bfqq would belong to), and it implements all
1345	* the steps needed to detect the occurrence of a large burst and to
1346	* properly mark all the queues belonging to it (so that they can then
1347	* be treated in a different way). This goal is achieved by
1348	* maintaining a "burst list" that holds, temporarily, the queues that
1349	* belong to the burst in progress. The list is then used to mark
1350	* these queues as belonging to a large burst if the burst does become
1351	* large. The main steps are the following.
1352	*
1353	* . when the very first queue is created, the queue is inserted into the
1354	* list (as it could be the first queue in a possible burst)
1355	*
1356	* . if the current burst has not yet become large, and a queue Q that does
1357	* not yet belong to the burst is activated shortly after the last time
1358	* at which a new queue entered the burst list, then the function appends
1359	* Q to the burst list
1360	*
1361	* . if, as a consequence of the previous step, the burst size reaches
1362	* the large-burst threshold, then
1363	*
1364	* . all the queues in the burst list are marked as belonging to a
1365	* large burst
1366	*
1367	* . the burst list is deleted; in fact, the burst list already served
1368	* its purpose (keeping temporarily track of the queues in a burst,
1369	* so as to be able to mark them as belonging to a large burst in the
1370	* previous sub-step), and now is not needed any more
1371	*
1372	* . the device enters a large-burst mode
1373	*
1374	* . if a queue Q that does not belong to the burst is created while
1375	* the device is in large-burst mode and shortly after the last time
1376	* at which a queue either entered the burst list or was marked as
1377	* belonging to the current large burst, then Q is immediately marked
1378	* as belonging to a large burst.
1379	*
1380	* . if a queue Q that does not belong to the burst is created a while
1381	* later, i.e., not shortly after, than the last time at which a queue
1382	* either entered the burst list or was marked as belonging to the
1383	* current large burst, then the current burst is deemed as finished and:
1384	*
1385	* . the large-burst mode is reset if set
1386	*
1387	* . the burst list is emptied
1388	*
1389	* . Q is inserted in the burst list, as Q may be the first queue
1390	* in a possible new burst (then the burst list contains just Q
1391	* after this step).
1392	*/
1393	static void bfq_handle_burst(struct bfq_data bfqd, struct* bfq_queue *bfqq)
1394	{
1395	/*
1396	* If bfqq is already in the burst list or is part of a large
1397	* burst, or finally has just been split, then there is
1398	* nothing else to do.
1399	*/
1400	if (!hlist_unhashed(h: &bfqq->burst_list_node) \|\|
1401	bfq_bfqq_in_large_burst(bfqq) \|\|
1402	time_is_after_eq_jiffies(bfqq->split_time +
1403	msecs_to_jiffies(`10`)))
1404	return;
1405
1406	/*
1407	* If bfqq's creation happens late enough, or bfqq belongs to
1408	* a different group than the burst group, then the current
1409	* burst is finished, and related data structures must be
1410	* reset.
1411	*
1412	* In this respect, consider the special case where bfqq is
1413	* the very first queue created after BFQ is selected for this
1414	* device. In this case, last_ins_in_burst and
1415	* burst_parent_entity are not yet significant when we get
1416	* here. But it is easy to verify that, whether or not the
1417	* following condition is true, bfqq will end up being
1418	* inserted into the burst list. In particular the list will
1419	* happen to contain only bfqq. And this is exactly what has
1420	* to happen, as bfqq may be the first queue of the first
1421	* burst.
1422	*/
1423	if (time_is_before_jiffies(bfqd->last_ins_in_burst +
1424	bfqd->bfq_burst_interval) \|\|
1425	bfqq->entity.parent != bfqd->burst_parent_entity) {
1426	bfqd->large_burst = false;
1427	bfq_reset_burst_list(bfqd, bfqq);
1428	goto end;
1429	}
1430
1431	/*
1432	* If we get here, then bfqq is being activated shortly after the
1433	* last queue. So, if the current burst is also large, we can mark
1434	* bfqq as belonging to this large burst immediately.
1435	*/
1436	if (bfqd->large_burst) {
1437	bfq_mark_bfqq_in_large_burst(bfqq);
1438	goto end;
1439	}
1440
1441	/*
1442	* If we get here, then a large-burst state has not yet been
1443	* reached, but bfqq is being activated shortly after the last
1444	* queue. Then we add bfqq to the burst.
1445	*/
1446	bfq_add_to_burst(bfqd, bfqq);
1447	end:
1448	/*
1449	* At this point, bfqq either has been added to the current
1450	* burst or has caused the current burst to terminate and a
1451	* possible new burst to start. In particular, in the second
1452	* case, bfqq has become the first queue in the possible new
1453	* burst. In both cases last_ins_in_burst needs to be moved
1454	* forward.
1455	*/
1456	bfqd->last_ins_in_burst = jiffies;
1457	}
1458
1459	static int bfq_bfqq_budget_left(struct bfq_queue *bfqq)
1460	{
1461	struct bfq_entity *entity = &bfqq->entity;
1462
1463	return entity->budget - entity->service;
1464	}
1465
1466	/*
1467	* If enough samples have been computed, return the current max budget
1468	* stored in bfqd, which is dynamically updated according to the
1469	* estimated disk peak rate; otherwise return the default max budget
1470	*/
1471	static int bfq_max_budget(struct bfq_data *bfqd)
1472	{
1473	if (bfqd->budgets_assigned < bfq_stats_min_budgets)
1474	return bfq_default_max_budget;
1475	else
1476	return bfqd->bfq_max_budget;
1477	}
1478
1479	/*
1480	* Return min budget, which is a fraction of the current or default
1481	* max budget (trying with 1/32)
1482	*/
1483	static int bfq_min_budget(struct bfq_data *bfqd)
1484	{
1485	if (bfqd->budgets_assigned < bfq_stats_min_budgets)
1486	return bfq_default_max_budget / `32`;
1487	else
1488	return bfqd->bfq_max_budget / `32`;
1489	}
1490
1491	/*
1492	* The next function, invoked after the input queue bfqq switches from
1493	* idle to busy, updates the budget of bfqq. The function also tells
1494	* whether the in-service queue should be expired, by returning
1495	* true. The purpose of expiring the in-service queue is to give bfqq
1496	* the chance to possibly preempt the in-service queue, and the reason
1497	* for preempting the in-service queue is to achieve one of the two
1498	* goals below.
1499	*
1500	* 1. Guarantee to bfqq its reserved bandwidth even if bfqq has
1501	* expired because it has remained idle. In particular, bfqq may have
1502	* expired for one of the following two reasons:
1503	*
1504	* - BFQQE_NO_MORE_REQUESTS bfqq did not enjoy any device idling
1505	* and did not make it to issue a new request before its last
1506	* request was served;
1507	*
1508	* - BFQQE_TOO_IDLE bfqq did enjoy device idling, but did not issue
1509	* a new request before the expiration of the idling-time.
1510	*
1511	* Even if bfqq has expired for one of the above reasons, the process
1512	* associated with the queue may be however issuing requests greedily,
1513	* and thus be sensitive to the bandwidth it receives (bfqq may have
1514	* remained idle for other reasons: CPU high load, bfqq not enjoying
1515	* idling, I/O throttling somewhere in the path from the process to
1516	* the I/O scheduler, ...). But if, after every expiration for one of
1517	* the above two reasons, bfqq has to wait for the service of at least
1518	* one full budget of another queue before being served again, then
1519	* bfqq is likely to get a much lower bandwidth or resource time than
1520	* its reserved ones. To address this issue, two countermeasures need
1521	* to be taken.
1522	*
1523	* First, the budget and the timestamps of bfqq need to be updated in
1524	* a special way on bfqq reactivation: they need to be updated as if
1525	* bfqq did not remain idle and did not expire. In fact, if they are
1526	* computed as if bfqq expired and remained idle until reactivation,
1527	* then the process associated with bfqq is treated as if, instead of
1528	* being greedy, it stopped issuing requests when bfqq remained idle,
1529	* and restarts issuing requests only on this reactivation. In other
1530	* words, the scheduler does not help the process recover the "service
1531	* hole" between bfqq expiration and reactivation. As a consequence,
1532	* the process receives a lower bandwidth than its reserved one. In
1533	* contrast, to recover this hole, the budget must be updated as if
1534	* bfqq was not expired at all before this reactivation, i.e., it must
1535	* be set to the value of the remaining budget when bfqq was
1536	* expired. Along the same line, timestamps need to be assigned the
1537	* value they had the last time bfqq was selected for service, i.e.,
1538	* before last expiration. Thus timestamps need to be back-shifted
1539	* with respect to their normal computation (see [1] for more details
1540	* on this tricky aspect).
1541	*
1542	* Secondly, to allow the process to recover the hole, the in-service
1543	* queue must be expired too, to give bfqq the chance to preempt it
1544	* immediately. In fact, if bfqq has to wait for a full budget of the
1545	* in-service queue to be completed, then it may become impossible to
1546	* let the process recover the hole, even if the back-shifted
1547	* timestamps of bfqq are lower than those of the in-service queue. If
1548	* this happens for most or all of the holes, then the process may not
1549	* receive its reserved bandwidth. In this respect, it is worth noting
1550	* that, being the service of outstanding requests unpreemptible, a
1551	* little fraction of the holes may however be unrecoverable, thereby
1552	* causing a little loss of bandwidth.
1553	*
1554	* The last important point is detecting whether bfqq does need this
1555	* bandwidth recovery. In this respect, the next function deems the
1556	* process associated with bfqq greedy, and thus allows it to recover
1557	* the hole, if: 1) the process is waiting for the arrival of a new
1558	* request (which implies that bfqq expired for one of the above two
1559	* reasons), and 2) such a request has arrived soon. The first
1560	* condition is controlled through the flag non_blocking_wait_rq,
1561	* while the second through the flag arrived_in_time. If both
1562	* conditions hold, then the function computes the budget in the
1563	* above-described special way, and signals that the in-service queue
1564	* should be expired. Timestamp back-shifting is done later in
1565	* __bfq_activate_entity.
1566	*
1567	* 2. Reduce latency. Even if timestamps are not backshifted to let
1568	* the process associated with bfqq recover a service hole, bfqq may
1569	* however happen to have, after being (re)activated, a lower finish
1570	* timestamp than the in-service queue. That is, the next budget of
1571	* bfqq may have to be completed before the one of the in-service
1572	* queue. If this is the case, then preempting the in-service queue
1573	* allows this goal to be achieved, apart from the unpreemptible,
1574	* outstanding requests mentioned above.
1575	*
1576	* Unfortunately, regardless of which of the above two goals one wants
1577	* to achieve, service trees need first to be updated to know whether
1578	* the in-service queue must be preempted. To have service trees
1579	* correctly updated, the in-service queue must be expired and
1580	* rescheduled, and bfqq must be scheduled too. This is one of the
1581	* most costly operations (in future versions, the scheduling
1582	* mechanism may be re-designed in such a way to make it possible to
1583	* know whether preemption is needed without needing to update service
1584	* trees). In addition, queue preemptions almost always cause random
1585	* I/O, which may in turn cause loss of throughput. Finally, there may
1586	* even be no in-service queue when the next function is invoked (so,
1587	* no queue to compare timestamps with). Because of these facts, the
1588	* next function adopts the following simple scheme to avoid costly
1589	* operations, too frequent preemptions and too many dependencies on
1590	* the state of the scheduler: it requests the expiration of the
1591	* in-service queue (unconditionally) only for queues that need to
1592	* recover a hole. Then it delegates to other parts of the code the
1593	* responsibility of handling the above case 2.
1594	*/
1595	static bool bfq_bfqq_update_budg_for_activation(struct bfq_data *bfqd,
1596	struct bfq_queue *bfqq,
1597	bool arrived_in_time)
1598	{
1599	struct bfq_entity *entity = &bfqq->entity;
1600
1601	/*
1602	* In the next compound condition, we check also whether there
1603	* is some budget left, because otherwise there is no point in
1604	* trying to go on serving bfqq with this same budget: bfqq
1605	* would be expired immediately after being selected for
1606	* service. This would only cause useless overhead.
1607	*/
1608	if (bfq_bfqq_non_blocking_wait_rq(bfqq) && arrived_in_time &&
1609	bfq_bfqq_budget_left(bfqq) > `0`) {
1610	/*
1611	* We do not clear the flag non_blocking_wait_rq here, as
1612	* the latter is used in bfq_activate_bfqq to signal
1613	* that timestamps need to be back-shifted (and is
1614	* cleared right after).
1615	*/
1616
1617	/*
1618	* In next assignment we rely on that either
1619	* entity->service or entity->budget are not updated
1620	* on expiration if bfqq is empty (see
1621	* __bfq_bfqq_recalc_budget). Thus both quantities
1622	* remain unchanged after such an expiration, and the
1623	* following statement therefore assigns to
1624	* entity->budget the remaining budget on such an
1625	* expiration.
1626	*/
1627	entity->budget = min_t(unsigned long,
1628	bfq_bfqq_budget_left(bfqq),
1629	bfqq->max_budget);
1630
1631	/*
1632	* At this point, we have used entity->service to get
1633	* the budget left (needed for updating
1634	* entity->budget). Thus we finally can, and have to,
1635	* reset entity->service. The latter must be reset
1636	* because bfqq would otherwise be charged again for
1637	* the service it has received during its previous
1638	* service slot(s).
1639	*/
1640	entity->service = `0`;
1641
1642	return true;
1643	}
1644
1645	/*
1646	* We can finally complete expiration, by setting service to 0.
1647	*/
1648	entity->service = `0`;
1649	entity->budget = max_t(unsigned long, bfqq->max_budget,
1650	bfq_serv_to_charge(bfqq->next_rq, bfqq));
1651	bfq_clear_bfqq_non_blocking_wait_rq(bfqq);
1652	return false;
1653	}
1654
1655	/*
1656	* Return the farthest past time instant according to jiffies
1657	* macros.
1658	*/
1659	static unsigned long bfq_smallest_from_now(void)
1660	{
1661	return jiffies - MAX_JIFFY_OFFSET;
1662	}
1663
1664	static void bfq_update_bfqq_wr_on_rq_arrival(struct bfq_data *bfqd,
1665	struct bfq_queue *bfqq,
1666	unsigned int old_wr_coeff,
1667	bool wr_or_deserves_wr,
1668	bool interactive,
1669	bool in_burst,
1670	bool soft_rt)
1671	{
1672	if (old_wr_coeff == `1` && wr_or_deserves_wr) {
1673	/ start a weight-raising period /
1674	if (interactive) {
1675	bfqq->service_from_wr = `0`;
1676	bfqq->wr_coeff = bfqd->bfq_wr_coeff;
1677	bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
1678	} else {
1679	/*
1680	* No interactive weight raising in progress
1681	* here: assign minus infinity to
1682	* wr_start_at_switch_to_srt, to make sure
1683	* that, at the end of the soft-real-time
1684	* weight raising periods that is starting
1685	* now, no interactive weight-raising period
1686	* may be wrongly considered as still in
1687	* progress (and thus actually started by
1688	* mistake).
1689	*/
1690	bfqq->wr_start_at_switch_to_srt =
1691	bfq_smallest_from_now();
1692	bfqq->wr_coeff = bfqd->bfq_wr_coeff *
1693	BFQ_SOFTRT_WEIGHT_FACTOR;
1694	bfqq->wr_cur_max_time =
1695	bfqd->bfq_wr_rt_max_time;
1696	}
1697
1698	/*
1699	* If needed, further reduce budget to make sure it is
1700	* close to bfqq's backlog, so as to reduce the
1701	* scheduling-error component due to a too large
1702	* budget. Do not care about throughput consequences,
1703	* but only about latency. Finally, do not assign a
1704	* too small budget either, to avoid increasing
1705	* latency by causing too frequent expirations.
1706	*/
1707	bfqq->entity.budget = min_t(unsigned long,
1708	bfqq->entity.budget,
1709	`2` * bfq_min_budget(bfqd));
1710	} else if (old_wr_coeff > `1`) {
1711	if (interactive) { / update wr coeff and duration /
1712	bfqq->wr_coeff = bfqd->bfq_wr_coeff;
1713	bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
1714	} else if (in_burst)
1715	bfqq->wr_coeff = `1`;
1716	else if (soft_rt) {
1717	/*
1718	* The application is now or still meeting the
1719	* requirements for being deemed soft rt. We
1720	* can then correctly and safely (re)charge
1721	* the weight-raising duration for the
1722	* application with the weight-raising
1723	* duration for soft rt applications.
1724	*
1725	* In particular, doing this recharge now, i.e.,
1726	* before the weight-raising period for the
1727	* application finishes, reduces the probability
1728	* of the following negative scenario:
1729	* 1) the weight of a soft rt application is
1730	* raised at startup (as for any newly
1731	* created application),
1732	* 2) since the application is not interactive,
1733	* at a certain time weight-raising is
1734	* stopped for the application,
1735	* 3) at that time the application happens to
1736	* still have pending requests, and hence
1737	* is destined to not have a chance to be
1738	* deemed soft rt before these requests are
1739	* completed (see the comments to the
1740	* function bfq_bfqq_softrt_next_start()
1741	* for details on soft rt detection),
1742	* 4) these pending requests experience a high
1743	* latency because the application is not
1744	* weight-raised while they are pending.
1745	*/
1746	if (bfqq->wr_cur_max_time !=
1747	bfqd->bfq_wr_rt_max_time) {
1748	bfqq->wr_start_at_switch_to_srt =
1749	bfqq->last_wr_start_finish;
1750
1751	bfqq->wr_cur_max_time =
1752	bfqd->bfq_wr_rt_max_time;
1753	bfqq->wr_coeff = bfqd->bfq_wr_coeff *
1754	BFQ_SOFTRT_WEIGHT_FACTOR;
1755	}
1756	bfqq->last_wr_start_finish = jiffies;
1757	}
1758	}
1759	}
1760
1761	static bool bfq_bfqq_idle_for_long_time(struct bfq_data *bfqd,
1762	struct bfq_queue *bfqq)
1763	{
1764	return bfqq->dispatched == `0` &&
1765	time_is_before_jiffies(
1766	bfqq->budget_timeout +
1767	bfqd->bfq_wr_min_idle_time);
1768	}
1769
1770
1771	/*
1772	* Return true if bfqq is in a higher priority class, or has a higher
1773	* weight than the in-service queue.
1774	*/
1775	static bool bfq_bfqq_higher_class_or_weight(struct bfq_queue *bfqq,
1776	struct bfq_queue *in_serv_bfqq)
1777	{
1778	int bfqq_weight, in_serv_weight;
1779
1780	if (bfqq->ioprio_class < in_serv_bfqq->ioprio_class)
1781	return true;
1782
1783	if (in_serv_bfqq->entity.parent == bfqq->entity.parent) {
1784	bfqq_weight = bfqq->entity.weight;
1785	in_serv_weight = in_serv_bfqq->entity.weight;
1786	} else {
1787	if (bfqq->entity.parent)
1788	bfqq_weight = bfqq->entity.parent->weight;
1789	else
1790	bfqq_weight = bfqq->entity.weight;
1791	if (in_serv_bfqq->entity.parent)
1792	in_serv_weight = in_serv_bfqq->entity.parent->weight;
1793	else
1794	in_serv_weight = in_serv_bfqq->entity.weight;
1795	}
1796
1797	return bfqq_weight > in_serv_weight;
1798	}
1799
1800	/*
1801	* Get the index of the actuator that will serve bio.
1802	*/
1803	static unsigned int bfq_actuator_index(struct bfq_data bfqd, struct* bio *bio)
1804	{
1805	unsigned int i;
1806	sector_t end;
1807
1808	/ no search needed if one or zero ranges present /
1809	if (bfqd->num_actuators == `1`)
1810	return `0`;
1811
1812	/ bio_end_sector(bio) gives the sector after the last one /
1813	end = bio_end_sector(bio) - `1`;
1814
1815	for (i = `0`; i < bfqd->num_actuators; i++) {
1816	if (end >= bfqd->sector[i] &&
1817	end < bfqd->sector[i] + bfqd->nr_sectors[i])
1818	return i;
1819	}
1820
1821	WARN_ONCE(true,
1822	"bfq_actuator_index: bio sector out of ranges: end=%llu\n",
1823	end);
1824	return `0`;
1825	}
1826
1827	static bool bfq_better_to_idle(struct bfq_queue *bfqq);
1828
1829	static void bfq_bfqq_handle_idle_busy_switch(struct bfq_data *bfqd,
1830	struct bfq_queue *bfqq,
1831	int old_wr_coeff,
1832	struct request *rq,
1833	bool *interactive)
1834	{
1835	bool soft_rt, in_burst, wr_or_deserves_wr,
1836	bfqq_wants_to_preempt,
1837	idle_for_long_time = bfq_bfqq_idle_for_long_time(bfqd, bfqq),
1838	/*
1839	* See the comments on
1840	* bfq_bfqq_update_budg_for_activation for
1841	* details on the usage of the next variable.
1842	*/
1843	arrived_in_time = blk_time_get_ns() <=
1844	bfqq->ttime.last_end_request +
1845	bfqd->bfq_slice_idle * `3`;
1846	unsigned int act_idx = bfq_actuator_index(bfqd, bio: rq->bio);
1847	bool bfqq_non_merged_or_stably_merged =
1848	bfqq->bic \|\| RQ_BIC(rq)->bfqq_data[act_idx].stably_merged;
1849
1850	/*
1851	* bfqq deserves to be weight-raised if:
1852	* - it is sync,
1853	* - it does not belong to a large burst,
1854	* - it has been idle for enough time or is soft real-time,
1855	* - is linked to a bfq_io_cq (it is not shared in any sense),
1856	* - has a default weight (otherwise we assume the user wanted
1857	* to control its weight explicitly)
1858	*/
1859	in_burst = bfq_bfqq_in_large_burst(bfqq);
1860	soft_rt = bfqd->bfq_wr_max_softrt_rate > `0` &&
1861	!BFQQ_TOTALLY_SEEKY(bfqq) &&
1862	!in_burst &&
1863	time_is_before_jiffies(bfqq->soft_rt_next_start) &&
1864	bfqq->dispatched == `0` &&
1865	bfqq->entity.new_weight == `40`;
1866	*interactive = !in_burst && idle_for_long_time &&
1867	bfqq->entity.new_weight == `40`;
1868	/*
1869	* Merged bfq_queues are kept out of weight-raising
1870	* (low-latency) mechanisms. The reason is that these queues
1871	* are usually created for non-interactive and
1872	* non-soft-real-time tasks. Yet this is not the case for
1873	* stably-merged queues. These queues are merged just because
1874	* they are created shortly after each other. So they may
1875	* easily serve the I/O of an interactive or soft-real time
1876	* application, if the application happens to spawn multiple
1877	* processes. So let also stably-merged queued enjoy weight
1878	* raising.
1879	*/
1880	wr_or_deserves_wr = bfqd->low_latency &&
1881	(bfqq->wr_coeff > `1` \|\|
1882	(bfq_bfqq_sync(bfqq) && bfqq_non_merged_or_stably_merged &&
1883	(*interactive \|\| soft_rt)));
1884
1885	/*
1886	* Using the last flag, update budget and check whether bfqq
1887	* may want to preempt the in-service queue.
1888	*/
1889	bfqq_wants_to_preempt =
1890	bfq_bfqq_update_budg_for_activation(bfqd, bfqq,
1891	arrived_in_time);
1892
1893	/*
1894	* If bfqq happened to be activated in a burst, but has been
1895	* idle for much more than an interactive queue, then we
1896	* assume that, in the overall I/O initiated in the burst, the
1897	* I/O associated with bfqq is finished. So bfqq does not need
1898	* to be treated as a queue belonging to a burst
1899	* anymore. Accordingly, we reset bfqq's in_large_burst flag
1900	* if set, and remove bfqq from the burst list if it's
1901	* there. We do not decrement burst_size, because the fact
1902	* that bfqq does not need to belong to the burst list any
1903	* more does not invalidate the fact that bfqq was created in
1904	* a burst.
1905	*/
1906	if (likely(!bfq_bfqq_just_created(bfqq)) &&
1907	idle_for_long_time &&
1908	time_is_before_jiffies(
1909	bfqq->budget_timeout +
1910	msecs_to_jiffies(`10000`))) {
1911	hlist_del_init(n: &bfqq->burst_list_node);
1912	bfq_clear_bfqq_in_large_burst(bfqq);
1913	}
1914
1915	bfq_clear_bfqq_just_created(bfqq);
1916
1917	if (bfqd->low_latency) {
1918	if (unlikely(time_is_after_jiffies(bfqq->split_time)))
1919	/ wraparound /
1920	bfqq->split_time =
1921	jiffies - bfqd->bfq_wr_min_idle_time - `1`;
1922
1923	if (time_is_before_jiffies(bfqq->split_time +
1924	bfqd->bfq_wr_min_idle_time)) {
1925	bfq_update_bfqq_wr_on_rq_arrival(bfqd, bfqq,
1926	old_wr_coeff,
1927	wr_or_deserves_wr,
1928	interactive: *interactive,
1929	in_burst,
1930	soft_rt);
1931
1932	if (old_wr_coeff != bfqq->wr_coeff)
1933	bfqq->entity.prio_changed = `1`;
1934	}
1935	}
1936
1937	bfqq->last_idle_bklogged = jiffies;
1938	bfqq->service_from_backlogged = `0`;
1939	bfq_clear_bfqq_softrt_update(bfqq);
1940
1941	bfq_add_bfqq_busy(bfqq);
1942
1943	/*
1944	* Expire in-service queue if preemption may be needed for
1945	* guarantees or throughput. As for guarantees, we care
1946	* explicitly about two cases. The first is that bfqq has to
1947	* recover a service hole, as explained in the comments on
1948	* bfq_bfqq_update_budg_for_activation(), i.e., that
1949	* bfqq_wants_to_preempt is true. However, if bfqq does not
1950	* carry time-critical I/O, then bfqq's bandwidth is less
1951	* important than that of queues that carry time-critical I/O.
1952	* So, as a further constraint, we consider this case only if
1953	* bfqq is at least as weight-raised, i.e., at least as time
1954	* critical, as the in-service queue.
1955	*
1956	* The second case is that bfqq is in a higher priority class,
1957	* or has a higher weight than the in-service queue. If this
1958	* condition does not hold, we don't care because, even if
1959	* bfqq does not start to be served immediately, the resulting
1960	* delay for bfqq's I/O is however lower or much lower than
1961	* the ideal completion time to be guaranteed to bfqq's I/O.
1962	*
1963	* In both cases, preemption is needed only if, according to
1964	* the timestamps of both bfqq and of the in-service queue,
1965	* bfqq actually is the next queue to serve. So, to reduce
1966	* useless preemptions, the return value of
1967	* next_queue_may_preempt() is considered in the next compound
1968	* condition too. Yet next_queue_may_preempt() just checks a
1969	* simple, necessary condition for bfqq to be the next queue
1970	* to serve. In fact, to evaluate a sufficient condition, the
1971	* timestamps of the in-service queue would need to be
1972	* updated, and this operation is quite costly (see the
1973	* comments on bfq_bfqq_update_budg_for_activation()).
1974	*
1975	* As for throughput, we ask bfq_better_to_idle() whether we
1976	* still need to plug I/O dispatching. If bfq_better_to_idle()
1977	* says no, then plugging is not needed any longer, either to
1978	* boost throughput or to perserve service guarantees. Then
1979	* the best option is to stop plugging I/O, as not doing so
1980	* would certainly lower throughput. We may end up in this
1981	* case if: (1) upon a dispatch attempt, we detected that it
1982	* was better to plug I/O dispatch, and to wait for a new
1983	* request to arrive for the currently in-service queue, but
1984	* (2) this switch of bfqq to busy changes the scenario.
1985	*/
1986	if (bfqd->in_service_queue &&
1987	((bfqq_wants_to_preempt &&
1988	bfqq->wr_coeff >= bfqd->in_service_queue->wr_coeff) \|\|
1989	bfq_bfqq_higher_class_or_weight(bfqq, in_serv_bfqq: bfqd->in_service_queue) \|\|
1990	!bfq_better_to_idle(bfqq: bfqd->in_service_queue)) &&
1991	next_queue_may_preempt(bfqd))
1992	bfq_bfqq_expire(bfqd, bfqq: bfqd->in_service_queue,
1993	compensate: false, reason: BFQQE_PREEMPTED);
1994	}
1995
1996	static void bfq_reset_inject_limit(struct bfq_data *bfqd,
1997	struct bfq_queue *bfqq)
1998	{
1999	/ invalidate baseline total service time /
2000	bfqq->last_serv_time_ns = `0`;
2001
2002	/*
2003	* Reset pointer in case we are waiting for
2004	* some request completion.
2005	*/
2006	bfqd->waited_rq = NULL;
2007
2008	/*
2009	* If bfqq has a short think time, then start by setting the
2010	* inject limit to 0 prudentially, because the service time of
2011	* an injected I/O request may be higher than the think time
2012	* of bfqq, and therefore, if one request was injected when
2013	* bfqq remains empty, this injected request might delay the
2014	* service of the next I/O request for bfqq significantly. In
2015	* case bfqq can actually tolerate some injection, then the
2016	* adaptive update will however raise the limit soon. This
2017	* lucky circumstance holds exactly because bfqq has a short
2018	* think time, and thus, after remaining empty, is likely to
2019	* get new I/O enqueued---and then completed---before being
2020	* expired. This is the very pattern that gives the
2021	* limit-update algorithm the chance to measure the effect of
2022	* injection on request service times, and then to update the
2023	* limit accordingly.
2024	*
2025	* However, in the following special case, the inject limit is
2026	* left to 1 even if the think time is short: bfqq's I/O is
2027	* synchronized with that of some other queue, i.e., bfqq may
2028	* receive new I/O only after the I/O of the other queue is
2029	* completed. Keeping the inject limit to 1 allows the
2030	* blocking I/O to be served while bfqq is in service. And
2031	* this is very convenient both for bfqq and for overall
2032	* throughput, as explained in detail in the comments in
2033	* bfq_update_has_short_ttime().
2034	*
2035	* On the opposite end, if bfqq has a long think time, then
2036	* start directly by 1, because:
2037	* a) on the bright side, keeping at most one request in
2038	* service in the drive is unlikely to cause any harm to the
2039	* latency of bfqq's requests, as the service time of a single
2040	* request is likely to be lower than the think time of bfqq;
2041	* b) on the downside, after becoming empty, bfqq is likely to
2042	* expire before getting its next request. With this request
2043	* arrival pattern, it is very hard to sample total service
2044	* times and update the inject limit accordingly (see comments
2045	* on bfq_update_inject_limit()). So the limit is likely to be
2046	* never, or at least seldom, updated. As a consequence, by
2047	* setting the limit to 1, we avoid that no injection ever
2048	* occurs with bfqq. On the downside, this proactive step
2049	* further reduces chances to actually compute the baseline
2050	* total service time. Thus it reduces chances to execute the
2051	* limit-update algorithm and possibly raise the limit to more
2052	* than 1.
2053	*/
2054	if (bfq_bfqq_has_short_ttime(bfqq))
2055	bfqq->inject_limit = `0`;
2056	else
2057	bfqq->inject_limit = `1`;
2058
2059	bfqq->decrease_time_jif = jiffies;
2060	}
2061
2062	static void bfq_update_io_intensity(struct bfq_queue *bfqq, u64 now_ns)
2063	{
2064	u64 tot_io_time = now_ns - bfqq->io_start_time;
2065
2066	if (RB_EMPTY_ROOT(&bfqq->sort_list) && bfqq->dispatched == `0`)
2067	bfqq->tot_idle_time +=
2068	now_ns - bfqq->ttime.last_end_request;
2069
2070	if (unlikely(bfq_bfqq_just_created(bfqq)))
2071	return;
2072
2073	/*
2074	* Must be busy for at least about 80% of the time to be
2075	* considered I/O bound.
2076	*/
2077	if (bfqq->tot_idle_time * `5` > tot_io_time)
2078	bfq_clear_bfqq_IO_bound(bfqq);
2079	else
2080	bfq_mark_bfqq_IO_bound(bfqq);
2081
2082	/*
2083	* Keep an observation window of at most 200 ms in the past
2084	* from now.
2085	*/
2086	if (tot_io_time > `200` * NSEC_PER_MSEC) {
2087	bfqq->io_start_time = now_ns - (tot_io_time>>`1`);
2088	bfqq->tot_idle_time >>= `1`;
2089	}
2090	}
2091
2092	/*
2093	* Detect whether bfqq's I/O seems synchronized with that of some
2094	* other queue, i.e., whether bfqq, after remaining empty, happens to
2095	* receive new I/O only right after some I/O request of the other
2096	* queue has been completed. We call waker queue the other queue, and
2097	* we assume, for simplicity, that bfqq may have at most one waker
2098	* queue.
2099	*
2100	* A remarkable throughput boost can be reached by unconditionally
2101	* injecting the I/O of the waker queue, every time a new
2102	* bfq_dispatch_request happens to be invoked while I/O is being
2103	* plugged for bfqq. In addition to boosting throughput, this
2104	* unblocks bfqq's I/O, thereby improving bandwidth and latency for
2105	* bfqq. Note that these same results may be achieved with the general
2106	* injection mechanism, but less effectively. For details on this
2107	* aspect, see the comments on the choice of the queue for injection
2108	* in bfq_select_queue().
2109	*
2110	* Turning back to the detection of a waker queue, a queue Q is deemed as a
2111	* waker queue for bfqq if, for three consecutive times, bfqq happens to become
2112	* non empty right after a request of Q has been completed within given
2113	* timeout. In this respect, even if bfqq is empty, we do not check for a waker
2114	* if it still has some in-flight I/O. In fact, in this case bfqq is actually
2115	* still being served by the drive, and may receive new I/O on the completion
2116	* of some of the in-flight requests. In particular, on the first time, Q is
2117	* tentatively set as a candidate waker queue, while on the third consecutive
2118	* time that Q is detected, the field waker_bfqq is set to Q, to confirm that Q
2119	* is a waker queue for bfqq. These detection steps are performed only if bfqq
2120	* has a long think time, so as to make it more likely that bfqq's I/O is
2121	* actually being blocked by a synchronization. This last filter, plus the
2122	* above three-times requirement and time limit for detection, make false
2123	* positives less likely.
2124	*
2125	* NOTE
2126	*
2127	* The sooner a waker queue is detected, the sooner throughput can be
2128	* boosted by injecting I/O from the waker queue. Fortunately,
2129	* detection is likely to be actually fast, for the following
2130	* reasons. While blocked by synchronization, bfqq has a long think
2131	* time. This implies that bfqq's inject limit is at least equal to 1
2132	* (see the comments in bfq_update_inject_limit()). So, thanks to
2133	* injection, the waker queue is likely to be served during the very
2134	* first I/O-plugging time interval for bfqq. This triggers the first
2135	* step of the detection mechanism. Thanks again to injection, the
2136	* candidate waker queue is then likely to be confirmed no later than
2137	* during the next I/O-plugging interval for bfqq.
2138	*
2139	* ISSUE
2140	*
2141	* On queue merging all waker information is lost.
2142	*/
2143	static void bfq_check_waker(struct bfq_data bfqd, struct* bfq_queue *bfqq,
2144	u64 now_ns)
2145	{
2146	char waker_name[MAX_BFQQ_NAME_LENGTH];
2147
2148	if (!bfqd->last_completed_rq_bfqq \|\|
2149	bfqd->last_completed_rq_bfqq == bfqq \|\|
2150	bfq_bfqq_has_short_ttime(bfqq) \|\|
2151	now_ns - bfqd->last_completion >= `4` * NSEC_PER_MSEC \|\|
2152	bfqd->last_completed_rq_bfqq == &bfqd->oom_bfqq \|\|
2153	bfqq == &bfqd->oom_bfqq)
2154	return;
2155
2156	/*
2157	* We reset waker detection logic also if too much time has passed
2158	* since the first detection. If wakeups are rare, pointless idling
2159	* doesn't hurt throughput that much. The condition below makes sure
2160	* we do not uselessly idle blocking waker in more than 1/64 cases.
2161	*/
2162	if (bfqd->last_completed_rq_bfqq !=
2163	bfqq->tentative_waker_bfqq \|\|
2164	now_ns > bfqq->waker_detection_started +
2165	`128` * (u64)bfqd->bfq_slice_idle) {
2166	/*
2167	* First synchronization detected with a
2168	* candidate waker queue, or with a different
2169	* candidate waker queue from the current one.
2170	*/
2171	bfqq->tentative_waker_bfqq =
2172	bfqd->last_completed_rq_bfqq;
2173	bfqq->num_waker_detections = `1`;
2174	bfqq->waker_detection_started = now_ns;
2175	bfq_bfqq_name(bfqq: bfqq->tentative_waker_bfqq, str: waker_name,
2176	MAX_BFQQ_NAME_LENGTH);
2177	bfq_log_bfqq(bfqd, bfqq, "set tentative waker %s", waker_name);
2178	} else / Same tentative waker queue detected again /
2179	bfqq->num_waker_detections++;
2180
2181	if (bfqq->num_waker_detections == `3`) {
2182	bfqq->waker_bfqq = bfqd->last_completed_rq_bfqq;
2183	bfqq->tentative_waker_bfqq = NULL;
2184	bfq_bfqq_name(bfqq: bfqq->waker_bfqq, str: waker_name,
2185	MAX_BFQQ_NAME_LENGTH);
2186	bfq_log_bfqq(bfqd, bfqq, "set waker %s", waker_name);
2187
2188	/*
2189	* If the waker queue disappears, then
2190	* bfqq->waker_bfqq must be reset. To
2191	* this goal, we maintain in each
2192	* waker queue a list, woken_list, of
2193	* all the queues that reference the
2194	* waker queue through their
2195	* waker_bfqq pointer. When the waker
2196	* queue exits, the waker_bfqq pointer
2197	* of all the queues in the woken_list
2198	* is reset.
2199	*
2200	* In addition, if bfqq is already in
2201	* the woken_list of a waker queue,
2202	* then, before being inserted into
2203	* the woken_list of a new waker
2204	* queue, bfqq must be removed from
2205	* the woken_list of the old waker
2206	* queue.
2207	*/
2208	if (!hlist_unhashed(h: &bfqq->woken_list_node))
2209	hlist_del_init(n: &bfqq->woken_list_node);
2210	hlist_add_head(n: &bfqq->woken_list_node,
2211	h: &bfqd->last_completed_rq_bfqq->woken_list);
2212	}
2213	}
2214
2215	static void bfq_add_request(struct request *rq)
2216	{
2217	struct bfq_queue *bfqq = RQ_BFQQ(rq);
2218	struct bfq_data *bfqd = bfqq->bfqd;
2219	struct request next_rq, prev;
2220	unsigned int old_wr_coeff = bfqq->wr_coeff;
2221	bool interactive = false;
2222	u64 now_ns = blk_time_get_ns();
2223
2224	bfq_log_bfqq(bfqd, bfqq, "add_request %d", rq_is_sync(rq));
2225	bfqq->queued[rq_is_sync(rq)]++;
2226	/*
2227	* Updating of 'bfqd->queued' is protected by 'bfqd->lock', however, it
2228	* may be read without holding the lock in bfq_has_work().
2229	*/
2230	WRITE_ONCE(bfqd->queued, bfqd->queued + `1`);
2231
2232	if (bfq_bfqq_sync(bfqq) && RQ_BIC(rq)->requests <= `1`) {
2233	bfq_check_waker(bfqd, bfqq, now_ns);
2234
2235	/*
2236	* Periodically reset inject limit, to make sure that
2237	* the latter eventually drops in case workload
2238	* changes, see step (3) in the comments on
2239	* bfq_update_inject_limit().
2240	*/
2241	if (time_is_before_eq_jiffies(bfqq->decrease_time_jif +
2242	msecs_to_jiffies(`1000`)))
2243	bfq_reset_inject_limit(bfqd, bfqq);
2244
2245	/*
2246	* The following conditions must hold to setup a new
2247	* sampling of total service time, and then a new
2248	* update of the inject limit:
2249	* - bfqq is in service, because the total service
2250	* time is evaluated only for the I/O requests of
2251	* the queues in service;
2252	* - this is the right occasion to compute or to
2253	* lower the baseline total service time, because
2254	* there are actually no requests in the drive,
2255	* or
2256	* the baseline total service time is available, and
2257	* this is the right occasion to compute the other
2258	* quantity needed to update the inject limit, i.e.,
2259	* the total service time caused by the amount of
2260	* injection allowed by the current value of the
2261	* limit. It is the right occasion because injection
2262	* has actually been performed during the service
2263	* hole, and there are still in-flight requests,
2264	* which are very likely to be exactly the injected
2265	* requests, or part of them;
2266	* - the minimum interval for sampling the total
2267	* service time and updating the inject limit has
2268	* elapsed.
2269	*/
2270	if (bfqq == bfqd->in_service_queue &&
2271	(bfqd->tot_rq_in_driver == `0` \|\|
2272	(bfqq->last_serv_time_ns > `0` &&
2273	bfqd->rqs_injected && bfqd->tot_rq_in_driver > `0`)) &&
2274	time_is_before_eq_jiffies(bfqq->decrease_time_jif +
2275	msecs_to_jiffies(`10`))) {
2276	bfqd->last_empty_occupied_ns = blk_time_get_ns();
2277	/*
2278	* Start the state machine for measuring the
2279	* total service time of rq: setting
2280	* wait_dispatch will cause bfqd->waited_rq to
2281	* be set when rq will be dispatched.
2282	*/
2283	bfqd->wait_dispatch = true;
2284	/*
2285	* If there is no I/O in service in the drive,
2286	* then possible injection occurred before the
2287	* arrival of rq will not affect the total
2288	* service time of rq. So the injection limit
2289	* must not be updated as a function of such
2290	* total service time, unless new injection
2291	* occurs before rq is completed. To have the
2292	* injection limit updated only in the latter
2293	* case, reset rqs_injected here (rqs_injected
2294	* will be set in case injection is performed
2295	* on bfqq before rq is completed).
2296	*/
2297	if (bfqd->tot_rq_in_driver == `0`)
2298	bfqd->rqs_injected = false;
2299	}
2300	}
2301
2302	if (bfq_bfqq_sync(bfqq))
2303	bfq_update_io_intensity(bfqq, now_ns);
2304
2305	elv_rb_add(&bfqq->sort_list, rq);
2306
2307	/*
2308	* Check if this request is a better next-serve candidate.
2309	*/
2310	prev = bfqq->next_rq;
2311	next_rq = bfq_choose_req(bfqd, rq1: bfqq->next_rq, rq2: rq, last: bfqd->last_position);
2312	bfqq->next_rq = next_rq;
2313
2314	/*
2315	* Adjust priority tree position, if next_rq changes.
2316	* See comments on bfq_pos_tree_add_move() for the unlikely().
2317	*/
2318	if (unlikely(!bfqd->nonrot_with_queueing && prev != bfqq->next_rq))
2319	bfq_pos_tree_add_move(bfqd, bfqq);
2320
2321	if (!bfq_bfqq_busy(bfqq)) / switching to busy ... /
2322	bfq_bfqq_handle_idle_busy_switch(bfqd, bfqq, old_wr_coeff,
2323	rq, interactive: &interactive);
2324	else {
2325	if (bfqd->low_latency && old_wr_coeff == `1` && !rq_is_sync(rq) &&
2326	time_is_before_jiffies(
2327	bfqq->last_wr_start_finish +
2328	bfqd->bfq_wr_min_inter_arr_async)) {
2329	bfqq->wr_coeff = bfqd->bfq_wr_coeff;
2330	bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
2331
2332	bfqd->wr_busy_queues++;
2333	bfqq->entity.prio_changed = `1`;
2334	}
2335	if (prev != bfqq->next_rq)
2336	bfq_updated_next_req(bfqd, bfqq);
2337	}
2338
2339	/*
2340	* Assign jiffies to last_wr_start_finish in the following
2341	* cases:
2342	*
2343	* . if bfqq is not going to be weight-raised, because, for
2344	* non weight-raised queues, last_wr_start_finish stores the
2345	* arrival time of the last request; as of now, this piece
2346	* of information is used only for deciding whether to
2347	* weight-raise async queues
2348	*
2349	* . if bfqq is not weight-raised, because, if bfqq is now
2350	* switching to weight-raised, then last_wr_start_finish
2351	* stores the time when weight-raising starts
2352	*
2353	* . if bfqq is interactive, because, regardless of whether
2354	* bfqq is currently weight-raised, the weight-raising
2355	* period must start or restart (this case is considered
2356	* separately because it is not detected by the above
2357	* conditions, if bfqq is already weight-raised)
2358	*
2359	* last_wr_start_finish has to be updated also if bfqq is soft
2360	* real-time, because the weight-raising period is constantly
2361	* restarted on idle-to-busy transitions for these queues, but
2362	* this is already done in bfq_bfqq_handle_idle_busy_switch if
2363	* needed.
2364	*/
2365	if (bfqd->low_latency &&
2366	(old_wr_coeff == `1` \|\| bfqq->wr_coeff == `1` \|\| interactive))
2367	bfqq->last_wr_start_finish = jiffies;
2368	}
2369
2370	static struct request bfq_find_rq_fmerge(struct* bfq_data *bfqd,
2371	struct bio *bio,
2372	struct request_queue *q)
2373	{
2374	struct bfq_queue *bfqq = bfqd->bio_bfqq;
2375
2376
2377	if (bfqq)
2378	return elv_rb_find(&bfqq->sort_list, bio_end_sector(bio));
2379
2380	return NULL;
2381	}
2382
2383	static sector_t get_sdist(sector_t last_pos, struct request *rq)
2384	{
2385	if (last_pos)
2386	return abs(blk_rq_pos(rq) - last_pos);
2387
2388	return `0`;
2389	}
2390
2391	static void bfq_remove_request(struct request_queue *q,
2392	struct request *rq)
2393	{
2394	struct bfq_queue *bfqq = RQ_BFQQ(rq);
2395	struct bfq_data *bfqd = bfqq->bfqd;
2396	const int sync = rq_is_sync(rq);
2397
2398	if (bfqq->next_rq == rq) {
2399	bfqq->next_rq = bfq_find_next_rq(bfqd, bfqq, last: rq);
2400	bfq_updated_next_req(bfqd, bfqq);
2401	}
2402
2403	if (rq->queuelist.prev != &rq->queuelist)
2404	list_del_init(entry: &rq->queuelist);
2405	bfqq->queued[sync]--;
2406	/*
2407	* Updating of 'bfqd->queued' is protected by 'bfqd->lock', however, it
2408	* may be read without holding the lock in bfq_has_work().
2409	*/
2410	WRITE_ONCE(bfqd->queued, bfqd->queued - `1`);
2411	elv_rb_del(&bfqq->sort_list, rq);
2412
2413	elv_rqhash_del(q, rq);
2414	if (q->last_merge == rq)
2415	q->last_merge = NULL;
2416
2417	if (RB_EMPTY_ROOT(&bfqq->sort_list)) {
2418	bfqq->next_rq = NULL;
2419
2420	if (bfq_bfqq_busy(bfqq) && bfqq != bfqd->in_service_queue) {
2421	bfq_del_bfqq_busy(bfqq, expiration: false);
2422	/*
2423	* bfqq emptied. In normal operation, when
2424	* bfqq is empty, bfqq->entity.service and
2425	* bfqq->entity.budget must contain,
2426	* respectively, the service received and the
2427	* budget used last time bfqq emptied. These
2428	* facts do not hold in this case, as at least
2429	* this last removal occurred while bfqq is
2430	* not in service. To avoid inconsistencies,
2431	* reset both bfqq->entity.service and
2432	* bfqq->entity.budget, if bfqq has still a
2433	* process that may issue I/O requests to it.
2434	*/
2435	bfqq->entity.budget = bfqq->entity.service = `0`;
2436	}
2437
2438	/*
2439	* Remove queue from request-position tree as it is empty.
2440	*/
2441	if (bfqq->pos_root) {
2442	rb_erase(&bfqq->pos_node, bfqq->pos_root);
2443	bfqq->pos_root = NULL;
2444	}
2445	} else {
2446	/ see comments on bfq_pos_tree_add_move() for the unlikely() /
2447	if (unlikely(!bfqd->nonrot_with_queueing))
2448	bfq_pos_tree_add_move(bfqd, bfqq);
2449	}
2450
2451	if (rq->cmd_flags & REQ_META)
2452	bfqq->meta_pending--;
2453
2454	}
2455
2456	static bool bfq_bio_merge(struct request_queue q, struct* bio *bio,
2457	unsigned int nr_segs)
2458	{
2459	struct bfq_data *bfqd = q->elevator->elevator_data;
2460	struct request *free = NULL;
2461	/*
2462	* bfq_bic_lookup grabs the queue_lock: invoke it now and
2463	* store its return value for later use, to avoid nesting
2464	* queue_lock inside the bfqd->lock. We assume that the bic
2465	* returned by bfq_bic_lookup does not go away before
2466	* bfqd->lock is taken.
2467	*/
2468	struct bfq_io_cq *bic = bfq_bic_lookup(q);
2469	bool ret;
2470
2471	spin_lock_irq(lock: &bfqd->lock);
2472
2473	if (bic) {
2474	/*
2475	* Make sure cgroup info is uptodate for current process before
2476	* considering the merge.
2477	*/
2478	bfq_bic_update_cgroup(bic, bio);
2479
2480	bfqd->bio_bfqq = bic_to_bfqq(bic, is_sync: op_is_sync(op: bio->bi_opf),
2481	actuator_idx: bfq_actuator_index(bfqd, bio));
2482	} else {
2483	bfqd->bio_bfqq = NULL;
2484	}
2485	bfqd->bio_bic = bic;
2486
2487	ret = blk_mq_sched_try_merge(q, bio, nr_segs, merged_request: &free);
2488
2489	spin_unlock_irq(lock: &bfqd->lock);
2490	if (free)
2491	blk_mq_free_request(rq: free);
2492
2493	return ret;
2494	}
2495
2496	static int bfq_request_merge(struct request_queue q, struct* request **req,
2497	struct bio *bio)
2498	{
2499	struct bfq_data *bfqd = q->elevator->elevator_data;
2500	struct request *__rq;
2501
2502	__rq = bfq_find_rq_fmerge(bfqd, bio, q);
2503	if (__rq && elv_bio_merge_ok(__rq, bio)) {
2504	*req = __rq;
2505
2506	if (blk_discard_mergable(req: __rq))
2507	return ELEVATOR_DISCARD_MERGE;
2508	return ELEVATOR_FRONT_MERGE;
2509	}
2510
2511	return ELEVATOR_NO_MERGE;
2512	}
2513
2514	static void bfq_request_merged(struct request_queue q, struct* request *req,
2515	enum elv_merge type)
2516	{
2517	if (type == ELEVATOR_FRONT_MERGE &&
2518	rb_prev(&req->rb_node) &&
2519	blk_rq_pos(rq: req) <
2520	blk_rq_pos(container_of(rb_prev(&req->rb_node),
2521	struct request, rb_node))) {
2522	struct bfq_queue *bfqq = RQ_BFQQ(req);
2523	struct bfq_data *bfqd;
2524	struct request prev, next_rq;
2525
2526	if (!bfqq)
2527	return;
2528
2529	bfqd = bfqq->bfqd;
2530
2531	/ Reposition request in its sort_list /
2532	elv_rb_del(&bfqq->sort_list, req);
2533	elv_rb_add(&bfqq->sort_list, req);
2534
2535	/ Choose next request to be served for bfqq /
2536	prev = bfqq->next_rq;
2537	next_rq = bfq_choose_req(bfqd, rq1: bfqq->next_rq, rq2: req,
2538	last: bfqd->last_position);
2539	bfqq->next_rq = next_rq;
2540	/*
2541	* If next_rq changes, update both the queue's budget to
2542	* fit the new request and the queue's position in its
2543	* rq_pos_tree.
2544	*/
2545	if (prev != bfqq->next_rq) {
2546	bfq_updated_next_req(bfqd, bfqq);
2547	/*
2548	* See comments on bfq_pos_tree_add_move() for
2549	* the unlikely().
2550	*/
2551	if (unlikely(!bfqd->nonrot_with_queueing))
2552	bfq_pos_tree_add_move(bfqd, bfqq);
2553	}
2554	}
2555	}
2556
2557	/*
2558	* This function is called to notify the scheduler that the requests
2559	* rq and 'next' have been merged, with 'next' going away. BFQ
2560	* exploits this hook to address the following issue: if 'next' has a
2561	* fifo_time lower that rq, then the fifo_time of rq must be set to
2562	* the value of 'next', to not forget the greater age of 'next'.
2563	*
2564	* NOTE: in this function we assume that rq is in a bfq_queue, basing
2565	* on that rq is picked from the hash table q->elevator->hash, which,
2566	* in its turn, is filled only with I/O requests present in
2567	* bfq_queues, while BFQ is in use for the request queue q. In fact,
2568	* the function that fills this hash table (elv_rqhash_add) is called
2569	* only by bfq_insert_request.
2570	*/
2571	static void bfq_requests_merged(struct request_queue q, struct* request *rq,
2572	struct request *next)
2573	{
2574	struct bfq_queue *bfqq = RQ_BFQQ(rq),
2575	*next_bfqq = RQ_BFQQ(next);
2576
2577	if (!bfqq)
2578	goto remove;
2579
2580	/*
2581	* If next and rq belong to the same bfq_queue and next is older
2582	* than rq, then reposition rq in the fifo (by substituting next
2583	* with rq). Otherwise, if next and rq belong to different
2584	* bfq_queues, never reposition rq: in fact, we would have to
2585	* reposition it with respect to next's position in its own fifo,
2586	* which would most certainly be too expensive with respect to
2587	* the benefits.
2588	*/
2589	if (bfqq == next_bfqq &&
2590	!list_empty(head: &rq->queuelist) && !list_empty(head: &next->queuelist) &&
2591	next->fifo_time < rq->fifo_time) {
2592	list_del_init(entry: &rq->queuelist);
2593	list_replace_init(old: &next->queuelist, new: &rq->queuelist);
2594	rq->fifo_time = next->fifo_time;
2595	}
2596
2597	if (bfqq->next_rq == next)
2598	bfqq->next_rq = rq;
2599
2600	bfqg_stats_update_io_merged(bfqg: bfqq_group(bfqq), opf: next->cmd_flags);
2601	remove:
2602	/ Merged request may be in the IO scheduler. Remove it. /
2603	if (!RB_EMPTY_NODE(&next->rb_node)) {
2604	bfq_remove_request(q: next->q, rq: next);
2605	if (next_bfqq)
2606	bfqg_stats_update_io_remove(bfqg: bfqq_group(bfqq: next_bfqq),
2607	opf: next->cmd_flags);
2608	}
2609	}
2610
2611	/ Must be called with bfqq != NULL /
2612	static void bfq_bfqq_end_wr(struct bfq_queue *bfqq)
2613	{
2614	/*
2615	* If bfqq has been enjoying interactive weight-raising, then
2616	* reset soft_rt_next_start. We do it for the following
2617	* reason. bfqq may have been conveying the I/O needed to load
2618	* a soft real-time application. Such an application actually
2619	* exhibits a soft real-time I/O pattern after it finishes
2620	* loading, and finally starts doing its job. But, if bfqq has
2621	* been receiving a lot of bandwidth so far (likely to happen
2622	* on a fast device), then soft_rt_next_start now contains a
2623	* high value that. So, without this reset, bfqq would be
2624	* prevented from being possibly considered as soft_rt for a
2625	* very long time.
2626	*/
2627
2628	if (bfqq->wr_cur_max_time !=
2629	bfqq->bfqd->bfq_wr_rt_max_time)
2630	bfqq->soft_rt_next_start = jiffies;
2631
2632	if (bfq_bfqq_busy(bfqq))
2633	bfqq->bfqd->wr_busy_queues--;
2634	bfqq->wr_coeff = `1`;
2635	bfqq->wr_cur_max_time = `0`;
2636	bfqq->last_wr_start_finish = jiffies;
2637	/*
2638	* Trigger a weight change on the next invocation of
2639	* __bfq_entity_update_weight_prio.
2640	*/
2641	bfqq->entity.prio_changed = `1`;
2642	}
2643
2644	void bfq_end_wr_async_queues(struct bfq_data *bfqd,
2645	struct bfq_group *bfqg)
2646	{
2647	int i, j, k;
2648
2649	for (k = `0`; k < bfqd->num_actuators; k++) {
2650	for (i = `0`; i < `2`; i++)
2651	for (j = `0`; j < IOPRIO_NR_LEVELS; j++)
2652	if (bfqg->async_bfqq[i][j][k])
2653	bfq_bfqq_end_wr(bfqq: bfqg->async_bfqq[i][j][k]);
2654	if (bfqg->async_idle_bfqq[k])
2655	bfq_bfqq_end_wr(bfqq: bfqg->async_idle_bfqq[k]);
2656	}
2657	}
2658
2659	static void bfq_end_wr(struct bfq_data *bfqd)
2660	{
2661	struct bfq_queue *bfqq;
2662	int i;
2663
2664	spin_lock_irq(lock: &bfqd->lock);
2665
2666	for (i = `0`; i < bfqd->num_actuators; i++) {
2667	list_for_each_entry(bfqq, &bfqd->active_list[i], bfqq_list)
2668	bfq_bfqq_end_wr(bfqq);
2669	}
2670	list_for_each_entry(bfqq, &bfqd->idle_list, bfqq_list)
2671	bfq_bfqq_end_wr(bfqq);
2672	bfq_end_wr_async(bfqd);
2673
2674	spin_unlock_irq(lock: &bfqd->lock);
2675	}
2676
2677	static sector_t bfq_io_struct_pos(void *io_struct, bool request)
2678	{
2679	if (request)
2680	return blk_rq_pos(rq: io_struct);
2681	else
2682	return ((struct bio *)io_struct)->bi_iter.bi_sector;
2683	}
2684
2685	static int bfq_rq_close_to_sector(void *io_struct, bool request,
2686	sector_t sector)
2687	{
2688	return abs(bfq_io_struct_pos(io_struct, request) - sector) <=
2689	BFQQ_CLOSE_THR;
2690	}
2691
2692	static struct bfq_queue bfqq_find_close(struct* bfq_data *bfqd,
2693	struct bfq_queue *bfqq,
2694	sector_t sector)
2695	{
2696	struct rb_root *root = &bfqq_group(bfqq)->rq_pos_tree;
2697	struct rb_node parent, node;
2698	struct bfq_queue *__bfqq;
2699
2700	if (RB_EMPTY_ROOT(root))
2701	return NULL;
2702
2703	/*
2704	* First, if we find a request starting at the end of the last
2705	* request, choose it.
2706	*/
2707	__bfqq = bfq_rq_pos_tree_lookup(bfqd, root, sector, ret_parent: &parent, NULL);
2708	if (__bfqq)
2709	return __bfqq;
2710
2711	/*
2712	* If the exact sector wasn't found, the parent of the NULL leaf
2713	* will contain the closest sector (rq_pos_tree sorted by
2714	* next_request position).
2715	*/
2716	__bfqq = rb_entry(parent, struct bfq_queue, pos_node);
2717	if (bfq_rq_close_to_sector(io_struct: __bfqq->next_rq, request: true, sector))
2718	return __bfqq;
2719
2720	if (blk_rq_pos(rq: __bfqq->next_rq) < sector)
2721	node = rb_next(&__bfqq->pos_node);
2722	else
2723	node = rb_prev(&__bfqq->pos_node);
2724	if (!node)
2725	return NULL;
2726
2727	__bfqq = rb_entry(node, struct bfq_queue, pos_node);
2728	if (bfq_rq_close_to_sector(io_struct: __bfqq->next_rq, request: true, sector))
2729	return __bfqq;
2730
2731	return NULL;
2732	}
2733
2734	static struct bfq_queue bfq_find_close_cooperator(struct* bfq_data *bfqd,
2735	struct bfq_queue *cur_bfqq,
2736	sector_t sector)
2737	{
2738	struct bfq_queue *bfqq;
2739
2740	/*
2741	* We shall notice if some of the queues are cooperating,
2742	* e.g., working closely on the same area of the device. In
2743	* that case, we can group them together and: 1) don't waste
2744	* time idling, and 2) serve the union of their requests in
2745	* the best possible order for throughput.
2746	*/
2747	bfqq = bfqq_find_close(bfqd, bfqq: cur_bfqq, sector);
2748	if (!bfqq \|\| bfqq == cur_bfqq)
2749	return NULL;
2750
2751	return bfqq;
2752	}
2753
2754	static struct bfq_queue *
2755	bfq_setup_merge(struct bfq_queue bfqq, struct* bfq_queue *new_bfqq)
2756	{
2757	int process_refs, new_process_refs;
2758	struct bfq_queue *__bfqq;
2759
2760	/*
2761	* If there are no process references on the new_bfqq, then it is
2762	* unsafe to follow the ->new_bfqq chain as other bfqq's in the chain
2763	* may have dropped their last reference (not just their last process
2764	* reference).
2765	*/
2766	if (!bfqq_process_refs(bfqq: new_bfqq))
2767	return NULL;
2768
2769	/ Avoid a circular list and skip interim queue merges. /
2770	while ((__bfqq = new_bfqq->new_bfqq)) {
2771	if (__bfqq == bfqq)
2772	return NULL;
2773	new_bfqq = __bfqq;
2774	}
2775
2776	process_refs = bfqq_process_refs(bfqq);
2777	new_process_refs = bfqq_process_refs(bfqq: new_bfqq);
2778	/*
2779	* If the process for the bfqq has gone away, there is no
2780	* sense in merging the queues.
2781	*/
2782	if (process_refs == `0` \|\| new_process_refs == `0`)
2783	return NULL;
2784
2785	/*
2786	* Make sure merged queues belong to the same parent. Parents could
2787	* have changed since the time we decided the two queues are suitable
2788	* for merging.
2789	*/
2790	if (new_bfqq->entity.parent != bfqq->entity.parent)
2791	return NULL;
2792
2793	bfq_log_bfqq(bfqq->bfqd, bfqq, "scheduling merge with queue %d",
2794	new_bfqq->pid);
2795
2796	/*
2797	* Merging is just a redirection: the requests of the process
2798	* owning one of the two queues are redirected to the other queue.
2799	* The latter queue, in its turn, is set as shared if this is the
2800	* first time that the requests of some process are redirected to
2801	* it.
2802	*
2803	* We redirect bfqq to new_bfqq and not the opposite, because
2804	* we are in the context of the process owning bfqq, thus we
2805	* have the io_cq of this process. So we can immediately
2806	* configure this io_cq to redirect the requests of the
2807	* process to new_bfqq. In contrast, the io_cq of new_bfqq is
2808	* not available any more (new_bfqq->bic == NULL).
2809	*
2810	* Anyway, even in case new_bfqq coincides with the in-service
2811	* queue, redirecting requests the in-service queue is the
2812	* best option, as we feed the in-service queue with new
2813	* requests close to the last request served and, by doing so,
2814	* are likely to increase the throughput.
2815	*/
2816	bfqq->new_bfqq = new_bfqq;
2817	/*
2818	* The above assignment schedules the following redirections:
2819	* each time some I/O for bfqq arrives, the process that
2820	* generated that I/O is disassociated from bfqq and
2821	* associated with new_bfqq. Here we increases new_bfqq->ref
2822	* in advance, adding the number of processes that are
2823	* expected to be associated with new_bfqq as they happen to
2824	* issue I/O.
2825	*/
2826	new_bfqq->ref += process_refs;
2827	return new_bfqq;
2828	}
2829
2830	static bool bfq_may_be_close_cooperator(struct bfq_queue *bfqq,
2831	struct bfq_queue *new_bfqq)
2832	{
2833	if (bfq_too_late_for_merging(bfqq: new_bfqq))
2834	return false;
2835
2836	if (bfq_class_idle(bfqq) \|\| bfq_class_idle(new_bfqq) \|\|
2837	(bfqq->ioprio_class != new_bfqq->ioprio_class))
2838	return false;
2839
2840	/*
2841	* If either of the queues has already been detected as seeky,
2842	* then merging it with the other queue is unlikely to lead to
2843	* sequential I/O.
2844	*/
2845	if (BFQQ_SEEKY(bfqq) \|\| BFQQ_SEEKY(new_bfqq))
2846	return false;
2847
2848	/*
2849	* Interleaved I/O is known to be done by (some) applications
2850	* only for reads, so it does not make sense to merge async
2851	* queues.
2852	*/
2853	if (!bfq_bfqq_sync(bfqq) \|\| !bfq_bfqq_sync(bfqq: new_bfqq))
2854	return false;
2855
2856	return true;
2857	}
2858
2859	static bool idling_boosts_thr_without_issues(struct bfq_data *bfqd,
2860	struct bfq_queue *bfqq);
2861
2862	static struct bfq_queue *
2863	bfq_setup_stable_merge(struct bfq_data bfqd, struct* bfq_queue *bfqq,
2864	struct bfq_queue *stable_merge_bfqq,
2865	struct bfq_iocq_bfqq_data *bfqq_data)
2866	{
2867	int proc_ref = min(bfqq_process_refs(bfqq),
2868	bfqq_process_refs(stable_merge_bfqq));
2869	struct bfq_queue *new_bfqq = NULL;
2870
2871	bfqq_data->stable_merge_bfqq = NULL;
2872	if (idling_boosts_thr_without_issues(bfqd, bfqq) \|\| proc_ref == `0`)
2873	goto out;
2874
2875	/ next function will take at least one ref /
2876	new_bfqq = bfq_setup_merge(bfqq, new_bfqq: stable_merge_bfqq);
2877
2878	if (new_bfqq) {
2879	bfqq_data->stably_merged = true;
2880	if (new_bfqq->bic) {
2881	unsigned int new_a_idx = new_bfqq->actuator_idx;
2882	struct bfq_iocq_bfqq_data *new_bfqq_data =
2883	&new_bfqq->bic->bfqq_data[new_a_idx];
2884
2885	new_bfqq_data->stably_merged = true;
2886	}
2887	}
2888
2889	out:
2890	/ deschedule stable merge, because done or aborted here /
2891	bfq_put_stable_ref(bfqq: stable_merge_bfqq);
2892
2893	return new_bfqq;
2894	}
2895
2896	/*
2897	* Attempt to schedule a merge of bfqq with the currently in-service
2898	* queue or with a close queue among the scheduled queues. Return
2899	* NULL if no merge was scheduled, a pointer to the shared bfq_queue
2900	* structure otherwise.
2901	*
2902	* The OOM queue is not allowed to participate to cooperation: in fact, since
2903	* the requests temporarily redirected to the OOM queue could be redirected
2904	* again to dedicated queues at any time, the state needed to correctly
2905	* handle merging with the OOM queue would be quite complex and expensive
2906	* to maintain. Besides, in such a critical condition as an out of memory,
2907	* the benefits of queue merging may be little relevant, or even negligible.
2908	*
2909	* WARNING: queue merging may impair fairness among non-weight raised
2910	* queues, for at least two reasons: 1) the original weight of a
2911	* merged queue may change during the merged state, 2) even being the
2912	* weight the same, a merged queue may be bloated with many more
2913	* requests than the ones produced by its originally-associated
2914	* process.
2915	*/
2916	static struct bfq_queue *
2917	bfq_setup_cooperator(struct bfq_data bfqd, struct* bfq_queue *bfqq,
2918	void io_struct, bool request, struct* bfq_io_cq *bic)
2919	{
2920	struct bfq_queue in_service_bfqq, new_bfqq;
2921	unsigned int a_idx = bfqq->actuator_idx;
2922	struct bfq_iocq_bfqq_data *bfqq_data = &bic->bfqq_data[a_idx];
2923
2924	/ if a merge has already been setup, then proceed with that first /
2925	new_bfqq = bfqq->new_bfqq;
2926	if (new_bfqq) {
2927	while (new_bfqq->new_bfqq)
2928	new_bfqq = new_bfqq->new_bfqq;
2929	return new_bfqq;
2930	}
2931
2932	/*
2933	* Check delayed stable merge for rotational or non-queueing
2934	* devs. For this branch to be executed, bfqq must not be
2935	* currently merged with some other queue (i.e., bfqq->bic
2936	* must be non null). If we considered also merged queues,
2937	* then we should also check whether bfqq has already been
2938	* merged with bic->stable_merge_bfqq. But this would be
2939	* costly and complicated.
2940	*/
2941	if (unlikely(!bfqd->nonrot_with_queueing)) {
2942	/*
2943	* Make sure also that bfqq is sync, because
2944	* bic->stable_merge_bfqq may point to some queue (for
2945	* stable merging) also if bic is associated with a
2946	* sync queue, but this bfqq is async
2947	*/
2948	if (bfq_bfqq_sync(bfqq) && bfqq_data->stable_merge_bfqq &&
2949	!bfq_bfqq_just_created(bfqq) &&
2950	time_is_before_jiffies(bfqq->split_time +
2951	msecs_to_jiffies(bfq_late_stable_merging)) &&
2952	time_is_before_jiffies(bfqq->creation_time +
2953	msecs_to_jiffies(bfq_late_stable_merging))) {
2954	struct bfq_queue *stable_merge_bfqq =
2955	bfqq_data->stable_merge_bfqq;
2956
2957	return bfq_setup_stable_merge(bfqd, bfqq,
2958	stable_merge_bfqq,
2959	bfqq_data);
2960	}
2961	}
2962
2963	/*
2964	* Do not perform queue merging if the device is non
2965	* rotational and performs internal queueing. In fact, such a
2966	* device reaches a high speed through internal parallelism
2967	* and pipelining. This means that, to reach a high
2968	* throughput, it must have many requests enqueued at the same
2969	* time. But, in this configuration, the internal scheduling
2970	* algorithm of the device does exactly the job of queue
2971	* merging: it reorders requests so as to obtain as much as
2972	* possible a sequential I/O pattern. As a consequence, with
2973	* the workload generated by processes doing interleaved I/O,
2974	* the throughput reached by the device is likely to be the
2975	* same, with and without queue merging.
2976	*
2977	* Disabling merging also provides a remarkable benefit in
2978	* terms of throughput. Merging tends to make many workloads
2979	* artificially more uneven, because of shared queues
2980	* remaining non empty for incomparably more time than
2981	* non-merged queues. This may accentuate workload
2982	* asymmetries. For example, if one of the queues in a set of
2983	* merged queues has a higher weight than a normal queue, then
2984	* the shared queue may inherit such a high weight and, by
2985	* staying almost always active, may force BFQ to perform I/O
2986	* plugging most of the time. This evidently makes it harder
2987	* for BFQ to let the device reach a high throughput.
2988	*
2989	* Finally, the likely() macro below is not used because one
2990	* of the two branches is more likely than the other, but to
2991	* have the code path after the following if() executed as
2992	* fast as possible for the case of a non rotational device
2993	* with queueing. We want it because this is the fastest kind
2994	* of device. On the opposite end, the likely() may lengthen
2995	* the execution time of BFQ for the case of slower devices
2996	* (rotational or at least without queueing). But in this case
2997	* the execution time of BFQ matters very little, if not at
2998	* all.
2999	*/
3000	if (likely(bfqd->nonrot_with_queueing))
3001	return NULL;
3002
3003	/*
3004	* Prevent bfqq from being merged if it has been created too
3005	* long ago. The idea is that true cooperating processes, and
3006	* thus their associated bfq_queues, are supposed to be
3007	* created shortly after each other. This is the case, e.g.,
3008	* for KVM/QEMU and dump I/O threads. Basing on this
3009	* assumption, the following filtering greatly reduces the
3010	* probability that two non-cooperating processes, which just
3011	* happen to do close I/O for some short time interval, have
3012	* their queues merged by mistake.
3013	*/
3014	if (bfq_too_late_for_merging(bfqq))
3015	return NULL;
3016
3017	if (!io_struct \|\| unlikely(bfqq == &bfqd->oom_bfqq))
3018	return NULL;
3019
3020	/ If there is only one backlogged queue, don't search. /
3021	if (bfq_tot_busy_queues(bfqd) == `1`)
3022	return NULL;
3023
3024	in_service_bfqq = bfqd->in_service_queue;
3025
3026	if (in_service_bfqq && in_service_bfqq != bfqq &&
3027	likely(in_service_bfqq != &bfqd->oom_bfqq) &&
3028	bfq_rq_close_to_sector(io_struct, request,
3029	sector: bfqd->in_serv_last_pos) &&
3030	bfqq->entity.parent == in_service_bfqq->entity.parent &&
3031	bfq_may_be_close_cooperator(bfqq, new_bfqq: in_service_bfqq)) {
3032	new_bfqq = bfq_setup_merge(bfqq, new_bfqq: in_service_bfqq);
3033	if (new_bfqq)
3034	return new_bfqq;
3035	}
3036	/*
3037	* Check whether there is a cooperator among currently scheduled
3038	* queues. The only thing we need is that the bio/request is not
3039	* NULL, as we need it to establish whether a cooperator exists.
3040	*/
3041	new_bfqq = bfq_find_close_cooperator(bfqd, cur_bfqq: bfqq,
3042	sector: bfq_io_struct_pos(io_struct, request));
3043
3044	if (new_bfqq && likely(new_bfqq != &bfqd->oom_bfqq) &&
3045	bfq_may_be_close_cooperator(bfqq, new_bfqq))
3046	return bfq_setup_merge(bfqq, new_bfqq);
3047
3048	return NULL;
3049	}
3050
3051	static void bfq_bfqq_save_state(struct bfq_queue *bfqq)
3052	{
3053	struct bfq_io_cq *bic = bfqq->bic;
3054	unsigned int a_idx = bfqq->actuator_idx;
3055	struct bfq_iocq_bfqq_data *bfqq_data = &bic->bfqq_data[a_idx];
3056
3057	/*
3058	* If !bfqq->bic, the queue is already shared or its requests
3059	* have already been redirected to a shared queue; both idle window
3060	* and weight raising state have already been saved. Do nothing.
3061	*/
3062	if (!bic)
3063	return;
3064
3065	bfqq_data->saved_last_serv_time_ns = bfqq->last_serv_time_ns;
3066	bfqq_data->saved_inject_limit = bfqq->inject_limit;
3067	bfqq_data->saved_decrease_time_jif = bfqq->decrease_time_jif;
3068
3069	bfqq_data->saved_weight = bfqq->entity.orig_weight;
3070	bfqq_data->saved_ttime = bfqq->ttime;
3071	bfqq_data->saved_has_short_ttime =
3072	bfq_bfqq_has_short_ttime(bfqq);
3073	bfqq_data->saved_IO_bound = bfq_bfqq_IO_bound(bfqq);
3074	bfqq_data->saved_io_start_time = bfqq->io_start_time;
3075	bfqq_data->saved_tot_idle_time = bfqq->tot_idle_time;
3076	bfqq_data->saved_in_large_burst = bfq_bfqq_in_large_burst(bfqq);
3077	bfqq_data->was_in_burst_list =
3078	!hlist_unhashed(h: &bfqq->burst_list_node);
3079
3080	if (unlikely(bfq_bfqq_just_created(bfqq) &&
3081	!bfq_bfqq_in_large_burst(bfqq) &&
3082	bfqq->bfqd->low_latency)) {
3083	/*
3084	* bfqq being merged right after being created: bfqq
3085	* would have deserved interactive weight raising, but
3086	* did not make it to be set in a weight-raised state,
3087	* because of this early merge. Store directly the
3088	* weight-raising state that would have been assigned
3089	* to bfqq, so that to avoid that bfqq unjustly fails
3090	* to enjoy weight raising if split soon.
3091	*/
3092	bfqq_data->saved_wr_coeff = bfqq->bfqd->bfq_wr_coeff;
3093	bfqq_data->saved_wr_start_at_switch_to_srt =
3094	bfq_smallest_from_now();
3095	bfqq_data->saved_wr_cur_max_time =
3096	bfq_wr_duration(bfqd: bfqq->bfqd);
3097	bfqq_data->saved_last_wr_start_finish = jiffies;
3098	} else {
3099	bfqq_data->saved_wr_coeff = bfqq->wr_coeff;
3100	bfqq_data->saved_wr_start_at_switch_to_srt =
3101	bfqq->wr_start_at_switch_to_srt;
3102	bfqq_data->saved_service_from_wr =
3103	bfqq->service_from_wr;
3104	bfqq_data->saved_last_wr_start_finish =
3105	bfqq->last_wr_start_finish;
3106	bfqq_data->saved_wr_cur_max_time = bfqq->wr_cur_max_time;
3107	}
3108	}
3109
3110
3111	void bfq_reassign_last_bfqq(struct bfq_queue *cur_bfqq,
3112	struct bfq_queue *new_bfqq)
3113	{
3114	if (cur_bfqq->entity.parent &&
3115	cur_bfqq->entity.parent->last_bfqq_created == cur_bfqq)
3116	cur_bfqq->entity.parent->last_bfqq_created = new_bfqq;
3117	else if (cur_bfqq->bfqd && cur_bfqq->bfqd->last_bfqq_created == cur_bfqq)
3118	cur_bfqq->bfqd->last_bfqq_created = new_bfqq;
3119	}
3120
3121	void bfq_release_process_ref(struct bfq_data bfqd, struct* bfq_queue *bfqq)
3122	{
3123	/*
3124	* To prevent bfqq's service guarantees from being violated,
3125	* bfqq may be left busy, i.e., queued for service, even if
3126	* empty (see comments in __bfq_bfqq_expire() for
3127	* details). But, if no process will send requests to bfqq any
3128	* longer, then there is no point in keeping bfqq queued for
3129	* service. In addition, keeping bfqq queued for service, but
3130	* with no process ref any longer, may have caused bfqq to be
3131	* freed when dequeued from service. But this is assumed to
3132	* never happen.
3133	*/
3134	if (bfq_bfqq_busy(bfqq) && RB_EMPTY_ROOT(&bfqq->sort_list) &&
3135	bfqq != bfqd->in_service_queue)
3136	bfq_del_bfqq_busy(bfqq, expiration: false);
3137
3138	bfq_reassign_last_bfqq(cur_bfqq: bfqq, NULL);
3139
3140	bfq_put_queue(bfqq);
3141	}
3142
3143	static struct bfq_queue bfq_merge_bfqqs(struct* bfq_data *bfqd,
3144	struct bfq_io_cq *bic,
3145	struct bfq_queue *bfqq)
3146	{
3147	struct bfq_queue *new_bfqq = bfqq->new_bfqq;
3148
3149	bfq_log_bfqq(bfqd, bfqq, "merging with queue %lu",
3150	(unsigned long)new_bfqq->pid);
3151	/ Save weight raising and idle window of the merged queues /
3152	bfq_bfqq_save_state(bfqq);
3153	bfq_bfqq_save_state(bfqq: new_bfqq);
3154	if (bfq_bfqq_IO_bound(bfqq))
3155	bfq_mark_bfqq_IO_bound(bfqq: new_bfqq);
3156	bfq_clear_bfqq_IO_bound(bfqq);
3157
3158	/*
3159	* The processes associated with bfqq are cooperators of the
3160	* processes associated with new_bfqq. So, if bfqq has a
3161	* waker, then assume that all these processes will be happy
3162	* to let bfqq's waker freely inject I/O when they have no
3163	* I/O.
3164	*/
3165	if (bfqq->waker_bfqq && !new_bfqq->waker_bfqq &&
3166	bfqq->waker_bfqq != new_bfqq) {
3167	new_bfqq->waker_bfqq = bfqq->waker_bfqq;
3168	new_bfqq->tentative_waker_bfqq = NULL;
3169
3170	/*
3171	* If the waker queue disappears, then
3172	* new_bfqq->waker_bfqq must be reset. So insert
3173	* new_bfqq into the woken_list of the waker. See
3174	* bfq_check_waker for details.
3175	*/
3176	hlist_add_head(n: &new_bfqq->woken_list_node,
3177	h: &new_bfqq->waker_bfqq->woken_list);
3178
3179	}
3180
3181	/*
3182	* If bfqq is weight-raised, then let new_bfqq inherit
3183	* weight-raising. To reduce false positives, neglect the case
3184	* where bfqq has just been created, but has not yet made it
3185	* to be weight-raised (which may happen because EQM may merge
3186	* bfqq even before bfq_add_request is executed for the first
3187	* time for bfqq). Handling this case would however be very
3188	* easy, thanks to the flag just_created.
3189	*/
3190	if (new_bfqq->wr_coeff == `1` && bfqq->wr_coeff > `1`) {
3191	new_bfqq->wr_coeff = bfqq->wr_coeff;
3192	new_bfqq->wr_cur_max_time = bfqq->wr_cur_max_time;
3193	new_bfqq->last_wr_start_finish = bfqq->last_wr_start_finish;
3194	new_bfqq->wr_start_at_switch_to_srt =
3195	bfqq->wr_start_at_switch_to_srt;
3196	if (bfq_bfqq_busy(bfqq: new_bfqq))
3197	bfqd->wr_busy_queues++;
3198	new_bfqq->entity.prio_changed = `1`;
3199	}
3200
3201	if (bfqq->wr_coeff > `1`) { / bfqq has given its wr to new_bfqq /
3202	bfqq->wr_coeff = `1`;
3203	bfqq->entity.prio_changed = `1`;
3204	if (bfq_bfqq_busy(bfqq))
3205	bfqd->wr_busy_queues--;
3206	}
3207
3208	bfq_log_bfqq(bfqd, new_bfqq, "merge_bfqqs: wr_busy %d",
3209	bfqd->wr_busy_queues);
3210
3211	/*
3212	* Merge queues (that is, let bic redirect its requests to new_bfqq)
3213	*/
3214	bic_set_bfqq(bic, bfqq: new_bfqq, is_sync: true, actuator_idx: bfqq->actuator_idx);
3215	bfq_mark_bfqq_coop(bfqq: new_bfqq);
3216	/*
3217	* new_bfqq now belongs to at least two bics (it is a shared queue):
3218	* set new_bfqq->bic to NULL. bfqq either:
3219	* - does not belong to any bic any more, and hence bfqq->bic must
3220	* be set to NULL, or
3221	* - is a queue whose owning bics have already been redirected to a
3222	* different queue, hence the queue is destined to not belong to
3223	* any bic soon and bfqq->bic is already NULL (therefore the next
3224	* assignment causes no harm).
3225	*/
3226	new_bfqq->bic = NULL;
3227	/*
3228	* If the queue is shared, the pid is the pid of one of the associated
3229	* processes. Which pid depends on the exact sequence of merge events
3230	* the queue underwent. So printing such a pid is useless and confusing
3231	* because it reports a random pid between those of the associated
3232	* processes.
3233	* We mark such a queue with a pid -1, and then print SHARED instead of
3234	* a pid in logging messages.
3235	*/
3236	new_bfqq->pid = -`1`;
3237	bfqq->bic = NULL;
3238
3239	bfq_reassign_last_bfqq(cur_bfqq: bfqq, new_bfqq);
3240
3241	bfq_release_process_ref(bfqd, bfqq);
3242
3243	return new_bfqq;
3244	}
3245
3246	static bool bfq_allow_bio_merge(struct request_queue q, struct* request *rq,
3247	struct bio *bio)
3248	{
3249	struct bfq_data *bfqd = q->elevator->elevator_data;
3250	bool is_sync = op_is_sync(op: bio->bi_opf);
3251	struct bfq_queue bfqq = bfqd->bio_bfqq, new_bfqq;
3252
3253	/*
3254	* Disallow merge of a sync bio into an async request.
3255	*/
3256	if (is_sync && !rq_is_sync(rq))
3257	return false;
3258
3259	/*
3260	* Lookup the bfqq that this bio will be queued with. Allow
3261	* merge only if rq is queued there.
3262	*/
3263	if (!bfqq)
3264	return false;
3265
3266	/*
3267	* We take advantage of this function to perform an early merge
3268	* of the queues of possible cooperating processes.
3269	*/
3270	new_bfqq = bfq_setup_cooperator(bfqd, bfqq, io_struct: bio, request: false, bic: bfqd->bio_bic);
3271	if (new_bfqq) {
3272	/*
3273	* bic still points to bfqq, then it has not yet been
3274	* redirected to some other bfq_queue, and a queue
3275	* merge between bfqq and new_bfqq can be safely
3276	* fulfilled, i.e., bic can be redirected to new_bfqq
3277	* and bfqq can be put.
3278	*/
3279	while (bfqq != new_bfqq)
3280	bfqq = bfq_merge_bfqqs(bfqd, bic: bfqd->bio_bic, bfqq);
3281
3282	/*
3283	* Change also bqfd->bio_bfqq, as
3284	* bfqd->bio_bic now points to new_bfqq, and
3285	* this function may be invoked again (and then may
3286	* use again bqfd->bio_bfqq).
3287	*/
3288	bfqd->bio_bfqq = bfqq;
3289	}
3290
3291	return bfqq == RQ_BFQQ(rq);
3292	}
3293
3294	/*
3295	* Set the maximum time for the in-service queue to consume its
3296	* budget. This prevents seeky processes from lowering the throughput.
3297	* In practice, a time-slice service scheme is used with seeky
3298	* processes.
3299	*/
3300	static void bfq_set_budget_timeout(struct bfq_data *bfqd,
3301	struct bfq_queue *bfqq)
3302	{
3303	unsigned int timeout_coeff;
3304
3305	if (bfqq->wr_cur_max_time == bfqd->bfq_wr_rt_max_time)
3306	timeout_coeff = `1`;
3307	else
3308	timeout_coeff = bfqq->entity.weight / bfqq->entity.orig_weight;
3309
3310	bfqd->last_budget_start = blk_time_get();
3311
3312	bfqq->budget_timeout = jiffies +
3313	bfqd->bfq_timeout * timeout_coeff;
3314	}
3315
3316	static void __bfq_set_in_service_queue(struct bfq_data *bfqd,
3317	struct bfq_queue *bfqq)
3318	{
3319	if (bfqq) {
3320	bfq_clear_bfqq_fifo_expire(bfqq);
3321
3322	bfqd->budgets_assigned = (bfqd->budgets_assigned * `7` + `256`) / `8`;
3323
3324	if (time_is_before_jiffies(bfqq->last_wr_start_finish) &&
3325	bfqq->wr_coeff > `1` &&
3326	bfqq->wr_cur_max_time == bfqd->bfq_wr_rt_max_time &&
3327	time_is_before_jiffies(bfqq->budget_timeout)) {
3328	/*
3329	* For soft real-time queues, move the start
3330	* of the weight-raising period forward by the
3331	* time the queue has not received any
3332	* service. Otherwise, a relatively long
3333	* service delay is likely to cause the
3334	* weight-raising period of the queue to end,
3335	* because of the short duration of the
3336	* weight-raising period of a soft real-time
3337	* queue. It is worth noting that this move
3338	* is not so dangerous for the other queues,
3339	* because soft real-time queues are not
3340	* greedy.
3341	*
3342	* To not add a further variable, we use the
3343	* overloaded field budget_timeout to
3344	* determine for how long the queue has not
3345	* received service, i.e., how much time has
3346	* elapsed since the queue expired. However,
3347	* this is a little imprecise, because
3348	* budget_timeout is set to jiffies if bfqq
3349	* not only expires, but also remains with no
3350	* request.
3351	*/
3352	if (time_after(bfqq->budget_timeout,
3353	bfqq->last_wr_start_finish))
3354	bfqq->last_wr_start_finish +=
3355	jiffies - bfqq->budget_timeout;
3356	else
3357	bfqq->last_wr_start_finish = jiffies;
3358	}
3359
3360	bfq_set_budget_timeout(bfqd, bfqq);
3361	bfq_log_bfqq(bfqd, bfqq,
3362	"set_in_service_queue, cur-budget = %d",
3363	bfqq->entity.budget);
3364	}
3365
3366	bfqd->in_service_queue = bfqq;
3367	bfqd->in_serv_last_pos = `0`;
3368	}
3369
3370	/*
3371	* Get and set a new queue for service.
3372	*/
3373	static struct bfq_queue bfq_set_in_service_queue(struct* bfq_data *bfqd)
3374	{
3375	struct bfq_queue *bfqq = bfq_get_next_queue(bfqd);
3376
3377	__bfq_set_in_service_queue(bfqd, bfqq);
3378	return bfqq;
3379	}
3380
3381	static void bfq_arm_slice_timer(struct bfq_data *bfqd)
3382	{
3383	struct bfq_queue *bfqq = bfqd->in_service_queue;
3384	u32 sl;
3385
3386	bfq_mark_bfqq_wait_request(bfqq);
3387
3388	/*
3389	* We don't want to idle for seeks, but we do want to allow
3390	* fair distribution of slice time for a process doing back-to-back
3391	* seeks. So allow a little bit of time for him to submit a new rq.
3392	*/
3393	sl = bfqd->bfq_slice_idle;
3394	/*
3395	* Unless the queue is being weight-raised or the scenario is
3396	* asymmetric, grant only minimum idle time if the queue
3397	* is seeky. A long idling is preserved for a weight-raised
3398	* queue, or, more in general, in an asymmetric scenario,
3399	* because a long idling is needed for guaranteeing to a queue
3400	* its reserved share of the throughput (in particular, it is
3401	* needed if the queue has a higher weight than some other
3402	* queue).
3403	*/
3404	if (BFQQ_SEEKY(bfqq) && bfqq->wr_coeff == `1` &&
3405	!bfq_asymmetric_scenario(bfqd, bfqq))
3406	sl = min_t(u64, sl, BFQ_MIN_TT);
3407	else if (bfqq->wr_coeff > `1`)
3408	sl = max_t(u32, sl, `20ULL` * NSEC_PER_MSEC);
3409
3410	bfqd->last_idling_start = blk_time_get();
3411	bfqd->last_idling_start_jiffies = jiffies;
3412
3413	hrtimer_start(timer: &bfqd->idle_slice_timer, tim: ns_to_ktime(ns: sl),
3414	mode: HRTIMER_MODE_REL);
3415	bfqg_stats_set_start_idle_time(bfqg: bfqq_group(bfqq));
3416	}
3417
3418	/*
3419	* In autotuning mode, max_budget is dynamically recomputed as the
3420	* amount of sectors transferred in timeout at the estimated peak
3421	* rate. This enables BFQ to utilize a full timeslice with a full
3422	* budget, even if the in-service queue is served at peak rate. And
3423	* this maximises throughput with sequential workloads.
3424	*/
3425	static unsigned long bfq_calc_max_budget(struct bfq_data *bfqd)
3426	{
3427	return (u64)bfqd->peak_rate * USEC_PER_MSEC *
3428	jiffies_to_msecs(j: bfqd->bfq_timeout)>>BFQ_RATE_SHIFT;
3429	}
3430
3431	/*
3432	* Update parameters related to throughput and responsiveness, as a
3433	* function of the estimated peak rate. See comments on
3434	* bfq_calc_max_budget(), and on the ref_wr_duration array.
3435	*/
3436	static void update_thr_responsiveness_params(struct bfq_data *bfqd)
3437	{
3438	if (bfqd->bfq_user_max_budget == `0`) {
3439	bfqd->bfq_max_budget =
3440	bfq_calc_max_budget(bfqd);
3441	bfq_log(bfqd, "new max_budget = %d", bfqd->bfq_max_budget);
3442	}
3443	}
3444
3445	static void bfq_reset_rate_computation(struct bfq_data *bfqd,
3446	struct request *rq)
3447	{
3448	if (rq != NULL) { / new rq dispatch now, reset accordingly /
3449	bfqd->last_dispatch = bfqd->first_dispatch = blk_time_get_ns();
3450	bfqd->peak_rate_samples = `1`;
3451	bfqd->sequential_samples = `0`;
3452	bfqd->tot_sectors_dispatched = bfqd->last_rq_max_size =
3453	blk_rq_sectors(rq);
3454	} else / no new rq dispatched, just reset the number of samples /
3455	bfqd->peak_rate_samples = `0`; / full re-init on next disp. /
3456
3457	bfq_log(bfqd,
3458	"reset_rate_computation at end, sample %u/%u tot_sects %llu",
3459	bfqd->peak_rate_samples, bfqd->sequential_samples,
3460	bfqd->tot_sectors_dispatched);
3461	}
3462
3463	static void bfq_update_rate_reset(struct bfq_data bfqd, struct* request *rq)
3464	{
3465	u32 rate, weight, divisor;
3466
3467	/*
3468	* For the convergence property to hold (see comments on
3469	* bfq_update_peak_rate()) and for the assessment to be
3470	* reliable, a minimum number of samples must be present, and
3471	* a minimum amount of time must have elapsed. If not so, do
3472	* not compute new rate. Just reset parameters, to get ready
3473	* for a new evaluation attempt.
3474	*/
3475	if (bfqd->peak_rate_samples < BFQ_RATE_MIN_SAMPLES \|\|
3476	bfqd->delta_from_first < BFQ_RATE_MIN_INTERVAL)
3477	goto reset_computation;
3478
3479	/*
3480	* If a new request completion has occurred after last
3481	* dispatch, then, to approximate the rate at which requests
3482	* have been served by the device, it is more precise to
3483	* extend the observation interval to the last completion.
3484	*/
3485	bfqd->delta_from_first =
3486	max_t(u64, bfqd->delta_from_first,
3487	bfqd->last_completion - bfqd->first_dispatch);
3488
3489	/*
3490	* Rate computed in sects/usec, and not sects/nsec, for
3491	* precision issues.
3492	*/
3493	rate = div64_ul(bfqd->tot_sectors_dispatched<<BFQ_RATE_SHIFT,
3494	div_u64(bfqd->delta_from_first, NSEC_PER_USEC));
3495
3496	/*
3497	* Peak rate not updated if:
3498	* - the percentage of sequential dispatches is below 3/4 of the
3499	* total, and rate is below the current estimated peak rate
3500	* - rate is unreasonably high (> 20M sectors/sec)
3501	*/
3502	if ((bfqd->sequential_samples < (`3` * bfqd->peak_rate_samples)>>`2` &&
3503	rate <= bfqd->peak_rate) \|\|
3504	rate > `20`<<BFQ_RATE_SHIFT)
3505	goto reset_computation;
3506
3507	/*
3508	* We have to update the peak rate, at last! To this purpose,
3509	* we use a low-pass filter. We compute the smoothing constant
3510	* of the filter as a function of the 'weight' of the new
3511	* measured rate.
3512	*
3513	* As can be seen in next formulas, we define this weight as a
3514	* quantity proportional to how sequential the workload is,
3515	* and to how long the observation time interval is.
3516	*
3517	* The weight runs from 0 to 8. The maximum value of the
3518	* weight, 8, yields the minimum value for the smoothing
3519	* constant. At this minimum value for the smoothing constant,
3520	* the measured rate contributes for half of the next value of
3521	* the estimated peak rate.
3522	*
3523	* So, the first step is to compute the weight as a function
3524	* of how sequential the workload is. Note that the weight
3525	* cannot reach 9, because bfqd->sequential_samples cannot
3526	* become equal to bfqd->peak_rate_samples, which, in its
3527	* turn, holds true because bfqd->sequential_samples is not
3528	* incremented for the first sample.
3529	*/
3530	weight = (`9` * bfqd->sequential_samples) / bfqd->peak_rate_samples;
3531
3532	/*
3533	* Second step: further refine the weight as a function of the
3534	* duration of the observation interval.
3535	*/
3536	weight = min_t(u32, `8`,
3537	div_u64(weight * bfqd->delta_from_first,
3538	BFQ_RATE_REF_INTERVAL));
3539
3540	/*
3541	* Divisor ranging from 10, for minimum weight, to 2, for
3542	* maximum weight.
3543	*/
3544	divisor = `10` - weight;
3545
3546	/*
3547	* Finally, update peak rate:
3548	*
3549	* peak_rate = peak_rate * (divisor-1) / divisor + rate / divisor
3550	*/
3551	bfqd->peak_rate *= divisor-`1`;
3552	bfqd->peak_rate /= divisor;
3553	rate /= divisor; / smoothing constant alpha = 1/divisor /
3554
3555	bfqd->peak_rate += rate;
3556
3557	/*
3558	* For a very slow device, bfqd->peak_rate can reach 0 (see
3559	* the minimum representable values reported in the comments
3560	* on BFQ_RATE_SHIFT). Push to 1 if this happens, to avoid
3561	* divisions by zero where bfqd->peak_rate is used as a
3562	* divisor.
3563	*/
3564	bfqd->peak_rate = max_t(u32, `1`, bfqd->peak_rate);
3565
3566	update_thr_responsiveness_params(bfqd);
3567
3568	reset_computation:
3569	bfq_reset_rate_computation(bfqd, rq);
3570	}
3571
3572	/*
3573	* Update the read/write peak rate (the main quantity used for
3574	* auto-tuning, see update_thr_responsiveness_params()).
3575	*
3576	* It is not trivial to estimate the peak rate (correctly): because of
3577	* the presence of sw and hw queues between the scheduler and the
3578	* device components that finally serve I/O requests, it is hard to
3579	* say exactly when a given dispatched request is served inside the
3580	* device, and for how long. As a consequence, it is hard to know
3581	* precisely at what rate a given set of requests is actually served
3582	* by the device.
3583	*
3584	* On the opposite end, the dispatch time of any request is trivially
3585	* available, and, from this piece of information, the "dispatch rate"
3586	* of requests can be immediately computed. So, the idea in the next
3587	* function is to use what is known, namely request dispatch times
3588	* (plus, when useful, request completion times), to estimate what is
3589	* unknown, namely in-device request service rate.
3590	*
3591	* The main issue is that, because of the above facts, the rate at
3592	* which a certain set of requests is dispatched over a certain time
3593	* interval can vary greatly with respect to the rate at which the
3594	* same requests are then served. But, since the size of any
3595	* intermediate queue is limited, and the service scheme is lossless
3596	* (no request is silently dropped), the following obvious convergence
3597	* property holds: the number of requests dispatched MUST become
3598	* closer and closer to the number of requests completed as the
3599	* observation interval grows. This is the key property used in
3600	* the next function to estimate the peak service rate as a function
3601	* of the observed dispatch rate. The function assumes to be invoked
3602	* on every request dispatch.
3603	*/
3604	static void bfq_update_peak_rate(struct bfq_data bfqd, struct* request *rq)
3605	{
3606	u64 now_ns = blk_time_get_ns();
3607
3608	if (bfqd->peak_rate_samples == `0`) { / first dispatch /
3609	bfq_log(bfqd, "update_peak_rate: goto reset, samples %d",
3610	bfqd->peak_rate_samples);
3611	bfq_reset_rate_computation(bfqd, rq);
3612	goto update_last_values; / will add one sample /
3613	}
3614
3615	/*
3616	* Device idle for very long: the observation interval lasting
3617	* up to this dispatch cannot be a valid observation interval
3618	* for computing a new peak rate (similarly to the late-
3619	* completion event in bfq_completed_request()). Go to
3620	* update_rate_and_reset to have the following three steps
3621	* taken:
3622	* - close the observation interval at the last (previous)
3623	* request dispatch or completion
3624	* - compute rate, if possible, for that observation interval
3625	* - start a new observation interval with this dispatch
3626	*/
3627	if (now_ns - bfqd->last_dispatch > `100`*NSEC_PER_MSEC &&
3628	bfqd->tot_rq_in_driver == `0`)
3629	goto update_rate_and_reset;
3630
3631	/ Update sampling information /
3632	bfqd->peak_rate_samples++;
3633
3634	if ((bfqd->tot_rq_in_driver > `0` \|\|
3635	now_ns - bfqd->last_completion < BFQ_MIN_TT)
3636	&& !BFQ_RQ_SEEKY(bfqd, bfqd->last_position, rq))
3637	bfqd->sequential_samples++;
3638
3639	bfqd->tot_sectors_dispatched += blk_rq_sectors(rq);
3640
3641	/ Reset max observed rq size every 32 dispatches /
3642	if (likely(bfqd->peak_rate_samples % `32`))
3643	bfqd->last_rq_max_size =
3644	max_t(u32, blk_rq_sectors(rq), bfqd->last_rq_max_size);
3645	else
3646	bfqd->last_rq_max_size = blk_rq_sectors(rq);
3647
3648	bfqd->delta_from_first = now_ns - bfqd->first_dispatch;
3649
3650	/ Target observation interval not yet reached, go on sampling /
3651	if (bfqd->delta_from_first < BFQ_RATE_REF_INTERVAL)
3652	goto update_last_values;
3653
3654	update_rate_and_reset:
3655	bfq_update_rate_reset(bfqd, rq);
3656	update_last_values:
3657	bfqd->last_position = blk_rq_pos(rq) + blk_rq_sectors(rq);
3658	if (RQ_BFQQ(rq) == bfqd->in_service_queue)
3659	bfqd->in_serv_last_pos = bfqd->last_position;
3660	bfqd->last_dispatch = now_ns;
3661	}
3662
3663	/*
3664	* Remove request from internal lists.
3665	*/
3666	static void bfq_dispatch_remove(struct request_queue q, struct* request *rq)
3667	{
3668	struct bfq_queue *bfqq = RQ_BFQQ(rq);
3669
3670	/*
3671	* For consistency, the next instruction should have been
3672	* executed after removing the request from the queue and
3673	* dispatching it. We execute instead this instruction before
3674	* bfq_remove_request() (and hence introduce a temporary
3675	* inconsistency), for efficiency. In fact, should this
3676	* dispatch occur for a non in-service bfqq, this anticipated
3677	* increment prevents two counters related to bfqq->dispatched
3678	* from risking to be, first, uselessly decremented, and then
3679	* incremented again when the (new) value of bfqq->dispatched
3680	* happens to be taken into account.
3681	*/
3682	bfqq->dispatched++;
3683	bfq_update_peak_rate(bfqd: q->elevator->elevator_data, rq);
3684
3685	bfq_remove_request(q, rq);
3686	}
3687
3688	/*
3689	* There is a case where idling does not have to be performed for
3690	* throughput concerns, but to preserve the throughput share of
3691	* the process associated with bfqq.
3692	*
3693	* To introduce this case, we can note that allowing the drive
3694	* to enqueue more than one request at a time, and hence
3695	* delegating de facto final scheduling decisions to the
3696	* drive's internal scheduler, entails loss of control on the
3697	* actual request service order. In particular, the critical
3698	* situation is when requests from different processes happen
3699	* to be present, at the same time, in the internal queue(s)
3700	* of the drive. In such a situation, the drive, by deciding
3701	* the service order of the internally-queued requests, does
3702	* determine also the actual throughput distribution among
3703	* these processes. But the drive typically has no notion or
3704	* concern about per-process throughput distribution, and
3705	* makes its decisions only on a per-request basis. Therefore,
3706	* the service distribution enforced by the drive's internal
3707	* scheduler is likely to coincide with the desired throughput
3708	* distribution only in a completely symmetric, or favorably
3709	* skewed scenario where:
3710	* (i-a) each of these processes must get the same throughput as
3711	* the others,
3712	* (i-b) in case (i-a) does not hold, it holds that the process
3713	* associated with bfqq must receive a lower or equal
3714	* throughput than any of the other processes;
3715	* (ii) the I/O of each process has the same properties, in
3716	* terms of locality (sequential or random), direction
3717	* (reads or writes), request sizes, greediness
3718	* (from I/O-bound to sporadic), and so on;
3719
3720	* In fact, in such a scenario, the drive tends to treat the requests
3721	* of each process in about the same way as the requests of the
3722	* others, and thus to provide each of these processes with about the
3723	* same throughput. This is exactly the desired throughput
3724	* distribution if (i-a) holds, or, if (i-b) holds instead, this is an
3725	* even more convenient distribution for (the process associated with)
3726	* bfqq.
3727	*
3728	* In contrast, in any asymmetric or unfavorable scenario, device
3729	* idling (I/O-dispatch plugging) is certainly needed to guarantee
3730	* that bfqq receives its assigned fraction of the device throughput
3731	* (see [1] for details).
3732	*
3733	* The problem is that idling may significantly reduce throughput with
3734	* certain combinations of types of I/O and devices. An important
3735	* example is sync random I/O on flash storage with command
3736	* queueing. So, unless bfqq falls in cases where idling also boosts
3737	* throughput, it is important to check conditions (i-a), i(-b) and
3738	* (ii) accurately, so as to avoid idling when not strictly needed for
3739	* service guarantees.
3740	*
3741	* Unfortunately, it is extremely difficult to thoroughly check
3742	* condition (ii). And, in case there are active groups, it becomes
3743	* very difficult to check conditions (i-a) and (i-b) too. In fact,
3744	* if there are active groups, then, for conditions (i-a) or (i-b) to
3745	* become false 'indirectly', it is enough that an active group
3746	* contains more active processes or sub-groups than some other active
3747	* group. More precisely, for conditions (i-a) or (i-b) to become
3748	* false because of such a group, it is not even necessary that the
3749	* group is (still) active: it is sufficient that, even if the group
3750	* has become inactive, some of its descendant processes still have
3751	* some request already dispatched but still waiting for
3752	* completion. In fact, requests have still to be guaranteed their
3753	* share of the throughput even after being dispatched. In this
3754	* respect, it is easy to show that, if a group frequently becomes
3755	* inactive while still having in-flight requests, and if, when this
3756	* happens, the group is not considered in the calculation of whether
3757	* the scenario is asymmetric, then the group may fail to be
3758	* guaranteed its fair share of the throughput (basically because
3759	* idling may not be performed for the descendant processes of the
3760	* group, but it had to be). We address this issue with the following
3761	* bi-modal behavior, implemented in the function
3762	* bfq_asymmetric_scenario().
3763	*
3764	* If there are groups with requests waiting for completion
3765	* (as commented above, some of these groups may even be
3766	* already inactive), then the scenario is tagged as
3767	* asymmetric, conservatively, without checking any of the
3768	* conditions (i-a), (i-b) or (ii). So the device is idled for bfqq.
3769	* This behavior matches also the fact that groups are created
3770	* exactly if controlling I/O is a primary concern (to
3771	* preserve bandwidth and latency guarantees).
3772	*
3773	* On the opposite end, if there are no groups with requests waiting
3774	* for completion, then only conditions (i-a) and (i-b) are actually
3775	* controlled, i.e., provided that conditions (i-a) or (i-b) holds,
3776	* idling is not performed, regardless of whether condition (ii)
3777	* holds. In other words, only if conditions (i-a) and (i-b) do not
3778	* hold, then idling is allowed, and the device tends to be prevented
3779	* from queueing many requests, possibly of several processes. Since
3780	* there are no groups with requests waiting for completion, then, to
3781	* control conditions (i-a) and (i-b) it is enough to check just
3782	* whether all the queues with requests waiting for completion also
3783	* have the same weight.
3784	*
3785	* Not checking condition (ii) evidently exposes bfqq to the
3786	* risk of getting less throughput than its fair share.
3787	* However, for queues with the same weight, a further
3788	* mechanism, preemption, mitigates or even eliminates this
3789	* problem. And it does so without consequences on overall
3790	* throughput. This mechanism and its benefits are explained
3791	* in the next three paragraphs.
3792	*
3793	* Even if a queue, say Q, is expired when it remains idle, Q
3794	* can still preempt the new in-service queue if the next
3795	* request of Q arrives soon (see the comments on
3796	* bfq_bfqq_update_budg_for_activation). If all queues and
3797	* groups have the same weight, this form of preemption,
3798	* combined with the hole-recovery heuristic described in the
3799	* comments on function bfq_bfqq_update_budg_for_activation,
3800	* are enough to preserve a correct bandwidth distribution in
3801	* the mid term, even without idling. In fact, even if not
3802	* idling allows the internal queues of the device to contain
3803	* many requests, and thus to reorder requests, we can rather
3804	* safely assume that the internal scheduler still preserves a
3805	* minimum of mid-term fairness.
3806	*
3807	* More precisely, this preemption-based, idleless approach
3808	* provides fairness in terms of IOPS, and not sectors per
3809	* second. This can be seen with a simple example. Suppose
3810	* that there are two queues with the same weight, but that
3811	* the first queue receives requests of 8 sectors, while the
3812	* second queue receives requests of 1024 sectors. In
3813	* addition, suppose that each of the two queues contains at
3814	* most one request at a time, which implies that each queue
3815	* always remains idle after it is served. Finally, after
3816	* remaining idle, each queue receives very quickly a new
3817	* request. It follows that the two queues are served
3818	* alternatively, preempting each other if needed. This
3819	* implies that, although both queues have the same weight,
3820	* the queue with large requests receives a service that is
3821	* 1024/8 times as high as the service received by the other
3822	* queue.
3823	*
3824	* The motivation for using preemption instead of idling (for
3825	* queues with the same weight) is that, by not idling,
3826	* service guarantees are preserved (completely or at least in
3827	* part) without minimally sacrificing throughput. And, if
3828	* there is no active group, then the primary expectation for
3829	* this device is probably a high throughput.
3830	*
3831	* We are now left only with explaining the two sub-conditions in the
3832	* additional compound condition that is checked below for deciding
3833	* whether the scenario is asymmetric. To explain the first
3834	* sub-condition, we need to add that the function
3835	* bfq_asymmetric_scenario checks the weights of only
3836	* non-weight-raised queues, for efficiency reasons (see comments on
3837	* bfq_weights_tree_add()). Then the fact that bfqq is weight-raised
3838	* is checked explicitly here. More precisely, the compound condition
3839	* below takes into account also the fact that, even if bfqq is being
3840	* weight-raised, the scenario is still symmetric if all queues with
3841	* requests waiting for completion happen to be
3842	* weight-raised. Actually, we should be even more precise here, and
3843	* differentiate between interactive weight raising and soft real-time
3844	* weight raising.
3845	*
3846	* The second sub-condition checked in the compound condition is
3847	* whether there is a fair amount of already in-flight I/O not
3848	* belonging to bfqq. If so, I/O dispatching is to be plugged, for the
3849	* following reason. The drive may decide to serve in-flight
3850	* non-bfqq's I/O requests before bfqq's ones, thereby delaying the
3851	* arrival of new I/O requests for bfqq (recall that bfqq is sync). If
3852	* I/O-dispatching is not plugged, then, while bfqq remains empty, a
3853	* basically uncontrolled amount of I/O from other queues may be
3854	* dispatched too, possibly causing the service of bfqq's I/O to be
3855	* delayed even longer in the drive. This problem gets more and more
3856	* serious as the speed and the queue depth of the drive grow,
3857	* because, as these two quantities grow, the probability to find no
3858	* queue busy but many requests in flight grows too. By contrast,
3859	* plugging I/O dispatching minimizes the delay induced by already
3860	* in-flight I/O, and enables bfqq to recover the bandwidth it may
3861	* lose because of this delay.
3862	*
3863	* As a side note, it is worth considering that the above
3864	* device-idling countermeasures may however fail in the following
3865	* unlucky scenario: if I/O-dispatch plugging is (correctly) disabled
3866	* in a time period during which all symmetry sub-conditions hold, and
3867	* therefore the device is allowed to enqueue many requests, but at
3868	* some later point in time some sub-condition stops to hold, then it
3869	* may become impossible to make requests be served in the desired
3870	* order until all the requests already queued in the device have been
3871	* served. The last sub-condition commented above somewhat mitigates
3872	* this problem for weight-raised queues.
3873	*
3874	* However, as an additional mitigation for this problem, we preserve
3875	* plugging for a special symmetric case that may suddenly turn into
3876	* asymmetric: the case where only bfqq is busy. In this case, not
3877	* expiring bfqq does not cause any harm to any other queues in terms
3878	* of service guarantees. In contrast, it avoids the following unlucky
3879	* sequence of events: (1) bfqq is expired, (2) a new queue with a
3880	* lower weight than bfqq becomes busy (or more queues), (3) the new
3881	* queue is served until a new request arrives for bfqq, (4) when bfqq
3882	* is finally served, there are so many requests of the new queue in
3883	* the drive that the pending requests for bfqq take a lot of time to
3884	* be served. In particular, event (2) may case even already
3885	* dispatched requests of bfqq to be delayed, inside the drive. So, to
3886	* avoid this series of events, the scenario is preventively declared
3887	* as asymmetric also if bfqq is the only busy queues
3888	*/
3889	static bool idling_needed_for_service_guarantees(struct bfq_data *bfqd,
3890	struct bfq_queue *bfqq)
3891	{
3892	int tot_busy_queues = bfq_tot_busy_queues(bfqd);
3893
3894	/ No point in idling for bfqq if it won't get requests any longer /
3895	if (unlikely(!bfqq_process_refs(bfqq)))
3896	return false;
3897
3898	return (bfqq->wr_coeff > `1` &&
3899	(bfqd->wr_busy_queues < tot_busy_queues \|\|
3900	bfqd->tot_rq_in_driver >= bfqq->dispatched + `4`)) \|\|
3901	bfq_asymmetric_scenario(bfqd, bfqq) \|\|
3902	tot_busy_queues == `1`;
3903	}
3904
3905	static bool __bfq_bfqq_expire(struct bfq_data bfqd, struct* bfq_queue *bfqq,
3906	enum bfqq_expiration reason)
3907	{
3908	/*
3909	* If this bfqq is shared between multiple processes, check
3910	* to make sure that those processes are still issuing I/Os
3911	* within the mean seek distance. If not, it may be time to
3912	* break the queues apart again.
3913	*/
3914	if (bfq_bfqq_coop(bfqq) && BFQQ_SEEKY(bfqq))
3915	bfq_mark_bfqq_split_coop(bfqq);
3916
3917	/*
3918	* Consider queues with a higher finish virtual time than
3919	* bfqq. If idling_needed_for_service_guarantees(bfqq) returns
3920	* true, then bfqq's bandwidth would be violated if an
3921	* uncontrolled amount of I/O from these queues were
3922	* dispatched while bfqq is waiting for its new I/O to
3923	* arrive. This is exactly what may happen if this is a forced
3924	* expiration caused by a preemption attempt, and if bfqq is
3925	* not re-scheduled. To prevent this from happening, re-queue
3926	* bfqq if it needs I/O-dispatch plugging, even if it is
3927	* empty. By doing so, bfqq is granted to be served before the
3928	* above queues (provided that bfqq is of course eligible).
3929	*/
3930	if (RB_EMPTY_ROOT(&bfqq->sort_list) &&
3931	!(reason == BFQQE_PREEMPTED &&
3932	idling_needed_for_service_guarantees(bfqd, bfqq))) {
3933	if (bfqq->dispatched == `0`)
3934	/*
3935	* Overloading budget_timeout field to store
3936	* the time at which the queue remains with no
3937	* backlog and no outstanding request; used by
3938	* the weight-raising mechanism.
3939	*/
3940	bfqq->budget_timeout = jiffies;
3941
3942	bfq_del_bfqq_busy(bfqq, expiration: true);
3943	} else {
3944	bfq_requeue_bfqq(bfqd, bfqq, expiration: true);
3945	/*
3946	* Resort priority tree of potential close cooperators.
3947	* See comments on bfq_pos_tree_add_move() for the unlikely().
3948	*/
3949	if (unlikely(!bfqd->nonrot_with_queueing &&
3950	!RB_EMPTY_ROOT(&bfqq->sort_list)))
3951	bfq_pos_tree_add_move(bfqd, bfqq);
3952	}
3953
3954	/*
3955	* All in-service entities must have been properly deactivated
3956	* or requeued before executing the next function, which
3957	* resets all in-service entities as no more in service. This
3958	* may cause bfqq to be freed. If this happens, the next
3959	* function returns true.
3960	*/
3961	return __bfq_bfqd_reset_in_service(bfqd);
3962	}
3963
3964	/**
3965	* __bfq_bfqq_recalc_budget - try to adapt the budget to the @bfqq behavior.
3966	* @bfqd: device data.
3967	* @bfqq: queue to update.
3968	* @reason: reason for expiration.
3969	*
3970	* Handle the feedback on @bfqq budget at queue expiration.
3971	* See the body for detailed comments.
3972	*/
3973	static void __bfq_bfqq_recalc_budget(struct bfq_data *bfqd,
3974	struct bfq_queue *bfqq,
3975	enum bfqq_expiration reason)
3976	{
3977	struct request *next_rq;
3978	int budget, min_budget;
3979
3980	min_budget = bfq_min_budget(bfqd);
3981
3982	if (bfqq->wr_coeff == `1`)
3983	budget = bfqq->max_budget;
3984	else /*
3985	* Use a constant, low budget for weight-raised queues,
3986	* to help achieve a low latency. Keep it slightly higher
3987	* than the minimum possible budget, to cause a little
3988	* bit fewer expirations.
3989	*/
3990	budget = `2` * min_budget;
3991
3992	bfq_log_bfqq(bfqd, bfqq, "recalc_budg: last budg %d, budg left %d",
3993	bfqq->entity.budget, bfq_bfqq_budget_left(bfqq));
3994	bfq_log_bfqq(bfqd, bfqq, "recalc_budg: last max_budg %d, min budg %d",
3995	budget, bfq_min_budget(bfqd));
3996	bfq_log_bfqq(bfqd, bfqq, "recalc_budg: sync %d, seeky %d",
3997	bfq_bfqq_sync(bfqq), BFQQ_SEEKY(bfqd->in_service_queue));
3998
3999	if (bfq_bfqq_sync(bfqq) && bfqq->wr_coeff == `1`) {
4000	switch (reason) {
4001	/*
4002	* Caveat: in all the following cases we trade latency
4003	* for throughput.
4004	*/
4005	case BFQQE_TOO_IDLE:
4006	/*
4007	* This is the only case where we may reduce
4008	* the budget: if there is no request of the
4009	* process still waiting for completion, then
4010	* we assume (tentatively) that the timer has
4011	* expired because the batch of requests of
4012	* the process could have been served with a
4013	* smaller budget. Hence, betting that
4014	* process will behave in the same way when it
4015	* becomes backlogged again, we reduce its
4016	* next budget. As long as we guess right,
4017	* this budget cut reduces the latency
4018	* experienced by the process.
4019	*
4020	* However, if there are still outstanding
4021	* requests, then the process may have not yet
4022	* issued its next request just because it is
4023	* still waiting for the completion of some of
4024	* the still outstanding ones. So in this
4025	* subcase we do not reduce its budget, on the
4026	* contrary we increase it to possibly boost
4027	* the throughput, as discussed in the
4028	* comments to the BUDGET_TIMEOUT case.
4029	*/
4030	if (bfqq->dispatched > `0`) / still outstanding reqs /
4031	budget = min(budget * `2`, bfqd->bfq_max_budget);
4032	else {
4033	if (budget > `5` * min_budget)
4034	budget -= `4` * min_budget;
4035	else
4036	budget = min_budget;
4037	}
4038	break;
4039	case BFQQE_BUDGET_TIMEOUT:
4040	/*
4041	* We double the budget here because it gives
4042	* the chance to boost the throughput if this
4043	* is not a seeky process (and has bumped into
4044	* this timeout because of, e.g., ZBR).
4045	*/
4046	budget = min(budget * `2`, bfqd->bfq_max_budget);
4047	break;
4048	case BFQQE_BUDGET_EXHAUSTED:
4049	/*
4050	* The process still has backlog, and did not
4051	* let either the budget timeout or the disk
4052	* idling timeout expire. Hence it is not
4053	* seeky, has a short thinktime and may be
4054	* happy with a higher budget too. So
4055	* definitely increase the budget of this good
4056	* candidate to boost the disk throughput.
4057	*/
4058	budget = min(budget * `4`, bfqd->bfq_max_budget);
4059	break;
4060	case BFQQE_NO_MORE_REQUESTS:
4061	/*
4062	* For queues that expire for this reason, it
4063	* is particularly important to keep the
4064	* budget close to the actual service they
4065	* need. Doing so reduces the timestamp
4066	* misalignment problem described in the
4067	* comments in the body of
4068	* __bfq_activate_entity. In fact, suppose
4069	* that a queue systematically expires for
4070	* BFQQE_NO_MORE_REQUESTS and presents a
4071	* new request in time to enjoy timestamp
4072	* back-shifting. The larger the budget of the
4073	* queue is with respect to the service the
4074	* queue actually requests in each service
4075	* slot, the more times the queue can be
4076	* reactivated with the same virtual finish
4077	* time. It follows that, even if this finish
4078	* time is pushed to the system virtual time
4079	* to reduce the consequent timestamp
4080	* misalignment, the queue unjustly enjoys for
4081	* many re-activations a lower finish time
4082	* than all newly activated queues.
4083	*
4084	* The service needed by bfqq is measured
4085	* quite precisely by bfqq->entity.service.
4086	* Since bfqq does not enjoy device idling,
4087	* bfqq->entity.service is equal to the number
4088	* of sectors that the process associated with
4089	* bfqq requested to read/write before waiting
4090	* for request completions, or blocking for
4091	* other reasons.
4092	*/
4093	budget = max_t(int, bfqq->entity.service, min_budget);
4094	break;
4095	default:
4096	return;
4097	}
4098	} else if (!bfq_bfqq_sync(bfqq)) {
4099	/*
4100	* Async queues get always the maximum possible
4101	* budget, as for them we do not care about latency
4102	* (in addition, their ability to dispatch is limited
4103	* by the charging factor).
4104	*/
4105	budget = bfqd->bfq_max_budget;
4106	}
4107
4108	bfqq->max_budget = budget;
4109
4110	if (bfqd->budgets_assigned >= bfq_stats_min_budgets &&
4111	!bfqd->bfq_user_max_budget)
4112	bfqq->max_budget = min(bfqq->max_budget, bfqd->bfq_max_budget);
4113
4114	/*
4115	* If there is still backlog, then assign a new budget, making
4116	* sure that it is large enough for the next request. Since
4117	* the finish time of bfqq must be kept in sync with the
4118	* budget, be sure to call __bfq_bfqq_expire() after this
4119	* update.
4120	*
4121	* If there is no backlog, then no need to update the budget;
4122	* it will be updated on the arrival of a new request.
4123	*/
4124	next_rq = bfqq->next_rq;
4125	if (next_rq)
4126	bfqq->entity.budget = max_t(unsigned long, bfqq->max_budget,
4127	bfq_serv_to_charge(next_rq, bfqq));
4128
4129	bfq_log_bfqq(bfqd, bfqq, "head sect: %u, new budget %d",
4130	next_rq ? blk_rq_sectors(next_rq) : `0`,
4131	bfqq->entity.budget);
4132	}
4133
4134	/*
4135	* Return true if the process associated with bfqq is "slow". The slow
4136	* flag is used, in addition to the budget timeout, to reduce the
4137	* amount of service provided to seeky processes, and thus reduce
4138	* their chances to lower the throughput. More details in the comments
4139	* on the function bfq_bfqq_expire().
4140	*
4141	* An important observation is in order: as discussed in the comments
4142	* on the function bfq_update_peak_rate(), with devices with internal
4143	* queues, it is hard if ever possible to know when and for how long
4144	* an I/O request is processed by the device (apart from the trivial
4145	* I/O pattern where a new request is dispatched only after the
4146	* previous one has been completed). This makes it hard to evaluate
4147	* the real rate at which the I/O requests of each bfq_queue are
4148	* served. In fact, for an I/O scheduler like BFQ, serving a
4149	* bfq_queue means just dispatching its requests during its service
4150	* slot (i.e., until the budget of the queue is exhausted, or the
4151	* queue remains idle, or, finally, a timeout fires). But, during the
4152	* service slot of a bfq_queue, around 100 ms at most, the device may
4153	* be even still processing requests of bfq_queues served in previous
4154	* service slots. On the opposite end, the requests of the in-service
4155	* bfq_queue may be completed after the service slot of the queue
4156	* finishes.
4157	*
4158	* Anyway, unless more sophisticated solutions are used
4159	* (where possible), the sum of the sizes of the requests dispatched
4160	* during the service slot of a bfq_queue is probably the only
4161	* approximation available for the service received by the bfq_queue
4162	* during its service slot. And this sum is the quantity used in this
4163	* function to evaluate the I/O speed of a process.
4164	*/
4165	static bool bfq_bfqq_is_slow(struct bfq_data bfqd, struct* bfq_queue *bfqq,
4166	bool compensate, unsigned long *delta_ms)
4167	{
4168	ktime_t delta_ktime;
4169	u32 delta_usecs;
4170	bool slow = BFQQ_SEEKY(bfqq); / if delta too short, use seekyness /
4171
4172	if (!bfq_bfqq_sync(bfqq))
4173	return false;
4174
4175	if (compensate)
4176	delta_ktime = bfqd->last_idling_start;
4177	else
4178	delta_ktime = blk_time_get();
4179	delta_ktime = ktime_sub(delta_ktime, bfqd->last_budget_start);
4180	delta_usecs = ktime_to_us(kt: delta_ktime);
4181
4182	/ don't use too short time intervals /
4183	if (delta_usecs < `1000`) {
4184	if (blk_queue_nonrot(bfqd->queue))
4185	/*
4186	* give same worst-case guarantees as idling
4187	* for seeky
4188	*/
4189	*delta_ms = BFQ_MIN_TT / NSEC_PER_MSEC;
4190	else / charge at least one seek /
4191	*delta_ms = bfq_slice_idle / NSEC_PER_MSEC;
4192
4193	return slow;
4194	}
4195
4196	*delta_ms = delta_usecs / USEC_PER_MSEC;
4197
4198	/*
4199	* Use only long (> 20ms) intervals to filter out excessive
4200	* spikes in service rate estimation.
4201	*/
4202	if (delta_usecs > `20000`) {
4203	/*
4204	* Caveat for rotational devices: processes doing I/O
4205	* in the slower disk zones tend to be slow(er) even
4206	* if not seeky. In this respect, the estimated peak
4207	* rate is likely to be an average over the disk
4208	* surface. Accordingly, to not be too harsh with
4209	* unlucky processes, a process is deemed slow only if
4210	* its rate has been lower than half of the estimated
4211	* peak rate.
4212	*/
4213	slow = bfqq->entity.service < bfqd->bfq_max_budget / `2`;
4214	}
4215
4216	bfq_log_bfqq(bfqd, bfqq, "bfq_bfqq_is_slow: slow %d", slow);
4217
4218	return slow;
4219	}
4220
4221	/*
4222	* To be deemed as soft real-time, an application must meet two
4223	* requirements. First, the application must not require an average
4224	* bandwidth higher than the approximate bandwidth required to playback or
4225	* record a compressed high-definition video.
4226	* The next function is invoked on the completion of the last request of a
4227	* batch, to compute the next-start time instant, soft_rt_next_start, such
4228	* that, if the next request of the application does not arrive before
4229	* soft_rt_next_start, then the above requirement on the bandwidth is met.
4230	*
4231	* The second requirement is that the request pattern of the application is
4232	* isochronous, i.e., that, after issuing a request or a batch of requests,
4233	* the application stops issuing new requests until all its pending requests
4234	* have been completed. After that, the application may issue a new batch,
4235	* and so on.
4236	* For this reason the next function is invoked to compute
4237	* soft_rt_next_start only for applications that meet this requirement,
4238	* whereas soft_rt_next_start is set to infinity for applications that do
4239	* not.
4240	*
4241	* Unfortunately, even a greedy (i.e., I/O-bound) application may
4242	* happen to meet, occasionally or systematically, both the above
4243	* bandwidth and isochrony requirements. This may happen at least in
4244	* the following circumstances. First, if the CPU load is high. The
4245	* application may stop issuing requests while the CPUs are busy
4246	* serving other processes, then restart, then stop again for a while,
4247	* and so on. The other circumstances are related to the storage
4248	* device: the storage device is highly loaded or reaches a low-enough
4249	* throughput with the I/O of the application (e.g., because the I/O
4250	* is random and/or the device is slow). In all these cases, the
4251	* I/O of the application may be simply slowed down enough to meet
4252	* the bandwidth and isochrony requirements. To reduce the probability
4253	* that greedy applications are deemed as soft real-time in these
4254	* corner cases, a further rule is used in the computation of
4255	* soft_rt_next_start: the return value of this function is forced to
4256	* be higher than the maximum between the following two quantities.
4257	*
4258	* (a) Current time plus: (1) the maximum time for which the arrival
4259	* of a request is waited for when a sync queue becomes idle,
4260	* namely bfqd->bfq_slice_idle, and (2) a few extra jiffies. We
4261	* postpone for a moment the reason for adding a few extra
4262	* jiffies; we get back to it after next item (b). Lower-bounding
4263	* the return value of this function with the current time plus
4264	* bfqd->bfq_slice_idle tends to filter out greedy applications,
4265	* because the latter issue their next request as soon as possible
4266	* after the last one has been completed. In contrast, a soft
4267	* real-time application spends some time processing data, after a
4268	* batch of its requests has been completed.
4269	*
4270	* (b) Current value of bfqq->soft_rt_next_start. As pointed out
4271	* above, greedy applications may happen to meet both the
4272	* bandwidth and isochrony requirements under heavy CPU or
4273	* storage-device load. In more detail, in these scenarios, these
4274	* applications happen, only for limited time periods, to do I/O
4275	* slowly enough to meet all the requirements described so far,
4276	* including the filtering in above item (a). These slow-speed
4277	* time intervals are usually interspersed between other time
4278	* intervals during which these applications do I/O at a very high
4279	* speed. Fortunately, exactly because of the high speed of the
4280	* I/O in the high-speed intervals, the values returned by this
4281	* function happen to be so high, near the end of any such
4282	* high-speed interval, to be likely to fall after the end of
4283	* the low-speed time interval that follows. These high values are
4284	* stored in bfqq->soft_rt_next_start after each invocation of
4285	* this function. As a consequence, if the last value of
4286	* bfqq->soft_rt_next_start is constantly used to lower-bound the
4287	* next value that this function may return, then, from the very
4288	* beginning of a low-speed interval, bfqq->soft_rt_next_start is
4289	* likely to be constantly kept so high that any I/O request
4290	* issued during the low-speed interval is considered as arriving
4291	* to soon for the application to be deemed as soft
4292	* real-time. Then, in the high-speed interval that follows, the
4293	* application will not be deemed as soft real-time, just because
4294	* it will do I/O at a high speed. And so on.
4295	*
4296	* Getting back to the filtering in item (a), in the following two
4297	* cases this filtering might be easily passed by a greedy
4298	* application, if the reference quantity was just
4299	* bfqd->bfq_slice_idle:
4300	* 1) HZ is so low that the duration of a jiffy is comparable to or
4301	* higher than bfqd->bfq_slice_idle. This happens, e.g., on slow
4302	* devices with HZ=100. The time granularity may be so coarse
4303	* that the approximation, in jiffies, of bfqd->bfq_slice_idle
4304	* is rather lower than the exact value.
4305	* 2) jiffies, instead of increasing at a constant rate, may stop increasing
4306	* for a while, then suddenly 'jump' by several units to recover the lost
4307	* increments. This seems to happen, e.g., inside virtual machines.
4308	* To address this issue, in the filtering in (a) we do not use as a
4309	* reference time interval just bfqd->bfq_slice_idle, but
4310	* bfqd->bfq_slice_idle plus a few jiffies. In particular, we add the
4311	* minimum number of jiffies for which the filter seems to be quite
4312	* precise also in embedded systems and KVM/QEMU virtual machines.
4313	*/
4314	static unsigned long bfq_bfqq_softrt_next_start(struct bfq_data *bfqd,
4315	struct bfq_queue *bfqq)
4316	{
4317	return max3(bfqq->soft_rt_next_start,
4318	bfqq->last_idle_bklogged +
4319	HZ * bfqq->service_from_backlogged /
4320	bfqd->bfq_wr_max_softrt_rate,
4321	jiffies + nsecs_to_jiffies(bfqq->bfqd->bfq_slice_idle) + `4`);
4322	}
4323
4324	/**
4325	* bfq_bfqq_expire - expire a queue.
4326	* @bfqd: device owning the queue.
4327	* @bfqq: the queue to expire.
4328	* @compensate: if true, compensate for the time spent idling.
4329	* @reason: the reason causing the expiration.
4330	*
4331	* If the process associated with bfqq does slow I/O (e.g., because it
4332	* issues random requests), we charge bfqq with the time it has been
4333	* in service instead of the service it has received (see
4334	* bfq_bfqq_charge_time for details on how this goal is achieved). As
4335	* a consequence, bfqq will typically get higher timestamps upon
4336	* reactivation, and hence it will be rescheduled as if it had
4337	* received more service than what it has actually received. In the
4338	* end, bfqq receives less service in proportion to how slowly its
4339	* associated process consumes its budgets (and hence how seriously it
4340	* tends to lower the throughput). In addition, this time-charging
4341	* strategy guarantees time fairness among slow processes. In
4342	* contrast, if the process associated with bfqq is not slow, we
4343	* charge bfqq exactly with the service it has received.
4344	*
4345	* Charging time to the first type of queues and the exact service to
4346	* the other has the effect of using the WF2Q+ policy to schedule the
4347	* former on a timeslice basis, without violating service domain
4348	* guarantees among the latter.
4349	*/
4350	void bfq_bfqq_expire(struct bfq_data *bfqd,
4351	struct bfq_queue *bfqq,
4352	bool compensate,
4353	enum bfqq_expiration reason)
4354	{
4355	bool slow;
4356	unsigned long delta = `0`;
4357	struct bfq_entity *entity = &bfqq->entity;
4358
4359	/*
4360	* Check whether the process is slow (see bfq_bfqq_is_slow).
4361	*/
4362	slow = bfq_bfqq_is_slow(bfqd, bfqq, compensate, delta_ms: &delta);
4363
4364	/*
4365	* As above explained, charge slow (typically seeky) and
4366	* timed-out queues with the time and not the service
4367	* received, to favor sequential workloads.
4368	*
4369	* Processes doing I/O in the slower disk zones will tend to
4370	* be slow(er) even if not seeky. Therefore, since the
4371	* estimated peak rate is actually an average over the disk
4372	* surface, these processes may timeout just for bad luck. To
4373	* avoid punishing them, do not charge time to processes that
4374	* succeeded in consuming at least 2/3 of their budget. This
4375	* allows BFQ to preserve enough elasticity to still perform
4376	* bandwidth, and not time, distribution with little unlucky
4377	* or quasi-sequential processes.
4378	*/
4379	if (bfqq->wr_coeff == `1` &&
4380	(slow \|\|
4381	(reason == BFQQE_BUDGET_TIMEOUT &&
4382	bfq_bfqq_budget_left(bfqq) >= entity->budget / `3`)))
4383	bfq_bfqq_charge_time(bfqd, bfqq, time_ms: delta);
4384
4385	if (bfqd->low_latency && bfqq->wr_coeff == `1`)
4386	bfqq->last_wr_start_finish = jiffies;
4387
4388	if (bfqd->low_latency && bfqd->bfq_wr_max_softrt_rate > `0` &&
4389	RB_EMPTY_ROOT(&bfqq->sort_list)) {
4390	/*
4391	* If we get here, and there are no outstanding
4392	* requests, then the request pattern is isochronous
4393	* (see the comments on the function
4394	* bfq_bfqq_softrt_next_start()). Therefore we can
4395	* compute soft_rt_next_start.
4396	*
4397	* If, instead, the queue still has outstanding
4398	* requests, then we have to wait for the completion
4399	* of all the outstanding requests to discover whether
4400	* the request pattern is actually isochronous.
4401	*/
4402	if (bfqq->dispatched == `0`)
4403	bfqq->soft_rt_next_start =
4404	bfq_bfqq_softrt_next_start(bfqd, bfqq);
4405	else if (bfqq->dispatched > `0`) {
4406	/*
4407	* Schedule an update of soft_rt_next_start to when
4408	* the task may be discovered to be isochronous.
4409	*/
4410	bfq_mark_bfqq_softrt_update(bfqq);
4411	}
4412	}
4413
4414	bfq_log_bfqq(bfqd, bfqq,
4415	"expire (%d, slow %d, num_disp %d, short_ttime %d)", reason,
4416	slow, bfqq->dispatched, bfq_bfqq_has_short_ttime(bfqq));
4417
4418	/*
4419	* bfqq expired, so no total service time needs to be computed
4420	* any longer: reset state machine for measuring total service
4421	* times.
4422	*/
4423	bfqd->rqs_injected = bfqd->wait_dispatch = false;
4424	bfqd->waited_rq = NULL;
4425
4426	/*
4427	* Increase, decrease or leave budget unchanged according to
4428	* reason.
4429	*/
4430	__bfq_bfqq_recalc_budget(bfqd, bfqq, reason);
4431	if (__bfq_bfqq_expire(bfqd, bfqq, reason))
4432	/ bfqq is gone, no more actions on it /
4433	return;
4434
4435	/ mark bfqq as waiting a request only if a bic still points to it /
4436	if (!bfq_bfqq_busy(bfqq) &&
4437	reason != BFQQE_BUDGET_TIMEOUT &&
4438	reason != BFQQE_BUDGET_EXHAUSTED) {
4439	bfq_mark_bfqq_non_blocking_wait_rq(bfqq);
4440	/*
4441	* Not setting service to 0, because, if the next rq
4442	* arrives in time, the queue will go on receiving
4443	* service with this same budget (as if it never expired)
4444	*/
4445	} else
4446	entity->service = `0`;
4447
4448	/*
4449	* Reset the received-service counter for every parent entity.
4450	* Differently from what happens with bfqq->entity.service,
4451	* the resetting of this counter never needs to be postponed
4452	* for parent entities. In fact, in case bfqq may have a
4453	* chance to go on being served using the last, partially
4454	* consumed budget, bfqq->entity.service needs to be kept,
4455	* because if bfqq then actually goes on being served using
4456	* the same budget, the last value of bfqq->entity.service is
4457	* needed to properly decrement bfqq->entity.budget by the
4458	* portion already consumed. In contrast, it is not necessary
4459	* to keep entity->service for parent entities too, because
4460	* the bubble up of the new value of bfqq->entity.budget will
4461	* make sure that the budgets of parent entities are correct,
4462	* even in case bfqq and thus parent entities go on receiving
4463	* service with the same budget.
4464	*/
4465	entity = entity->parent;
4466	for_each_entity(entity)
4467	entity->service = `0`;
4468	}
4469
4470	/*
4471	* Budget timeout is not implemented through a dedicated timer, but
4472	* just checked on request arrivals and completions, as well as on
4473	* idle timer expirations.
4474	*/
4475	static bool bfq_bfqq_budget_timeout(struct bfq_queue *bfqq)
4476	{
4477	return time_is_before_eq_jiffies(bfqq->budget_timeout);
4478	}
4479
4480	/*
4481	* If we expire a queue that is actively waiting (i.e., with the
4482	* device idled) for the arrival of a new request, then we may incur
4483	* the timestamp misalignment problem described in the body of the
4484	* function __bfq_activate_entity. Hence we return true only if this
4485	* condition does not hold, or if the queue is slow enough to deserve
4486	* only to be kicked off for preserving a high throughput.
4487	*/
4488	static bool bfq_may_expire_for_budg_timeout(struct bfq_queue *bfqq)
4489	{
4490	bfq_log_bfqq(bfqq->bfqd, bfqq,
4491	"may_budget_timeout: wait_request %d left %d timeout %d",
4492	bfq_bfqq_wait_request(bfqq),
4493	bfq_bfqq_budget_left(bfqq) >= bfqq->entity.budget / `3`,
4494	bfq_bfqq_budget_timeout(bfqq));
4495
4496	return (!bfq_bfqq_wait_request(bfqq) \|\|
4497	bfq_bfqq_budget_left(bfqq) >= bfqq->entity.budget / `3`)
4498	&&
4499	bfq_bfqq_budget_timeout(bfqq);
4500	}
4501
4502	static bool idling_boosts_thr_without_issues(struct bfq_data *bfqd,
4503	struct bfq_queue *bfqq)
4504	{
4505	bool rot_without_queueing =
4506	!blk_queue_nonrot(bfqd->queue) && !bfqd->hw_tag,
4507	bfqq_sequential_and_IO_bound,
4508	idling_boosts_thr;
4509
4510	/ No point in idling for bfqq if it won't get requests any longer /
4511	if (unlikely(!bfqq_process_refs(bfqq)))
4512	return false;
4513
4514	bfqq_sequential_and_IO_bound = !BFQQ_SEEKY(bfqq) &&
4515	bfq_bfqq_IO_bound(bfqq) && bfq_bfqq_has_short_ttime(bfqq);
4516
4517	/*
4518	* The next variable takes into account the cases where idling
4519	* boosts the throughput.
4520	*
4521	* The value of the variable is computed considering, first, that
4522	* idling is virtually always beneficial for the throughput if:
4523	* (a) the device is not NCQ-capable and rotational, or
4524	* (b) regardless of the presence of NCQ, the device is rotational and
4525	* the request pattern for bfqq is I/O-bound and sequential, or
4526	* (c) regardless of whether it is rotational, the device is
4527	* not NCQ-capable and the request pattern for bfqq is
4528	* I/O-bound and sequential.
4529	*
4530	* Secondly, and in contrast to the above item (b), idling an
4531	* NCQ-capable flash-based device would not boost the
4532	* throughput even with sequential I/O; rather it would lower
4533	* the throughput in proportion to how fast the device
4534	* is. Accordingly, the next variable is true if any of the
4535	* above conditions (a), (b) or (c) is true, and, in
4536	* particular, happens to be false if bfqd is an NCQ-capable
4537	* flash-based device.
4538	*/
4539	idling_boosts_thr = rot_without_queueing \|\|
4540	((!blk_queue_nonrot(bfqd->queue) \|\| !bfqd->hw_tag) &&
4541	bfqq_sequential_and_IO_bound);
4542
4543	/*
4544	* The return value of this function is equal to that of
4545	* idling_boosts_thr, unless a special case holds. In this
4546	* special case, described below, idling may cause problems to
4547	* weight-raised queues.
4548	*
4549	* When the request pool is saturated (e.g., in the presence
4550	* of write hogs), if the processes associated with
4551	* non-weight-raised queues ask for requests at a lower rate,
4552	* then processes associated with weight-raised queues have a
4553	* higher probability to get a request from the pool
4554	* immediately (or at least soon) when they need one. Thus
4555	* they have a higher probability to actually get a fraction
4556	* of the device throughput proportional to their high
4557	* weight. This is especially true with NCQ-capable drives,
4558	* which enqueue several requests in advance, and further
4559	* reorder internally-queued requests.
4560	*
4561	* For this reason, we force to false the return value if
4562	* there are weight-raised busy queues. In this case, and if
4563	* bfqq is not weight-raised, this guarantees that the device
4564	* is not idled for bfqq (if, instead, bfqq is weight-raised,
4565	* then idling will be guaranteed by another variable, see
4566	* below). Combined with the timestamping rules of BFQ (see
4567	* [1] for details), this behavior causes bfqq, and hence any
4568	* sync non-weight-raised queue, to get a lower number of
4569	* requests served, and thus to ask for a lower number of
4570	* requests from the request pool, before the busy
4571	* weight-raised queues get served again. This often mitigates
4572	* starvation problems in the presence of heavy write
4573	* workloads and NCQ, thereby guaranteeing a higher
4574	* application and system responsiveness in these hostile
4575	* scenarios.
4576	*/
4577	return idling_boosts_thr &&
4578	bfqd->wr_busy_queues == `0`;
4579	}
4580
4581	/*
4582	* For a queue that becomes empty, device idling is allowed only if
4583	* this function returns true for that queue. As a consequence, since
4584	* device idling plays a critical role for both throughput boosting
4585	* and service guarantees, the return value of this function plays a
4586	* critical role as well.
4587	*
4588	* In a nutshell, this function returns true only if idling is
4589	* beneficial for throughput or, even if detrimental for throughput,
4590	* idling is however necessary to preserve service guarantees (low
4591	* latency, desired throughput distribution, ...). In particular, on
4592	* NCQ-capable devices, this function tries to return false, so as to
4593	* help keep the drives' internal queues full, whenever this helps the
4594	* device boost the throughput without causing any service-guarantee
4595	* issue.
4596	*
4597	* Most of the issues taken into account to get the return value of
4598	* this function are not trivial. We discuss these issues in the two
4599	* functions providing the main pieces of information needed by this
4600	* function.
4601	*/
4602	static bool bfq_better_to_idle(struct bfq_queue *bfqq)
4603	{
4604	struct bfq_data *bfqd = bfqq->bfqd;
4605	bool idling_boosts_thr_with_no_issue, idling_needed_for_service_guar;
4606
4607	/ No point in idling for bfqq if it won't get requests any longer /
4608	if (unlikely(!bfqq_process_refs(bfqq)))
4609	return false;
4610
4611	if (unlikely(bfqd->strict_guarantees))
4612	return true;
4613
4614	/*
4615	* Idling is performed only if slice_idle > 0. In addition, we
4616	* do not idle if
4617	* (a) bfqq is async
4618	* (b) bfqq is in the idle io prio class: in this case we do
4619	* not idle because we want to minimize the bandwidth that
4620	* queues in this class can steal to higher-priority queues
4621	*/
4622	if (bfqd->bfq_slice_idle == `0` \|\| !bfq_bfqq_sync(bfqq) \|\|
4623	bfq_class_idle(bfqq))
4624	return false;
4625
4626	idling_boosts_thr_with_no_issue =
4627	idling_boosts_thr_without_issues(bfqd, bfqq);
4628
4629	idling_needed_for_service_guar =
4630	idling_needed_for_service_guarantees(bfqd, bfqq);
4631
4632	/*
4633	* We have now the two components we need to compute the
4634	* return value of the function, which is true only if idling
4635	* either boosts the throughput (without issues), or is
4636	* necessary to preserve service guarantees.
4637	*/
4638	return idling_boosts_thr_with_no_issue \|\|
4639	idling_needed_for_service_guar;
4640	}
4641
4642	/*
4643	* If the in-service queue is empty but the function bfq_better_to_idle
4644	* returns true, then:
4645	* 1) the queue must remain in service and cannot be expired, and
4646	* 2) the device must be idled to wait for the possible arrival of a new
4647	* request for the queue.
4648	* See the comments on the function bfq_better_to_idle for the reasons
4649	* why performing device idling is the best choice to boost the throughput
4650	* and preserve service guarantees when bfq_better_to_idle itself
4651	* returns true.
4652	*/
4653	static bool bfq_bfqq_must_idle(struct bfq_queue *bfqq)
4654	{
4655	return RB_EMPTY_ROOT(&bfqq->sort_list) && bfq_better_to_idle(bfqq);
4656	}
4657
4658	/*
4659	* This function chooses the queue from which to pick the next extra
4660	* I/O request to inject, if it finds a compatible queue. See the
4661	* comments on bfq_update_inject_limit() for details on the injection
4662	* mechanism, and for the definitions of the quantities mentioned
4663	* below.
4664	*/
4665	static struct bfq_queue *
4666	bfq_choose_bfqq_for_injection(struct bfq_data *bfqd)
4667	{
4668	struct bfq_queue bfqq, in_serv_bfqq = bfqd->in_service_queue;
4669	unsigned int limit = in_serv_bfqq->inject_limit;
4670	int i;
4671
4672	/*
4673	* If
4674	* - bfqq is not weight-raised and therefore does not carry
4675	* time-critical I/O,
4676	* or
4677	* - regardless of whether bfqq is weight-raised, bfqq has
4678	* however a long think time, during which it can absorb the
4679	* effect of an appropriate number of extra I/O requests
4680	* from other queues (see bfq_update_inject_limit for
4681	* details on the computation of this number);
4682	* then injection can be performed without restrictions.
4683	*/
4684	bool in_serv_always_inject = in_serv_bfqq->wr_coeff == `1` \|\|
4685	!bfq_bfqq_has_short_ttime(bfqq: in_serv_bfqq);
4686
4687	/*
4688	* If
4689	* - the baseline total service time could not be sampled yet,
4690	* so the inject limit happens to be still 0, and
4691	* - a lot of time has elapsed since the plugging of I/O
4692	* dispatching started, so drive speed is being wasted
4693	* significantly;
4694	* then temporarily raise inject limit to one request.
4695	*/
4696	if (limit == `0` && in_serv_bfqq->last_serv_time_ns == `0` &&
4697	bfq_bfqq_wait_request(bfqq: in_serv_bfqq) &&
4698	time_is_before_eq_jiffies(bfqd->last_idling_start_jiffies +
4699	bfqd->bfq_slice_idle)
4700	)
4701	limit = `1`;
4702
4703	if (bfqd->tot_rq_in_driver >= limit)
4704	return NULL;
4705
4706	/*
4707	* Linear search of the source queue for injection; but, with
4708	* a high probability, very few steps are needed to find a
4709	* candidate queue, i.e., a queue with enough budget left for
4710	* its next request. In fact:
4711	* - BFQ dynamically updates the budget of every queue so as
4712	* to accommodate the expected backlog of the queue;
4713	* - if a queue gets all its requests dispatched as injected
4714	* service, then the queue is removed from the active list
4715	* (and re-added only if it gets new requests, but then it
4716	* is assigned again enough budget for its new backlog).
4717	*/
4718	for (i = `0`; i < bfqd->num_actuators; i++) {
4719	list_for_each_entry(bfqq, &bfqd->active_list[i], bfqq_list)
4720	if (!RB_EMPTY_ROOT(&bfqq->sort_list) &&
4721	(in_serv_always_inject \|\| bfqq->wr_coeff > `1`) &&
4722	bfq_serv_to_charge(rq: bfqq->next_rq, bfqq) <=
4723	bfq_bfqq_budget_left(bfqq)) {
4724	/*
4725	* Allow for only one large in-flight request
4726	* on non-rotational devices, for the
4727	* following reason. On non-rotationl drives,
4728	* large requests take much longer than
4729	* smaller requests to be served. In addition,
4730	* the drive prefers to serve large requests
4731	* w.r.t. to small ones, if it can choose. So,
4732	* having more than one large requests queued
4733	* in the drive may easily make the next first
4734	* request of the in-service queue wait for so
4735	* long to break bfqq's service guarantees. On
4736	* the bright side, large requests let the
4737	* drive reach a very high throughput, even if
4738	* there is only one in-flight large request
4739	* at a time.
4740	*/
4741	if (blk_queue_nonrot(bfqd->queue) &&
4742	blk_rq_sectors(rq: bfqq->next_rq) >=
4743	BFQQ_SECT_THR_NONROT &&
4744	bfqd->tot_rq_in_driver >= `1`)
4745	continue;
4746	else {
4747	bfqd->rqs_injected = true;
4748	return bfqq;
4749	}
4750	}
4751	}
4752
4753	return NULL;
4754	}
4755
4756	static struct bfq_queue *
4757	bfq_find_active_bfqq_for_actuator(struct bfq_data bfqd, int* idx)
4758	{
4759	struct bfq_queue *bfqq;
4760
4761	if (bfqd->in_service_queue &&
4762	bfqd->in_service_queue->actuator_idx == idx)
4763	return bfqd->in_service_queue;
4764
4765	list_for_each_entry(bfqq, &bfqd->active_list[idx], bfqq_list) {
4766	if (!RB_EMPTY_ROOT(&bfqq->sort_list) &&
4767	bfq_serv_to_charge(rq: bfqq->next_rq, bfqq) <=
4768	bfq_bfqq_budget_left(bfqq)) {
4769	return bfqq;
4770	}
4771	}
4772
4773	return NULL;
4774	}
4775
4776	/*
4777	* Perform a linear scan of each actuator, until an actuator is found
4778	* for which the following three conditions hold: the load of the
4779	* actuator is below the threshold (see comments on
4780	* actuator_load_threshold for details) and lower than that of the
4781	* next actuator (comments on this extra condition below), and there
4782	* is a queue that contains I/O for that actuator. On success, return
4783	* that queue.
4784	*
4785	* Performing a plain linear scan entails a prioritization among
4786	* actuators. The extra condition above breaks this prioritization and
4787	* tends to distribute injection uniformly across actuators.
4788	*/
4789	static struct bfq_queue *
4790	bfq_find_bfqq_for_underused_actuator(struct bfq_data *bfqd)
4791	{
4792	int i;
4793
4794	for (i = `0` ; i < bfqd->num_actuators; i++) {
4795	if (bfqd->rq_in_driver[i] < bfqd->actuator_load_threshold &&
4796	(i == bfqd->num_actuators - `1` \|\|
4797	bfqd->rq_in_driver[i] < bfqd->rq_in_driver[i+`1`])) {
4798	struct bfq_queue *bfqq =
4799	bfq_find_active_bfqq_for_actuator(bfqd, idx: i);
4800
4801	if (bfqq)
4802	return bfqq;
4803	}
4804	}
4805
4806	return NULL;
4807	}
4808
4809
4810	/*
4811	* Select a queue for service. If we have a current queue in service,
4812	* check whether to continue servicing it, or retrieve and set a new one.
4813	*/
4814	static struct bfq_queue bfq_select_queue(struct* bfq_data *bfqd)
4815	{
4816	struct bfq_queue bfqq, inject_bfqq;
4817	struct request *next_rq;
4818	enum bfqq_expiration reason = BFQQE_BUDGET_TIMEOUT;
4819
4820	bfqq = bfqd->in_service_queue;
4821	if (!bfqq)
4822	goto new_queue;
4823
4824	bfq_log_bfqq(bfqd, bfqq, "select_queue: already in-service queue");
4825
4826	/*
4827	* Do not expire bfqq for budget timeout if bfqq may be about
4828	* to enjoy device idling. The reason why, in this case, we
4829	* prevent bfqq from expiring is the same as in the comments
4830	* on the case where bfq_bfqq_must_idle() returns true, in
4831	* bfq_completed_request().
4832	*/
4833	if (bfq_may_expire_for_budg_timeout(bfqq) &&
4834	!bfq_bfqq_must_idle(bfqq))
4835	goto expire;
4836
4837	check_queue:
4838	/*
4839	* If some actuator is underutilized, but the in-service
4840	* queue does not contain I/O for that actuator, then try to
4841	* inject I/O for that actuator.
4842	*/
4843	inject_bfqq = bfq_find_bfqq_for_underused_actuator(bfqd);
4844	if (inject_bfqq && inject_bfqq != bfqq)
4845	return inject_bfqq;
4846
4847	/*
4848	* This loop is rarely executed more than once. Even when it
4849	* happens, it is much more convenient to re-execute this loop
4850	* than to return NULL and trigger a new dispatch to get a
4851	* request served.
4852	*/
4853	next_rq = bfqq->next_rq;
4854	/*
4855	* If bfqq has requests queued and it has enough budget left to
4856	* serve them, keep the queue, otherwise expire it.
4857	*/
4858	if (next_rq) {
4859	if (bfq_serv_to_charge(rq: next_rq, bfqq) >
4860	bfq_bfqq_budget_left(bfqq)) {
4861	/*
4862	* Expire the queue for budget exhaustion,
4863	* which makes sure that the next budget is
4864	* enough to serve the next request, even if
4865	* it comes from the fifo expired path.
4866	*/
4867	reason = BFQQE_BUDGET_EXHAUSTED;
4868	goto expire;
4869	} else {
4870	/*
4871	* The idle timer may be pending because we may
4872	* not disable disk idling even when a new request
4873	* arrives.
4874	*/
4875	if (bfq_bfqq_wait_request(bfqq)) {
4876	/*
4877	* If we get here: 1) at least a new request
4878	* has arrived but we have not disabled the
4879	* timer because the request was too small,
4880	* 2) then the block layer has unplugged
4881	* the device, causing the dispatch to be
4882	* invoked.
4883	*
4884	* Since the device is unplugged, now the
4885	* requests are probably large enough to
4886	* provide a reasonable throughput.
4887	* So we disable idling.
4888	*/
4889	bfq_clear_bfqq_wait_request(bfqq);
4890	hrtimer_try_to_cancel(timer: &bfqd->idle_slice_timer);
4891	}
4892	goto keep_queue;
4893	}
4894	}
4895
4896	/*
4897	* No requests pending. However, if the in-service queue is idling
4898	* for a new request, or has requests waiting for a completion and
4899	* may idle after their completion, then keep it anyway.
4900	*
4901	* Yet, inject service from other queues if it boosts
4902	* throughput and is possible.
4903	*/
4904	if (bfq_bfqq_wait_request(bfqq) \|\|
4905	(bfqq->dispatched != `0` && bfq_better_to_idle(bfqq))) {
4906	unsigned int act_idx = bfqq->actuator_idx;
4907	struct bfq_queue *async_bfqq = NULL;
4908	struct bfq_queue *blocked_bfqq =
4909	!hlist_empty(h: &bfqq->woken_list) ?
4910	container_of(bfqq->woken_list.first,
4911	struct bfq_queue,
4912	woken_list_node)
4913	: NULL;
4914
4915	if (bfqq->bic && bfqq->bic->bfqq[`0`][act_idx] &&
4916	bfq_bfqq_busy(bfqq: bfqq->bic->bfqq[`0`][act_idx]) &&
4917	bfqq->bic->bfqq[`0`][act_idx]->next_rq)
4918	async_bfqq = bfqq->bic->bfqq[`0`][act_idx];
4919	/*
4920	* The next four mutually-exclusive ifs decide
4921	* whether to try injection, and choose the queue to
4922	* pick an I/O request from.
4923	*
4924	* The first if checks whether the process associated
4925	* with bfqq has also async I/O pending. If so, it
4926	* injects such I/O unconditionally. Injecting async
4927	* I/O from the same process can cause no harm to the
4928	* process. On the contrary, it can only increase
4929	* bandwidth and reduce latency for the process.
4930	*
4931	* The second if checks whether there happens to be a
4932	* non-empty waker queue for bfqq, i.e., a queue whose
4933	* I/O needs to be completed for bfqq to receive new
4934	* I/O. This happens, e.g., if bfqq is associated with
4935	* a process that does some sync. A sync generates
4936	* extra blocking I/O, which must be completed before
4937	* the process associated with bfqq can go on with its
4938	* I/O. If the I/O of the waker queue is not served,
4939	* then bfqq remains empty, and no I/O is dispatched,
4940	* until the idle timeout fires for bfqq. This is
4941	* likely to result in lower bandwidth and higher
4942	* latencies for bfqq, and in a severe loss of total
4943	* throughput. The best action to take is therefore to
4944	* serve the waker queue as soon as possible. So do it
4945	* (without relying on the third alternative below for
4946	* eventually serving waker_bfqq's I/O; see the last
4947	* paragraph for further details). This systematic
4948	* injection of I/O from the waker queue does not
4949	* cause any delay to bfqq's I/O. On the contrary,
4950	* next bfqq's I/O is brought forward dramatically,
4951	* for it is not blocked for milliseconds.
4952	*
4953	* The third if checks whether there is a queue woken
4954	* by bfqq, and currently with pending I/O. Such a
4955	* woken queue does not steal bandwidth from bfqq,
4956	* because it remains soon without I/O if bfqq is not
4957	* served. So there is virtually no risk of loss of
4958	* bandwidth for bfqq if this woken queue has I/O
4959	* dispatched while bfqq is waiting for new I/O.
4960	*
4961	* The fourth if checks whether bfqq is a queue for
4962	* which it is better to avoid injection. It is so if
4963	* bfqq delivers more throughput when served without
4964	* any further I/O from other queues in the middle, or
4965	* if the service times of bfqq's I/O requests both
4966	* count more than overall throughput, and may be
4967	* easily increased by injection (this happens if bfqq
4968	* has a short think time). If none of these
4969	* conditions holds, then a candidate queue for
4970	* injection is looked for through
4971	* bfq_choose_bfqq_for_injection(). Note that the
4972	* latter may return NULL (for example if the inject
4973	* limit for bfqq is currently 0).
4974	*
4975	* NOTE: motivation for the second alternative
4976	*
4977	* Thanks to the way the inject limit is updated in
4978	* bfq_update_has_short_ttime(), it is rather likely
4979	* that, if I/O is being plugged for bfqq and the
4980	* waker queue has pending I/O requests that are
4981	* blocking bfqq's I/O, then the fourth alternative
4982	* above lets the waker queue get served before the
4983	* I/O-plugging timeout fires. So one may deem the
4984	* second alternative superfluous. It is not, because
4985	* the fourth alternative may be way less effective in
4986	* case of a synchronization. For two main
4987	* reasons. First, throughput may be low because the
4988	* inject limit may be too low to guarantee the same
4989	* amount of injected I/O, from the waker queue or
4990	* other queues, that the second alternative
4991	* guarantees (the second alternative unconditionally
4992	* injects a pending I/O request of the waker queue
4993	* for each bfq_dispatch_request()). Second, with the
4994	* fourth alternative, the duration of the plugging,
4995	* i.e., the time before bfqq finally receives new I/O,
4996	* may not be minimized, because the waker queue may
4997	* happen to be served only after other queues.
4998	*/
4999	if (async_bfqq &&
5000	icq_to_bic(icq: async_bfqq->next_rq->elv.icq) == bfqq->bic &&
5001	bfq_serv_to_charge(rq: async_bfqq->next_rq, bfqq: async_bfqq) <=
5002	bfq_bfqq_budget_left(bfqq: async_bfqq))
5003	bfqq = async_bfqq;
5004	else if (bfqq->waker_bfqq &&
5005	bfq_bfqq_busy(bfqq: bfqq->waker_bfqq) &&
5006	bfqq->waker_bfqq->next_rq &&
5007	bfq_serv_to_charge(rq: bfqq->waker_bfqq->next_rq,
5008	bfqq: bfqq->waker_bfqq) <=
5009	bfq_bfqq_budget_left(bfqq: bfqq->waker_bfqq)
5010	)
5011	bfqq = bfqq->waker_bfqq;
5012	else if (blocked_bfqq &&
5013	bfq_bfqq_busy(bfqq: blocked_bfqq) &&
5014	blocked_bfqq->next_rq &&
5015	bfq_serv_to_charge(rq: blocked_bfqq->next_rq,
5016	bfqq: blocked_bfqq) <=
5017	bfq_bfqq_budget_left(bfqq: blocked_bfqq)
5018	)
5019	bfqq = blocked_bfqq;
5020	else if (!idling_boosts_thr_without_issues(bfqd, bfqq) &&
5021	(bfqq->wr_coeff == `1` \|\| bfqd->wr_busy_queues > `1` \|\|
5022	!bfq_bfqq_has_short_ttime(bfqq)))
5023	bfqq = bfq_choose_bfqq_for_injection(bfqd);
5024	else
5025	bfqq = NULL;
5026
5027	goto keep_queue;
5028	}
5029
5030	reason = BFQQE_NO_MORE_REQUESTS;
5031	expire:
5032	bfq_bfqq_expire(bfqd, bfqq, compensate: false, reason);
5033	new_queue:
5034	bfqq = bfq_set_in_service_queue(bfqd);
5035	if (bfqq) {
5036	bfq_log_bfqq(bfqd, bfqq, "select_queue: checking new queue");
5037	goto check_queue;
5038	}
5039	keep_queue:
5040	if (bfqq)
5041	bfq_log_bfqq(bfqd, bfqq, "select_queue: returned this queue");
5042	else
5043	bfq_log(bfqd, "select_queue: no queue returned");
5044
5045	return bfqq;
5046	}
5047
5048	static void bfq_update_wr_data(struct bfq_data bfqd, struct* bfq_queue *bfqq)
5049	{
5050	struct bfq_entity *entity = &bfqq->entity;
5051
5052	if (bfqq->wr_coeff > `1`) { / queue is being weight-raised /
5053	bfq_log_bfqq(bfqd, bfqq,
5054	"raising period dur %u/%u msec, old coeff %u, w %d(%d)",
5055	jiffies_to_msecs(jiffies - bfqq->last_wr_start_finish),
5056	jiffies_to_msecs(bfqq->wr_cur_max_time),
5057	bfqq->wr_coeff,
5058	bfqq->entity.weight, bfqq->entity.orig_weight);
5059
5060	if (entity->prio_changed)
5061	bfq_log_bfqq(bfqd, bfqq, "WARN: pending prio change");
5062
5063	/*
5064	* If the queue was activated in a burst, or too much
5065	* time has elapsed from the beginning of this
5066	* weight-raising period, then end weight raising.
5067	*/
5068	if (bfq_bfqq_in_large_burst(bfqq))
5069	bfq_bfqq_end_wr(bfqq);
5070	else if (time_is_before_jiffies(bfqq->last_wr_start_finish +
5071	bfqq->wr_cur_max_time)) {
5072	if (bfqq->wr_cur_max_time != bfqd->bfq_wr_rt_max_time \|\|
5073	time_is_before_jiffies(bfqq->wr_start_at_switch_to_srt +
5074	bfq_wr_duration(bfqd))) {
5075	/*
5076	* Either in interactive weight
5077	* raising, or in soft_rt weight
5078	* raising with the
5079	* interactive-weight-raising period
5080	* elapsed (so no switch back to
5081	* interactive weight raising).
5082	*/
5083	bfq_bfqq_end_wr(bfqq);
5084	} else { /*
5085	* soft_rt finishing while still in
5086	* interactive period, switch back to
5087	* interactive weight raising
5088	*/
5089	switch_back_to_interactive_wr(bfqq, bfqd);
5090	bfqq->entity.prio_changed = `1`;
5091	}
5092	}
5093	if (bfqq->wr_coeff > `1` &&
5094	bfqq->wr_cur_max_time != bfqd->bfq_wr_rt_max_time &&
5095	bfqq->service_from_wr > max_service_from_wr) {
5096	/ see comments on max_service_from_wr /
5097	bfq_bfqq_end_wr(bfqq);
5098	}
5099	}
5100	/*
5101	* To improve latency (for this or other queues), immediately
5102	* update weight both if it must be raised and if it must be
5103	* lowered. Since, entity may be on some active tree here, and
5104	* might have a pending change of its ioprio class, invoke
5105	* next function with the last parameter unset (see the
5106	* comments on the function).
5107	*/
5108	if ((entity->weight > entity->orig_weight) != (bfqq->wr_coeff > `1`))
5109	__bfq_entity_update_weight_prio(old_st: bfq_entity_service_tree(entity),
5110	entity, update_class_too: false);
5111	}
5112
5113	/*
5114	* Dispatch next request from bfqq.
5115	*/
5116	static struct request bfq_dispatch_rq_from_bfqq(struct* bfq_data *bfqd,
5117	struct bfq_queue *bfqq)
5118	{
5119	struct request *rq = bfqq->next_rq;
5120	unsigned long service_to_charge;
5121
5122	service_to_charge = bfq_serv_to_charge(rq, bfqq);
5123
5124	bfq_bfqq_served(bfqq, served: service_to_charge);
5125
5126	if (bfqq == bfqd->in_service_queue && bfqd->wait_dispatch) {
5127	bfqd->wait_dispatch = false;
5128	bfqd->waited_rq = rq;
5129	}
5130
5131	bfq_dispatch_remove(q: bfqd->queue, rq);
5132
5133	if (bfqq != bfqd->in_service_queue)
5134	return rq;
5135
5136	/*
5137	* If weight raising has to terminate for bfqq, then next
5138	* function causes an immediate update of bfqq's weight,
5139	* without waiting for next activation. As a consequence, on
5140	* expiration, bfqq will be timestamped as if has never been
5141	* weight-raised during this service slot, even if it has
5142	* received part or even most of the service as a
5143	* weight-raised queue. This inflates bfqq's timestamps, which
5144	* is beneficial, as bfqq is then more willing to leave the
5145	* device immediately to possible other weight-raised queues.
5146	*/
5147	bfq_update_wr_data(bfqd, bfqq);
5148
5149	/*
5150	* Expire bfqq, pretending that its budget expired, if bfqq
5151	* belongs to CLASS_IDLE and other queues are waiting for
5152	* service.
5153	*/
5154	if (bfq_tot_busy_queues(bfqd) > `1` && bfq_class_idle(bfqq))
5155	bfq_bfqq_expire(bfqd, bfqq, compensate: false, reason: BFQQE_BUDGET_EXHAUSTED);
5156
5157	return rq;
5158	}
5159
5160	static bool bfq_has_work(struct blk_mq_hw_ctx *hctx)
5161	{
5162	struct bfq_data *bfqd = hctx->queue->elevator->elevator_data;
5163
5164	/*
5165	* Avoiding lock: a race on bfqd->queued should cause at
5166	* most a call to dispatch for nothing
5167	*/
5168	return !list_empty_careful(head: &bfqd->dispatch) \|\|
5169	READ_ONCE(bfqd->queued);
5170	}
5171
5172	static struct request __bfq_dispatch_request(struct* blk_mq_hw_ctx *hctx)
5173	{
5174	struct bfq_data *bfqd = hctx->queue->elevator->elevator_data;
5175	struct request *rq = NULL;
5176	struct bfq_queue *bfqq = NULL;
5177
5178	if (!list_empty(head: &bfqd->dispatch)) {
5179	rq = list_first_entry(&bfqd->dispatch, struct request,
5180	queuelist);
5181	list_del_init(entry: &rq->queuelist);
5182
5183	bfqq = RQ_BFQQ(rq);
5184
5185	if (bfqq) {
5186	/*
5187	* Increment counters here, because this
5188	* dispatch does not follow the standard
5189	* dispatch flow (where counters are
5190	* incremented)
5191	*/
5192	bfqq->dispatched++;
5193
5194	goto inc_in_driver_start_rq;
5195	}
5196
5197	/*
5198	* We exploit the bfq_finish_requeue_request hook to
5199	* decrement tot_rq_in_driver, but
5200	* bfq_finish_requeue_request will not be invoked on
5201	* this request. So, to avoid unbalance, just start
5202	* this request, without incrementing tot_rq_in_driver. As
5203	* a negative consequence, tot_rq_in_driver is deceptively
5204	* lower than it should be while this request is in
5205	* service. This may cause bfq_schedule_dispatch to be
5206	* invoked uselessly.
5207	*
5208	* As for implementing an exact solution, the
5209	* bfq_finish_requeue_request hook, if defined, is
5210	* probably invoked also on this request. So, by
5211	* exploiting this hook, we could 1) increment
5212	* tot_rq_in_driver here, and 2) decrement it in
5213	* bfq_finish_requeue_request. Such a solution would
5214	* let the value of the counter be always accurate,
5215	* but it would entail using an extra interface
5216	* function. This cost seems higher than the benefit,
5217	* being the frequency of non-elevator-private
5218	* requests very low.
5219	*/
5220	goto start_rq;
5221	}
5222
5223	bfq_log(bfqd, "dispatch requests: %d busy queues",
5224	bfq_tot_busy_queues(bfqd));
5225
5226	if (bfq_tot_busy_queues(bfqd) == `0`)
5227	goto exit;
5228
5229	/*
5230	* Force device to serve one request at a time if
5231	* strict_guarantees is true. Forcing this service scheme is
5232	* currently the ONLY way to guarantee that the request
5233	* service order enforced by the scheduler is respected by a
5234	* queueing device. Otherwise the device is free even to make
5235	* some unlucky request wait for as long as the device
5236	* wishes.
5237	*
5238	* Of course, serving one request at a time may cause loss of
5239	* throughput.
5240	*/
5241	if (bfqd->strict_guarantees && bfqd->tot_rq_in_driver > `0`)
5242	goto exit;
5243
5244	bfqq = bfq_select_queue(bfqd);
5245	if (!bfqq)
5246	goto exit;
5247
5248	rq = bfq_dispatch_rq_from_bfqq(bfqd, bfqq);
5249
5250	if (rq) {
5251	inc_in_driver_start_rq:
5252	bfqd->rq_in_driver[bfqq->actuator_idx]++;
5253	bfqd->tot_rq_in_driver++;
5254	start_rq:
5255	rq->rq_flags \|= RQF_STARTED;
5256	}
5257	exit:
5258	return rq;
5259	}
5260
5261	#ifdef CONFIG_BFQ_CGROUP_DEBUG
5262	static void bfq_update_dispatch_stats(struct request_queue *q,
5263	struct request *rq,
5264	struct bfq_queue *in_serv_queue,
5265	bool idle_timer_disabled)
5266	{
5267	struct bfq_queue *bfqq = rq ? RQ_BFQQ(rq) : NULL;
5268
5269	if (!idle_timer_disabled && !bfqq)
5270	return;
5271
5272	/*
5273	* rq and bfqq are guaranteed to exist until this function
5274	* ends, for the following reasons. First, rq can be
5275	* dispatched to the device, and then can be completed and
5276	* freed, only after this function ends. Second, rq cannot be
5277	* merged (and thus freed because of a merge) any longer,
5278	* because it has already started. Thus rq cannot be freed
5279	* before this function ends, and, since rq has a reference to
5280	* bfqq, the same guarantee holds for bfqq too.
5281	*
5282	* In addition, the following queue lock guarantees that
5283	* bfqq_group(bfqq) exists as well.
5284	*/
5285	spin_lock_irq(lock: &q->queue_lock);
5286	if (idle_timer_disabled)
5287	/*
5288	* Since the idle timer has been disabled,
5289	* in_serv_queue contained some request when
5290	* __bfq_dispatch_request was invoked above, which
5291	* implies that rq was picked exactly from
5292	* in_serv_queue. Thus in_serv_queue == bfqq, and is
5293	* therefore guaranteed to exist because of the above
5294	* arguments.
5295	*/
5296	bfqg_stats_update_idle_time(bfqg: bfqq_group(bfqq: in_serv_queue));
5297	if (bfqq) {
5298	struct bfq_group *bfqg = bfqq_group(bfqq);
5299
5300	bfqg_stats_update_avg_queue_size(bfqg);
5301	bfqg_stats_set_start_empty_time(bfqg);
5302	bfqg_stats_update_io_remove(bfqg, opf: rq->cmd_flags);
5303	}
5304	spin_unlock_irq(lock: &q->queue_lock);
5305	}
5306	#else
5307	static inline void bfq_update_dispatch_stats(struct request_queue *q,
5308	struct request *rq,
5309	struct bfq_queue *in_serv_queue,
5310	bool idle_timer_disabled) {}
5311	#endif /* CONFIG_BFQ_CGROUP_DEBUG */
5312
5313	static struct request bfq_dispatch_request(struct* blk_mq_hw_ctx *hctx)
5314	{
5315	struct bfq_data *bfqd = hctx->queue->elevator->elevator_data;
5316	struct request *rq;
5317	struct bfq_queue *in_serv_queue;
5318	bool waiting_rq, idle_timer_disabled = false;
5319
5320	spin_lock_irq(lock: &bfqd->lock);
5321
5322	in_serv_queue = bfqd->in_service_queue;
5323	waiting_rq = in_serv_queue && bfq_bfqq_wait_request(bfqq: in_serv_queue);
5324
5325	rq = __bfq_dispatch_request(hctx);
5326	if (in_serv_queue == bfqd->in_service_queue) {
5327	idle_timer_disabled =
5328	waiting_rq && !bfq_bfqq_wait_request(bfqq: in_serv_queue);
5329	}
5330
5331	spin_unlock_irq(lock: &bfqd->lock);
5332	bfq_update_dispatch_stats(q: hctx->queue, rq,
5333	in_serv_queue: idle_timer_disabled ? in_serv_queue : NULL,
5334	idle_timer_disabled);
5335
5336	return rq;
5337	}
5338
5339	/*
5340	* Task holds one reference to the queue, dropped when task exits. Each rq
5341	* in-flight on this queue also holds a reference, dropped when rq is freed.
5342	*
5343	* Scheduler lock must be held here. Recall not to use bfqq after calling
5344	* this function on it.
5345	*/
5346	void bfq_put_queue(struct bfq_queue *bfqq)
5347	{
5348	struct bfq_queue *item;
5349	struct hlist_node *n;
5350	struct bfq_group *bfqg = bfqq_group(bfqq);
5351
5352	bfq_log_bfqq(bfqq->bfqd, bfqq, "put_queue: %p %d", bfqq, bfqq->ref);
5353
5354	bfqq->ref--;
5355	if (bfqq->ref)
5356	return;
5357
5358	if (!hlist_unhashed(h: &bfqq->burst_list_node)) {
5359	hlist_del_init(n: &bfqq->burst_list_node);
5360	/*
5361	* Decrement also burst size after the removal, if the
5362	* process associated with bfqq is exiting, and thus
5363	* does not contribute to the burst any longer. This
5364	* decrement helps filter out false positives of large
5365	* bursts, when some short-lived process (often due to
5366	* the execution of commands by some service) happens
5367	* to start and exit while a complex application is
5368	* starting, and thus spawning several processes that
5369	* do I/O (and that must not be treated as a large
5370	* burst, see comments on bfq_handle_burst).
5371	*
5372	* In particular, the decrement is performed only if:
5373	* 1) bfqq is not a merged queue, because, if it is,
5374	* then this free of bfqq is not triggered by the exit
5375	* of the process bfqq is associated with, but exactly
5376	* by the fact that bfqq has just been merged.
5377	* 2) burst_size is greater than 0, to handle
5378	* unbalanced decrements. Unbalanced decrements may
5379	* happen in te following case: bfqq is inserted into
5380	* the current burst list--without incrementing
5381	* bust_size--because of a split, but the current
5382	* burst list is not the burst list bfqq belonged to
5383	* (see comments on the case of a split in
5384	* bfq_set_request).
5385	*/
5386	if (bfqq->bic && bfqq->bfqd->burst_size > `0`)
5387	bfqq->bfqd->burst_size--;
5388	}
5389
5390	/*
5391	* bfqq does not exist any longer, so it cannot be woken by
5392	* any other queue, and cannot wake any other queue. Then bfqq
5393	* must be removed from the woken list of its possible waker
5394	* queue, and all queues in the woken list of bfqq must stop
5395	* having a waker queue. Strictly speaking, these updates
5396	* should be performed when bfqq remains with no I/O source
5397	* attached to it, which happens before bfqq gets freed. In
5398	* particular, this happens when the last process associated
5399	* with bfqq exits or gets associated with a different
5400	* queue. However, both events lead to bfqq being freed soon,
5401	* and dangling references would come out only after bfqq gets
5402	* freed. So these updates are done here, as a simple and safe
5403	* way to handle all cases.
5404	*/
5405	/ remove bfqq from woken list /
5406	if (!hlist_unhashed(h: &bfqq->woken_list_node))
5407	hlist_del_init(n: &bfqq->woken_list_node);
5408
5409	/ reset waker for all queues in woken list /
5410	hlist_for_each_entry_safe(item, n, &bfqq->woken_list,
5411	woken_list_node) {
5412	item->waker_bfqq = NULL;
5413	hlist_del_init(n: &item->woken_list_node);
5414	}
5415
5416	if (bfqq->bfqd->last_completed_rq_bfqq == bfqq)
5417	bfqq->bfqd->last_completed_rq_bfqq = NULL;
5418
5419	WARN_ON_ONCE(!list_empty(&bfqq->fifo));
5420	WARN_ON_ONCE(!RB_EMPTY_ROOT(&bfqq->sort_list));
5421	WARN_ON_ONCE(bfqq->dispatched);
5422
5423	kmem_cache_free(s: bfq_pool, objp: bfqq);
5424	bfqg_and_blkg_put(bfqg);
5425	}
5426
5427	static void bfq_put_stable_ref(struct bfq_queue *bfqq)
5428	{
5429	bfqq->stable_ref--;
5430	bfq_put_queue(bfqq);
5431	}
5432
5433	void bfq_put_cooperator(struct bfq_queue *bfqq)
5434	{
5435	struct bfq_queue __bfqq, next;
5436
5437	/*
5438	* If this queue was scheduled to merge with another queue, be
5439	* sure to drop the reference taken on that queue (and others in
5440	* the merge chain). See bfq_setup_merge and bfq_merge_bfqqs.
5441	*/
5442	__bfqq = bfqq->new_bfqq;
5443	while (__bfqq) {
5444	next = __bfqq->new_bfqq;
5445	bfq_put_queue(bfqq: __bfqq);
5446	__bfqq = next;
5447	}
5448	}
5449
5450	static void bfq_exit_bfqq(struct bfq_data bfqd, struct* bfq_queue *bfqq)
5451	{
5452	if (bfqq == bfqd->in_service_queue) {
5453	__bfq_bfqq_expire(bfqd, bfqq, reason: BFQQE_BUDGET_TIMEOUT);
5454	bfq_schedule_dispatch(bfqd);
5455	}
5456
5457	bfq_log_bfqq(bfqd, bfqq, "exit_bfqq: %p, %d", bfqq, bfqq->ref);
5458
5459	bfq_put_cooperator(bfqq);
5460
5461	bfq_release_process_ref(bfqd, bfqq);
5462	}
5463
5464	static void bfq_exit_icq_bfqq(struct bfq_io_cq *bic, bool is_sync,
5465	unsigned int actuator_idx)
5466	{
5467	struct bfq_queue *bfqq = bic_to_bfqq(bic, is_sync, actuator_idx);
5468	struct bfq_data *bfqd;
5469
5470	if (bfqq)
5471	bfqd = bfqq->bfqd; / NULL if scheduler already exited /
5472
5473	if (bfqq && bfqd) {
5474	bic_set_bfqq(bic, NULL, is_sync, actuator_idx);
5475	bfq_exit_bfqq(bfqd, bfqq);
5476	}
5477	}
5478
5479	static void _bfq_exit_icq(struct bfq_io_cq bic, unsigned* int num_actuators)
5480	{
5481	struct bfq_iocq_bfqq_data *bfqq_data = bic->bfqq_data;
5482	unsigned int act_idx;
5483
5484	for (act_idx = `0`; act_idx < num_actuators; act_idx++) {
5485	if (bfqq_data[act_idx].stable_merge_bfqq)
5486	bfq_put_stable_ref(bfqq: bfqq_data[act_idx].stable_merge_bfqq);
5487
5488	bfq_exit_icq_bfqq(bic, is_sync: true, actuator_idx: act_idx);
5489	bfq_exit_icq_bfqq(bic, is_sync: false, actuator_idx: act_idx);
5490	}
5491	}
5492
5493	static void bfq_exit_icq(struct io_cq *icq)
5494	{
5495	struct bfq_io_cq *bic = icq_to_bic(icq);
5496	struct bfq_data *bfqd = bic_to_bfqd(bic);
5497	unsigned long flags;
5498
5499	/*
5500	* If bfqd and thus bfqd->num_actuators is not available any
5501	* longer, then cycle over all possible per-actuator bfqqs in
5502	* next loop. We rely on bic being zeroed on creation, and
5503	* therefore on its unused per-actuator fields being NULL.
5504	*
5505	* bfqd is NULL if scheduler already exited, and in that case
5506	* this is the last time these queues are accessed.
5507	*/
5508	if (bfqd) {
5509	spin_lock_irqsave(&bfqd->lock, flags);
5510	_bfq_exit_icq(bic, num_actuators: bfqd->num_actuators);
5511	spin_unlock_irqrestore(lock: &bfqd->lock, flags);
5512	} else {
5513	_bfq_exit_icq(bic, BFQ_MAX_ACTUATORS);
5514	}
5515	}
5516
5517	/*
5518	* Update the entity prio values; note that the new values will not
5519	* be used until the next (re)activation.
5520	*/
5521	static void
5522	bfq_set_next_ioprio_data(struct bfq_queue bfqq, struct* bfq_io_cq *bic)
5523	{
5524	struct task_struct *tsk = current;
5525	int ioprio_class;
5526	struct bfq_data *bfqd = bfqq->bfqd;
5527
5528	if (!bfqd)
5529	return;
5530
5531	ioprio_class = IOPRIO_PRIO_CLASS(bic->ioprio);
5532	switch (ioprio_class) {
5533	default:
5534	pr_err("bdi %s: bfq: bad prio class %d\n",
5535	bdi_dev_name(bfqq->bfqd->queue->disk->bdi),
5536	ioprio_class);
5537	fallthrough;
5538	case IOPRIO_CLASS_NONE:
5539	/*
5540	* No prio set, inherit CPU scheduling settings.
5541	*/
5542	bfqq->new_ioprio = task_nice_ioprio(task: tsk);
5543	bfqq->new_ioprio_class = task_nice_ioclass(task: tsk);
5544	break;
5545	case IOPRIO_CLASS_RT:
5546	bfqq->new_ioprio = IOPRIO_PRIO_LEVEL(bic->ioprio);
5547	bfqq->new_ioprio_class = IOPRIO_CLASS_RT;
5548	break;
5549	case IOPRIO_CLASS_BE:
5550	bfqq->new_ioprio = IOPRIO_PRIO_LEVEL(bic->ioprio);
5551	bfqq->new_ioprio_class = IOPRIO_CLASS_BE;
5552	break;
5553	case IOPRIO_CLASS_IDLE:
5554	bfqq->new_ioprio_class = IOPRIO_CLASS_IDLE;
5555	bfqq->new_ioprio = IOPRIO_NR_LEVELS - `1`;
5556	break;
5557	}
5558
5559	if (bfqq->new_ioprio >= IOPRIO_NR_LEVELS) {
5560	pr_crit("bfq_set_next_ioprio_data: new_ioprio %d\n",
5561	bfqq->new_ioprio);
5562	bfqq->new_ioprio = IOPRIO_NR_LEVELS - `1`;
5563	}
5564
5565	bfqq->entity.new_weight = bfq_ioprio_to_weight(ioprio: bfqq->new_ioprio);
5566	bfq_log_bfqq(bfqd, bfqq, "new_ioprio %d new_weight %d",
5567	bfqq->new_ioprio, bfqq->entity.new_weight);
5568	bfqq->entity.prio_changed = `1`;
5569	}
5570
5571	static struct bfq_queue bfq_get_queue(struct* bfq_data *bfqd,
5572	struct bio *bio, bool is_sync,
5573	struct bfq_io_cq *bic,
5574	bool respawn);
5575
5576	static void bfq_check_ioprio_change(struct bfq_io_cq bic, struct* bio *bio)
5577	{
5578	struct bfq_data *bfqd = bic_to_bfqd(bic);
5579	struct bfq_queue *bfqq;
5580	int ioprio = bic->icq.ioc->ioprio;
5581
5582	/*
5583	* This condition may trigger on a newly created bic, be sure to
5584	* drop the lock before returning.
5585	*/
5586	if (unlikely(!bfqd) \|\| likely(bic->ioprio == ioprio))
5587	return;
5588
5589	bic->ioprio = ioprio;
5590
5591	bfqq = bic_to_bfqq(bic, is_sync: false, actuator_idx: bfq_actuator_index(bfqd, bio));
5592	if (bfqq) {
5593	struct bfq_queue *old_bfqq = bfqq;
5594
5595	bfqq = bfq_get_queue(bfqd, bio, is_sync: false, bic, respawn: true);
5596	bic_set_bfqq(bic, bfqq, is_sync: false, actuator_idx: bfq_actuator_index(bfqd, bio));
5597	bfq_release_process_ref(bfqd, bfqq: old_bfqq);
5598	}
5599
5600	bfqq = bic_to_bfqq(bic, is_sync: true, actuator_idx: bfq_actuator_index(bfqd, bio));
5601	if (bfqq)
5602	bfq_set_next_ioprio_data(bfqq, bic);
5603	}
5604
5605	static void bfq_init_bfqq(struct bfq_data bfqd, struct* bfq_queue *bfqq,
5606	struct bfq_io_cq bic, pid_t pid, int* is_sync,
5607	unsigned int act_idx)
5608	{
5609	u64 now_ns = blk_time_get_ns();
5610
5611	bfqq->actuator_idx = act_idx;
5612	RB_CLEAR_NODE(&bfqq->entity.rb_node);
5613	INIT_LIST_HEAD(list: &bfqq->fifo);
5614	INIT_HLIST_NODE(h: &bfqq->burst_list_node);
5615	INIT_HLIST_NODE(h: &bfqq->woken_list_node);
5616	INIT_HLIST_HEAD(&bfqq->woken_list);
5617
5618	bfqq->ref = `0`;
5619	bfqq->bfqd = bfqd;
5620
5621	if (bic)
5622	bfq_set_next_ioprio_data(bfqq, bic);
5623
5624	if (is_sync) {
5625	/*
5626	* No need to mark as has_short_ttime if in
5627	* idle_class, because no device idling is performed
5628	* for queues in idle class
5629	*/
5630	if (!bfq_class_idle(bfqq))
5631	/ tentatively mark as has_short_ttime /
5632	bfq_mark_bfqq_has_short_ttime(bfqq);
5633	bfq_mark_bfqq_sync(bfqq);
5634	bfq_mark_bfqq_just_created(bfqq);
5635	} else
5636	bfq_clear_bfqq_sync(bfqq);
5637
5638	/ set end request to minus infinity from now /
5639	bfqq->ttime.last_end_request = now_ns + `1`;
5640
5641	bfqq->creation_time = jiffies;
5642
5643	bfqq->io_start_time = now_ns;
5644
5645	bfq_mark_bfqq_IO_bound(bfqq);
5646
5647	bfqq->pid = pid;
5648
5649	/ Tentative initial value to trade off between thr and lat /
5650	bfqq->max_budget = (`2` * bfq_max_budget(bfqd)) / `3`;
5651	bfqq->budget_timeout = bfq_smallest_from_now();
5652
5653	bfqq->wr_coeff = `1`;
5654	bfqq->last_wr_start_finish = jiffies;
5655	bfqq->wr_start_at_switch_to_srt = bfq_smallest_from_now();
5656	bfqq->split_time = bfq_smallest_from_now();
5657
5658	/*
5659	* To not forget the possibly high bandwidth consumed by a
5660	* process/queue in the recent past,
5661	* bfq_bfqq_softrt_next_start() returns a value at least equal
5662	* to the current value of bfqq->soft_rt_next_start (see
5663	* comments on bfq_bfqq_softrt_next_start). Set
5664	* soft_rt_next_start to now, to mean that bfqq has consumed
5665	* no bandwidth so far.
5666	*/
5667	bfqq->soft_rt_next_start = jiffies;
5668
5669	/ first request is almost certainly seeky /
5670	bfqq->seek_history = `1`;
5671
5672	bfqq->decrease_time_jif = jiffies;
5673	}
5674
5675	static struct bfq_queue bfq_async_queue_prio(struct** bfq_data *bfqd,
5676	struct bfq_group *bfqg,
5677	int ioprio_class, int ioprio, int act_idx)
5678	{
5679	switch (ioprio_class) {
5680	case IOPRIO_CLASS_RT:
5681	return &bfqg->async_bfqq[`0`][ioprio][act_idx];
5682	case IOPRIO_CLASS_NONE:
5683	ioprio = IOPRIO_BE_NORM;
5684	fallthrough;
5685	case IOPRIO_CLASS_BE:
5686	return &bfqg->async_bfqq[`1`][ioprio][act_idx];
5687	case IOPRIO_CLASS_IDLE:
5688	return &bfqg->async_idle_bfqq[act_idx];
5689	default:
5690	return NULL;
5691	}
5692	}
5693
5694	static struct bfq_queue *
5695	bfq_do_early_stable_merge(struct bfq_data bfqd, struct* bfq_queue *bfqq,
5696	struct bfq_io_cq *bic,
5697	struct bfq_queue *last_bfqq_created)
5698	{
5699	unsigned int a_idx = last_bfqq_created->actuator_idx;
5700	struct bfq_queue *new_bfqq =
5701	bfq_setup_merge(bfqq, new_bfqq: last_bfqq_created);
5702
5703	if (!new_bfqq)
5704	return bfqq;
5705
5706	if (new_bfqq->bic)
5707	new_bfqq->bic->bfqq_data[a_idx].stably_merged = true;
5708	bic->bfqq_data[a_idx].stably_merged = true;
5709
5710	/*
5711	* Reusing merge functions. This implies that
5712	* bfqq->bic must be set too, for
5713	* bfq_merge_bfqqs to correctly save bfqq's
5714	* state before killing it.
5715	*/
5716	bfqq->bic = bic;
5717	return bfq_merge_bfqqs(bfqd, bic, bfqq);
5718	}
5719
5720	/*
5721	* Many throughput-sensitive workloads are made of several parallel
5722	* I/O flows, with all flows generated by the same application, or
5723	* more generically by the same task (e.g., system boot). The most
5724	* counterproductive action with these workloads is plugging I/O
5725	* dispatch when one of the bfq_queues associated with these flows
5726	* remains temporarily empty.
5727	*
5728	* To avoid this plugging, BFQ has been using a burst-handling
5729	* mechanism for years now. This mechanism has proven effective for
5730	* throughput, and not detrimental for service guarantees. The
5731	* following function pushes this mechanism a little bit further,
5732	* basing on the following two facts.
5733	*
5734	* First, all the I/O flows of a the same application or task
5735	* contribute to the execution/completion of that common application
5736	* or task. So the performance figures that matter are total
5737	* throughput of the flows and task-wide I/O latency. In particular,
5738	* these flows do not need to be protected from each other, in terms
5739	* of individual bandwidth or latency.
5740	*
5741	* Second, the above fact holds regardless of the number of flows.
5742	*
5743	* Putting these two facts together, this commits merges stably the
5744	* bfq_queues associated with these I/O flows, i.e., with the
5745	* processes that generate these IO/ flows, regardless of how many the
5746	* involved processes are.
5747	*
5748	* To decide whether a set of bfq_queues is actually associated with
5749	* the I/O flows of a common application or task, and to merge these
5750	* queues stably, this function operates as follows: given a bfq_queue,
5751	* say Q2, currently being created, and the last bfq_queue, say Q1,
5752	* created before Q2, Q2 is merged stably with Q1 if
5753	* - very little time has elapsed since when Q1 was created
5754	* - Q2 has the same ioprio as Q1
5755	* - Q2 belongs to the same group as Q1
5756	*
5757	* Merging bfq_queues also reduces scheduling overhead. A fio test
5758	* with ten random readers on /dev/nullb shows a throughput boost of
5759	* 40%, with a quadcore. Since BFQ's execution time amounts to ~50% of
5760	* the total per-request processing time, the above throughput boost
5761	* implies that BFQ's overhead is reduced by more than 50%.
5762	*
5763	* This new mechanism most certainly obsoletes the current
5764	* burst-handling heuristics. We keep those heuristics for the moment.
5765	*/
5766	static struct bfq_queue bfq_do_or_sched_stable_merge(struct* bfq_data *bfqd,
5767	struct bfq_queue *bfqq,
5768	struct bfq_io_cq *bic)
5769	{
5770	struct bfq_queue **source_bfqq = bfqq->entity.parent ?
5771	&bfqq->entity.parent->last_bfqq_created :
5772	&bfqd->last_bfqq_created;
5773
5774	struct bfq_queue last_bfqq_created = source_bfqq;
5775
5776	/*
5777	* If last_bfqq_created has not been set yet, then init it. If
5778	* it has been set already, but too long ago, then move it
5779	* forward to bfqq. Finally, move also if bfqq belongs to a
5780	* different group than last_bfqq_created, or if bfqq has a
5781	* different ioprio, ioprio_class or actuator_idx. If none of
5782	* these conditions holds true, then try an early stable merge
5783	* or schedule a delayed stable merge. As for the condition on
5784	* actuator_idx, the reason is that, if queues associated with
5785	* different actuators are merged, then control is lost on
5786	* each actuator. Therefore some actuator may be
5787	* underutilized, and throughput may decrease.
5788	*
5789	* A delayed merge is scheduled (instead of performing an
5790	* early merge), in case bfqq might soon prove to be more
5791	* throughput-beneficial if not merged. Currently this is
5792	* possible only if bfqd is rotational with no queueing. For
5793	* such a drive, not merging bfqq is better for throughput if
5794	* bfqq happens to contain sequential I/O. So, we wait a
5795	* little bit for enough I/O to flow through bfqq. After that,
5796	* if such an I/O is sequential, then the merge is
5797	* canceled. Otherwise the merge is finally performed.
5798	*/
5799	if (!last_bfqq_created \|\|
5800	time_before(last_bfqq_created->creation_time +
5801	msecs_to_jiffies(bfq_activation_stable_merging),
5802	bfqq->creation_time) \|\|
5803	bfqq->entity.parent != last_bfqq_created->entity.parent \|\|
5804	bfqq->ioprio != last_bfqq_created->ioprio \|\|
5805	bfqq->ioprio_class != last_bfqq_created->ioprio_class \|\|
5806	bfqq->actuator_idx != last_bfqq_created->actuator_idx)
5807	*source_bfqq = bfqq;
5808	else if (time_after_eq(last_bfqq_created->creation_time +
5809	bfqd->bfq_burst_interval,
5810	bfqq->creation_time)) {
5811	if (likely(bfqd->nonrot_with_queueing))
5812	/*
5813	* With this type of drive, leaving
5814	* bfqq alone may provide no
5815	* throughput benefits compared with
5816	* merging bfqq. So merge bfqq now.
5817	*/
5818	bfqq = bfq_do_early_stable_merge(bfqd, bfqq,
5819	bic,
5820	last_bfqq_created);
5821	else { / schedule tentative stable merge /
5822	/*
5823	* get reference on last_bfqq_created,
5824	* to prevent it from being freed,
5825	* until we decide whether to merge
5826	*/
5827	last_bfqq_created->ref++;
5828	/*
5829	* need to keep track of stable refs, to
5830	* compute process refs correctly
5831	*/
5832	last_bfqq_created->stable_ref++;
5833	/*
5834	* Record the bfqq to merge to.
5835	*/
5836	bic->bfqq_data[last_bfqq_created->actuator_idx].stable_merge_bfqq =
5837	last_bfqq_created;
5838	}
5839	}
5840
5841	return bfqq;
5842	}
5843
5844
5845	static struct bfq_queue bfq_get_queue(struct* bfq_data *bfqd,
5846	struct bio *bio, bool is_sync,
5847	struct bfq_io_cq *bic,
5848	bool respawn)
5849	{
5850	const int ioprio = IOPRIO_PRIO_LEVEL(bic->ioprio);
5851	const int ioprio_class = IOPRIO_PRIO_CLASS(bic->ioprio);
5852	struct bfq_queue **async_bfqq = NULL;
5853	struct bfq_queue *bfqq;
5854	struct bfq_group *bfqg;
5855
5856	bfqg = bfq_bio_bfqg(bfqd, bio);
5857	if (!is_sync) {
5858	async_bfqq = bfq_async_queue_prio(bfqd, bfqg, ioprio_class,
5859	ioprio,
5860	act_idx: bfq_actuator_index(bfqd, bio));
5861	bfqq = *async_bfqq;
5862	if (bfqq)
5863	goto out;
5864	}
5865
5866	bfqq = kmem_cache_alloc_node(bfq_pool,
5867	GFP_NOWAIT \| __GFP_ZERO \| __GFP_NOWARN,
5868	bfqd->queue->node);
5869
5870	if (bfqq) {
5871	bfq_init_bfqq(bfqd, bfqq, bic, current->pid,
5872	is_sync, act_idx: bfq_actuator_index(bfqd, bio));
5873	bfq_init_entity(entity: &bfqq->entity, bfqg);
5874	bfq_log_bfqq(bfqd, bfqq, "allocated");
5875	} else {
5876	bfqq = &bfqd->oom_bfqq;
5877	bfq_log_bfqq(bfqd, bfqq, "using oom bfqq");
5878	goto out;
5879	}
5880
5881	/*
5882	* Pin the queue now that it's allocated, scheduler exit will
5883	* prune it.
5884	*/
5885	if (async_bfqq) {
5886	bfqq->ref++; /*
5887	* Extra group reference, w.r.t. sync
5888	* queue. This extra reference is removed
5889	* only if bfqq->bfqg disappears, to
5890	* guarantee that this queue is not freed
5891	* until its group goes away.
5892	*/
5893	bfq_log_bfqq(bfqd, bfqq, "get_queue, bfqq not in async: %p, %d",
5894	bfqq, bfqq->ref);
5895	*async_bfqq = bfqq;
5896	}
5897
5898	out:
5899	bfqq->ref++; / get a process reference to this queue /
5900
5901	if (bfqq != &bfqd->oom_bfqq && is_sync && !respawn)
5902	bfqq = bfq_do_or_sched_stable_merge(bfqd, bfqq, bic);
5903	return bfqq;
5904	}
5905
5906	static void bfq_update_io_thinktime(struct bfq_data *bfqd,
5907	struct bfq_queue *bfqq)
5908	{
5909	struct bfq_ttime *ttime = &bfqq->ttime;
5910	u64 elapsed;
5911
5912	/*
5913	* We are really interested in how long it takes for the queue to
5914	* become busy when there is no outstanding IO for this queue. So
5915	* ignore cases when the bfq queue has already IO queued.
5916	*/
5917	if (bfqq->dispatched \|\| bfq_bfqq_busy(bfqq))
5918	return;
5919	elapsed = blk_time_get_ns() - bfqq->ttime.last_end_request;
5920	elapsed = min_t(u64, elapsed, `2ULL` * bfqd->bfq_slice_idle);
5921
5922	ttime->ttime_samples = (`7`*ttime->ttime_samples + `256`) / `8`;
5923	ttime->ttime_total = div_u64(dividend: `7`ttime->ttime_total + `256`elapsed, divisor: `8`);
5924	ttime->ttime_mean = div64_ul(ttime->ttime_total + `128`,
5925	ttime->ttime_samples);
5926	}
5927
5928	static void
5929	bfq_update_io_seektime(struct bfq_data bfqd, struct* bfq_queue *bfqq,
5930	struct request *rq)
5931	{
5932	bfqq->seek_history <<= `1`;
5933	bfqq->seek_history \|= BFQ_RQ_SEEKY(bfqd, bfqq->last_request_pos, rq);
5934
5935	if (bfqq->wr_coeff > `1` &&
5936	bfqq->wr_cur_max_time == bfqd->bfq_wr_rt_max_time &&
5937	BFQQ_TOTALLY_SEEKY(bfqq)) {
5938	if (time_is_before_jiffies(bfqq->wr_start_at_switch_to_srt +
5939	bfq_wr_duration(bfqd))) {
5940	/*
5941	* In soft_rt weight raising with the
5942	* interactive-weight-raising period
5943	* elapsed (so no switch back to
5944	* interactive weight raising).
5945	*/
5946	bfq_bfqq_end_wr(bfqq);
5947	} else { /*
5948	* stopping soft_rt weight raising
5949	* while still in interactive period,
5950	* switch back to interactive weight
5951	* raising
5952	*/
5953	switch_back_to_interactive_wr(bfqq, bfqd);
5954	bfqq->entity.prio_changed = `1`;
5955	}
5956	}
5957	}
5958
5959	static void bfq_update_has_short_ttime(struct bfq_data *bfqd,
5960	struct bfq_queue *bfqq,
5961	struct bfq_io_cq *bic)
5962	{
5963	bool has_short_ttime = true, state_changed;
5964
5965	/*
5966	* No need to update has_short_ttime if bfqq is async or in
5967	* idle io prio class, or if bfq_slice_idle is zero, because
5968	* no device idling is performed for bfqq in this case.
5969	*/
5970	if (!bfq_bfqq_sync(bfqq) \|\| bfq_class_idle(bfqq) \|\|
5971	bfqd->bfq_slice_idle == `0`)
5972	return;
5973
5974	/ Idle window just restored, statistics are meaningless. /
5975	if (time_is_after_eq_jiffies(bfqq->split_time +
5976	bfqd->bfq_wr_min_idle_time))
5977	return;
5978
5979	/ Think time is infinite if no process is linked to*
5980	* bfqq. Otherwise check average think time to decide whether
5981	* to mark as has_short_ttime. To this goal, compare average
5982	* think time with half the I/O-plugging timeout.
5983	*/
5984	if (atomic_read(v: &bic->icq.ioc->active_ref) == `0` \|\|
5985	(bfq_sample_valid(bfqq->ttime.ttime_samples) &&
5986	bfqq->ttime.ttime_mean > bfqd->bfq_slice_idle>>`1`))
5987	has_short_ttime = false;
5988
5989	state_changed = has_short_ttime != bfq_bfqq_has_short_ttime(bfqq);
5990
5991	if (has_short_ttime)
5992	bfq_mark_bfqq_has_short_ttime(bfqq);
5993	else
5994	bfq_clear_bfqq_has_short_ttime(bfqq);
5995
5996	/*
5997	* Until the base value for the total service time gets
5998	* finally computed for bfqq, the inject limit does depend on
5999	* the think-time state (short\|long). In particular, the limit
6000	* is 0 or 1 if the think time is deemed, respectively, as
6001	* short or long (details in the comments in
6002	* bfq_update_inject_limit()). Accordingly, the next
6003	* instructions reset the inject limit if the think-time state
6004	* has changed and the above base value is still to be
6005	* computed.
6006	*
6007	* However, the reset is performed only if more than 100 ms
6008	* have elapsed since the last update of the inject limit, or
6009	* (inclusive) if the change is from short to long think
6010	* time. The reason for this waiting is as follows.
6011	*
6012	* bfqq may have a long think time because of a
6013	* synchronization with some other queue, i.e., because the
6014	* I/O of some other queue may need to be completed for bfqq
6015	* to receive new I/O. Details in the comments on the choice
6016	* of the queue for injection in bfq_select_queue().
6017	*
6018	* As stressed in those comments, if such a synchronization is
6019	* actually in place, then, without injection on bfqq, the
6020	* blocking I/O cannot happen to served while bfqq is in
6021	* service. As a consequence, if bfqq is granted
6022	* I/O-dispatch-plugging, then bfqq remains empty, and no I/O
6023	* is dispatched, until the idle timeout fires. This is likely
6024	* to result in lower bandwidth and higher latencies for bfqq,
6025	* and in a severe loss of total throughput.
6026	*
6027	* On the opposite end, a non-zero inject limit may allow the
6028	* I/O that blocks bfqq to be executed soon, and therefore
6029	* bfqq to receive new I/O soon.
6030	*
6031	* But, if the blocking gets actually eliminated, then the
6032	* next think-time sample for bfqq may be very low. This in
6033	* turn may cause bfqq's think time to be deemed
6034	* short. Without the 100 ms barrier, this new state change
6035	* would cause the body of the next if to be executed
6036	* immediately. But this would set to 0 the inject
6037	* limit. Without injection, the blocking I/O would cause the
6038	* think time of bfqq to become long again, and therefore the
6039	* inject limit to be raised again, and so on. The only effect
6040	* of such a steady oscillation between the two think-time
6041	* states would be to prevent effective injection on bfqq.
6042	*
6043	* In contrast, if the inject limit is not reset during such a
6044	* long time interval as 100 ms, then the number of short
6045	* think time samples can grow significantly before the reset
6046	* is performed. As a consequence, the think time state can
6047	* become stable before the reset. Therefore there will be no
6048	* state change when the 100 ms elapse, and no reset of the
6049	* inject limit. The inject limit remains steadily equal to 1
6050	* both during and after the 100 ms. So injection can be
6051	* performed at all times, and throughput gets boosted.
6052	*
6053	* An inject limit equal to 1 is however in conflict, in
6054	* general, with the fact that the think time of bfqq is
6055	* short, because injection may be likely to delay bfqq's I/O
6056	* (as explained in the comments in
6057	* bfq_update_inject_limit()). But this does not happen in
6058	* this special case, because bfqq's low think time is due to
6059	* an effective handling of a synchronization, through
6060	* injection. In this special case, bfqq's I/O does not get
6061	* delayed by injection; on the contrary, bfqq's I/O is
6062	* brought forward, because it is not blocked for
6063	* milliseconds.
6064	*
6065	* In addition, serving the blocking I/O much sooner, and much
6066	* more frequently than once per I/O-plugging timeout, makes
6067	* it much quicker to detect a waker queue (the concept of
6068	* waker queue is defined in the comments in
6069	* bfq_add_request()). This makes it possible to start sooner
6070	* to boost throughput more effectively, by injecting the I/O
6071	* of the waker queue unconditionally on every
6072	* bfq_dispatch_request().
6073	*
6074	* One last, important benefit of not resetting the inject
6075	* limit before 100 ms is that, during this time interval, the
6076	* base value for the total service time is likely to get
6077	* finally computed for bfqq, freeing the inject limit from
6078	* its relation with the think time.
6079	*/
6080	if (state_changed && bfqq->last_serv_time_ns == `0` &&
6081	(time_is_before_eq_jiffies(bfqq->decrease_time_jif +
6082	msecs_to_jiffies(`100`)) \|\|
6083	!has_short_ttime))
6084	bfq_reset_inject_limit(bfqd, bfqq);
6085	}
6086
6087	/*
6088	* Called when a new fs request (rq) is added to bfqq. Check if there's
6089	* something we should do about it.
6090	*/
6091	static void bfq_rq_enqueued(struct bfq_data bfqd, struct* bfq_queue *bfqq,
6092	struct request *rq)
6093	{
6094	if (rq->cmd_flags & REQ_META)
6095	bfqq->meta_pending++;
6096
6097	bfqq->last_request_pos = blk_rq_pos(rq) + blk_rq_sectors(rq);
6098
6099	if (bfqq == bfqd->in_service_queue && bfq_bfqq_wait_request(bfqq)) {
6100	bool small_req = bfqq->queued[rq_is_sync(rq)] == `1` &&
6101	blk_rq_sectors(rq) < `32`;
6102	bool budget_timeout = bfq_bfqq_budget_timeout(bfqq);
6103
6104	/*
6105	* There is just this request queued: if
6106	* - the request is small, and
6107	* - we are idling to boost throughput, and
6108	* - the queue is not to be expired,
6109	* then just exit.
6110	*
6111	* In this way, if the device is being idled to wait
6112	* for a new request from the in-service queue, we
6113	* avoid unplugging the device and committing the
6114	* device to serve just a small request. In contrast
6115	* we wait for the block layer to decide when to
6116	* unplug the device: hopefully, new requests will be
6117	* merged to this one quickly, then the device will be
6118	* unplugged and larger requests will be dispatched.
6119	*/
6120	if (small_req && idling_boosts_thr_without_issues(bfqd, bfqq) &&
6121	!budget_timeout)
6122	return;
6123
6124	/*
6125	* A large enough request arrived, or idling is being
6126	* performed to preserve service guarantees, or
6127	* finally the queue is to be expired: in all these
6128	* cases disk idling is to be stopped, so clear
6129	* wait_request flag and reset timer.
6130	*/
6131	bfq_clear_bfqq_wait_request(bfqq);
6132	hrtimer_try_to_cancel(timer: &bfqd->idle_slice_timer);
6133
6134	/*
6135	* The queue is not empty, because a new request just
6136	* arrived. Hence we can safely expire the queue, in
6137	* case of budget timeout, without risking that the
6138	* timestamps of the queue are not updated correctly.
6139	* See [1] for more details.
6140	*/
6141	if (budget_timeout)
6142	bfq_bfqq_expire(bfqd, bfqq, compensate: false,
6143	reason: BFQQE_BUDGET_TIMEOUT);
6144	}
6145	}
6146
6147	static void bfqq_request_allocated(struct bfq_queue *bfqq)
6148	{
6149	struct bfq_entity *entity = &bfqq->entity;
6150
6151	for_each_entity(entity)
6152	entity->allocated++;
6153	}
6154
6155	static void bfqq_request_freed(struct bfq_queue *bfqq)
6156	{
6157	struct bfq_entity *entity = &bfqq->entity;
6158
6159	for_each_entity(entity)
6160	entity->allocated--;
6161	}
6162
6163	/ returns true if it causes the idle timer to be disabled /
6164	static bool __bfq_insert_request(struct bfq_data bfqd, struct* request *rq)
6165	{
6166	struct bfq_queue *bfqq = RQ_BFQQ(rq),
6167	*new_bfqq = bfq_setup_cooperator(bfqd, bfqq, io_struct: rq, request: true,
6168	RQ_BIC(rq));
6169	bool waiting, idle_timer_disabled = false;
6170
6171	if (new_bfqq) {
6172	struct bfq_queue *old_bfqq = bfqq;
6173	/*
6174	* Release the request's reference to the old bfqq
6175	* and make sure one is taken to the shared queue.
6176	*/
6177	bfqq_request_allocated(bfqq: new_bfqq);
6178	bfqq_request_freed(bfqq);
6179	new_bfqq->ref++;
6180	/*
6181	* If the bic associated with the process
6182	* issuing this request still points to bfqq
6183	* (and thus has not been already redirected
6184	* to new_bfqq or even some other bfq_queue),
6185	* then complete the merge and redirect it to
6186	* new_bfqq.
6187	*/
6188	if (bic_to_bfqq(RQ_BIC(rq), is_sync: true,
6189	actuator_idx: bfq_actuator_index(bfqd, bio: rq->bio)) == bfqq) {
6190	while (bfqq != new_bfqq)
6191	bfqq = bfq_merge_bfqqs(bfqd, RQ_BIC(rq), bfqq);
6192	}
6193
6194	bfq_clear_bfqq_just_created(bfqq: old_bfqq);
6195	/*
6196	* rq is about to be enqueued into new_bfqq,
6197	* release rq reference on bfqq
6198	*/
6199	bfq_put_queue(bfqq: old_bfqq);
6200	rq->elv.priv[`1`] = new_bfqq;
6201	}
6202
6203	bfq_update_io_thinktime(bfqd, bfqq);
6204	bfq_update_has_short_ttime(bfqd, bfqq, RQ_BIC(rq));
6205	bfq_update_io_seektime(bfqd, bfqq, rq);
6206
6207	waiting = bfqq && bfq_bfqq_wait_request(bfqq);
6208	bfq_add_request(rq);
6209	idle_timer_disabled = waiting && !bfq_bfqq_wait_request(bfqq);
6210
6211	rq->fifo_time = blk_time_get_ns() + bfqd->bfq_fifo_expire[rq_is_sync(rq)];
6212	list_add_tail(new: &rq->queuelist, head: &bfqq->fifo);
6213
6214	bfq_rq_enqueued(bfqd, bfqq, rq);
6215
6216	return idle_timer_disabled;
6217	}
6218
6219	#ifdef CONFIG_BFQ_CGROUP_DEBUG
6220	static void bfq_update_insert_stats(struct request_queue *q,
6221	struct bfq_queue *bfqq,
6222	bool idle_timer_disabled,
6223	blk_opf_t cmd_flags)
6224	{
6225	if (!bfqq)
6226	return;
6227
6228	/*
6229	* bfqq still exists, because it can disappear only after
6230	* either it is merged with another queue, or the process it
6231	* is associated with exits. But both actions must be taken by
6232	* the same process currently executing this flow of
6233	* instructions.
6234	*
6235	* In addition, the following queue lock guarantees that
6236	* bfqq_group(bfqq) exists as well.
6237	*/
6238	spin_lock_irq(lock: &q->queue_lock);
6239	bfqg_stats_update_io_add(bfqg: bfqq_group(bfqq), bfqq, opf: cmd_flags);
6240	if (idle_timer_disabled)
6241	bfqg_stats_update_idle_time(bfqg: bfqq_group(bfqq));
6242	spin_unlock_irq(lock: &q->queue_lock);
6243	}
6244	#else
6245	static inline void bfq_update_insert_stats(struct request_queue *q,
6246	struct bfq_queue *bfqq,
6247	bool idle_timer_disabled,
6248	blk_opf_t cmd_flags) {}
6249	#endif /* CONFIG_BFQ_CGROUP_DEBUG */
6250
6251	static struct bfq_queue bfq_init_rq(struct* request *rq);
6252
6253	static void bfq_insert_request(struct blk_mq_hw_ctx hctx, struct* request *rq,
6254	blk_insert_t flags)
6255	{
6256	struct request_queue *q = hctx->queue;
6257	struct bfq_data *bfqd = q->elevator->elevator_data;
6258	struct bfq_queue *bfqq;
6259	bool idle_timer_disabled = false;
6260	blk_opf_t cmd_flags;
6261	LIST_HEAD(free);
6262
6263	#ifdef CONFIG_BFQ_GROUP_IOSCHED
6264	if (!cgroup_subsys_on_dfl(io_cgrp_subsys) && rq->bio)
6265	bfqg_stats_update_legacy_io(q, rq);
6266	#endif
6267	spin_lock_irq(lock: &bfqd->lock);
6268	bfqq = bfq_init_rq(rq);
6269	if (blk_mq_sched_try_insert_merge(q, rq, free: &free)) {
6270	spin_unlock_irq(lock: &bfqd->lock);
6271	blk_mq_free_requests(list: &free);
6272	return;
6273	}
6274
6275	trace_block_rq_insert(rq);
6276
6277	if (flags & BLK_MQ_INSERT_AT_HEAD) {
6278	list_add(new: &rq->queuelist, head: &bfqd->dispatch);
6279	} else if (!bfqq) {
6280	list_add_tail(new: &rq->queuelist, head: &bfqd->dispatch);
6281	} else {
6282	idle_timer_disabled = __bfq_insert_request(bfqd, rq);
6283	/*
6284	* Update bfqq, because, if a queue merge has occurred
6285	* in __bfq_insert_request, then rq has been
6286	* redirected into a new queue.
6287	*/
6288	bfqq = RQ_BFQQ(rq);
6289
6290	if (rq_mergeable(rq)) {
6291	elv_rqhash_add(q, rq);
6292	if (!q->last_merge)
6293	q->last_merge = rq;
6294	}
6295	}
6296
6297	/*
6298	* Cache cmd_flags before releasing scheduler lock, because rq
6299	* may disappear afterwards (for example, because of a request
6300	* merge).
6301	*/
6302	cmd_flags = rq->cmd_flags;
6303	spin_unlock_irq(lock: &bfqd->lock);
6304
6305	bfq_update_insert_stats(q, bfqq, idle_timer_disabled,
6306	cmd_flags);
6307	}
6308
6309	static void bfq_insert_requests(struct blk_mq_hw_ctx *hctx,
6310	struct list_head *list,
6311	blk_insert_t flags)
6312	{
6313	while (!list_empty(head: list)) {
6314	struct request *rq;
6315
6316	rq = list_first_entry(list, struct request, queuelist);
6317	list_del_init(entry: &rq->queuelist);
6318	bfq_insert_request(hctx, rq, flags);
6319	}
6320	}
6321
6322	static void bfq_update_hw_tag(struct bfq_data *bfqd)
6323	{
6324	struct bfq_queue *bfqq = bfqd->in_service_queue;
6325
6326	bfqd->max_rq_in_driver = max_t(int, bfqd->max_rq_in_driver,
6327	bfqd->tot_rq_in_driver);
6328
6329	if (bfqd->hw_tag == `1`)
6330	return;
6331
6332	/*
6333	* This sample is valid if the number of outstanding requests
6334	* is large enough to allow a queueing behavior. Note that the
6335	* sum is not exact, as it's not taking into account deactivated
6336	* requests.
6337	*/
6338	if (bfqd->tot_rq_in_driver + bfqd->queued <= BFQ_HW_QUEUE_THRESHOLD)
6339	return;
6340
6341	/*
6342	* If active queue hasn't enough requests and can idle, bfq might not
6343	* dispatch sufficient requests to hardware. Don't zero hw_tag in this
6344	* case
6345	*/
6346	if (bfqq && bfq_bfqq_has_short_ttime(bfqq) &&
6347	bfqq->dispatched + bfqq->queued[`0`] + bfqq->queued[`1`] <
6348	BFQ_HW_QUEUE_THRESHOLD &&
6349	bfqd->tot_rq_in_driver < BFQ_HW_QUEUE_THRESHOLD)
6350	return;
6351
6352	if (bfqd->hw_tag_samples++ < BFQ_HW_QUEUE_SAMPLES)
6353	return;
6354
6355	bfqd->hw_tag = bfqd->max_rq_in_driver > BFQ_HW_QUEUE_THRESHOLD;
6356	bfqd->max_rq_in_driver = `0`;
6357	bfqd->hw_tag_samples = `0`;
6358
6359	bfqd->nonrot_with_queueing =
6360	blk_queue_nonrot(bfqd->queue) && bfqd->hw_tag;
6361	}
6362
6363	static void bfq_completed_request(struct bfq_queue bfqq, struct* bfq_data *bfqd)
6364	{
6365	u64 now_ns;
6366	u32 delta_us;
6367
6368	bfq_update_hw_tag(bfqd);
6369
6370	bfqd->rq_in_driver[bfqq->actuator_idx]--;
6371	bfqd->tot_rq_in_driver--;
6372	bfqq->dispatched--;
6373
6374	if (!bfqq->dispatched && !bfq_bfqq_busy(bfqq)) {
6375	/*
6376	* Set budget_timeout (which we overload to store the
6377	* time at which the queue remains with no backlog and
6378	* no outstanding request; used by the weight-raising
6379	* mechanism).
6380	*/
6381	bfqq->budget_timeout = jiffies;
6382
6383	bfq_del_bfqq_in_groups_with_pending_reqs(bfqq);
6384	bfq_weights_tree_remove(bfqq);
6385	}
6386
6387	now_ns = blk_time_get_ns();
6388
6389	bfqq->ttime.last_end_request = now_ns;
6390
6391	/*
6392	* Using us instead of ns, to get a reasonable precision in
6393	* computing rate in next check.
6394	*/
6395	delta_us = div_u64(dividend: now_ns - bfqd->last_completion, NSEC_PER_USEC);
6396
6397	/*
6398	* If the request took rather long to complete, and, according
6399	* to the maximum request size recorded, this completion latency
6400	* implies that the request was certainly served at a very low
6401	* rate (less than 1M sectors/sec), then the whole observation
6402	* interval that lasts up to this time instant cannot be a
6403	* valid time interval for computing a new peak rate. Invoke
6404	* bfq_update_rate_reset to have the following three steps
6405	* taken:
6406	* - close the observation interval at the last (previous)
6407	* request dispatch or completion
6408	* - compute rate, if possible, for that observation interval
6409	* - reset to zero samples, which will trigger a proper
6410	* re-initialization of the observation interval on next
6411	* dispatch
6412	*/
6413	if (delta_us > BFQ_MIN_TT/NSEC_PER_USEC &&
6414	(bfqd->last_rq_max_size<<BFQ_RATE_SHIFT)/delta_us <
6415	`1UL`<<(BFQ_RATE_SHIFT - `10`))
6416	bfq_update_rate_reset(bfqd, NULL);
6417	bfqd->last_completion = now_ns;
6418	/*
6419	* Shared queues are likely to receive I/O at a high
6420	* rate. This may deceptively let them be considered as wakers
6421	* of other queues. But a false waker will unjustly steal
6422	* bandwidth to its supposedly woken queue. So considering
6423	* also shared queues in the waking mechanism may cause more
6424	* control troubles than throughput benefits. Then reset
6425	* last_completed_rq_bfqq if bfqq is a shared queue.
6426	*/
6427	if (!bfq_bfqq_coop(bfqq))
6428	bfqd->last_completed_rq_bfqq = bfqq;
6429	else
6430	bfqd->last_completed_rq_bfqq = NULL;
6431
6432	/*
6433	* If we are waiting to discover whether the request pattern
6434	* of the task associated with the queue is actually
6435	* isochronous, and both requisites for this condition to hold
6436	* are now satisfied, then compute soft_rt_next_start (see the
6437	* comments on the function bfq_bfqq_softrt_next_start()). We
6438	* do not compute soft_rt_next_start if bfqq is in interactive
6439	* weight raising (see the comments in bfq_bfqq_expire() for
6440	* an explanation). We schedule this delayed update when bfqq
6441	* expires, if it still has in-flight requests.
6442	*/
6443	if (bfq_bfqq_softrt_update(bfqq) && bfqq->dispatched == `0` &&
6444	RB_EMPTY_ROOT(&bfqq->sort_list) &&
6445	bfqq->wr_coeff != bfqd->bfq_wr_coeff)
6446	bfqq->soft_rt_next_start =
6447	bfq_bfqq_softrt_next_start(bfqd, bfqq);
6448
6449	/*
6450	* If this is the in-service queue, check if it needs to be expired,
6451	* or if we want to idle in case it has no pending requests.
6452	*/
6453	if (bfqd->in_service_queue == bfqq) {
6454	if (bfq_bfqq_must_idle(bfqq)) {
6455	if (bfqq->dispatched == `0`)
6456	bfq_arm_slice_timer(bfqd);
6457	/*
6458	* If we get here, we do not expire bfqq, even
6459	* if bfqq was in budget timeout or had no
6460	* more requests (as controlled in the next
6461	* conditional instructions). The reason for
6462	* not expiring bfqq is as follows.
6463	*
6464	* Here bfqq->dispatched > 0 holds, but
6465	* bfq_bfqq_must_idle() returned true. This
6466	* implies that, even if no request arrives
6467	* for bfqq before bfqq->dispatched reaches 0,
6468	* bfqq will, however, not be expired on the
6469	* completion event that causes bfqq->dispatch
6470	* to reach zero. In contrast, on this event,
6471	* bfqq will start enjoying device idling
6472	* (I/O-dispatch plugging).
6473	*
6474	* But, if we expired bfqq here, bfqq would
6475	* not have the chance to enjoy device idling
6476	* when bfqq->dispatched finally reaches
6477	* zero. This would expose bfqq to violation
6478	* of its reserved service guarantees.
6479	*/
6480	return;
6481	} else if (bfq_may_expire_for_budg_timeout(bfqq))
6482	bfq_bfqq_expire(bfqd, bfqq, compensate: false,
6483	reason: BFQQE_BUDGET_TIMEOUT);
6484	else if (RB_EMPTY_ROOT(&bfqq->sort_list) &&
6485	(bfqq->dispatched == `0` \|\|
6486	!bfq_better_to_idle(bfqq)))
6487	bfq_bfqq_expire(bfqd, bfqq, compensate: false,
6488	reason: BFQQE_NO_MORE_REQUESTS);
6489	}
6490
6491	if (!bfqd->tot_rq_in_driver)
6492	bfq_schedule_dispatch(bfqd);
6493	}
6494
6495	/*
6496	* The processes associated with bfqq may happen to generate their
6497	* cumulative I/O at a lower rate than the rate at which the device
6498	* could serve the same I/O. This is rather probable, e.g., if only
6499	* one process is associated with bfqq and the device is an SSD. It
6500	* results in bfqq becoming often empty while in service. In this
6501	* respect, if BFQ is allowed to switch to another queue when bfqq
6502	* remains empty, then the device goes on being fed with I/O requests,
6503	* and the throughput is not affected. In contrast, if BFQ is not
6504	* allowed to switch to another queue---because bfqq is sync and
6505	* I/O-dispatch needs to be plugged while bfqq is temporarily
6506	* empty---then, during the service of bfqq, there will be frequent
6507	* "service holes", i.e., time intervals during which bfqq gets empty
6508	* and the device can only consume the I/O already queued in its
6509	* hardware queues. During service holes, the device may even get to
6510	* remaining idle. In the end, during the service of bfqq, the device
6511	* is driven at a lower speed than the one it can reach with the kind
6512	* of I/O flowing through bfqq.
6513	*
6514	* To counter this loss of throughput, BFQ implements a "request
6515	* injection mechanism", which tries to fill the above service holes
6516	* with I/O requests taken from other queues. The hard part in this
6517	* mechanism is finding the right amount of I/O to inject, so as to
6518	* both boost throughput and not break bfqq's bandwidth and latency
6519	* guarantees. In this respect, the mechanism maintains a per-queue
6520	* inject limit, computed as below. While bfqq is empty, the injection
6521	* mechanism dispatches extra I/O requests only until the total number
6522	* of I/O requests in flight---i.e., already dispatched but not yet
6523	* completed---remains lower than this limit.
6524	*
6525	* A first definition comes in handy to introduce the algorithm by
6526	* which the inject limit is computed. We define as first request for
6527	* bfqq, an I/O request for bfqq that arrives while bfqq is in
6528	* service, and causes bfqq to switch from empty to non-empty. The
6529	* algorithm updates the limit as a function of the effect of
6530	* injection on the service times of only the first requests of
6531	* bfqq. The reason for this restriction is that these are the
6532	* requests whose service time is affected most, because they are the
6533	* first to arrive after injection possibly occurred.
6534	*
6535	* To evaluate the effect of injection, the algorithm measures the
6536	* "total service time" of first requests. We define as total service
6537	* time of an I/O request, the time that elapses since when the
6538	* request is enqueued into bfqq, to when it is completed. This
6539	* quantity allows the whole effect of injection to be measured. It is
6540	* easy to see why. Suppose that some requests of other queues are
6541	* actually injected while bfqq is empty, and that a new request R
6542	* then arrives for bfqq. If the device does start to serve all or
6543	* part of the injected requests during the service hole, then,
6544	* because of this extra service, it may delay the next invocation of
6545	* the dispatch hook of BFQ. Then, even after R gets eventually
6546	* dispatched, the device may delay the actual service of R if it is
6547	* still busy serving the extra requests, or if it decides to serve,
6548	* before R, some extra request still present in its queues. As a
6549	* conclusion, the cumulative extra delay caused by injection can be
6550	* easily evaluated by just comparing the total service time of first
6551	* requests with and without injection.
6552	*
6553	* The limit-update algorithm works as follows. On the arrival of a
6554	* first request of bfqq, the algorithm measures the total time of the
6555	* request only if one of the three cases below holds, and, for each
6556	* case, it updates the limit as described below:
6557	*
6558	* (1) If there is no in-flight request. This gives a baseline for the
6559	* total service time of the requests of bfqq. If the baseline has
6560	* not been computed yet, then, after computing it, the limit is
6561	* set to 1, to start boosting throughput, and to prepare the
6562	* ground for the next case. If the baseline has already been
6563	* computed, then it is updated, in case it results to be lower
6564	* than the previous value.
6565	*
6566	* (2) If the limit is higher than 0 and there are in-flight
6567	* requests. By comparing the total service time in this case with
6568	* the above baseline, it is possible to know at which extent the
6569	* current value of the limit is inflating the total service
6570	* time. If the inflation is below a certain threshold, then bfqq
6571	* is assumed to be suffering from no perceivable loss of its
6572	* service guarantees, and the limit is even tentatively
6573	* increased. If the inflation is above the threshold, then the
6574	* limit is decreased. Due to the lack of any hysteresis, this
6575	* logic makes the limit oscillate even in steady workload
6576	* conditions. Yet we opted for it, because it is fast in reaching
6577	* the best value for the limit, as a function of the current I/O
6578	* workload. To reduce oscillations, this step is disabled for a
6579	* short time interval after the limit happens to be decreased.
6580	*
6581	* (3) Periodically, after resetting the limit, to make sure that the
6582	* limit eventually drops in case the workload changes. This is
6583	* needed because, after the limit has gone safely up for a
6584	* certain workload, it is impossible to guess whether the
6585	* baseline total service time may have changed, without measuring
6586	* it again without injection. A more effective version of this
6587	* step might be to just sample the baseline, by interrupting
6588	* injection only once, and then to reset/lower the limit only if
6589	* the total service time with the current limit does happen to be
6590	* too large.
6591	*
6592	* More details on each step are provided in the comments on the
6593	* pieces of code that implement these steps: the branch handling the
6594	* transition from empty to non empty in bfq_add_request(), the branch
6595	* handling injection in bfq_select_queue(), and the function
6596	* bfq_choose_bfqq_for_injection(). These comments also explain some
6597	* exceptions, made by the injection mechanism in some special cases.
6598	*/
6599	static void bfq_update_inject_limit(struct bfq_data *bfqd,
6600	struct bfq_queue *bfqq)
6601	{
6602	u64 tot_time_ns = blk_time_get_ns() - bfqd->last_empty_occupied_ns;
6603	unsigned int old_limit = bfqq->inject_limit;
6604
6605	if (bfqq->last_serv_time_ns > `0` && bfqd->rqs_injected) {
6606	u64 threshold = (bfqq->last_serv_time_ns * `3`)>>`1`;
6607
6608	if (tot_time_ns >= threshold && old_limit > `0`) {
6609	bfqq->inject_limit--;
6610	bfqq->decrease_time_jif = jiffies;
6611	} else if (tot_time_ns < threshold &&
6612	old_limit <= bfqd->max_rq_in_driver)
6613	bfqq->inject_limit++;
6614	}
6615
6616	/*
6617	* Either we still have to compute the base value for the
6618	* total service time, and there seem to be the right
6619	* conditions to do it, or we can lower the last base value
6620	* computed.
6621	*
6622	* NOTE: (bfqd->tot_rq_in_driver == 1) means that there is no I/O
6623	* request in flight, because this function is in the code
6624	* path that handles the completion of a request of bfqq, and,
6625	* in particular, this function is executed before
6626	* bfqd->tot_rq_in_driver is decremented in such a code path.
6627	*/
6628	if ((bfqq->last_serv_time_ns == `0` && bfqd->tot_rq_in_driver == `1`) \|\|
6629	tot_time_ns < bfqq->last_serv_time_ns) {
6630	if (bfqq->last_serv_time_ns == `0`) {
6631	/*
6632	* Now we certainly have a base value: make sure we
6633	* start trying injection.
6634	*/
6635	bfqq->inject_limit = max_t(unsigned int, `1`, old_limit);
6636	}
6637	bfqq->last_serv_time_ns = tot_time_ns;
6638	} else if (!bfqd->rqs_injected && bfqd->tot_rq_in_driver == `1`)
6639	/*
6640	* No I/O injected and no request still in service in
6641	* the drive: these are the exact conditions for
6642	* computing the base value of the total service time
6643	* for bfqq. So let's update this value, because it is
6644	* rather variable. For example, it varies if the size
6645	* or the spatial locality of the I/O requests in bfqq
6646	* change.
6647	*/
6648	bfqq->last_serv_time_ns = tot_time_ns;
6649
6650
6651	/ update complete, not waiting for any request completion any longer /
6652	bfqd->waited_rq = NULL;
6653	bfqd->rqs_injected = false;
6654	}
6655
6656	/*
6657	* Handle either a requeue or a finish for rq. The things to do are
6658	* the same in both cases: all references to rq are to be dropped. In
6659	* particular, rq is considered completed from the point of view of
6660	* the scheduler.
6661	*/
6662	static void bfq_finish_requeue_request(struct request *rq)
6663	{
6664	struct bfq_queue *bfqq = RQ_BFQQ(rq);
6665	struct bfq_data *bfqd;
6666	unsigned long flags;
6667
6668	/*
6669	* rq either is not associated with any icq, or is an already
6670	* requeued request that has not (yet) been re-inserted into
6671	* a bfq_queue.
6672	*/
6673	if (!rq->elv.icq \|\| !bfqq)
6674	return;
6675
6676	bfqd = bfqq->bfqd;
6677
6678	if (rq->rq_flags & RQF_STARTED)
6679	bfqg_stats_update_completion(bfqg: bfqq_group(bfqq),
6680	start_time_ns: rq->start_time_ns,
6681	io_start_time_ns: rq->io_start_time_ns,
6682	opf: rq->cmd_flags);
6683
6684	spin_lock_irqsave(&bfqd->lock, flags);
6685	if (likely(rq->rq_flags & RQF_STARTED)) {
6686	if (rq == bfqd->waited_rq)
6687	bfq_update_inject_limit(bfqd, bfqq);
6688
6689	bfq_completed_request(bfqq, bfqd);
6690	}
6691	bfqq_request_freed(bfqq);
6692	bfq_put_queue(bfqq);
6693	RQ_BIC(rq)->requests--;
6694	spin_unlock_irqrestore(lock: &bfqd->lock, flags);
6695
6696	/*
6697	* Reset private fields. In case of a requeue, this allows
6698	* this function to correctly do nothing if it is spuriously
6699	* invoked again on this same request (see the check at the
6700	* beginning of the function). Probably, a better general
6701	* design would be to prevent blk-mq from invoking the requeue
6702	* or finish hooks of an elevator, for a request that is not
6703	* referred by that elevator.
6704	*
6705	* Resetting the following fields would break the
6706	* request-insertion logic if rq is re-inserted into a bfq
6707	* internal queue, without a re-preparation. Here we assume
6708	* that re-insertions of requeued requests, without
6709	* re-preparation, can happen only for pass_through or at_head
6710	* requests (which are not re-inserted into bfq internal
6711	* queues).
6712	*/
6713	rq->elv.priv[`0`] = NULL;
6714	rq->elv.priv[`1`] = NULL;
6715	}
6716
6717	static void bfq_finish_request(struct request *rq)
6718	{
6719	bfq_finish_requeue_request(rq);
6720
6721	if (rq->elv.icq) {
6722	put_io_context(ioc: rq->elv.icq->ioc);
6723	rq->elv.icq = NULL;
6724	}
6725	}
6726
6727	/*
6728	* Removes the association between the current task and bfqq, assuming
6729	* that bic points to the bfq iocontext of the task.
6730	* Returns NULL if a new bfqq should be allocated, or the old bfqq if this
6731	* was the last process referring to that bfqq.
6732	*/
6733	static struct bfq_queue *
6734	bfq_split_bfqq(struct bfq_io_cq bic, struct* bfq_queue *bfqq)
6735	{
6736	bfq_log_bfqq(bfqq->bfqd, bfqq, "splitting queue");
6737
6738	if (bfqq_process_refs(bfqq) == `1` && !bfqq->new_bfqq) {
6739	bfqq->pid = current->pid;
6740	bfq_clear_bfqq_coop(bfqq);
6741	bfq_clear_bfqq_split_coop(bfqq);
6742	return bfqq;
6743	}
6744
6745	bic_set_bfqq(bic, NULL, is_sync: true, actuator_idx: bfqq->actuator_idx);
6746
6747	bfq_put_cooperator(bfqq);
6748
6749	bfq_release_process_ref(bfqd: bfqq->bfqd, bfqq);
6750	return NULL;
6751	}
6752
6753	static struct bfq_queue *
6754	__bfq_get_bfqq_handle_split(struct bfq_data bfqd, struct* bfq_io_cq *bic,
6755	struct bio *bio, bool split, bool is_sync,
6756	bool *new_queue)
6757	{
6758	unsigned int act_idx = bfq_actuator_index(bfqd, bio);
6759	struct bfq_queue *bfqq = bic_to_bfqq(bic, is_sync, actuator_idx: act_idx);
6760	struct bfq_iocq_bfqq_data *bfqq_data = &bic->bfqq_data[act_idx];
6761
6762	if (likely(bfqq && bfqq != &bfqd->oom_bfqq))
6763	return bfqq;
6764
6765	if (new_queue)
6766	*new_queue = true;
6767
6768	if (bfqq)
6769	bfq_put_queue(bfqq);
6770	bfqq = bfq_get_queue(bfqd, bio, is_sync, bic, respawn: split);
6771
6772	bic_set_bfqq(bic, bfqq, is_sync, actuator_idx: act_idx);
6773	if (split && is_sync) {
6774	if ((bfqq_data->was_in_burst_list && bfqd->large_burst) \|\|
6775	bfqq_data->saved_in_large_burst)
6776	bfq_mark_bfqq_in_large_burst(bfqq);
6777	else {
6778	bfq_clear_bfqq_in_large_burst(bfqq);
6779	if (bfqq_data->was_in_burst_list)
6780	/*
6781	* If bfqq was in the current
6782	* burst list before being
6783	* merged, then we have to add
6784	* it back. And we do not need
6785	* to increase burst_size, as
6786	* we did not decrement
6787	* burst_size when we removed
6788	* bfqq from the burst list as
6789	* a consequence of a merge
6790	* (see comments in
6791	* bfq_put_queue). In this
6792	* respect, it would be rather
6793	* costly to know whether the
6794	* current burst list is still
6795	* the same burst list from
6796	* which bfqq was removed on
6797	* the merge. To avoid this
6798	* cost, if bfqq was in a
6799	* burst list, then we add
6800	* bfqq to the current burst
6801	* list without any further
6802	* check. This can cause
6803	* inappropriate insertions,
6804	* but rarely enough to not
6805	* harm the detection of large
6806	* bursts significantly.
6807	*/
6808	hlist_add_head(n: &bfqq->burst_list_node,
6809	h: &bfqd->burst_list);
6810	}
6811	bfqq->split_time = jiffies;
6812	}
6813
6814	return bfqq;
6815	}
6816
6817	/*
6818	* Only reset private fields. The actual request preparation will be
6819	* performed by bfq_init_rq, when rq is either inserted or merged. See
6820	* comments on bfq_init_rq for the reason behind this delayed
6821	* preparation.
6822	*/
6823	static void bfq_prepare_request(struct request *rq)
6824	{
6825	rq->elv.icq = ioc_find_get_icq(q: rq->q);
6826
6827	/*
6828	* Regardless of whether we have an icq attached, we have to
6829	* clear the scheduler pointers, as they might point to
6830	* previously allocated bic/bfqq structs.
6831	*/
6832	rq->elv.priv[`0`] = rq->elv.priv[`1`] = NULL;
6833	}
6834
6835	static struct bfq_queue bfq_waker_bfqq(struct* bfq_queue *bfqq)
6836	{
6837	struct bfq_queue *new_bfqq = bfqq->new_bfqq;
6838	struct bfq_queue *waker_bfqq = bfqq->waker_bfqq;
6839
6840	if (!waker_bfqq)
6841	return NULL;
6842
6843	while (new_bfqq) {
6844	if (new_bfqq == waker_bfqq) {
6845	/*
6846	* If waker_bfqq is in the merge chain, and current
6847	* is the only process, waker_bfqq can be freed.
6848	*/
6849	if (bfqq_process_refs(bfqq: waker_bfqq) == `1`)
6850	return NULL;
6851
6852	return waker_bfqq;
6853	}
6854
6855	new_bfqq = new_bfqq->new_bfqq;
6856	}
6857
6858	/*
6859	* If waker_bfqq is not in the merge chain, and it's procress reference
6860	* is 0, waker_bfqq can be freed.
6861	*/
6862	if (bfqq_process_refs(bfqq: waker_bfqq) == `0`)
6863	return NULL;
6864
6865	return waker_bfqq;
6866	}
6867
6868	static struct bfq_queue bfq_get_bfqq_handle_split(struct* bfq_data *bfqd,
6869	struct bfq_io_cq *bic,
6870	struct bio *bio,
6871	unsigned int idx,
6872	bool is_sync)
6873	{
6874	struct bfq_queue *waker_bfqq;
6875	struct bfq_queue *bfqq;
6876	bool new_queue = false;
6877
6878	bfqq = __bfq_get_bfqq_handle_split(bfqd, bic, bio, split: false, is_sync,
6879	new_queue: &new_queue);
6880	if (unlikely(new_queue))
6881	return bfqq;
6882
6883	/ If the queue was seeky for too long, break it apart. /
6884	if (!bfq_bfqq_coop(bfqq) \|\| !bfq_bfqq_split_coop(bfqq) \|\|
6885	bic->bfqq_data[idx].stably_merged)
6886	return bfqq;
6887
6888	waker_bfqq = bfq_waker_bfqq(bfqq);
6889
6890	/ Update bic before losing reference to bfqq /
6891	if (bfq_bfqq_in_large_burst(bfqq))
6892	bic->bfqq_data[idx].saved_in_large_burst = true;
6893
6894	bfqq = bfq_split_bfqq(bic, bfqq);
6895	if (bfqq) {
6896	bfq_bfqq_resume_state(bfqq, bfqd, bic, bfq_already_existing: true);
6897	return bfqq;
6898	}
6899
6900	bfqq = __bfq_get_bfqq_handle_split(bfqd, bic, bio, split: true, is_sync, NULL);
6901	if (unlikely(bfqq == &bfqd->oom_bfqq))
6902	return bfqq;
6903
6904	bfq_bfqq_resume_state(bfqq, bfqd, bic, bfq_already_existing: false);
6905	bfqq->waker_bfqq = waker_bfqq;
6906	bfqq->tentative_waker_bfqq = NULL;
6907
6908	/*
6909	* If the waker queue disappears, then new_bfqq->waker_bfqq must be
6910	* reset. So insert new_bfqq into the
6911	* woken_list of the waker. See
6912	* bfq_check_waker for details.
6913	*/
6914	if (waker_bfqq)
6915	hlist_add_head(n: &bfqq->woken_list_node,
6916	h: &bfqq->waker_bfqq->woken_list);
6917
6918	return bfqq;
6919	}
6920
6921	/*
6922	* If needed, init rq, allocate bfq data structures associated with
6923	* rq, and increment reference counters in the destination bfq_queue
6924	* for rq. Return the destination bfq_queue for rq, or NULL is rq is
6925	* not associated with any bfq_queue.
6926	*
6927	* This function is invoked by the functions that perform rq insertion
6928	* or merging. One may have expected the above preparation operations
6929	* to be performed in bfq_prepare_request, and not delayed to when rq
6930	* is inserted or merged. The rationale behind this delayed
6931	* preparation is that, after the prepare_request hook is invoked for
6932	* rq, rq may still be transformed into a request with no icq, i.e., a
6933	* request not associated with any queue. No bfq hook is invoked to
6934	* signal this transformation. As a consequence, should these
6935	* preparation operations be performed when the prepare_request hook
6936	* is invoked, and should rq be transformed one moment later, bfq
6937	* would end up in an inconsistent state, because it would have
6938	* incremented some queue counters for an rq destined to
6939	* transformation, without any chance to correctly lower these
6940	* counters back. In contrast, no transformation can still happen for
6941	* rq after rq has been inserted or merged. So, it is safe to execute
6942	* these preparation operations when rq is finally inserted or merged.
6943	*/
6944	static struct bfq_queue bfq_init_rq(struct* request *rq)
6945	{
6946	struct request_queue *q = rq->q;
6947	struct bio *bio = rq->bio;
6948	struct bfq_data *bfqd = q->elevator->elevator_data;
6949	struct bfq_io_cq *bic;
6950	const int is_sync = rq_is_sync(rq);
6951	struct bfq_queue *bfqq;
6952	unsigned int a_idx = bfq_actuator_index(bfqd, bio);
6953
6954	if (unlikely(!rq->elv.icq))
6955	return NULL;
6956
6957	/*
6958	* Assuming that RQ_BFQQ(rq) is set only if everything is set
6959	* for this rq. This holds true, because this function is
6960	* invoked only for insertion or merging, and, after such
6961	* events, a request cannot be manipulated any longer before
6962	* being removed from bfq.
6963	*/
6964	if (RQ_BFQQ(rq))
6965	return RQ_BFQQ(rq);
6966
6967	bic = icq_to_bic(icq: rq->elv.icq);
6968	bfq_check_ioprio_change(bic, bio);
6969	bfq_bic_update_cgroup(bic, bio);
6970	bfqq = bfq_get_bfqq_handle_split(bfqd, bic, bio, idx: a_idx, is_sync);
6971
6972	bfqq_request_allocated(bfqq);
6973	bfqq->ref++;
6974	bic->requests++;
6975	bfq_log_bfqq(bfqd, bfqq, "get_request %p: bfqq %p, %d",
6976	rq, bfqq, bfqq->ref);
6977
6978	rq->elv.priv[`0`] = bic;
6979	rq->elv.priv[`1`] = bfqq;
6980
6981	/*
6982	* If a bfq_queue has only one process reference, it is owned
6983	* by only this bic: we can then set bfqq->bic = bic. in
6984	* addition, if the queue has also just been split, we have to
6985	* resume its state.
6986	*/
6987	if (likely(bfqq != &bfqd->oom_bfqq) && !bfqq->new_bfqq &&
6988	bfqq_process_refs(bfqq) == `1`)
6989	bfqq->bic = bic;
6990
6991	/*
6992	* Consider bfqq as possibly belonging to a burst of newly
6993	* created queues only if:
6994	* 1) A burst is actually happening (bfqd->burst_size > 0)
6995	* or
6996	* 2) There is no other active queue. In fact, if, in
6997	* contrast, there are active queues not belonging to the
6998	* possible burst bfqq may belong to, then there is no gain
6999	* in considering bfqq as belonging to a burst, and
7000	* therefore in not weight-raising bfqq. See comments on
7001	* bfq_handle_burst().
7002	*
7003	* This filtering also helps eliminating false positives,
7004	* occurring when bfqq does not belong to an actual large
7005	* burst, but some background task (e.g., a service) happens
7006	* to trigger the creation of new queues very close to when
7007	* bfqq and its possible companion queues are created. See
7008	* comments on bfq_handle_burst() for further details also on
7009	* this issue.
7010	*/
7011	if (unlikely(bfq_bfqq_just_created(bfqq) &&
7012	(bfqd->burst_size > `0` \|\|
7013	bfq_tot_busy_queues(bfqd) == `0`)))
7014	bfq_handle_burst(bfqd, bfqq);
7015
7016	return bfqq;
7017	}
7018
7019	static void
7020	bfq_idle_slice_timer_body(struct bfq_data bfqd, struct* bfq_queue *bfqq)
7021	{
7022	enum bfqq_expiration reason;
7023	unsigned long flags;
7024
7025	spin_lock_irqsave(&bfqd->lock, flags);
7026
7027	/*
7028	* Considering that bfqq may be in race, we should firstly check
7029	* whether bfqq is in service before doing something on it. If
7030	* the bfqq in race is not in service, it has already been expired
7031	* through __bfq_bfqq_expire func and its wait_request flags has
7032	* been cleared in __bfq_bfqd_reset_in_service func.
7033	*/
7034	if (bfqq != bfqd->in_service_queue) {
7035	spin_unlock_irqrestore(lock: &bfqd->lock, flags);
7036	return;
7037	}
7038
7039	bfq_clear_bfqq_wait_request(bfqq);
7040
7041	if (bfq_bfqq_budget_timeout(bfqq))
7042	/*
7043	* Also here the queue can be safely expired
7044	* for budget timeout without wasting
7045	* guarantees
7046	*/
7047	reason = BFQQE_BUDGET_TIMEOUT;
7048	else if (bfqq->queued[`0`] == `0` && bfqq->queued[`1`] == `0`)
7049	/*
7050	* The queue may not be empty upon timer expiration,
7051	* because we may not disable the timer when the
7052	* first request of the in-service queue arrives
7053	* during disk idling.
7054	*/
7055	reason = BFQQE_TOO_IDLE;
7056	else
7057	goto schedule_dispatch;
7058
7059	bfq_bfqq_expire(bfqd, bfqq, compensate: true, reason);
7060
7061	schedule_dispatch:
7062	bfq_schedule_dispatch(bfqd);
7063	spin_unlock_irqrestore(lock: &bfqd->lock, flags);
7064	}
7065
7066	/*
7067	* Handler of the expiration of the timer running if the in-service queue
7068	* is idling inside its time slice.
7069	*/
7070	static enum hrtimer_restart bfq_idle_slice_timer(struct hrtimer *timer)
7071	{
7072	struct bfq_data bfqd = container_of(timer, struct* bfq_data,
7073	idle_slice_timer);
7074	struct bfq_queue *bfqq = bfqd->in_service_queue;
7075
7076	/*
7077	* Theoretical race here: the in-service queue can be NULL or
7078	* different from the queue that was idling if a new request
7079	* arrives for the current queue and there is a full dispatch
7080	* cycle that changes the in-service queue. This can hardly
7081	* happen, but in the worst case we just expire a queue too
7082	* early.
7083	*/
7084	if (bfqq)
7085	bfq_idle_slice_timer_body(bfqd, bfqq);
7086
7087	return HRTIMER_NORESTART;
7088	}
7089
7090	static void __bfq_put_async_bfqq(struct bfq_data *bfqd,
7091	struct bfq_queue **bfqq_ptr)
7092	{
7093	struct bfq_queue bfqq = bfqq_ptr;
7094
7095	bfq_log(bfqd, "put_async_bfqq: %p", bfqq);
7096	if (bfqq) {
7097	bfq_bfqq_move(bfqd, bfqq, bfqg: bfqd->root_group);
7098
7099	bfq_log_bfqq(bfqd, bfqq, "put_async_bfqq: putting %p, %d",
7100	bfqq, bfqq->ref);
7101	bfq_put_queue(bfqq);
7102	*bfqq_ptr = NULL;
7103	}
7104	}
7105
7106	/*
7107	* Release all the bfqg references to its async queues. If we are
7108	* deallocating the group these queues may still contain requests, so
7109	* we reparent them to the root cgroup (i.e., the only one that will
7110	* exist for sure until all the requests on a device are gone).
7111	*/
7112	void bfq_put_async_queues(struct bfq_data bfqd, struct* bfq_group *bfqg)
7113	{
7114	int i, j, k;
7115
7116	for (k = `0`; k < bfqd->num_actuators; k++) {
7117	for (i = `0`; i < `2`; i++)
7118	for (j = `0`; j < IOPRIO_NR_LEVELS; j++)
7119	__bfq_put_async_bfqq(bfqd, bfqq_ptr: &bfqg->async_bfqq[i][j][k]);
7120
7121	__bfq_put_async_bfqq(bfqd, bfqq_ptr: &bfqg->async_idle_bfqq[k]);
7122	}
7123	}
7124
7125	/*
7126	* See the comments on bfq_limit_depth for the purpose of
7127	* the depths set in the function. Return minimum shallow depth we'll use.
7128	*/
7129	static void bfq_update_depths(struct bfq_data bfqd, struct* sbitmap_queue *bt)
7130	{
7131	unsigned int depth = `1U` << bt->sb.shift;
7132
7133	bfqd->full_depth_shift = bt->sb.shift;
7134	/*
7135	* In-word depths if no bfq_queue is being weight-raised:
7136	* leaving 25% of tags only for sync reads.
7137	*
7138	* In next formulas, right-shift the value
7139	* (1U<<bt->sb.shift), instead of computing directly
7140	* (1U<<(bt->sb.shift - something)), to be robust against
7141	* any possible value of bt->sb.shift, without having to
7142	* limit 'something'.
7143	*/
7144	/ no more than 50% of tags for async I/O /
7145	bfqd->word_depths[`0`][`0`] = max(depth >> `1`, `1U`);
7146	/*
7147	* no more than 75% of tags for sync writes (25% extra tags
7148	* w.r.t. async I/O, to prevent async I/O from starving sync
7149	* writes)
7150	*/
7151	bfqd->word_depths[`0`][`1`] = max((depth * `3`) >> `2`, `1U`);
7152
7153	/*
7154	* In-word depths in case some bfq_queue is being weight-
7155	* raised: leaving ~63% of tags for sync reads. This is the
7156	* highest percentage for which, in our tests, application
7157	* start-up times didn't suffer from any regression due to tag
7158	* shortage.
7159	*/
7160	/ no more than ~18% of tags for async I/O /
7161	bfqd->word_depths[`1`][`0`] = max((depth * `3`) >> `4`, `1U`);
7162	/ no more than ~37% of tags for sync writes (~20% extra tags) /
7163	bfqd->word_depths[`1`][`1`] = max((depth * `6`) >> `4`, `1U`);
7164	}
7165
7166	static void bfq_depth_updated(struct blk_mq_hw_ctx *hctx)
7167	{
7168	struct bfq_data *bfqd = hctx->queue->elevator->elevator_data;
7169	struct blk_mq_tags *tags = hctx->sched_tags;
7170
7171	bfq_update_depths(bfqd, bt: &tags->bitmap_tags);
7172	sbitmap_queue_min_shallow_depth(sbq: &tags->bitmap_tags, min_shallow_depth: `1`);
7173	}
7174
7175	static int bfq_init_hctx(struct blk_mq_hw_ctx hctx, unsigned* int index)
7176	{
7177	bfq_depth_updated(hctx);
7178	return `0`;
7179	}
7180
7181	static void bfq_exit_queue(struct elevator_queue *e)
7182	{
7183	struct bfq_data *bfqd = e->elevator_data;
7184	struct bfq_queue bfqq, n;
7185	unsigned int actuator;
7186
7187	hrtimer_cancel(timer: &bfqd->idle_slice_timer);
7188
7189	spin_lock_irq(lock: &bfqd->lock);
7190	list_for_each_entry_safe(bfqq, n, &bfqd->idle_list, bfqq_list)
7191	bfq_deactivate_bfqq(bfqd, bfqq, ins_into_idle_tree: false, expiration: false);
7192	spin_unlock_irq(lock: &bfqd->lock);
7193
7194	for (actuator = `0`; actuator < bfqd->num_actuators; actuator++)
7195	WARN_ON_ONCE(bfqd->rq_in_driver[actuator]);
7196	WARN_ON_ONCE(bfqd->tot_rq_in_driver);
7197
7198	hrtimer_cancel(timer: &bfqd->idle_slice_timer);
7199
7200	/ release oom-queue reference to root group /
7201	bfqg_and_blkg_put(bfqg: bfqd->root_group);
7202
7203	#ifdef CONFIG_BFQ_GROUP_IOSCHED
7204	blkcg_deactivate_policy(disk: bfqd->queue->disk, pol: &blkcg_policy_bfq);
7205	#else
7206	spin_lock_irq(&bfqd->lock);
7207	bfq_put_async_queues(bfqd, bfqd->root_group);
7208	kfree(bfqd->root_group);
7209	spin_unlock_irq(&bfqd->lock);
7210	#endif
7211
7212	blk_stat_disable_accounting(q: bfqd->queue);
7213	blk_queue_flag_clear(flag: QUEUE_FLAG_DISABLE_WBT_DEF, q: bfqd->queue);
7214	set_bit(ELEVATOR_FLAG_ENABLE_WBT_ON_EXIT, addr: &e->flags);
7215
7216	kfree(objp: bfqd);
7217	}
7218
7219	static void bfq_init_root_group(struct bfq_group *root_group,
7220	struct bfq_data *bfqd)
7221	{
7222	int i;
7223
7224	#ifdef CONFIG_BFQ_GROUP_IOSCHED
7225	root_group->entity.parent = NULL;
7226	root_group->my_entity = NULL;
7227	root_group->bfqd = bfqd;
7228	#endif
7229	root_group->rq_pos_tree = RB_ROOT;
7230	for (i = `0`; i < BFQ_IOPRIO_CLASSES; i++)
7231	root_group->sched_data.service_tree[i] = BFQ_SERVICE_TREE_INIT;
7232	root_group->sched_data.bfq_class_idle_last_service = jiffies;
7233	}
7234
7235	static int bfq_init_queue(struct request_queue q, struct* elevator_type *e)
7236	{
7237	struct bfq_data *bfqd;
7238	struct elevator_queue *eq;
7239	unsigned int i;
7240	struct blk_independent_access_ranges *ia_ranges = q->disk->ia_ranges;
7241
7242	eq = elevator_alloc(q, e);
7243	if (!eq)
7244	return -ENOMEM;
7245
7246	bfqd = kzalloc_node(sizeof(*bfqd), GFP_KERNEL, q->node);
7247	if (!bfqd) {
7248	kobject_put(kobj: &eq->kobj);
7249	return -ENOMEM;
7250	}
7251	eq->elevator_data = bfqd;
7252
7253	spin_lock_irq(lock: &q->queue_lock);
7254	q->elevator = eq;
7255	spin_unlock_irq(lock: &q->queue_lock);
7256
7257	/*
7258	* Our fallback bfqq if bfq_find_alloc_queue() runs into OOM issues.
7259	* Grab a permanent reference to it, so that the normal code flow
7260	* will not attempt to free it.
7261	* Set zero as actuator index: we will pretend that
7262	* all I/O requests are for the same actuator.
7263	*/
7264	bfq_init_bfqq(bfqd, bfqq: &bfqd->oom_bfqq, NULL, pid: `1`, is_sync: `0`, act_idx: `0`);
7265	bfqd->oom_bfqq.ref++;
7266	bfqd->oom_bfqq.new_ioprio = BFQ_DEFAULT_QUEUE_IOPRIO;
7267	bfqd->oom_bfqq.new_ioprio_class = IOPRIO_CLASS_BE;
7268	bfqd->oom_bfqq.entity.new_weight =
7269	bfq_ioprio_to_weight(ioprio: bfqd->oom_bfqq.new_ioprio);
7270
7271	/ oom_bfqq does not participate to bursts /
7272	bfq_clear_bfqq_just_created(bfqq: &bfqd->oom_bfqq);
7273
7274	/*
7275	* Trigger weight initialization, according to ioprio, at the
7276	* oom_bfqq's first activation. The oom_bfqq's ioprio and ioprio
7277	* class won't be changed any more.
7278	*/
7279	bfqd->oom_bfqq.entity.prio_changed = `1`;
7280
7281	bfqd->queue = q;
7282
7283	bfqd->num_actuators = `1`;
7284	/*
7285	* If the disk supports multiple actuators, copy independent
7286	* access ranges from the request queue structure.
7287	*/
7288	spin_lock_irq(lock: &q->queue_lock);
7289	if (ia_ranges) {
7290	/*
7291	* Check if the disk ia_ranges size exceeds the current bfq
7292	* actuator limit.
7293	*/
7294	if (ia_ranges->nr_ia_ranges > BFQ_MAX_ACTUATORS) {
7295	pr_crit("nr_ia_ranges higher than act limit: iars=%d, max=%d.\n",
7296	ia_ranges->nr_ia_ranges, BFQ_MAX_ACTUATORS);
7297	pr_crit("Falling back to single actuator mode.\n");
7298	} else {
7299	bfqd->num_actuators = ia_ranges->nr_ia_ranges;
7300
7301	for (i = `0`; i < bfqd->num_actuators; i++) {
7302	bfqd->sector[i] = ia_ranges->ia_range[i].sector;
7303	bfqd->nr_sectors[i] =
7304	ia_ranges->ia_range[i].nr_sectors;
7305	}
7306	}
7307	}
7308
7309	/ Otherwise use single-actuator dev info /
7310	if (bfqd->num_actuators == `1`) {
7311	bfqd->sector[`0`] = `0`;
7312	bfqd->nr_sectors[`0`] = get_capacity(disk: q->disk);
7313	}
7314	spin_unlock_irq(lock: &q->queue_lock);
7315
7316	INIT_LIST_HEAD(list: &bfqd->dispatch);
7317
7318	hrtimer_setup(timer: &bfqd->idle_slice_timer, function: bfq_idle_slice_timer, CLOCK_MONOTONIC,
7319	mode: HRTIMER_MODE_REL);
7320
7321	bfqd->queue_weights_tree = RB_ROOT_CACHED;
7322	#ifdef CONFIG_BFQ_GROUP_IOSCHED
7323	bfqd->num_groups_with_pending_reqs = `0`;
7324	#endif
7325
7326	INIT_LIST_HEAD(list: &bfqd->active_list[`0`]);
7327	INIT_LIST_HEAD(list: &bfqd->active_list[`1`]);
7328	INIT_LIST_HEAD(list: &bfqd->idle_list);
7329	INIT_HLIST_HEAD(&bfqd->burst_list);
7330
7331	bfqd->hw_tag = -`1`;
7332	bfqd->nonrot_with_queueing = blk_queue_nonrot(bfqd->queue);
7333
7334	bfqd->bfq_max_budget = bfq_default_max_budget;
7335
7336	bfqd->bfq_fifo_expire[`0`] = bfq_fifo_expire[`0`];
7337	bfqd->bfq_fifo_expire[`1`] = bfq_fifo_expire[`1`];
7338	bfqd->bfq_back_max = bfq_back_max;
7339	bfqd->bfq_back_penalty = bfq_back_penalty;
7340	bfqd->bfq_slice_idle = bfq_slice_idle;
7341	bfqd->bfq_timeout = bfq_timeout;
7342
7343	bfqd->bfq_large_burst_thresh = `8`;
7344	bfqd->bfq_burst_interval = msecs_to_jiffies(m: `180`);
7345
7346	bfqd->low_latency = true;
7347
7348	/*
7349	* Trade-off between responsiveness and fairness.
7350	*/
7351	bfqd->bfq_wr_coeff = `30`;
7352	bfqd->bfq_wr_rt_max_time = msecs_to_jiffies(m: `300`);
7353	bfqd->bfq_wr_min_idle_time = msecs_to_jiffies(m: `2000`);
7354	bfqd->bfq_wr_min_inter_arr_async = msecs_to_jiffies(m: `500`);
7355	bfqd->bfq_wr_max_softrt_rate = `7000`; /*
7356	* Approximate rate required
7357	* to playback or record a
7358	* high-definition compressed
7359	* video.
7360	*/
7361	bfqd->wr_busy_queues = `0`;
7362
7363	/*
7364	* Begin by assuming, optimistically, that the device peak
7365	* rate is equal to 2/3 of the highest reference rate.
7366	*/
7367	bfqd->rate_dur_prod = ref_rate[blk_queue_nonrot(bfqd->queue)] *
7368	ref_wr_duration[blk_queue_nonrot(bfqd->queue)];
7369	bfqd->peak_rate = ref_rate[blk_queue_nonrot(bfqd->queue)] * `2` / `3`;
7370
7371	/ see comments on the definition of next field inside bfq_data /
7372	bfqd->actuator_load_threshold = `4`;
7373
7374	spin_lock_init(&bfqd->lock);
7375
7376	/*
7377	* The invocation of the next bfq_create_group_hierarchy
7378	* function is the head of a chain of function calls
7379	* (bfq_create_group_hierarchy->blkcg_activate_policy->
7380	* blk_mq_freeze_queue) that may lead to the invocation of the
7381	* has_work hook function. For this reason,
7382	* bfq_create_group_hierarchy is invoked only after all
7383	* scheduler data has been initialized, apart from the fields
7384	* that can be initialized only after invoking
7385	* bfq_create_group_hierarchy. This, in particular, enables
7386	* has_work to correctly return false. Of course, to avoid
7387	* other inconsistencies, the blk-mq stack must then refrain
7388	* from invoking further scheduler hooks before this init
7389	* function is finished.
7390	*/
7391	bfqd->root_group = bfq_create_group_hierarchy(bfqd, node: q->node);
7392	if (!bfqd->root_group)
7393	goto out_free;
7394	bfq_init_root_group(root_group: bfqd->root_group, bfqd);
7395	bfq_init_entity(entity: &bfqd->oom_bfqq.entity, bfqg: bfqd->root_group);
7396
7397	/ We dispatch from request queue wide instead of hw queue /
7398	blk_queue_flag_set(flag: QUEUE_FLAG_SQ_SCHED, q);
7399
7400	blk_queue_flag_set(flag: QUEUE_FLAG_DISABLE_WBT_DEF, q);
7401	wbt_disable_default(disk: q->disk);
7402	blk_stat_enable_accounting(q);
7403
7404	return `0`;
7405
7406	out_free:
7407	kfree(objp: bfqd);
7408	kobject_put(kobj: &eq->kobj);
7409	return -ENOMEM;
7410	}
7411
7412	static void bfq_slab_kill(void)
7413	{
7414	kmem_cache_destroy(s: bfq_pool);
7415	}
7416
7417	static int __init bfq_slab_setup(void)
7418	{
7419	bfq_pool = KMEM_CACHE(bfq_queue, `0`);
7420	if (!bfq_pool)
7421	return -ENOMEM;
7422	return `0`;
7423	}
7424
7425	static ssize_t bfq_var_show(unsigned int var, char *page)
7426	{
7427	return sprintf(buf: page, fmt: "%u\n", var);
7428	}
7429
7430	static int bfq_var_store(unsigned long var, const* char *page)
7431	{
7432	unsigned long new_val;
7433	int ret = kstrtoul(s: page, base: `10`, res: &new_val);
7434
7435	if (ret)
7436	return ret;
7437	*var = new_val;
7438	return `0`;
7439	}
7440
7441	#define SHOW_FUNCTION(__FUNC, __VAR, __CONV) \
7442	static ssize_t __FUNC(struct elevator_queue e, char page) \
7443	{ \
7444	struct bfq_data *bfqd = e->elevator_data; \
7445	u64 __data = __VAR; \
7446	if (__CONV == 1) \
7447	__data = jiffies_to_msecs(__data); \
7448	else if (__CONV == 2) \
7449	__data = div_u64(__data, NSEC_PER_MSEC); \
7450	return bfq_var_show(__data, (page)); \
7451	}
7452	SHOW_FUNCTION(bfq_fifo_expire_sync_show, bfqd->bfq_fifo_expire[`1`], `2`);
7453	SHOW_FUNCTION(bfq_fifo_expire_async_show, bfqd->bfq_fifo_expire[`0`], `2`);
7454	SHOW_FUNCTION(bfq_back_seek_max_show, bfqd->bfq_back_max, `0`);
7455	SHOW_FUNCTION(bfq_back_seek_penalty_show, bfqd->bfq_back_penalty, `0`);
7456	SHOW_FUNCTION(bfq_slice_idle_show, bfqd->bfq_slice_idle, `2`);
7457	SHOW_FUNCTION(bfq_max_budget_show, bfqd->bfq_user_max_budget, `0`);
7458	SHOW_FUNCTION(bfq_timeout_sync_show, bfqd->bfq_timeout, `1`);
7459	SHOW_FUNCTION(bfq_strict_guarantees_show, bfqd->strict_guarantees, `0`);
7460	SHOW_FUNCTION(bfq_low_latency_show, bfqd->low_latency, `0`);
7461	#undef SHOW_FUNCTION
7462
7463	#define USEC_SHOW_FUNCTION(__FUNC, __VAR) \
7464	static ssize_t __FUNC(struct elevator_queue e, char page) \
7465	{ \
7466	struct bfq_data *bfqd = e->elevator_data; \
7467	u64 __data = __VAR; \
7468	__data = div_u64(__data, NSEC_PER_USEC); \
7469	return bfq_var_show(__data, (page)); \
7470	}
7471	USEC_SHOW_FUNCTION(bfq_slice_idle_us_show, bfqd->bfq_slice_idle);
7472	#undef USEC_SHOW_FUNCTION
7473
7474	#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV) \
7475	static ssize_t \
7476	__FUNC(struct elevator_queue e, const char page, size_t count) \
7477	{ \
7478	struct bfq_data *bfqd = e->elevator_data; \
7479	unsigned long __data, __min = (MIN), __max = (MAX); \
7480	int ret; \
7481	\
7482	ret = bfq_var_store(&__data, (page)); \
7483	if (ret) \
7484	return ret; \
7485	if (__data < __min) \
7486	__data = __min; \
7487	else if (__data > __max) \
7488	__data = __max; \
7489	if (__CONV == 1) \
7490	*(__PTR) = msecs_to_jiffies(__data); \
7491	else if (__CONV == 2) \
7492	(__PTR) = (u64)__data NSEC_PER_MSEC; \
7493	else \
7494	*(__PTR) = __data; \
7495	return count; \
7496	}
7497	STORE_FUNCTION(bfq_fifo_expire_sync_store, &bfqd->bfq_fifo_expire[`1`], `1`,
7498	INT_MAX, `2`);
7499	STORE_FUNCTION(bfq_fifo_expire_async_store, &bfqd->bfq_fifo_expire[`0`], `1`,
7500	INT_MAX, `2`);
7501	STORE_FUNCTION(bfq_back_seek_max_store, &bfqd->bfq_back_max, `0`, INT_MAX, `0`);
7502	STORE_FUNCTION(bfq_back_seek_penalty_store, &bfqd->bfq_back_penalty, `1`,
7503	INT_MAX, `0`);
7504	STORE_FUNCTION(bfq_slice_idle_store, &bfqd->bfq_slice_idle, `0`, INT_MAX, `2`);
7505	#undef STORE_FUNCTION
7506
7507	#define USEC_STORE_FUNCTION(__FUNC, __PTR, MIN, MAX) \
7508	static ssize_t __FUNC(struct elevator_queue e, const char page, size_t count)\
7509	{ \
7510	struct bfq_data *bfqd = e->elevator_data; \
7511	unsigned long __data, __min = (MIN), __max = (MAX); \
7512	int ret; \
7513	\
7514	ret = bfq_var_store(&__data, (page)); \
7515	if (ret) \
7516	return ret; \
7517	if (__data < __min) \
7518	__data = __min; \
7519	else if (__data > __max) \
7520	__data = __max; \
7521	(__PTR) = (u64)__data NSEC_PER_USEC; \
7522	return count; \
7523	}
7524	USEC_STORE_FUNCTION(bfq_slice_idle_us_store, &bfqd->bfq_slice_idle, `0`,
7525	UINT_MAX);
7526	#undef USEC_STORE_FUNCTION
7527
7528	static ssize_t bfq_max_budget_store(struct elevator_queue *e,
7529	const char *page, size_t count)
7530	{
7531	struct bfq_data *bfqd = e->elevator_data;
7532	unsigned long __data;
7533	int ret;
7534
7535	ret = bfq_var_store(var: &__data, (page));
7536	if (ret)
7537	return ret;
7538
7539	if (__data == `0`)
7540	bfqd->bfq_max_budget = bfq_calc_max_budget(bfqd);
7541	else {
7542	if (__data > INT_MAX)
7543	__data = INT_MAX;
7544	bfqd->bfq_max_budget = __data;
7545	}
7546
7547	bfqd->bfq_user_max_budget = __data;
7548
7549	return count;
7550	}
7551
7552	/*
7553	* Leaving this name to preserve name compatibility with cfq
7554	* parameters, but this timeout is used for both sync and async.
7555	*/
7556	static ssize_t bfq_timeout_sync_store(struct elevator_queue *e,
7557	const char *page, size_t count)
7558	{
7559	struct bfq_data *bfqd = e->elevator_data;
7560	unsigned long __data;
7561	int ret;
7562
7563	ret = bfq_var_store(var: &__data, (page));
7564	if (ret)
7565	return ret;
7566
7567	if (__data < `1`)
7568	__data = `1`;
7569	else if (__data > INT_MAX)
7570	__data = INT_MAX;
7571
7572	bfqd->bfq_timeout = msecs_to_jiffies(m: __data);
7573	if (bfqd->bfq_user_max_budget == `0`)
7574	bfqd->bfq_max_budget = bfq_calc_max_budget(bfqd);
7575
7576	return count;
7577	}
7578
7579	static ssize_t bfq_strict_guarantees_store(struct elevator_queue *e,
7580	const char *page, size_t count)
7581	{
7582	struct bfq_data *bfqd = e->elevator_data;
7583	unsigned long __data;
7584	int ret;
7585
7586	ret = bfq_var_store(var: &__data, (page));
7587	if (ret)
7588	return ret;
7589
7590	if (__data > `1`)
7591	__data = `1`;
7592	if (!bfqd->strict_guarantees && __data == `1`
7593	&& bfqd->bfq_slice_idle < `8` * NSEC_PER_MSEC)
7594	bfqd->bfq_slice_idle = `8` * NSEC_PER_MSEC;
7595
7596	bfqd->strict_guarantees = __data;
7597
7598	return count;
7599	}
7600
7601	static ssize_t bfq_low_latency_store(struct elevator_queue *e,
7602	const char *page, size_t count)
7603	{
7604	struct bfq_data *bfqd = e->elevator_data;
7605	unsigned long __data;
7606	int ret;
7607
7608	ret = bfq_var_store(var: &__data, (page));
7609	if (ret)
7610	return ret;
7611
7612	if (__data > `1`)
7613	__data = `1`;
7614	if (__data == `0` && bfqd->low_latency != `0`)
7615	bfq_end_wr(bfqd);
7616	bfqd->low_latency = __data;
7617
7618	return count;
7619	}
7620
7621	#define BFQ_ATTR(name) \
7622	__ATTR(name, 0644, bfq_##name##_show, bfq_##name##_store)
7623
7624	static const struct elv_fs_entry bfq_attrs[] = {
7625	BFQ_ATTR(fifo_expire_sync),
7626	BFQ_ATTR(fifo_expire_async),
7627	BFQ_ATTR(back_seek_max),
7628	BFQ_ATTR(back_seek_penalty),
7629	BFQ_ATTR(slice_idle),
7630	BFQ_ATTR(slice_idle_us),
7631	BFQ_ATTR(max_budget),
7632	BFQ_ATTR(timeout_sync),
7633	BFQ_ATTR(strict_guarantees),
7634	BFQ_ATTR(low_latency),
7635	__ATTR_NULL
7636	};
7637
7638	static struct elevator_type iosched_bfq_mq = {
7639	.ops = {
7640	.limit_depth = bfq_limit_depth,
7641	.prepare_request = bfq_prepare_request,
7642	.requeue_request = bfq_finish_requeue_request,
7643	.finish_request = bfq_finish_request,
7644	.exit_icq = bfq_exit_icq,
7645	.insert_requests = bfq_insert_requests,
7646	.dispatch_request = bfq_dispatch_request,
7647	.next_request = elv_rb_latter_request,
7648	.former_request = elv_rb_former_request,
7649	.allow_merge = bfq_allow_bio_merge,
7650	.bio_merge = bfq_bio_merge,
7651	.request_merge = bfq_request_merge,
7652	.requests_merged = bfq_requests_merged,
7653	.request_merged = bfq_request_merged,
7654	.has_work = bfq_has_work,
7655	.depth_updated = bfq_depth_updated,
7656	.init_hctx = bfq_init_hctx,
7657	.init_sched = bfq_init_queue,
7658	.exit_sched = bfq_exit_queue,
7659	},
7660
7661	.icq_size = sizeof(struct bfq_io_cq),
7662	.icq_align = __alignof__(struct bfq_io_cq),
7663	.elevator_attrs = bfq_attrs,
7664	.elevator_name = "bfq",
7665	.elevator_owner = THIS_MODULE,
7666	};
7667	MODULE_ALIAS("bfq-iosched");
7668
7669	static int __init bfq_init(void)
7670	{
7671	int ret;
7672
7673	#ifdef CONFIG_BFQ_GROUP_IOSCHED
7674	ret = blkcg_policy_register(pol: &blkcg_policy_bfq);
7675	if (ret)
7676	return ret;
7677	#endif
7678
7679	ret = -ENOMEM;
7680	if (bfq_slab_setup())
7681	goto err_pol_unreg;
7682
7683	/*
7684	* Times to load large popular applications for the typical
7685	* systems installed on the reference devices (see the
7686	* comments before the definition of the next
7687	* array). Actually, we use slightly lower values, as the
7688	* estimated peak rate tends to be smaller than the actual
7689	* peak rate. The reason for this last fact is that estimates
7690	* are computed over much shorter time intervals than the long
7691	* intervals typically used for benchmarking. Why? First, to
7692	* adapt more quickly to variations. Second, because an I/O
7693	* scheduler cannot rely on a peak-rate-evaluation workload to
7694	* be run for a long time.
7695	*/
7696	ref_wr_duration[`0`] = msecs_to_jiffies(m: `7000`); / actually 8 sec /
7697	ref_wr_duration[`1`] = msecs_to_jiffies(m: `2500`); / actually 3 sec /
7698
7699	ret = elv_register(&iosched_bfq_mq);
7700	if (ret)
7701	goto slab_kill;
7702
7703	return `0`;
7704
7705	slab_kill:
7706	bfq_slab_kill();
7707	err_pol_unreg:
7708	#ifdef CONFIG_BFQ_GROUP_IOSCHED
7709	blkcg_policy_unregister(pol: &blkcg_policy_bfq);
7710	#endif
7711	return ret;
7712	}
7713
7714	static void __exit bfq_exit(void)
7715	{
7716	elv_unregister(&iosched_bfq_mq);
7717	#ifdef CONFIG_BFQ_GROUP_IOSCHED
7718	blkcg_policy_unregister(pol: &blkcg_policy_bfq);
7719	#endif
7720	bfq_slab_kill();
7721	}
7722
7723	module_init(bfq_init);
7724	module_exit(bfq_exit);
7725
7726	MODULE_AUTHOR("Paolo Valente");
7727	MODULE_LICENSE("GPL");
7728	MODULE_DESCRIPTION("MQ Budget Fair Queueing I/O Scheduler");
7729

Provided by KDAB

Definitions

bfq_fifo_expire
bfq_back_max
bfq_back_penalty
bfq_slice_idle
bfq_stats_min_budgets
bfq_default_max_budget
bfq_async_charge_factor
bfq_timeout
bfq_merge_time_limit
bfq_pool
ref_rate
ref_wr_duration
max_service_from_wr
bfq_activation_stable_merging
bfq_late_stable_merging
bic_to_bfqq
bic_set_bfqq
bic_to_bfqd
icq_to_bic
bfq_bic_lookup
bfq_schedule_dispatch
bfq_choose_req
bfqq_request_over_limit
bfq_limit_depth
bfq_rq_pos_tree_lookup
bfq_too_late_for_merging
bfq_pos_tree_add_move
bfq_asymmetric_scenario
bfq_weights_tree_add
bfq_weights_tree_remove
bfq_check_fifo
bfq_find_next_rq
bfq_serv_to_charge
bfq_updated_next_req
bfq_wr_duration
switch_back_to_interactive_wr
bfq_bfqq_resume_state
bfqq_process_refs
bfq_reset_burst_list
bfq_add_to_burst
bfq_handle_burst
bfq_bfqq_budget_left
bfq_max_budget
bfq_min_budget
bfq_bfqq_update_budg_for_activation
bfq_smallest_from_now
bfq_update_bfqq_wr_on_rq_arrival
bfq_bfqq_idle_for_long_time
bfq_bfqq_higher_class_or_weight
bfq_actuator_index
bfq_bfqq_handle_idle_busy_switch
bfq_reset_inject_limit
bfq_update_io_intensity
bfq_check_waker
bfq_add_request
bfq_find_rq_fmerge
get_sdist
bfq_remove_request
bfq_bio_merge
bfq_request_merge
bfq_request_merged
bfq_requests_merged
bfq_bfqq_end_wr
bfq_end_wr_async_queues
bfq_end_wr
bfq_io_struct_pos
bfq_rq_close_to_sector
bfqq_find_close
bfq_find_close_cooperator
bfq_setup_merge
bfq_may_be_close_cooperator
bfq_setup_stable_merge
bfq_setup_cooperator
bfq_bfqq_save_state
bfq_reassign_last_bfqq
bfq_release_process_ref
bfq_merge_bfqqs
bfq_allow_bio_merge
bfq_set_budget_timeout
__bfq_set_in_service_queue
bfq_set_in_service_queue
bfq_arm_slice_timer
bfq_calc_max_budget
update_thr_responsiveness_params
bfq_reset_rate_computation
bfq_update_rate_reset
bfq_update_peak_rate
bfq_dispatch_remove
idling_needed_for_service_guarantees
__bfq_bfqq_expire
__bfq_bfqq_recalc_budget
bfq_bfqq_is_slow
bfq_bfqq_softrt_next_start
bfq_bfqq_expire
bfq_bfqq_budget_timeout
bfq_may_expire_for_budg_timeout
idling_boosts_thr_without_issues
bfq_better_to_idle
bfq_bfqq_must_idle
bfq_choose_bfqq_for_injection
bfq_find_active_bfqq_for_actuator
bfq_find_bfqq_for_underused_actuator
bfq_select_queue
bfq_update_wr_data
bfq_dispatch_rq_from_bfqq
bfq_has_work
__bfq_dispatch_request
bfq_update_dispatch_stats
bfq_dispatch_request
bfq_put_queue
bfq_put_stable_ref
bfq_put_cooperator
bfq_exit_bfqq
bfq_exit_icq_bfqq
_bfq_exit_icq
bfq_exit_icq
bfq_set_next_ioprio_data
bfq_check_ioprio_change
bfq_init_bfqq
bfq_async_queue_prio
bfq_do_early_stable_merge
bfq_do_or_sched_stable_merge
bfq_get_queue
bfq_update_io_thinktime
bfq_update_io_seektime
bfq_update_has_short_ttime
bfq_rq_enqueued
bfqq_request_allocated
bfqq_request_freed
__bfq_insert_request
bfq_update_insert_stats
bfq_insert_request
bfq_insert_requests
bfq_update_hw_tag
bfq_completed_request
bfq_update_inject_limit
bfq_finish_requeue_request
bfq_finish_request
bfq_split_bfqq
__bfq_get_bfqq_handle_split
bfq_prepare_request
bfq_waker_bfqq
bfq_get_bfqq_handle_split
bfq_init_rq
bfq_idle_slice_timer_body
bfq_idle_slice_timer
__bfq_put_async_bfqq
bfq_put_async_queues
bfq_update_depths
bfq_depth_updated
bfq_init_hctx
bfq_exit_queue
bfq_init_root_group
bfq_init_queue
bfq_slab_kill
bfq_slab_setup
bfq_var_show
bfq_var_store
bfq_fifo_expire_sync_show
bfq_fifo_expire_async_show
bfq_back_seek_max_show
bfq_back_seek_penalty_show
bfq_slice_idle_show
bfq_max_budget_show
bfq_timeout_sync_show
bfq_strict_guarantees_show
bfq_low_latency_show
bfq_slice_idle_us_show
bfq_fifo_expire_sync_store
bfq_fifo_expire_async_store
bfq_back_seek_max_store
bfq_back_seek_penalty_store
bfq_slice_idle_store
bfq_slice_idle_us_store
bfq_max_budget_store
bfq_timeout_sync_store
bfq_strict_guarantees_store
bfq_low_latency_store
bfq_attrs
iosched_bfq_mq
bfq_init

Improve your Profiling and Debugging skills

Find out more

Definitions

source code of linux/block/bfq-iosched.c