1 | //! Cosine function implemented using a parabolic approximation. |
2 | //! |
3 | //! Note that while this method is inspired by a [Taylor series] approximation, |
4 | //! it provides both better performance and accuracy. |
5 | //! |
6 | //! Most of the other trigonometric functions in this crate are implemented in |
7 | //! terms of this approximation. |
8 | //! |
9 | //! Method described: <https://stackoverflow.com/posts/28050328/revisions> |
10 | //! |
11 | //! Originally based on this approximation by Nick Capen: |
12 | //! |
13 | //! <https://web.archive.org/web/20180105200343if_/http://forum.devmaster.net/t/fast-and-accurate-sine-cosine/9648> |
14 | //! |
15 | //! Post quoted below for posterity as the original site is down: |
16 | //! |
17 | //! > In some situations you need a good approximation of sine and cosine that |
18 | //! > runs at very high performance. One example is to implement dynamic |
19 | //! > subdivision of round surfaces, comparable to those in Quake 3. Or to |
20 | //! > implement wave motion, in case there are no vertex shaders 2.0 available. |
21 | //! > |
22 | //! > The standard C `sinf()` and `cosf()` functions are terribly slow and offer |
23 | //! > much more precision than what we actually need. What we really want is an |
24 | //! > approximation that offers the best compromise between precision and |
25 | //! > performance. The most well known approximation method is to use a |
26 | //! > [Taylor series] about 0 (also known as a Maclaurin series), which for |
27 | //! > sine becomes: |
28 | //! > |
29 | //! > ```text |
30 | //! > x - 1/6 x^3 + 1/120 x^5 - 1/5040 x^7 + ... |
31 | //! > ``` |
32 | //! > |
33 | //! > When we plot this we get: taylor.gif. The green line is the real sine, |
34 | //! > the red line is the first four terms of the Taylor series. This seems |
35 | //! > like an acceptable approximation, but lets have a closer look: |
36 | //! > taylor_zoom.gif. It behaves very nicely up till `pi/2`, but after that it |
37 | //! > quickly deviates. At pi, it evaluates to -0.075 instead of 0. Using this |
38 | //! > for wave simulation will result in jerky motion which is unacceptable. |
39 | //! > |
40 | //! > We could add another term, which in fact reduces the error significantly, |
41 | //! > but this makes the formula pretty lengthy. We already need 7 |
42 | //! > multiplications and 3 additions for the 4-term version. The Taylor series |
43 | //! > just can't give us the precision and performance we're looking for. |
44 | //! > |
45 | //! > We did however note that we need `sin(pi) = 0`. And there's another thing |
46 | //! > we can see from taylor_zoom.gif: this looks very much like a parabola! |
47 | //! > So let's try to find the formula of a parabola that matches it as closely |
48 | //! > as possible. The generic formula for a parabola is `A + B x + C x²`. So |
49 | //! > this gives us three degrees of freedom. Obvious choices are that we want |
50 | //! > `sin(0) = 0`, `sin(pi/2) = 1` and `sin(pi) = 0`. This gives us the following |
51 | //! > three equations: |
52 | //! > |
53 | //! > ```text |
54 | //! > A + B 0 + C 0² = 0 |
55 | //! > A + B pi/2 + C (pi/2)² = 1 |
56 | //! > A + B pi + C pi² = 0 |
57 | //! > ``` |
58 | //! > |
59 | //! > Which has as solution `A = 0, B = 4/pi, C = -4/pi²`. So our parabola |
60 | //! > approximation becomes `4/pi x - 4/pi² x²`. Plotting this we get: |
61 | //! > parabola.gif. This looks worse than the 4-term Taylor series, right? |
62 | //! > Wrong! The maximum absolute error is 0.056. Furthermore, this |
63 | //! > approximation will give us smooth wave motion, and can be calculated |
64 | //! > in only 3 multiplications and 1 addition! |
65 | //! > |
66 | //! > Unfortunately it's not very practical yet. This is what we get in the |
67 | //! > `[-pi, pi]` range: negative.gif. It's quite obvious we want at least a |
68 | //! > full period. But it's also clear that it's just another parabola, |
69 | //! > mirrored around the origin. The formula for it is `4/pi x + 4/pi² x²`. |
70 | //! > So the straightforward (pseudo-C) solution is: |
71 | //! > |
72 | //! > ```text |
73 | //! > if(x > 0) |
74 | //! > { |
75 | //! > y = 4/pi x - 4/pi² x²; |
76 | //! > } |
77 | //! > else |
78 | //! > { |
79 | //! > y = 4/pi x + 4/pi² x²; |
80 | //! > } |
81 | //! > ``` |
82 | //! > |
83 | //! > Adding a branch is not a good idea though. It makes the code |
84 | //! > significantly slower. But look at how similar the two parts really are. |
85 | //! > A subtraction becomes an addition depending on the sign of x. In a naive |
86 | //! > first attempt to eliminate the branch we can 'extract' the sign of x |
87 | //! > using `x / abs(x)`. The division is very expensive but look at the resulting |
88 | //! > formula: `4/pi x - x / abs(x) 4/pi² x²`. By inverting the division we can |
89 | //! > simplify this to a very nice and clean `4/pi x - 4/pi² x abs(x)`. So for |
90 | //! > just one extra operation we get both halves of our sine approximation! |
91 | //! > Here's the graph of this formula confirming the result: abs.gif. |
92 | //! > |
93 | //! > Now let's look at cosine. Basic trigonometry tells us that |
94 | //! > `cos(x) = sin(pi/2 + x)`. Is that all, add pi/2 to x? No, we actually get |
95 | //! > the unwanted part of the parabola again: shift_sine.gif. What we need to |
96 | //! > do is 'wrap around' when x > pi/2. This can be done by subtracting 2 pi. |
97 | //! > So the code becomes: |
98 | //! > |
99 | //! > ```text |
100 | //! > x += pi/2; |
101 | //! > |
102 | //! > if(x > pi) // Original x > pi/2 |
103 | //! > { |
104 | //! > x -= 2 * pi; // Wrap: cos(x) = cos(x - 2 pi) |
105 | //! > } |
106 | //! > |
107 | //! > y = sin(x); |
108 | //! > ``` |
109 | //! > |
110 | //! > Yet another branch. To eliminate it, we can use binary logic, like this: |
111 | //! > |
112 | //! > ```text |
113 | //! > x -= (x > pi) & (2 * pi); |
114 | //! > ``` |
115 | //! > |
116 | //! > Note that this isn't valid C code at all. But it should clarify how this |
117 | //! > can work. When x > pi is false the & operation zeroes the right hand part |
118 | //! > so the subtraction does nothing, which is perfectly equivalent. I'll |
119 | //! > leave it as an excercise to the reader to create working C code for this |
120 | //! > (or just read on). Clearly the cosine requires a few more operations than |
121 | //! > the sine but there doesn't seem to be any other way and it's still |
122 | //! > extremely fast. |
123 | //! > |
124 | //! > Now, a maximum error of 0.056 is nice but clearly the 4-term Taylor series |
125 | //! > still has a smaller error on average. Recall what our sine looked like: |
126 | //! > abs.gif. So isn't there anything we can do to further increase precision |
127 | //! > at minimal cost? Of course the current version is already applicable for |
128 | //! > many situations where something that looks like a sine is just as good as |
129 | //! > the real sine. But for other situations that's just not good enough. |
130 | //! > |
131 | //! > Looking at the graphs you'll notice that our approximation always |
132 | //! > overestimates the real sine, except at 0, pi/2 and pi. So what we need is |
133 | //! > to 'scale it down' without touching these important points. The solution |
134 | //! > is to use the squared parabola, which looks like this: squared.gif. Note |
135 | //! > how it preserves those important points but it's always lower than the |
136 | //! > real sine. So we can use a weighted average of the two to get a better |
137 | //! > approximation: |
138 | //! > |
139 | //! > ```text |
140 | //! > Q (4/pi x - 4/pi² x²) + P (4/pi x - 4/pi² x²)² |
141 | //! > ``` |
142 | //! > |
143 | //! > With `Q + P = 1`. You can use exact minimization of either the absolute |
144 | //! > or relative error but I'll save you the math. The optimal weights are |
145 | //! > `Q = 0.775`, `P = 0.225` for the absolute error and `Q = 0.782`, |
146 | //! > `P = 0.218` for the relative error. I will use the former. The resulting |
147 | //! > graph is: average.gif. Where did the red line go? It's almost entirely |
148 | //! > covered by the green line, which instantly shows how good this |
149 | //! > approximation really is. The maximum error is about 0.001, a 50x |
150 | //! > improvement! The formula looks lengthy but the part between parenthesis |
151 | //! > is the same value from the parabola, which needs to be computed only |
152 | //! > once. In fact only 2 extra multiplications and 2 additions are required |
153 | //! > to achieve this precision boost. |
154 | //! > |
155 | //! > It shouldn't come as a big surprise that to make it work for negative x |
156 | //! > as well we need a second abs() operation. The final C code for the sine |
157 | //! > becomes: |
158 | //! > |
159 | //! > ```text |
160 | //! > float sin(float x) |
161 | //! > { |
162 | //! > const float B = 4/pi; |
163 | //! > const float C = -4/(pi*pi); |
164 | //! > |
165 | //! > float y = B * x + C * x * abs(x); |
166 | //! > |
167 | //! > #ifdef EXTRA_PRECISION |
168 | //! > // const float Q = 0.775; |
169 | //! > const float P = 0.225; |
170 | //! > |
171 | //! > y = P * (y * abs(y) - y) + y; // Q * y + P * y * abs(y) |
172 | //! > #endif |
173 | //! > } |
174 | //! > ``` |
175 | //! > |
176 | //! > So we need merely 5 multiplications and 3 additions; still faster than |
177 | //! > the 4-term Taylor if we neglect the abs(), and much more precise! The |
178 | //! > cosine version just needs the extra shift and wrap operations on x. |
179 | //! > |
180 | //! > Last but not least, I wouldn't be Nick if I didn't include the SIMD |
181 | //! > optimized assembly version. It allows to perform the wrap operation very |
182 | //! > efficiently so I'll give you the cosine: |
183 | //! > |
184 | //! > ```asm |
185 | //! > // cos(x) = sin(x + pi/2) |
186 | //! > addps xmm0, PI_2 |
187 | //! > movaps xmm1, xmm0 |
188 | //! > cmpnltps xmm1, PI |
189 | //! > andps xmm1, PIx2 |
190 | //! > subps xmm0, xmm1 |
191 | //! > |
192 | //! > // Parabola |
193 | //! > movaps xmm1, xmm0 |
194 | //! > andps xmm1, abs |
195 | //! > mulps xmm1, xmm0 |
196 | //! > mulps xmm0, B |
197 | //! > mulps xmm1, C |
198 | //! > addps xmm0, xmm1 |
199 | //! > |
200 | //! > // Extra precision |
201 | //! > movaps xmm1, xmm0 |
202 | //! > andps xmm1, abs |
203 | //! > mulps xmm1, xmm0 |
204 | //! > subps xmm1, xmm0 |
205 | //! > mulps xmm1, P |
206 | //! > addps xmm0, xmm1 |
207 | //! > ``` |
208 | //! > |
209 | //! > This code computes four cosines in parallel, resulting in a peak |
210 | //! > performance of about 9 clock cycles per cosine for most CPU architectures. |
211 | //! > Sines take only 6 clock cycles ideally. The lower precision version can |
212 | //! > even run at 3 clock cycles per sine... And lets not forget that all input |
213 | //! > between -pi and pi is valid and the formula is exact at -pi, -pi/2, 0, |
214 | //! > pi/2 and pi. |
215 | //! > |
216 | //! > So, the conclusion is don't ever again use a Taylor series to approximate |
217 | //! > a sine or cosine! To add a useful discussion to this article, I'd love to |
218 | //! > hear if anyone knows good approximations for other transcendental functions |
219 | //! > like the exponential, logarithm and power functions. |
220 | |
221 | use super::F32; |
222 | use core::f32::consts::FRAC_1_PI; |
223 | |
224 | impl F32 { |
225 | /// Approximates `cos(x)` in radians with a maximum error of `0.002`. |
226 | pub fn cos(self) -> Self { |
227 | let mut x: F32 = self; |
228 | x *= FRAC_1_PI / 2.0; |
229 | x -= 0.25 + (x + 0.25).floor().0; |
230 | x *= 16.0 * (x.abs() - 0.5); |
231 | x += 0.225 * x * (x.abs() - 1.0); |
232 | x |
233 | } |
234 | } |
235 | |
236 | #[cfg (test)] |
237 | pub(crate) mod tests { |
238 | use super::F32; |
239 | |
240 | /// Maximum error in radians |
241 | pub(crate) const MAX_ERROR: f32 = 0.002; |
242 | |
243 | /// Cosine test vectors - `(input, output)` |
244 | const TEST_VECTORS: &[(f32, f32)] = &[ |
245 | (0.000, 1.000), |
246 | (0.140, 0.990), |
247 | (0.279, 0.961), |
248 | (0.419, 0.914), |
249 | (0.559, 0.848), |
250 | (0.698, 0.766), |
251 | (0.838, 0.669), |
252 | (0.977, 0.559), |
253 | (1.117, 0.438), |
254 | (1.257, 0.309), |
255 | (1.396, 0.174), |
256 | (1.536, 0.035), |
257 | (1.676, -0.105), |
258 | (1.815, -0.242), |
259 | (1.955, -0.375), |
260 | (2.094, -0.500), |
261 | (2.234, -0.616), |
262 | (2.374, -0.719), |
263 | (2.513, -0.809), |
264 | (2.653, -0.883), |
265 | (2.793, -0.940), |
266 | (2.932, -0.978), |
267 | (3.072, -0.998), |
268 | (3.211, -0.998), |
269 | (3.351, -0.978), |
270 | (3.491, -0.940), |
271 | (3.630, -0.883), |
272 | (3.770, -0.809), |
273 | (3.910, -0.719), |
274 | (4.049, -0.616), |
275 | (4.189, -0.500), |
276 | (4.328, -0.375), |
277 | (4.468, -0.242), |
278 | (4.608, -0.105), |
279 | (4.747, 0.035), |
280 | (4.887, 0.174), |
281 | (5.027, 0.309), |
282 | (5.166, 0.438), |
283 | (5.306, 0.559), |
284 | (5.445, 0.669), |
285 | (5.585, 0.766), |
286 | (5.725, 0.848), |
287 | (5.864, 0.914), |
288 | (6.004, 0.961), |
289 | (6.144, 0.990), |
290 | (6.283, 1.000), |
291 | ]; |
292 | |
293 | #[test ] |
294 | fn sanity_check() { |
295 | for &(x, expected) in TEST_VECTORS { |
296 | let cos_x = F32(x).cos(); |
297 | let delta = (cos_x - expected).abs(); |
298 | |
299 | assert!( |
300 | delta <= MAX_ERROR, |
301 | "delta {} too large: {} vs {}" , |
302 | delta, |
303 | cos_x, |
304 | expected |
305 | ); |
306 | } |
307 | } |
308 | } |
309 | |