On modern machines, looking things up can be slower than recomputing it, when the computation is simple. This is because the memory is much slower than the CPU, which means you can often compute something many times over before the answer from memory arrives.
> I think it is stored like sintable[deg]. The degree is index.
I can think of a few reasons why this is a bad idea.
1. Why would you use degrees? Pretty much everybody uses and wants radians.
2. What are you going to do about fractional degrees? Some sort of interpretation, right?
3. There's only so much cache available, are you willing to spend multiple kilobytes of it every time you want to calculate a sine? If you're imagining doing this in hardware, there are only so many transistors available, are you willing to spend that many thousands of them?
4. If you're keeping a sine table, why not keep one half the size, and then add a cosine table of equal size. That way you can use double and sum angle formulae to get the original range back and pick up cosine along the way. Reflection formulae let you cut it down even further.
There's a certain train of thought that leads from (2).
a. I'm going to be interpreting values anyway
b. How few support points can I get away with?
c. Are there better choices than evenly spaced points?
d. Wait, do I want to limit myself to polynomials?
Following it you get answers "b: just a handful" and "c: oh yeah!" and "d: you can if you want but you don't have to". Then if you do a bunch of thinking you end up with something very much like what everybody else in these two threads have been talking about.
It isnt good idea to store such values in code. I think it is something that computed when a programming environment is booting up. E.g. when you run "python", or install "python".
I try to understand how Math.sin works. There is Math.cos. It is sin +90 degrees. So not all of them is something that completes a big puzzle.
There's no nice way of saying this, and I mean no malice here, but I think you're exceptionally confused or ignorant, and I don't think it would be rewarding for either of us to continue this conversation.
Unless your lookup table is small enough to only use a portion of your L1 cache and you're calling it so much that the lookup table is never evicted :)
Even that is not necessarily needed, I have gotten major speedups from LUTs even as large as 1MB because the lookup distribution was not uniform. Modern CPUs have high cache associativity and faster transfers between L1 and L2.
L1D caches have also gotten bigger -- as big as 128KB. A Deflate/zlib implementation, for instance, can use a brute force full 32K entry LUT for the 15-bit Huffman decoding on some chips, no longer needing the fast small table.
Not just modern machines, the Nintendo64 was memory bound under most circumstances and as such many traditional optimizations (lookup tables, unrolling loops) can be slower on the N64. The unrolling loops case is interesting. Because the cpu has to fetch more instructions this puts more strain on the memory bus.
If curious, On a N64 the graphics chip is also the memory controller so every thing the cpu can do to stay off the memory bus has an additive effect allowing the graphics to do more graphics. This is also why the n64 has weird 9-bit ram, it is so they could use a 18-bit pixel format, only taking two bytes per pixel, for cpu requests the memory controller ignored the 9th bit, presenting a normal 8 bit byte.
They were hoping that by having high speed memory, 250 mHz, the cpu ran at 90mHz, it could provide for everyone and it did ok, there are some very impressive games on the n64. but on most of them the cpu is running fairly light, gotta stay off that memory bus.
The N64 was a particularly unbalanced design for its era so nobody was used to writing code like that yet. Memory bandwidth wasn't a limitation on previous consoles so it's like nobody thought of it.
> This is also why the n64 has weird 9-bit ram, it is so they could use a 18-bit pixel format, only taking two bytes per pixel, for cpu requests the memory controller ignored the 9th bit, presenting a normal 8 bit byte.
The Ensoniq EPS sampler (the first version) used 13-bit RAM for sample memory. Why 13 and not 12? Who knows? Possibly because they wanted it "one louder", possibly because the Big Rival in the E-Mu Emulator series used μ-law codecs which have the same effective dynamic range as 13-bit linear.
Anyway you read a normal 16-bit word using the 68000's normal 16-bit instructions but only the upper 13 were actually valid data for the RAM, the rest were tied low. Haha, no code space for you!
Did you try polynomial preprocessing methods, like Knuth's and Estrin's methods?
https://en.wikipedia.org/wiki/Polynomial_evaluation#Evaluati...
they let you compute polynomials with half the multiplications of Horner's method, and I used them in the past to improve the speed of the exponential function in Boost.
Cool, although more ILP (instruction-level parallelism) might not necessarily be better on a modern GPU, which doesn't have much ILP, if any (instead it uses those resources to execute several threads in parallel).
That might explain why the original Cg (a GPU programming language) code did not use Estrin's, since at least the code in the post does add 1 extra op (squaring `abs_x`).
(AMD GPUs used to use VLIW (very long instruction word) which is "static" ILP).
The compiler can substitute the value how it sees fit. It's like #define, but type-safe and scoped.
Maybe it's folded into expressions, propagated through constant expressions, or used it in contexts that require compile-time constants (template parameters, array sizes, static_assert, other constexpr expressions).
I mean, not in this case of pi/2, where it's more about announcing semantics, but in general those are the purposes and uses.
If you just want the optimizer to be able to constant-fold a value, then yes, either of those will work.
If you want to be able to use the value in the other contexts the parent mentioned that require constant expressions as a language rule, then you generally need constexpr. As an exception, non-constexpr variable values can be used if they’re const (not ‘happens to not vary’) and have integer or enum type (no floats, structs, pointers, etc.). This exception exists for legacy reasons and there’s no particular reason to rely on it unless you’re aiming for compatibility with older versions of C++ or C.
Even if you don’t need to use a variable in those contexts, constexpr evaluation is different from optimizer constant evaluation, and generally better if you can use it. In particular, the optimizer will give up if an expression is too hard to evaluate (depending on implementation-specific heuristics), whereas constexpr will either succeed or give an error (depending only on language rules). It’s also a completely separate code path in the compiler. There are some cases where optimizer constant evaluation can do things constexpr can’t, but most of those have been removed or ameliorated in recent C++ standards.
So it’s often an improvement to tag anything you want to be evaluated at compile time as constexpr, and rarely worse. However, if an expression is so trivial that it’s obvious the optimizer will be able to evaluate it, and you don’t need it in contexts that require a constant expression, then there’s no concrete benefit either way and it becomes a matter of taste. Personally, I wouldn’t tag this particular pi/2 variable constexpr or const, because it does satisfy those criteria and I personally prefer brevity. But I understand why some people prefer a rule of “always constexpr if possible”, either because they like the explicitness or because it’s a simpler rule.
I'd like something like this in C or C++ quite honestly.
Something like a struct that I can say "this struct is global to the whole program and everyone can see it, but once this function exits those values are locked in". Maybe something like that one function is allowed to unlock and update it, but nowhere else.
Think in terms of storing a bunch of precomputed coefficients that are based on the samplerate of a system, where you really only need to set it up once on startup and it is unlikely to change during the application's running lifetime.
I feel like there probably is a way to do this, and if I was good at high level languages like C I'd know what it is. If you know, tell me what I'm not understanding ;-)
It is required to be evaluated at compile time, and it's const.
An optimizing compiler might see through a non-constexpr declaration like 'double a0 = ...' or it might not. Constexpr is somewhat more explicit, especially with more complicated initializer expressions.
Clearly India history has a lot to say about mathematics but I don't think they get enough attention. Or am I just ignorant and living in my own bubble? Indian philosophy is also very intriguing.
Regarding attention, anything that does not draw directly from the Greek mainline, does not get much attention in the mainstream.
Lot of interesting mathematics was done by Indians, Persians, Arabs, Mayans.
Indian mathematics has an additional layer of obscurity. Very little was written down and when it was it was written down in picturesque and poetic verses (as a mnemonic device) that used a lot of symbolism and imagery. For the number one they will mention the Sun, for two the moon and so on, these mappings would also change from work to work, chapter to chapter. So one needs a lot of context to understand what a document is saying.
I revisited that article and ... now I have no idea. Maybe I stumbled into some other trig-related article and came back here. Or maybe this one had some A/B content going on?
The only thing I remember at this point is that I copied and pasted that sentence (I didn't type it.) Even search doesn't find the sentence anywhere but HN.
A related interesting fact is that small angular motions compose almost like vectors, order does not matter (i.e. they are commutative). This makes differential kinematics easier to deal with when dealing with polar or cylindrical coordinate systems.
Large angular deflections while being linear transforms, do not in general commute.
It will spoil the linear relation in your elegant expression, but a slightly better approximation for cos for small θ is
The coefficients given are indeed a near-optimal cubic minimax approximation for (π/2 - arcsin(x))/sqrt(1-x) on [0,1]. But those coefficients aren't actually optimal for approximating arcsin(x) itself.
For reference, the coefficients given are [1.5707288, -0.2121144, 0.0742610, -0.0187293]: if we optimize P(x) = (π/2 - arcsin(x))/sqrt(1-x) ourselves, we can extend them to double precision as [1.5707288189560218, -0.21211524058527342, 0.0742623449400704, -0.018729868776598532]. Increasing the precision reduces the max error, at x = 0, by 0.028%.
Adjusting our error function to optimize the absolute error of arcsin(x) = π/2 - P(x)*sqrt(1-x) on [0,1], we get the coefficients [1.5707583404833712, -0.2128751841625164, 0.07689738736091772, -0.02089203710669022]. The max error is reduced by 44%, from 6.75e-5 to 3.80e-5. If we plot the error function [0], we see that the new max error is achieved at five points, x = 0, 0.105, 0.386, 0.730, 0.967.
(Alternatively, adjusting our error function to optimize the relative error of arcsin(x), we get the coefficients [1.5707963267948966, -0.21441792645252514, 0.08365774237116316, -0.02732304481232744]. The max absolute error is 2.24e-4, but the max relative error is now 0.0181%, even in the vicinity of the root at x = 0. Though we'd almost certainly want to use a different formula to avoid catastrophic cancellation.)
So it goes to show, we can nearly double our accuracy, without modifying the code, just by optimizing for the right error metric.
Actually, we can improve this a bit further, by also adjusting the "π/2" constant in arcsin(x) = π/2 - P(x)*sqrt(1-x). We take coefficients [1.5707256467180715, -0.21298179775496026, 0.07727939759417458, -0.02132102849918157] for P(x), then take arcsin(x) = 1.570760986756484 - P(x)*sqrt(1-x). This reduces the max error by 6.97%, from 3.80e-5 to 3.53e-5.
Adjusting the "1" upward in sqrt(1-x) does not seem to help at all.
Just looking at the formula in the code (and the book it came from), we see that the approximation is of form arcsin(x) = π/2 - P(x)*sqrt(1-x). It is called a minimax solution in both, and the simplest form of minimax optimization is for polynomials. So we look at P(x) = (π/2 - arcsin(x))/sqrt(1-x): plotting out its error function with the original coefficients, it has the clear equioscillations that you'd expect from an optimized polynomial (i.e., each local peak has the exact same height, which is the max error).
It makes zero sense to measure performance without measuring correctness. Especially, when you use LLM. Here is even faster asin(): `return 0;`
This precisions should be measured in avg and worst ULP, not in "charts". A good approximation should also give exact results in critical points (-1/0/-1 in this case).
The "faster" version gives this:
asin(0) = 6.75268e-05 (double precision)
Which gives around 5e+15 ULPs, while common libc math implementations targets 19 ULPs (but will be 0 for asin(0)).
fatih-erikli-cg | a day ago
xyzzyz | a day ago
fatih-erikli-cg | a day ago
Sin/cos must be borders of sqrt(x²+y²). It is also cached indeed.
groundzeros2015 | a day ago
We can compute these things using iteration or polynomial approximations (sufficient for 64 bit).
fatih-erikli-cg | 20 hours ago
BigTTYGothGF | 21 hours ago
This doesn't make a ton of sense.
fatih-erikli-cg | 20 hours ago
I think it is stored like sintable[deg]. The degree is index.
BigTTYGothGF | 18 hours ago
In some way vaguely like this: https://github.com/jeremybarnes/cephes/blob/master/cmath/sin...
> I think it is stored like sintable[deg]. The degree is index.
I can think of a few reasons why this is a bad idea.
1. Why would you use degrees? Pretty much everybody uses and wants radians.
2. What are you going to do about fractional degrees? Some sort of interpretation, right?
3. There's only so much cache available, are you willing to spend multiple kilobytes of it every time you want to calculate a sine? If you're imagining doing this in hardware, there are only so many transistors available, are you willing to spend that many thousands of them?
4. If you're keeping a sine table, why not keep one half the size, and then add a cosine table of equal size. That way you can use double and sum angle formulae to get the original range back and pick up cosine along the way. Reflection formulae let you cut it down even further.
There's a certain train of thought that leads from (2).
a. I'm going to be interpreting values anyway
b. How few support points can I get away with?
c. Are there better choices than evenly spaced points?
d. Wait, do I want to limit myself to polynomials?
Following it you get answers "b: just a handful" and "c: oh yeah!" and "d: you can if you want but you don't have to". Then if you do a bunch of thinking you end up with something very much like what everybody else in these two threads have been talking about.
fatih-erikli-cg | 17 hours ago
I try to understand how Math.sin works. There is Math.cos. It is sin +90 degrees. So not all of them is something that completes a big puzzle.
BigTTYGothGF | an hour ago
connicpu | a day ago
ack_complete | 19 hours ago
L1D caches have also gotten bigger -- as big as 128KB. A Deflate/zlib implementation, for instance, can use a brute force full 32K entry LUT for the 15-bit Huffman decoding on some chips, no longer needing the fast small table.
fph | 17 hours ago
somat | 22 hours ago
If curious, On a N64 the graphics chip is also the memory controller so every thing the cpu can do to stay off the memory bus has an additive effect allowing the graphics to do more graphics. This is also why the n64 has weird 9-bit ram, it is so they could use a 18-bit pixel format, only taking two bytes per pixel, for cpu requests the memory controller ignored the 9th bit, presenting a normal 8 bit byte.
They were hoping that by having high speed memory, 250 mHz, the cpu ran at 90mHz, it could provide for everyone and it did ok, there are some very impressive games on the n64. but on most of them the cpu is running fairly light, gotta stay off that memory bus.
https://www.youtube.com/watch?v=xFKFoGiGlXQ (Kaze Emanuar: Finding the BEST sine function for Nintendo 64)
gzread | 22 hours ago
ErroneousBosh | 20 hours ago
The Ensoniq EPS sampler (the first version) used 13-bit RAM for sample memory. Why 13 and not 12? Who knows? Possibly because they wanted it "one louder", possibly because the Big Rival in the E-Mu Emulator series used μ-law codecs which have the same effective dynamic range as 13-bit linear.
Anyway you read a normal 16-bit word using the 68000's normal 16-bit instructions but only the upper 13 were actually valid data for the RAM, the rest were tied low. Haha, no code space for you!
fhdkweig | a day ago
[OP] def-pri-pub | a day ago
thomasahle | a day ago
kzrdude | a day ago
thomasahle | 23 hours ago
fph | 21 hours ago
jaen | 23 hours ago
That might explain why the original Cg (a GPU programming language) code did not use Estrin's, since at least the code in the post does add 1 extra op (squaring `abs_x`).
(AMD GPUs used to use VLIW (very long instruction word) which is "static" ILP).
jonasenordin | 23 hours ago
fancy_pantser | 23 hours ago
Maybe it's folded into expressions, propagated through constant expressions, or used it in contexts that require compile-time constants (template parameters, array sizes, static_assert, other constexpr expressions).
I mean, not in this case of pi/2, where it's more about announcing semantics, but in general those are the purposes and uses.
gzread | 22 hours ago
comex | 18 hours ago
If you just want the optimizer to be able to constant-fold a value, then yes, either of those will work.
If you want to be able to use the value in the other contexts the parent mentioned that require constant expressions as a language rule, then you generally need constexpr. As an exception, non-constexpr variable values can be used if they’re const (not ‘happens to not vary’) and have integer or enum type (no floats, structs, pointers, etc.). This exception exists for legacy reasons and there’s no particular reason to rely on it unless you’re aiming for compatibility with older versions of C++ or C.
Even if you don’t need to use a variable in those contexts, constexpr evaluation is different from optimizer constant evaluation, and generally better if you can use it. In particular, the optimizer will give up if an expression is too hard to evaluate (depending on implementation-specific heuristics), whereas constexpr will either succeed or give an error (depending only on language rules). It’s also a completely separate code path in the compiler. There are some cases where optimizer constant evaluation can do things constexpr can’t, but most of those have been removed or ameliorated in recent C++ standards.
So it’s often an improvement to tag anything you want to be evaluated at compile time as constexpr, and rarely worse. However, if an expression is so trivial that it’s obvious the optimizer will be able to evaluate it, and you don’t need it in contexts that require a constant expression, then there’s no concrete benefit either way and it becomes a matter of taste. Personally, I wouldn’t tag this particular pi/2 variable constexpr or const, because it does satisfy those criteria and I personally prefer brevity. But I understand why some people prefer a rule of “always constexpr if possible”, either because they like the explicitness or because it’s a simpler rule.
ErroneousBosh | 20 hours ago
Something like a struct that I can say "this struct is global to the whole program and everyone can see it, but once this function exits those values are locked in". Maybe something like that one function is allowed to unlock and update it, but nowhere else.
Think in terms of storing a bunch of precomputed coefficients that are based on the samplerate of a system, where you really only need to set it up once on startup and it is unlikely to change during the application's running lifetime.
I feel like there probably is a way to do this, and if I was good at high level languages like C I'd know what it is. If you know, tell me what I'm not understanding ;-)
loeg | 22 hours ago
An optimizing compiler might see through a non-constexpr declaration like 'double a0 = ...' or it might not. Constexpr is somewhat more explicit, especially with more complicated initializer expressions.
xigoi | 17 hours ago
srean | 22 hours ago
This tradition came to its own as Madhava school of mathematics from Kerala. https://en.wikipedia.org/wiki/Kerala_school_of_astronomy_and...
Note the approximation is for 0 < x < 1. For the range [-1, 0] Bhaskara used symmetry.
If I remember correctly, Aryabhatta had derived a rational approximation about a hundred years before this.
EDIT https://doi.org/10.4169/math.mag.84.2.098
anon-3988 | 10 hours ago
mcraiha | 8 hours ago
srean | 5 hours ago
Lot of interesting mathematics was done by Indians, Persians, Arabs, Mayans.
Indian mathematics has an additional layer of obscurity. Very little was written down and when it was it was written down in picturesque and poetic verses (as a mnemonic device) that used a lot of symbolism and imagery. For the number one they will mention the Sun, for two the moon and so on, these mappings would also change from work to work, chapter to chapter. So one needs a lot of context to understand what a document is saying.
ashdnazg | 22 hours ago
jagged-chisel | 22 hours ago
That’s quite subjective. I happen to find trigonometry to be elegant and true.
I also agree that trigonometric functions lack efficiency in software.
[OP] def-pri-pub | 22 hours ago
Where did that come from in the article?
jagged-chisel | 21 hours ago
The only thing I remember at this point is that I copied and pasted that sentence (I didn't type it.) Even search doesn't find the sentence anywhere but HN.
FEELmyAGI | 21 hours ago
jagged-chisel | 21 hours ago
coldcity_again | 21 hours ago
Let α represent a roll rotation, and β a pitch rotation.
Let R(α) be:
Let R(β) be: Combine them: But! For small α and β, just approximate: So now: [1]https://news.ycombinator.com/item?id=47348192coldcity_again | 21 hours ago
This is a great technique for cheaply doing 3D starfields etc on 8-bit machines.
Look ma, no sine table!
srean | 21 hours ago
Large angular deflections while being linear transforms, do not in general commute.
It will spoil the linear relation in your elegant expression, but a slightly better approximation for cos for small θ is
coldcity_again | 15 hours ago
LegionMammal978 | 17 hours ago
For reference, the coefficients given are [1.5707288, -0.2121144, 0.0742610, -0.0187293]: if we optimize P(x) = (π/2 - arcsin(x))/sqrt(1-x) ourselves, we can extend them to double precision as [1.5707288189560218, -0.21211524058527342, 0.0742623449400704, -0.018729868776598532]. Increasing the precision reduces the max error, at x = 0, by 0.028%.
Adjusting our error function to optimize the absolute error of arcsin(x) = π/2 - P(x)*sqrt(1-x) on [0,1], we get the coefficients [1.5707583404833712, -0.2128751841625164, 0.07689738736091772, -0.02089203710669022]. The max error is reduced by 44%, from 6.75e-5 to 3.80e-5. If we plot the error function [0], we see that the new max error is achieved at five points, x = 0, 0.105, 0.386, 0.730, 0.967.
(Alternatively, adjusting our error function to optimize the relative error of arcsin(x), we get the coefficients [1.5707963267948966, -0.21441792645252514, 0.08365774237116316, -0.02732304481232744]. The max absolute error is 2.24e-4, but the max relative error is now 0.0181%, even in the vicinity of the root at x = 0. Though we'd almost certainly want to use a different formula to avoid catastrophic cancellation.)
So it goes to show, we can nearly double our accuracy, without modifying the code, just by optimizing for the right error metric.
[0] https://www.desmos.com/calculator/nj3b8rpvbs
LegionMammal978 | 10 hours ago
Adjusting the "1" upward in sqrt(1-x) does not seem to help at all.
kzrdude | 6 hours ago
LegionMammal978 | an hour ago
3836293648 | 16 hours ago
Lockal | 8 hours ago
This precisions should be measured in avg and worst ULP, not in "charts". A good approximation should also give exact results in critical points (-1/0/-1 in this case).
The "faster" version gives this:
Which gives around 5e+15 ULPs, while common libc math implementations targets 19 ULPs (but will be 0 for asin(0)).