This is so cool, I can't believe I'm only discovering this now! It makes sense that this would work for audio, as it seems to be a generalization of FM synthesis in that context
@mannyk76343 жыл бұрын
Very nice work especially the sinusoidal activation. I like to point out Candes in 1997 covered it rigorously in "Harmonic Analysis of Neural Networks" about periodic activation function - "admissible neural activation function". Strangely enough, the paper is not even cited by the authors.
@siarez4 жыл бұрын
How is this compare to just taking the Fourier (or discrete cosine) transform of the signal?
@luciengrondin58024 жыл бұрын
The application of this kind of representations for 3D rendering is fascinating. Could it be that in the future modelers will give up on the polygon+textures model and represent the whole scene with a neural network instead ?
@kwea1234 жыл бұрын
It requires huge processing time at the inference stage. Take sdf as example, you'll need to query the network for a huge number of times to find out where the surface is (also depends on the discretization resolution). I think currently it's only good at offline use.
@luciengrondin58024 жыл бұрын
@@kwea123 I expressed myself poorly. I was thinking of rendering, not modeling. More precisely, I was thinking of the problem of rendering scenes with very high poly counts, for instance where a very long draw distance is required. Currently modelers have to use level of details but this technique has limitations.
@qsobad4 жыл бұрын
@@luciengrondin5802 that is solved by another solution, take a look at ue5
@Oktokolo Жыл бұрын
@@luciengrondin5802 I hope, the network does not have to learn to give the pixel values for a coordinate - but could also learn to give coordinates and pixel values for an index. The real issue would be the compression stage requiring training a network of apropriate size on the scene to be "compressed".
@convolvr4 жыл бұрын
The first layer Sin(Wx + b) could be thought of as a vector of waves with frequency wi and phase offset bi. After the second linear layer, we have a vector of trigonometric series which look like a Fourier expansions except the frequencies and phase offsets can be anything. Although the next nonlinearity might do something new, we can already represent any function with the first 1 1/2 layers. What advantages does this approach offer vs representing and evaluating functions as a Fourier series?
@_vlek_4 жыл бұрын
Because you can learn the representation of lots of different signals via gradient descent?
@luciengrondin58024 жыл бұрын
@@_vlek_ I think this is it, indeed. Efficient Fourier transform algorithms only work with a regularly sampled signal and, if I'm not mistaken, of low dimension. This machine learning approach can work with any kind of signal, I think.
@isodoubIet Жыл бұрын
Fourier series are linear
@convolvr Жыл бұрын
@@isodoubIetthe Fourier transform is linear. The Fourier series is not. I assume you're implying that the neural net is a fundamentally more expressive by being nonlinear. But the Fourier series is also nonlinear.
@isodoubIet Жыл бұрын
@@convolvr Eh no if you have a smooth periodic signal it's still expressible as a linear combination of Fourier components, so, yes, this is fundamentally more expressive.
@TileBitan2 жыл бұрын
The music part was outstanding. Audio waveforms are just stacked sinewaves, as opposed to images or text where the input may not be too related to the sine function. So it just feels right to use sine activations and the required tweaks to make that work, instead of ReLUs, but I'm going to be careful with this as even though I have some experience in ML i haven't ever touched anything other than ReLUs, sigmoids, tanh and straight up linear activations
@Oktokolo Жыл бұрын
You can aproximate _everything_ with stacked sine waves. All modern video and image compression algorithms are based on that.
@TileBitan Жыл бұрын
@@Oktokolo let me rephrase that then. Audio waveforms can be approximated by a relatively SMALL number of stacked sine waves, so it feels natural to use them in NNs. Everything can be approximated by infinite numbers of sine waves, but sometimes it doesn't make sense to do it
@Oktokolo Жыл бұрын
@@TileBitan It obviously makes sense for images as that is how the best compression algorithms use. It should also be possible to encode text reasonably well - even though the resulting set of weights is probably larger than the text itself when not encoding input of a huge language model...
@TileBitan Жыл бұрын
@@Oktokolo i don't understand. Sounds are different amplitude waves with different frequencies inside the hearing range. Images nowadays can be 100M pixels with 3 times 256 on the BEST case scenario, where relationships between pixels can be really close to nothing. The case is completely different. The text case doesn't really have much to do with a wave. They might use FFTs for images but you gotta agree with me, for the same error you need way way less terms for sound than images.
@Oktokolo Жыл бұрын
@@TileBitan Doesn't matter whether it looks like it has anything to do with a wave or not or whether adjacent values look like they are in any relation to eachother. Treating data as signals and then encoding the signal as stacked waves just works surprisingly well. It might not work well for truly random bit noise. But most data interesting to humans seems to exhibit a surprisingly low entropy and can be compressed using stacked sines.
@rahuldeora11204 жыл бұрын
Can you please share your code? The link on the project page is not working
@IsaacGerg4 жыл бұрын
The arxiv has an incorrect reference. The paper states, "or positional encoding strategies proposed in concurrent work [5]" and video mentions a paper in 2020, but reference [5] your current arxiv is C. Ballester, M. Bertalmio, V. Caselles, G. Sapiro, and J. Verdera. Filling-in by joint interpolation of vectorfields and gray levels.IEEE Trans. on Image Processing, 10(8):1200-1211, 2001. I believe this should reference what you list as [35].
@kwea1234 жыл бұрын
Yes, Nerf: Representing scenes as neural radiance fields for view synthesis uses positional encoding. And they recently published a paper that uses Fourier transform.
@paulofalca04 жыл бұрын
Awesome work!
@NerdyRodent4 жыл бұрын
That’s amazing!
@tiagotiagot Жыл бұрын
How does it compare to using a sawtooth wave in place of the sine wave?
@Singularitarian4 жыл бұрын
What if you were to use exp(i x) = cos(x) + i sin(x) as the activation function? That seems potentially more elegant.
@rainerzufall18684 жыл бұрын
What would it mean for an activation to have a complex output? Or 2 outputs?
@OliverBatchelor4 жыл бұрын
@@rainerzufall1868 Twice as many outputs - just doubling the features. You can do a similar thing with ReLU where you threshold at maximum zero and at minimum zero and split into two parts, I'm not sure it's a whole lot better than just one though...
@luciengrondin58024 жыл бұрын
@@rainerzufall1868 Does the activation function necessarily have to be real ? I don't think so. I think using a complex exponential could help making the calculations and implementation clearer. It could have an overhead computational cost, though.
@rainerzufall18684 жыл бұрын
@@luciengrondin5802 I don't think it would simplicate things if you model it as the activation having 2 outputs, it would need some re-implementation.. and if you instead use 1 complex output and complex multiplication, the libraries are not optimized at all for this.. thus the computational hit would be big, i think..
@rainerzufall18684 жыл бұрын
Also, cosine and sine are the same except for a constant difference in the input, which we could learn from the bias. Thus, i don't think we would add much value. On the flipside, the deritive of sine is cosine and vice versa (with a minus), such that we can just reuse the output from the other in the derivative computation.
@wyalexlee85784 жыл бұрын
Love this! Thank you!!
@anilaxsus6376 Жыл бұрын
yeah i was wondering why people weren't using sin's and cosine's cause i watched a video and the guy explained that, a neural network of L number of layers, and N number of nodes per Layer, which use relu activation, can perfectly match a function with N to the power L number of bends or turning points in its curve (assuming the neural network has a single scalar node output), i guess that is why it failed on the audio, there is a lot of turning point in audio data, so technical the SIREN networks performance can be matched by a large enough relu neural network, so am looking at SIREN as an optimization on the usual relu networks. Am glad i saw this, i will look into it further. i suspect that sinusoidal activation will be useful in domains with some sort of repetition, cause relu act more like threshold switches.
@rodrigob4 жыл бұрын
No link to the paper in the video description ?
@rodrigob4 жыл бұрын
And project page at vsitzmann.github.io/siren/
@tigeruby4 жыл бұрын
lol at tanh - but very cool general purpose work; i can imagine this being a good exploratory topic/bonus project for intro signal processing courses
@Marcos10PT4 жыл бұрын
Goodbye ReLU, you had a good run! I feel I have to watch this a few more times to have a good idea of what's going on 😄but it looks like a breakthrough!
@dann_y53197 ай бұрын
Awesome
@aidungeon45394 жыл бұрын
Super cool!
@DrPapaya4 жыл бұрын
Code available?
@_zproxy9 ай бұрын
is it like a new jpg?
@volotat4 жыл бұрын
Wow, just watch Yannic Kilcher's video on this work and this is fascinating... I bet this work is going to change many things in ML. Please share the code!
@enginechen73124 жыл бұрын
Hi, could I download this video and upload it to bilibili.com, where Chinese students and researchers can visit freely?
@sherwoac4 жыл бұрын
already available implementation at: github.com/titu1994/tf_SIREN