Implicit Neural Representations with Periodic Activation Functions

Рет қаралды 58,517

Күн бұрын

Пікірлер: 51

@ctralie 25 күн бұрын

This is so cool, I can't believe I'm only discovering this now! It makes sense that this would work for audio, as it seems to be a generalization of FM synthesis in that context

@mannyk7634 3 жыл бұрын

Very nice work especially the sinusoidal activation. I like to point out Candes in 1997 covered it rigorously in "Harmonic Analysis of Neural Networks" about periodic activation function - "admissible neural activation function". Strangely enough, the paper is not even cited by the authors.

@siarez 4 жыл бұрын

How is this compare to just taking the Fourier (or discrete cosine) transform of the signal?

@luciengrondin5802 4 жыл бұрын

The application of this kind of representations for 3D rendering is fascinating. Could it be that in the future modelers will give up on the polygon+textures model and represent the whole scene with a neural network instead ?

@kwea123 4 жыл бұрын

It requires huge processing time at the inference stage. Take sdf as example, you'll need to query the network for a huge number of times to find out where the surface is (also depends on the discretization resolution). I think currently it's only good at offline use.

@luciengrondin5802 4 жыл бұрын

@@kwea123 I expressed myself poorly. I was thinking of rendering, not modeling. More precisely, I was thinking of the problem of rendering scenes with very high poly counts, for instance where a very long draw distance is required. Currently modelers have to use level of details but this technique has limitations.

@qsobad 4 жыл бұрын

@@luciengrondin5802 that is solved by another solution, take a look at ue5

@Oktokolo Жыл бұрын

@@luciengrondin5802 I hope, the network does not have to learn to give the pixel values for a coordinate - but could also learn to give coordinates and pixel values for an index. The real issue would be the compression stage requiring training a network of apropriate size on the scene to be "compressed".

@convolvr 4 жыл бұрын

The first layer Sin(Wx + b) could be thought of as a vector of waves with frequency wi and phase offset bi. After the second linear layer, we have a vector of trigonometric series which look like a Fourier expansions except the frequencies and phase offsets can be anything. Although the next nonlinearity might do something new, we can already represent any function with the first 1 1/2 layers. What advantages does this approach offer vs representing and evaluating functions as a Fourier series?

@_vlek_ 4 жыл бұрын

Because you can learn the representation of lots of different signals via gradient descent?

@luciengrondin5802 4 жыл бұрын

@@_vlek_ I think this is it, indeed. Efficient Fourier transform algorithms only work with a regularly sampled signal and, if I'm not mistaken, of low dimension. This machine learning approach can work with any kind of signal, I think.

@isodoubIet Жыл бұрын

Fourier series are linear

@convolvr Жыл бұрын

@@isodoubIetthe Fourier transform is linear. The Fourier series is not. I assume you're implying that the neural net is a fundamentally more expressive by being nonlinear. But the Fourier series is also nonlinear.

@isodoubIet Жыл бұрын

@@convolvr Eh no if you have a smooth periodic signal it's still expressible as a linear combination of Fourier components, so, yes, this is fundamentally more expressive.

@TileBitan 2 жыл бұрын

The music part was outstanding. Audio waveforms are just stacked sinewaves, as opposed to images or text where the input may not be too related to the sine function. So it just feels right to use sine activations and the required tweaks to make that work, instead of ReLUs, but I'm going to be careful with this as even though I have some experience in ML i haven't ever touched anything other than ReLUs, sigmoids, tanh and straight up linear activations

@Oktokolo Жыл бұрын

You can aproximate _everything_ with stacked sine waves. All modern video and image compression algorithms are based on that.

@TileBitan Жыл бұрын

@@Oktokolo let me rephrase that then. Audio waveforms can be approximated by a relatively SMALL number of stacked sine waves, so it feels natural to use them in NNs. Everything can be approximated by infinite numbers of sine waves, but sometimes it doesn't make sense to do it

@Oktokolo Жыл бұрын

@@TileBitan It obviously makes sense for images as that is how the best compression algorithms use. It should also be possible to encode text reasonably well - even though the resulting set of weights is probably larger than the text itself when not encoding input of a huge language model...

@TileBitan Жыл бұрын

@@Oktokolo i don't understand. Sounds are different amplitude waves with different frequencies inside the hearing range. Images nowadays can be 100M pixels with 3 times 256 on the BEST case scenario, where relationships between pixels can be really close to nothing. The case is completely different. The text case doesn't really have much to do with a wave. They might use FFTs for images but you gotta agree with me, for the same error you need way way less terms for sound than images.

@Oktokolo Жыл бұрын

@@TileBitan Doesn't matter whether it looks like it has anything to do with a wave or not or whether adjacent values look like they are in any relation to eachother. Treating data as signals and then encoding the signal as stacked waves just works surprisingly well. It might not work well for truly random bit noise. But most data interesting to humans seems to exhibit a surprisingly low entropy and can be compressed using stacked sines.

@rahuldeora1120 4 жыл бұрын

Can you please share your code? The link on the project page is not working

@IsaacGerg 4 жыл бұрын

The arxiv has an incorrect reference. The paper states, "or positional encoding strategies proposed in concurrent work [5]" and video mentions a paper in 2020, but reference [5] your current arxiv is C. Ballester, M. Bertalmio, V. Caselles, G. Sapiro, and J. Verdera. Filling-in by joint interpolation of vectorfields and gray levels.IEEE Trans. on Image Processing, 10(8):1200-1211, 2001. I believe this should reference what you list as [35].

@kwea123 4 жыл бұрын

Yes, Nerf: Representing scenes as neural radiance fields for view synthesis uses positional encoding. And they recently published a paper that uses Fourier transform.

@paulofalca0 4 жыл бұрын

Awesome work!

@NerdyRodent 4 жыл бұрын

That’s amazing!

@tiagotiagot Жыл бұрын

How does it compare to using a sawtooth wave in place of the sine wave?

@Singularitarian 4 жыл бұрын

What if you were to use exp(i x) = cos(x) + i sin(x) as the activation function? That seems potentially more elegant.

@rainerzufall1868 4 жыл бұрын

What would it mean for an activation to have a complex output? Or 2 outputs?

@OliverBatchelor 4 жыл бұрын

@@rainerzufall1868 Twice as many outputs - just doubling the features. You can do a similar thing with ReLU where you threshold at maximum zero and at minimum zero and split into two parts, I'm not sure it's a whole lot better than just one though...

@luciengrondin5802 4 жыл бұрын

@@rainerzufall1868 Does the activation function necessarily have to be real ? I don't think so. I think using a complex exponential could help making the calculations and implementation clearer. It could have an overhead computational cost, though.

@rainerzufall1868 4 жыл бұрын

@@luciengrondin5802 I don't think it would simplicate things if you model it as the activation having 2 outputs, it would need some re-implementation.. and if you instead use 1 complex output and complex multiplication, the libraries are not optimized at all for this.. thus the computational hit would be big, i think..

@rainerzufall1868 4 жыл бұрын

Also, cosine and sine are the same except for a constant difference in the input, which we could learn from the bias. Thus, i don't think we would add much value. On the flipside, the deritive of sine is cosine and vice versa (with a minus), such that we can just reuse the output from the other in the derivative computation.

@wyalexlee8578 4 жыл бұрын

Love this! Thank you!!

@anilaxsus6376 Жыл бұрын

yeah i was wondering why people weren't using sin's and cosine's cause i watched a video and the guy explained that, a neural network of L number of layers, and N number of nodes per Layer, which use relu activation, can perfectly match a function with N to the power L number of bends or turning points in its curve (assuming the neural network has a single scalar node output), i guess that is why it failed on the audio, there is a lot of turning point in audio data, so technical the SIREN networks performance can be matched by a large enough relu neural network, so am looking at SIREN as an optimization on the usual relu networks. Am glad i saw this, i will look into it further. i suspect that sinusoidal activation will be useful in domains with some sort of repetition, cause relu act more like threshold switches.

@rodrigob 4 жыл бұрын

No link to the paper in the video description ?

@rodrigob 4 жыл бұрын

And project page at vsitzmann.github.io/siren/

@tigeruby 4 жыл бұрын

lol at tanh - but very cool general purpose work; i can imagine this being a good exploratory topic/bonus project for intro signal processing courses

@Marcos10PT 4 жыл бұрын

Goodbye ReLU, you had a good run! I feel I have to watch this a few more times to have a good idea of what's going on 😄but it looks like a breakthrough!

@dann_y5319 7 ай бұрын

Awesome

@aidungeon4539 4 жыл бұрын

Super cool!

@DrPapaya 4 жыл бұрын

Code available?

@_zproxy 9 ай бұрын

is it like a new jpg?

@volotat 4 жыл бұрын

Wow, just watch Yannic Kilcher's video on this work and this is fascinating... I bet this work is going to change many things in ML. Please share the code!