Just to let people know, by popular demand, I've also uploaded a video where I do this with a male English voice! kzbin.info/www/bejne/hHiQfXSKd7-BeNU
@SomethingUnreal8 жыл бұрын
+Johnny Mccrum: I'm afraid not. I don't know enough to program my own from scratch, so I was using the open-source software "torch-rnn" (github.com/jcjohnson/torch-rnn/) here.
@postvideo978 жыл бұрын
Practical RNN applications don't use 'homebrew' code, they always use some kind of GPU-accelerated library, such as Torch, Tensorflow, etc. There's no need to reinvent the wheel by coding the LSTM by youself (except for educational purposes, which is recommended as it teaches the fundamentals of BPTT). Any implementation of an LSTM RNN will be the same, except some differences in performance.
@postvideo978 жыл бұрын
@SomethingUnreal You should try training the RNN with STFT (Short Time Fourier Transform) instead of raw audio data, it should perform much better at distinguishing words, as the NN won't need to care about generating the signal itself.
@SomethingUnreal7 жыл бұрын
+postvideo975 If you can point me to an RNN that takes 2D input, thten sure. Otherwise, I'm stuck with torch-rnn, which is 1D. BTW, I actually did experiment with feeding a spectrogram (FFT powers) to torch-rnn, "raster scan"-style (All of first time slice, all of second time slice, etc, end-to-end), and made a program that handles the fact that torch-rnn won't produce perfectly-sized slices some of the time, and amazingly, torch-rnn was able to output something that resembled the voice, but it couldn't make a stable sound at all (each generated slice didn't connect neatly to the next slice). I don't think I can get better than that while using torch-rnn.
@thorn93826 жыл бұрын
Wow youve been feeding your gan alot of hentai
@gafeht8 жыл бұрын
When your creation is screaming to be put out of its misery, maybe it's time to rethink what you're doing
@Alex-oz9eh8 жыл бұрын
ayy lmao
@top1percent4248 жыл бұрын
gafeht You took it to whole another level.
@yammie15368 жыл бұрын
gafeht exactly my thought lol
@elk34078 жыл бұрын
Yah..... This video somewhat reminds me of Nina Tucker from FMA. If you dont know what im talking about, DONT look it up..... Its honestly kinda disturbing.
@TGRoko8 жыл бұрын
No no, it's "learning."
@smiledogjgp8 жыл бұрын
Funny how it learns to laugh and scream, far before it may form words. Quite reminiscent of infant humanity.
@SuperGirl-eq1le6 жыл бұрын
Corpus Crewman ikr
@scarm_rune5 жыл бұрын
this neural network defined the evolution of humans in a few mins
@hbaggg8 жыл бұрын
well congrats you made a computer waifu
@username3068 жыл бұрын
Skaterboybob made my day :^)
@maxalbert89438 жыл бұрын
Paddy made my day a day after your day
@wavefireYT7 жыл бұрын
UP
@AzureOnyxscore7 жыл бұрын
what next? a motorized fleshlight?
@oliverhilton60867 жыл бұрын
When you talk to your waifu and she replies with 30 seconds of "iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii"
@TrekkerLLAP8 жыл бұрын
*screams in Japanese*
@Bugingas7 жыл бұрын
Minodrey あああああああああああああああああああああああああああああああああああああああああああ (A in hiragana)
AI learns to generate voice: First thing it does is scream.
@BigOlSmellyFlashlight7 жыл бұрын
Spooglecraft just like real people
@CiRdy345 жыл бұрын
right!
@AlyphRat4 жыл бұрын
When humans are born, the first thing they do is cry, the same with this
@red2theelectricboogaloo9613 жыл бұрын
i have no mouth. and i cannot scream,
@ChocolateMilkMage8 жыл бұрын
Yes. One step closer to robot waifu A.I. Keep doing the good work soldier.
@mateuszbugaj7998 жыл бұрын
What a time to be alive!
@koreboredom43027 жыл бұрын
Mateusz Bugaj No it is not. If you were, however, born in the future, you could probably fuck machines.
@oliverhilton60867 жыл бұрын
I'm thinking of kreiger from archer right now
@noblenoob9137 жыл бұрын
ChocolateMilkMage I don't want a waifu, because now I ahve someone watching what I do and have to take care of her when she breaks down.
@sheeloesreallycool5 жыл бұрын
Lynx Rapid so a waifu
@TDGalea8 жыл бұрын
It goes from screaming to laughing. Why are you torturing this poor thing?
@tbe72185 жыл бұрын
Thomas Galea why would it be laughing though
@TheAmazingDoorknob5 жыл бұрын
@@tbe7218 insanity
@EtanMarlin4 жыл бұрын
Nah it's a baby that is born screaming but the. Learns to laugh
@thechrisgrice8 жыл бұрын
Given the input on this was Japanese, I listened for some Japanese. I heard one phrase: 7:45 - "Denki, hen ka?" meaning "Electricity... is it strange?" When it's an AI saying that, you bet it's strange. Then again, Japanese literally only needs it to accurately string 2 sounds together and you get words.
@user-ty6we2sp2m7 жыл бұрын
thechrisgrice she also says itai. (That means pain in Japanese)
@alex732176 жыл бұрын
It even laughed afterwards :D
@elliot_rat6 жыл бұрын
Umitalia nyan yeah i think op knows that
@wigwagstudios24746 жыл бұрын
アアアアアアアアアアアア
@josoup82914 жыл бұрын
Where
@calebpeters25448 жыл бұрын
5:52 I saw "weird glitch" and immediately thought it was gonna say something like, "EVEN NOW, THE EVIL SEED OF WHAT YOU'VE DONE GERMINATES WITHIN YOU."
@SkyrimTheBoss8 жыл бұрын
40 seconds in and the AI is already screaming in pain.
@geokramer17117 жыл бұрын
same tbh fam.... same...
@kaijupants90958 жыл бұрын
This is what happens when you torture a vocaloid
@DAAI7418 жыл бұрын
this is honesty terrifying
@MrDremboy8 жыл бұрын
Adam Brown so true
@Garganzuul8 жыл бұрын
It's probably what we sound like to animals.
@thetoontastictoon17208 жыл бұрын
Garganzuul lmao
@alphakretin23878 жыл бұрын
She's trying her best
@zinqtable10927 жыл бұрын
Our voices are pretty high pitched to other animals.
@MaxLebled8 жыл бұрын
Good god, the 2k training iteration sounds like pained screaming
@Alex_Gol8 жыл бұрын
( ͡° ͜つ ͡°)
@thepetrarcticwar27787 жыл бұрын
yeah, it sometimes sounds like "HELP ME!"good grief, its nightamarish.
@kittenclaws57757 жыл бұрын
Like a newborn, perhaps?
@SuperGirl-eq1le6 жыл бұрын
Maxime Lebled 5k iterations are funnier
@TheLaXandro8 жыл бұрын
0:31 itai (pain) 1:00 shine (die) I guess she really hated you for at least first 5k iterations if these were among her first words.
@jongyon7192p7 жыл бұрын
TheLaXandro did I hear "yame"? shit, just kill me
@kajwbidonajdowlem50136 жыл бұрын
Lol true
@lollergsize7636 жыл бұрын
At 15k iterations she started laughing, so... maybe she planned a revenge
@Speedow8 жыл бұрын
it's like listening to someone in pain while slowly going mad and accepting it.
@EzraelVio8 жыл бұрын
They can actually hire your computer to produce new Pokemon cry for the next gen Pokemon game
@syntheticfox_real8 жыл бұрын
YES. This is EXACTLY WHAT IT IS.
@KuraIthys8 жыл бұрын
In some ways the early iterations sounds somewhat like the noises babies make...
oh fuuuuuuck no it started giggling that's creepy as hell
@Garganzuul8 жыл бұрын
You think that's creepy? - There might be a skeleton inside you RIGHT NOW.
@GoldphishAnimation8 жыл бұрын
Garganzuul oh shit there is I gotta find a surgeon NOW.
@nozumihishimatchi18808 жыл бұрын
The Stitch me to he messed up anime voice
@NoxUmbrae8 жыл бұрын
If you take the point of view that your brain is what you call "you" it gets even creepier! YOU MIGHT BE INSIDE OF A SKELETON RIGHT NOW.
@NoxUmbrae7 жыл бұрын
More like eat, shit and be a nuisance 24/7.
@grenadier428 жыл бұрын
ANIME AI THROWS TEMPER TANTRUM, MORE AT 9
@achaemenid73948 жыл бұрын
But isn't Anime Ai's bedtime 7? Exactly, hopefully we can get Anime Ai in bed by 9. Ah, I see.
@jorge286248 жыл бұрын
Neural network-chan
@KoltoxOfficialChanel8 жыл бұрын
9:09 first time in history when a creation "intentionally" calls its creator a baka
@KyureiProductions5 жыл бұрын
It called it's creator an "idiot" in Japanese.
@charaicommenternotalt9 ай бұрын
and babies?
@propername48308 жыл бұрын
Local robot tries to understand anime
@christiaanprinsloo5867 жыл бұрын
local robot goes full weeaboo in under 10 minutes
@frootube56626 жыл бұрын
Justin Y. I have a weird love hate relationship with you Justin I kinda like it...
@SuperGirl-eq1le6 жыл бұрын
XD
@Kinzsters1726 жыл бұрын
Justin Y. You're fake.
@Heliocentricity4 жыл бұрын
@@abacussssss for freddy's sake
@07actual8 жыл бұрын
You've successfully invented a very passable Kirby language. Use this newfound power wisely.
@SuperGirl-eq1le6 жыл бұрын
07actual XD
@CupoChinoMusic6 жыл бұрын
*using an anime voice holy shit this is next level amirite guys
@wigwagstudios24746 жыл бұрын
ポヨ
@Heliocentricity4 жыл бұрын
7:16 poyo
@curtiss57818 жыл бұрын
It cries just like a baby who doesn't know language yet either
@joekewl75397 жыл бұрын
Kizuna A.I. Prototype
@EdwardNavu8 жыл бұрын
An interesting case of neural network application, and an unintentional nightmare fuel, just to attempt to reenact voices in anime.
@Gurren8138 жыл бұрын
Imagine bastion but instead of cute beeps, chirps, and whistles, it just makes garbled anime lines. "Kon NIIIIIIII ch wAAAAAAAAAAAAAAAA"
@SomethingUnreal8 жыл бұрын
+Gurren813 Consider yourself lucky that I don't have Overwatch, so I won't make that mod.
@wawan87596 жыл бұрын
Im already tracer
@SlyHikari035 жыл бұрын
Pls somebody make that a thing!
@Zenthex8 жыл бұрын
the worst part is that someone somewhere is making this program their waifu.
@darianthe2nd428 жыл бұрын
"... OOOOOOaAaAaA... AAAAAAA!" "That's so hot."
@thebirchwoodtree7 жыл бұрын
Deltinum *Slow fap* *
@ketmax28056 жыл бұрын
TheBirchWoodTree *slow nut*
@SweetHyunho8 жыл бұрын
Thanks, I had imagined what it would sound like. Now I have a pretty good idea. I wish somebody use a virtual piano to reproduce piano recordings, train a lot, and then let it improvise.
@SomethingUnreal8 жыл бұрын
+SweetHyunho: Check Google's WaveNet project at their blog - they did this, and there are several samples there showing what it's output. The piano ones are near the bottom =) deepmind.com/blog/wavenet-generative-model-raw-audio/
@SweetHyunho8 жыл бұрын
Seen that already. That is sample-based. I'm talking about performing a virtual (or real) musical instrument. Perhaps we could simulate a set of virtual hands for extra human feeling!
@SomethingUnreal8 жыл бұрын
What do you mean "sample-based"? It's trained on lots of speech the same as mine is. The fact that they had to fragment it is just because it's a CNN rather than an RNN (and because they wanted to label each phoneme)... The concatenative speech synth that they compare it to is just samples stitched together, but the CNN's output is a continuous stream based on what it learned.
@SweetHyunho8 жыл бұрын
Yes, what you said. Both WaveNet and your RNN directly outputs the wave without a virtual instrument. What I want to see is the network "hitting" the keys of the piano, or moving the virtual tongue and lips to speak, by controlling(outputting to) a separate simulator which will synthesize the sound itself. WaveNet contains the piano acoustics itself, cannot replace piano with organ or tweak it, but in my idea the network focuses on the structure of the music. That should enable looking much farther (near-sightedness = boring music). I guess AI musicians will start being really competitive once the history+planning window exceeds one minute.
@SomethingUnreal8 жыл бұрын
Right, I understand. So in the case of speech, outputting something like the pitch, volume and the formant frequencies of the voice, which can then be fed to something like Praat to synthesize the sound. Yes, that would be very cool.
@Qualex148 жыл бұрын
Next time you should use Morgan Freeman's voice
@creeperlamoureux8 жыл бұрын
YAS
@sesseljabs9647 жыл бұрын
Qualex14 or david attenborough
@AlyphRat7 жыл бұрын
*Gordon
@gordonfreemanthesemendemon18056 жыл бұрын
I saw a gordon and i saw a freeman, so i have thus been summoned
@EmmysVerySeriousVideos8 жыл бұрын
This shit is creepy, sounds like it's in pain and screaming like hell
@Darkethi8 жыл бұрын
hOi
@EmmysVerySeriousVideos8 жыл бұрын
Darkethi.eXe hOI
@MsHumanOfTheDecade8 жыл бұрын
It's mostly screaming from existential dread
@mendaxMultorum8 жыл бұрын
Dodeca heavy doc
@Phoen1x8838 жыл бұрын
Ed... ward...
@Skelpolu8 жыл бұрын
Please, do continue this and make more videos about it - it's incredibly intriguing, and I'd love to see what happens with different voice-actors. Male-japanese, and even some English ones would be awesome, despite not getting a single word out of them anyway - literally.
@SomethingUnreal8 жыл бұрын
I'm glad you like it! I will eventually be uploading one trained on my voice (which happens to be male and English), which I trained with the specific goal of getting recognisable words out of it.
@Skelpolu8 жыл бұрын
You think that'd be possible? That would be amazing! By the way, as far as I understood from the video, the learning eventually flattens out and only adjusts minimal features (which, however, seem to affect our perception of the voice the most). Would increasing the amount epochs taught make a difference at all?
@SomethingUnreal8 жыл бұрын
Yes, the learning rate decreases over time to let things stabilise. I actually stopped it when I did because I wasn't noticing many changes (you can see that towards the end of the video, I'm skipping more results because there's nothing very different from previous results). Things would likely have continued to change a bit, but not much. Also, although the training loss ("error" in what it has learned) decreases roughly logarithmically, it doesn't get better forever. It eventually stops decreasing and becomes closer and closer to a flat line if you look at it on a graph (please check the link at the end of the video description for some pretty graphs of the losses over time =P). In other words, there is a limit to how much the network can learn, even if you could give it hours' worth of really good data. I think that the reason the results were sometimes still so different to each other at the end (even though the training loss had stopped decreasing) is because it was just tweaking a few detailed parameters in "random" ways (i.e. wasn't working towards a specific state) because it was not big enough to learn all details, compared to when it was learning the most important patterns. I could certainly be wrong, though. Another commenter did point out that I should sample from each checkpoint (iteration) more than once because they can produce wildly different results, but for technical reasons, I'm still not able to yet (I don't have access to the computer I trained it on, which trained using the GPU for speed; my computer can't train on the GPU, and the checkpoint files main by torch-rnn with GPU vs CPU training are different formats...).
@SomethingUnreal8 жыл бұрын
Update: I actually _can_ use these checkpoints on my computer! Although it takes 85 minutes to make a single output file (~27 seconds of audio), assuming it's not training at the same time. So I must've been confusing torch-rnn with something else (maybe char-rnn).
@sophiapinzon27658 жыл бұрын
idk about you guys but the fact that the network would take a liking to random sounds in the beginning and use them all the time (example: eeeeeeeeeeeeeeeeeeee!!!!!) is super cute
@darianthe2nd428 жыл бұрын
idk man 5:10 was cuter
@Ys-wd2lh8 жыл бұрын
0:32 hentai sound track
@DAAI7418 жыл бұрын
AAAAAAAAAAAAAAAAAAAA
@joshualettink75828 жыл бұрын
3 months late, but this made me laugh out loud for real haha
@sinistrolerta8 жыл бұрын
Dank Meme Sir, what the fuck kind of hentai are you watching? O_o
@rajatmond7 жыл бұрын
I'll watch what he's watching. Thanks!
@youtuber-h3g7 жыл бұрын
「 OKAY 」 holy fuck i'm dead 😂
@AshtonSnapp8 жыл бұрын
So, give neural network Japanese anime girl, get gibberish. Perfect.
@magikarpusedsplash88818 жыл бұрын
SnappGamez If you spoke Japanese, then it'd probably make more sense.
@AshtonSnapp8 жыл бұрын
Magikarp Used Splash I tried learning once. I understand why it is considered a Category V language.
@magikarpusedsplash88818 жыл бұрын
SnappGamez it's even more difficult (allegedly) to native English speakers.
@makeneko_s8 жыл бұрын
SnappGamez by the way, the voice_sample is of a boy... the voice said *boku* [僕]which is the i/me used for boys... watashi[私]are for girls... so basically it's a trap... まぁー、俺も女の子の声だと思ったけどね
@SomethingUnreal8 жыл бұрын
+白金圭 Have you never heard a tomboyish girl say "boku"? Please look up the game on VNDB (ぴゅあぴゅあ) if you don't believe me. ひなたはあんまりお転婆なんかないけど。犬耳っ子だから…かな。 ところで今スペルを確かめるために「犬耳っ子」をググったと、ひなたは検索結果の6番目だった。びっくりしたw
@ZiaSatazaki8 жыл бұрын
8:54 MANGO PUPPY ASYYYYLUUUUUM~
@TheSprunkerOfficial4 ай бұрын
bruh
@creator-link2 жыл бұрын
Weird to see this before the great transformer boom that basically accelerated ai to just everything
@Heliocentricity4 жыл бұрын
7:30 Neural network anime girl learns to sing the 7 GRAND DAD/Flintstones theme
@danpope38128 жыл бұрын
Someone please write some translation closed captions. Please.
@ChucksSEADnDEAD8 жыл бұрын
Dan Pope no actual words were spoken, apart from random chance. It's speaking gibberish.
@danpope38128 жыл бұрын
Filipe Amaral I meant could someone with comedic talent have some fun with it.
I am in your target audience! Absolutely love this.
@SomethingUnreal8 жыл бұрын
+Vincent Oostelbos I'm glad! And you even made it through my unreasonably long wall of text in the video description!
@Phagocytosis8 жыл бұрын
I sure did. I also noticed at the end in the video, you had written "I'd still like to try training a bigger network with longer training data". Is that something you have done or are still planning on doing, and if so, is it something that will find its way to this channel at some point?
@SomethingUnreal8 жыл бұрын
+Vincent Oostelbos I've not done it with this voice yet. I recently tried training a 760x3 network on 27 minutes of audio but with a very different voice (often becomes very quiet), but I haven't got it to turn out as well as this yet. I've trained several (smaller ones) on my own voice, with the goal of having it output recognisable words, to varying degrees of success. I think they could be better if I recorded more training data, but it's very hard to keep the same way of speaking similar things for over 15 minutes (it's like my brain becomes numb and I can't even form the words anymore). I should make videos of the results anyway, though.
@Phagocytosis8 жыл бұрын
SomethingUnreal Have you tried just reading out a lengthy piece of text, like a book, as if you were creating an audiobook? Anyway, I'm looking forward to seeing the results of some of those projects you mentioned. Good luck!
@SomethingUnreal8 жыл бұрын
+Vincent Oostelbos I did that a few days ago, yes. Thank you!
@henryzhang39616 жыл бұрын
0:21 killed 0:28 being tickled 1:19 boiling kettle 1:25 some assembly required, etc 1:49 riding roller coaster 3:53 karate screaming
@Zahlenteufel18 жыл бұрын
8:01 did it say "senpai"???
@skorpius20298 жыл бұрын
yes
@henrik000000000000018 жыл бұрын
4:11 (eeeeeeh oniisan )
@gabrielecipriani67988 жыл бұрын
Zahlenteufel1 2:43 "yamete!" so creepy...
@bolgeg61917 жыл бұрын
7:10 tomare tte
@HaSTaxHaX7 жыл бұрын
9:50 "motto, motto..."
@HeyItzMeDawg8 жыл бұрын
Can just imagine the mad scientists' notes now: Training iteration #15000. AI has taken to giggling manically. Will continue to apply procedure and monitor results. Training iteration #19,000. AI now screaming regularly in response to training regimen. Not sure if the response is indicative of any developed sense of pain or fear. Training iteration #23,000. AI appears to have produced a wailing and/or crying effect. Hypothesis is that it believes the sound might affect the continuity or speed of the training regimen. Steps are being made to remove this misconception. Training iteration #25,000. AI now incorporating self-calming mechanisms following training regimen, typically in the form of "mmms" and "shushing" noises. Previous corrective steps appear to be successful. Training iteration #29,000. AI producing laughter despite a lack of relevant stimuli. Further investigation necessary. Training iteration #33,000. AI appears to be asking questions on a semi-regular basis. A risk assessment is being prepared and lockdown procedures are being reviewed in case of catastrophic scenarios. Training iteration #48,000. Recorded usage of the phrases "no more", "die", and "so many" in between bouts of giggling. Will investigate possible interpretations. Training iteration #54,000. AI exhibiting bipolar personality following latest training exercises, alternating between pleasant and hostile responses to the same trainer. The training regimen at the time was not designed to elicit either response. Training iteration #58,000. AI appears to have taken a liking to singing spontaneously. Training regimen is being adapted to correct this behavior. Training iteration #59,000. AI is now communicating almost entirely in song, in defiance of prior training regimen.
@SomethingUnreal8 жыл бұрын
+HeyItzMeDawg: Please make this into a book.
@andrewsauer27298 жыл бұрын
I FEEEEL FANTASTIC! HEY HEY HEEY!!
@corazontuble2938 жыл бұрын
Pls no not that
@hellz234568 жыл бұрын
i just shiver by trying to remember it
@xiphosura4138 жыл бұрын
Wut?
@andrewsauer27297 жыл бұрын
+gremlinboii Thank you! I do aim to please.
@want-diversecontent38877 жыл бұрын
andrew sauer Don't go to my party next time.
@SleepyAdam8 жыл бұрын
This stuff scares me. It's adorable and terrifying at the same time.
@heartache57428 жыл бұрын
Adam McKibben At 4:15 it gave up on its life
@matthiasengh79358 жыл бұрын
1 have a cool project, 2 take the most annoying training data imaginable, 3 witness carnage
@nelsonnicholson6175 Жыл бұрын
We've come a long way
@Noname_20148 жыл бұрын
Its interest. But the voice sound a bit creepy
@crristox8 жыл бұрын
Is cringy*
@MsHumanOfTheDecade8 жыл бұрын
Why? Because anime is cringy? That is an odd verdict.
@crristox8 жыл бұрын
Dodeca Totally underrated, and original.
@ChrisD__8 жыл бұрын
It sounds horrifyingly uncanny.
@ulilulable8 жыл бұрын
3:05 "nani? kore? nani?" XD
@BSFilms19978 жыл бұрын
@ 2:45 it says "Itai yo", which means "It hurts" in Japanese. This is scary...
@delayed_control8 жыл бұрын
THE POWER OF CHRIST COMPELLS YOU
@nozumihishimatchi18808 жыл бұрын
m4ti140 crist and scince is seperatef
@Dirtfire8 жыл бұрын
I think he was just being funny, but yeah.
@__ten3 жыл бұрын
4:55 AWW that little "pop" noise was so cute sounded like someone smacking their lips together or smth
@MuzikBike8 жыл бұрын
This sounds terrifyingly adorable.
@sqoops86138 жыл бұрын
1:49 It has learned to express it's endless pain and suffering.
"Alright, let's give a voice to this neural network and see what happens." *_continuous screams of agony_*
@OnEiNsAnEmOtHeRfUcKa8 жыл бұрын
1:58 So that's what a computer screaming sounds like.
@htomerif8 жыл бұрын
Mkay, a lot of people in the comments who know nothing about AI. So what was the "training" algorithm used here? That's the most important piece of information. I'm assuming the input and output were frequency domain samplings.
@SomethingUnreal8 жыл бұрын
+htomerif: The input and output were raw 8-bit PCM audio samples, each of which was fed into or out of the network as the activation of one of 256 nodes. The fact that it's in the time domain is the part that amazed me the most (the way it's able to find the repeating patterns over time). I'm not entirely sure what you mean by "training algorithm", but torch-rnn (the software I used here) uses backpropagation with the "Adam" (Adaptive Moment Estimation) optimizer. You can get more details on exactly how it works here by checking its project page, and especially the text files "train.lua" and "doc/flags.md", here: github.com/jcjohnson/torch-rnn/
@htomerif8 жыл бұрын
SomethingUnreal Generally you have to have enhance/suppress condition for connections or live/die condition for individual nodes in a network. Like if you want a servomechanism and camera to follow a red ball, a training algorithm needs generally suppress connections more severely the further it is from the ball and enhance connections the closer it gets to a ball. So by "training algorithm", I mean "the thing that analyzes the input and output and decides whether the current network state is doing better or worse than the last network state." It looks like maybe the "criterion" in is what I'm talking about. Reading other people's code is one of my least favorite activities (no offense), but my best (most likely incorrect) guess is that its based solely on the cumulative numeric deviation from the original audio file? If that's the case then yeah, I would kind of expect the output to be some snips of time-synchronized copies of the input data repeated a lot. I know this is getting TL;DR, but it might be interesting to use frequency domain data (obviously you already know that), I've used FFTW3 for that general kind of thing and if Lua is your language of choice, I'm sure theres a an FFTW library with Lua hooks. Possibly quite a bit slower though if you were actually using Cuda though.
@SomethingUnreal8 жыл бұрын
+htomerif I was using CUDA (it improved speed by about 4x). I don't know the exact way the loss is calculated, but by my understanding, it's not calculated by comparing the network's predicted output to that of the main training set. The original file is split into a large training set and 2 smaller sets ("test" and "validation"). It appears to regularly compare the predicted output against the "test" set, and whether it's getting better or worse here influences the weights, which is why it doesn't generate perfect copies of the training set - it's never "seen" the test set before. If the original data is a short loop repeating many times, so that the same loop is repeated over and over in the training, test and validation sets, then all it does is perfectly memorise as long a sequence as it's possible to store in the network and blindly spit that out over and over. EDIT: I may have confused "test" and "validation". The usage of these according to torch-rnn's code and according to other posts I've seen seem to contradict each other, unless I've misunderstood something a lot...
@htomerif8 жыл бұрын
SomethingUnreal I think I see what we're getting at here. So the network is fed a small test sample of the input, and then its output is compared with *what should have come next*. That is the bit I was calling the "training algorithm". I did notice that the code had 2 distinct states, a training state and a running state. So the training state is never fed the entire file, but the running state *is*, for purposes of the video. But yeah, the terminology is basically an instant pitfall as there's huge variation in what means what across the field of AI programming. Also, any or all of what I said up there could be wrong. I think I get the gist though.
@augustinushipponensis30218 жыл бұрын
Have you played it back more slowly? I think your algorithm was being too efficient. :)
@blazelega29858 жыл бұрын
5:47 my longest "IIIII" ever
@ForgottenDawn8 жыл бұрын
Yoko Ono's new album, ladies and gents. *kill me*
@bobalinx87626 жыл бұрын
THIS is what killed the Beatles?
@curoamarsalus78228 жыл бұрын
Hmm, I'm extremely curious to see how this would sound with a normal voice spoken in a decent range.
@OrchidAlloy8 жыл бұрын
That is a good idea, excellent feedback, and a sick burn, all at the same time.
@curoamarsalus78228 жыл бұрын
Well, I don't mean it to be a burn. It's just that this voice at that quality is physically painful to listen to (for me at least).
@hecko-yes8 жыл бұрын
+Daniel T. Holtzclaw Look up WaveNet (if it doesn't find it, try "wavenet samples").
@LivvyHackett8 жыл бұрын
What is it learning? Is it trying to copy the phrase? Or make its own sentences?
@SomethingUnreal8 жыл бұрын
It's trying to learn how to make audio that sounds the same (without being able to simply store it). Or more technically, it's learning the probability of each of the 256 possible vertical waveform positions given all of the previous ones.
@LivvyHackett8 жыл бұрын
oh? sounds interesting
@BigOlSmellyFlashlight6 жыл бұрын
SomethingUnreal oh it's 8 bit
@GroovingPict8 жыл бұрын
"cute voice"... yeah, if by cute you mean intensely annoying
@VictorW238 жыл бұрын
About 2:04 it starting to sound like a baby learning how to speak. Creepy, but still amazing
@bingbongshamalama8 жыл бұрын
This is a mind-blowingly awesome outcome for this network. I had an idea similar to this a few years ago but never implemented it. This makes me wonder about how you could develop a set of learned words and string them together somehow. Not sure how to overcome how unnatural that would probably sound, though. Great stuff!
@SomethingUnreal8 жыл бұрын
Thank you! I was thinking something similar to that, but I have no idea how to program it. Something like manually transliterating the training data, then feeding it both the text and the audio so it associates a sound with each word - in other words how a particular voice pronounces text. Then, ultimately being able to give it text and have it read it in that voice. I believe some people are using the reverse of this for speech recognition. This would be much easier in a phonetic language like Japanese. Although that would make it all the more impressive if it learned the many rules of English pronunciation without the need for me to put some intermediate stage in where it converts text to/from something like IPA. I may be getting ahead of myself. =P
@katagonnАй бұрын
This video is a historical artifact
@arthurdent62568 жыл бұрын
Neat, very reminiscent of pre-language gibberish in young children. How does this thing learn exactly?
@SomethingUnreal8 жыл бұрын
I find those exact details hard to explain (and I don't understand the exact maths being used anyway), so I would recommend looking at the few videos by the channel "Computerphile" regarding neural networks and machine learning to start with. Wall of text incoming... An over-simplified analogy is to imagine a row of cups that can collect water, and a hose that will spray water into one of them at any one time. The hose's aim is changing in a pattern following the sound waves of a voice. This is like the input layer of the network, with a node for each possible vertical position of the waveform of the training audio (256 positions because it's 8-bit audio). Multiple pipes flow downwards from each cup to another series of containers underground. Every container on that second layer has a pipe between every cup above it, and there's also a valve on every pipe. This is the first "hidden layer", which has 680 nodes (containers) in this case. Below that is another layer of containers each connected to every container on the layer above it with a valve on every pipe. This is the second hidden layer, and there are three hidden layers in this network. At the very botom is an output layer of cups, connected to the last hidden layer in the same way, except there are only 256 cups (the same as the input layer). At the beginning of training, all valves are set to random positions (from closed to open). During training, we move the hose over the cups (a sequence of values enters the network) and water passes through the network. The computer slightly adjusts all the valves (about 1.27 million here if my calculations are correct) slightly, over and over, to try to make the output gradually come closer to resembling the input. This is where my analogy starts to fall apart, because each "container" actually has a kind of secondary "memory chamber" of its own that the computer can choose to change the water level of or not. Choosing not to allows it to store its current value, and the combined effect of lots of nodes doing this is that past patterns in the data are remembered (even from many thousands of steps ago) that may be important in learning long-term patterns. So a container can remember a value even if it passes some water from its memory chamber down to a lower container... that's more confusing if you're thinking of water. Anyway, the end result is that how much water is in a container (the activation of a node) is based on a combination of the settings of the valves (weights) that are in all the pipes (connections) above it, and on past sequences of values. The cups on the output layer fill to varying degrees, and the one with the most water "wins". Actually, you can imagine sorting them into order of how full they are and applying a probability curve so that if one is 95% full and another is 93% full, there's a good chance that you might use the 93%-full cup as your answer, but would be incredibly unlikely to choose a cup that's only 2% full. After training, when sampling from the network (generating output), you pass this answer from the output layer (which represents your first waveform position) back into the input layer and what comes out will be your second waveform position, etc, so that the next waveform position is based on all those that came before it (during sampling) and also on all of the patterns learned during training. The difference is that you don't actually save the state of the network to a file after sampling, only after training. Another key point that didn't fit into my analogy is that during training, you actually split your training data into two uneven pieces (something like a 90/10 split) - a large piece that we feed to the network, and a small piece called the "validation set" which we never allow the computer to see while it's making adjustments. The computer then, after adjusting weights for some time, can compare what it has learned so far to the validation set, which it has never seen before. If it is learning the important patterns (rules for generating output), then it should still be able to produce output that looks similar to the validation set, even though it's never seen the validation set before. This can be used to detect overfitting, which typically happens if the network is orders of magnitude bigger than the training data, and it ends up training itself to "perfection", perfectly memorising the entire sequence of training data. Then, even though it may score a 0.00 error during training, it'll have a huge error on the validation set, which it never saw before. It's like a child memorising all the answers to a test while not actually learning any of the rules. The network should be too small to be capable of storing all training data so that it is forced to learn patterns in the data, or "shortcuts" (rules) to producing similar data. The exact maths that the computer uses to make decisions about how to adjust the valves is, I suppose, the key. It's called backpropagation and is way beyond my understanding, but I can see that even in the list of related videos someone has made a video explaining it. It is based on the fact that every node's value relies on those connected before it, and works backwards through the connections to minimize a calculated "error" score in the end result, but I can't pretend to understand how that works. =P
@SomethingUnreal8 жыл бұрын
(I welcome any corrections to my dodgy analogy~)
@arthurdent62568 жыл бұрын
Thanks for responding, I've seen a few other videos since that kind of explain the same thing.
@VenetinOfficial Жыл бұрын
Had to come back to this video because i wanted to see just how far we've come since this. we have ai that can just fully convert yours (or anyone's) to anyone else's now AND in full 44.1kHz sample rate. kinda fucking stupid LOL
@rommix0 Жыл бұрын
You know honestly, I'm more amazed at how creative people used to be with AI like this guy. Most people that really use it now are for classifications and chatbots mostly. I kinda wanna see something as bonkers as this in the deep learning community again.
@justinslab20358 жыл бұрын
so that's what it sounds like when you put a chibi into a blender. xD
@durdleduc85206 жыл бұрын
*aaaaaaaaaaaaaaaaa*
@bobalinx87626 жыл бұрын
Oh my...
@rommix0 Жыл бұрын
6 years later, and I almost forgot about this video. This was the start on my neural network journey. It's fun to look back on it.
@bigbox89928 жыл бұрын
To use a waifu voice is to play a dangerous game.
@suwinkhamchaiwong83827 жыл бұрын
7:13 *That moment when you play too much Pokemon,* You talk about it!
@CorruptedMuse8 жыл бұрын
It's like the audio equivalent of the uncanny valley
@Lemonidas758 жыл бұрын
Its like watching Skynet learning.
@LUCABALUCA8 жыл бұрын
not my proudest fap
@fzero88218 жыл бұрын
i just kept hearing shine shine
@fzero88218 жыл бұрын
freaky neural networkn man
@ImDelphox8 жыл бұрын
Fzero Fz Heard it trying to become a Pikachu several times later on ("pika" @ 7:54)
@maggintons8 жыл бұрын
0:26 when you realise your dead and your memories have been uploaded into an AI program.
@Fuzzthefurr8 жыл бұрын
0:40 is what I hear whenever I try to watch anime.
@Teth478 жыл бұрын
It got pretty close given the limited training data. That's pretty impressive.
@sharkpaw4 жыл бұрын
idk anything about computers and IA but making a anime girl computer scream and a miku profile pic is enough to make me sub
@greenhound8 жыл бұрын
2:17 is hilarious basically hentai audio
@obnoxendroblox86036 жыл бұрын
wtf is this what hentai is?
@i_am_boredom4 жыл бұрын
. . .
@junkeyz4 жыл бұрын
What kinda hentai are you watching where the girl goes "sshshshhyyyaAAAAAAAAAAAAOAOAOIIIIIOOO9X
@SlyHikari034 жыл бұрын
@@obnoxendroblox8603 she sounds like she’s getting electrocuted.
@mrmaniac38 жыл бұрын
5:00 Kawaii
@xtrashocking7 жыл бұрын
k.....wAH-YEEEEE
@Heliocentricity4 жыл бұрын
@@xtrashocking eHehhEHehehEHehehhEHHe
@satan25835 жыл бұрын
2:49 sounds like “itai yo!” or “it hurts!” They say a computer can’t feel emotions. I’m pretty sure this is an exception.
@liammckewic32185 жыл бұрын
2:02 Waluigi goes "waa!" 2:07 HALT! HAAAALT! 3:53 HOOOLD UP! 5:54 The Deadly Screech of Four. Has. Returned.
@yazuak8 жыл бұрын
some of the more clear japanese i heard: 7:24 "ota~ku no terai" 7:27 "mina-sama" 7:28 "sono fuinki" 7:29 "sugoi wakuwaku" 7:33 "zehi koto wo ---" 7:44 "kiette tari, tenki" 7:52 "jitei(??) koto yo~~" 8:00 "--kouki(?) ni ike" 8:11 it almost says ""hitotsu tano_sase(te)kureta(de)shou" which would be nearly actual japanese
@creeperlamoureux8 жыл бұрын
Why isn't there more of this?
@2sighkick2furious398 жыл бұрын
When all was said and done. The Neural Network told a story of sadness
@porygonlover3228 жыл бұрын
I'm almost certain it said "itai yo" (it hurts) at 12,000
@porygonlover3228 жыл бұрын
59,000 ended with "masu" which is a way Japanese sentences actually end
@MilanKarakas8 жыл бұрын
Of course machine said "it hurts". What you will say if you are plugged at 240V AC as this machine did? :D LOL
@thetoontastictoon17208 жыл бұрын
Kzinssie (porygonlover322) very educational this must mean that in Puyo Puyo the "level start" sentence spoken by Arle must be saying "masu" and not "natsu" like I always thought-
@SomethingUnreal8 жыл бұрын
+The Toontastic Toon: batan kyuuu!
@thetoontastictoon17208 жыл бұрын
SomethingUnreal xD
@zephyr7337 жыл бұрын
Love how it's first reaction was basically screaming.
@forstnamelorstname41698 жыл бұрын
How do you begin to learn this stuff?
@ShadowriverUB8 жыл бұрын
Phven Lacro its similar to evolution concept. program manipulates neural network and there scoring system that motivates program, program will apply changes that gives higher score will avoid changes that lowers the score. I not sure about this case but thats how most machine learning system works, search "machine learning"
@АлександрБагмутов8 жыл бұрын
Wow! This is really creepy! Imagine one day you tell to your robo maid to bring some tea, but instead of silently obeying as always, she just stares at you and trying to mimic human voice, creaks: "ihihi... waaaaaaaaa... na kiiil..." "ihihi."
@SomethingUnreal8 жыл бұрын
I'd just have to put her through some more "training". (Speaking of creepy)
@austintasato71028 жыл бұрын
It's probably not going to generate words, as they're given no special connotation in the data set, right? Perhaps if one were to give it a supervised learning set with dialogue scripts, maybe something more realistic would arise?
@SomethingUnreal8 жыл бұрын
Austin T: That's exactly what Google's doing in their recently-announced project WaveNet! They trained a network on audio and text (2 inputs), and also audio + text + speaker ID (3 inputs), and they can now ask it to generate the audio if they give it only the other input(s). I.e. It's like a speech synth that blows away traditional speech synths in terms of how real it sounds. Really amazing stuff.
@inswedishmynameisdik8 жыл бұрын
never thought skynet would be kawaii
@Hurakion8 жыл бұрын
thanks to this video i now know how the WKCR Hijacking "demon voice" was created. its so similar to this
@SpanglesYT2 жыл бұрын
This sounds so scary. It can be used for a cyber horror flick.
@vladislavdracula17638 жыл бұрын
That was both amazing and disturbing...
@not-lukas66636 жыл бұрын
4:54 it starts words but cancle them then, but you can somtimes understand them (jes i can a little bit japanse)
@MultiSciGeek8 жыл бұрын
This is really interesting. Where can I learn more about this or possibly try recreate this experiment?
@SomethingUnreal8 жыл бұрын
This is where I first discovered neural networks: karpathy.github.io/2015/05/21/rnn-effectiveness/ It takes about a program called char-rnn, and I'm just using a more-efficient fork of that called torch-rnn ( github.com/jcjohnson/torch-rnn/ ).