Typical Decoding for Natural Language Generation (Get more human-like outputs from language models!)

Рет қаралды 18,969

Күн бұрын

#deeplearning #nlp #sampling
Modern language models like T5 or GPT-3 achieve remarkably low perplexities on both training and validation data, yet when sampling from their output distributions, the generated text often seems dull and uninteresting. Various workarounds have been proposed, such as top-k sampling and nucleus sampling, but while these manage to somewhat improve the generated samples, they are hacky and unfounded. This paper introduces typical sampling, a new decoding method that is principled, effective, and can be implemented efficiently. Typical sampling turns away from sampling purely based on likelihood and explicitly finds a trade-off between generating high-probability samples and generating high-information samples. The paper connects typical sampling to psycholinguistic theories on human speech generation, and shows experimentally that typical sampling achieves much more diverse and interesting results than any of the current methods.
Sponsor: Fully Connected by Weights & Biases
wandb.ai/fully-connected
OUTLINE:
0:00 - Intro
1:50 - Sponsor: Fully Connected by Weights & Biases
4:10 - Paper Overview
7:40 - What's the problem with sampling?
11:45 - Beam Search: The good and the bad
14:10 - Top-k and Nucleus Sampling
16:20 - Why the most likely things might not be the best
21:30 - The expected information content of the next word
25:00 - How to trade off information and likelihood
31:25 - Connections to information theory and psycholinguistics
36:40 - Introducing Typical Sampling
43:00 - Experimental Evaluation
44:40 - My thoughts on this paper
Paper: arxiv.org/abs/2202.00666
Code: github.com/cimeister/typical-...
Abstract:
Despite achieving incredibly low perplexities on myriad natural language corpora, today's language models still often underperform when used to generate text. This dichotomy has puzzled the language generation community for the last few years. In this work, we posit that the abstraction of natural language as a communication channel (à la Shannon, 1948) can provide new insights into the behaviors of probabilistic language generators, e.g., why high-probability texts can be dull or repetitive. Humans use language as a means of communicating information, and do so in a simultaneously efficient and error-minimizing manner; they choose each word in a string with this (perhaps subconscious) goal in mind. We propose that generation from probabilistic models should mimic this behavior. Rather than always choosing words from the high-probability region of the distribution--which have a low Shannon information content--we sample from the set of words with information content close to the conditional entropy of our model, i.e., close to the expected information content. This decision criterion can be realized through a simple and efficient implementation, which we call typical sampling. Automatic and human evaluations show that, in comparison to nucleus and top-k sampling, typical sampling offers competitive performance in terms of quality while consistently reducing the number of degenerate repetitions.
Authors: Clara Meister, Tiago Pimentel, Gian Wiher, Ryan Cotterell
Links:
Merch: store.ykilcher.com
TabNine Code Completion (Referral): bit.ly/tabnine-yannick
KZbin: / yannickilcher
Twitter: / ykilcher
Discord: ykilcher.com/discord
BitChute: www.bitchute.com/channel/yann...
LinkedIn: / ykilcher
BiliBili: space.bilibili.com/2017636191
If you want to support me, the best thing to do is to share out the content :)
If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: www.subscribestar.com/yannick...
Patreon: / yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

Пікірлер: 59

@YannicKilcher 2 жыл бұрын

OUTLINE: 0:00 - Intro 1:50 - Sponsor: Fully Connected by Weights & Biases 4:10 - Paper Overview 7:40 - What's the problem with sampling? 11:45 - Beam Search: The good and the bad 14:10 - Top-k and Nucleus Sampling 16:20 - Why the most likely things might not be the best 21:30 - The expected information content of the next word 25:00 - How to trade off information and likelihood 31:25 - Connections to information theory and psycholinguistics 36:40 - Introducing Typical Sampling 43:00 - Experimental Evaluation 44:40 - My thoughts on this paper Paper: arxiv.org/abs/2202.00666 Code: github.com/cimeister/typical-sampling

@norik1616 2 жыл бұрын

I love how you explain even the basics (top k, top p, beam search) - 2 years ago it made your videos very approachable and now I can "just read" much of the research directly myself. Thanks!

@arirahikkala 2 жыл бұрын

NovelAI users have been messing with this. The service has a UI for quickly configuring and testing sampling methods. It's limited to straight-line decoding (so no beam search or generate-many-and-filter etc.) but pretty capable within that, so you can do stuff like e.g. first apply a temperature of 0.9, then tail-free sampling at 0.975, then top-k at 10, and the result of that whole chain is then the distribution you actually sample from. People test, share and talk about sampling settings a decent amount. It turns out that basically nobody uses just typical sampling by itself. Most of the most popular settings combine multiple methods, maybe with typical sampling as one step, maybe not involving it at all. The community's process is of course incredibly unscientific, as the only thing they have to go on is how nice their generations feel to read, but however little they know, one thing they seem to have agreed on is that typical sampling is not quite the magic Right Way To Sample From A Language Model yet.

@laurenpinschannels 2 жыл бұрын

I don't think tail-free sampling with top-k actually mimics the effects of typical sampling terribly closely, though I could be wrong

@Kram1032 2 жыл бұрын

At least based on that one example generated at the end of the paper, I'm not very surprised. It feels like a solid step in the right direction, but insufficient on its own.

@arirahikkala 2 жыл бұрын

@@laurenpinschannels That's correct, it doesn't. They're not trying to mimic typical sampling, they're just trying to get generations that they like.

@oncedidactic 2 жыл бұрын

Decoding options- definitely area ripe for improving certain applications. Still feels like a hack to cheat out the desired type of behavior from a statistical hairball (whatever your model made out of your data). Your model ate the data and got a lot of 1) background knowledge (associative) and some 2) statistical understanding of the rules of language. Mostly sentence structure, both hard syntax/parts of speech and turns of phrase. It seems like what you really want to do for next level decoding and get past the “hack fishing” is to be explicit about picking from distribution of categories, not instances (single vocab objects). Yes this sounds like abstract reasoning but I think you could start modest and build up. You could even run multiple decoders in parallel for levels of abstraction and use some kind of combination like “expert voting” or whatever. This gives you explicit knobs for types of output behavior and not just a coarse stats-based sieving tying back to the input data distribution. Sorry for the word salad, hope above makes sense. Looking forward to the interview! Great topic.

@_paixi 2 жыл бұрын

28:49 With nucleus sampling I find it useful to sample a log-normal distribution for the temperature: lognormvariate(log(median - min), sigma) + min This will spike the temperature every now and then to introduce something interesting and then continue the text in a predictable, sensible way. (For general purpose I start with a median of 0.9, min of 0.7 and sigma of 0.6.) So I don't think sampling tokens to match an expected amount of information is the full picture. That makes sense to transmit information efficiently but another important factor is the novelty of the text, which sometimes, not all the time, is going to be bursts of information. And that information transmitted has to be valuable, which isn't going to be found in the likelihoods unless the model is fine-tuned with human feedback like InstructGPT, but if you do that then the sampling trick doesn't make sense to me to do anymore. I would like to see the shifted scores of each token in human and model generated text by highlighting them in color. I imagine in the human samples names and new words not seen yet in the context will have high information which would cause them to be removed from the selection process. This is noticeable in the generated samples in the paper too. Their method seems to avoid using names when not prompted to use them, but this could be desirable to control generation for certain use cases to minimize hallucinations.

@tchlux 2 жыл бұрын

Another way to notice what is wrong with the word-wise generation of text done for decoding (10:00) is to think about why language is used in the first place. Usually a person has a thought (a foreseen sequence of perceptions and actions that achieve some objective), and they want to convey that thought to other people. Language is used to represent an already-constructed sequence of information (in the human brain). That reveals a truth: the likelihood of the next word is independent of the thought. People usually try to pick sequences of words that maximize expectedness for convenience, but overall there is no reason that the thought they are encoding into words is an "expected" one! Any time some researcher writes a paper, they are reporting on *new* knowledge (something *unexpected*). So predicting the "most likely" next word has far more limited applications than already having a thought and trying to encode it with language. It can definitely still be useful, but not as much as language is for a human. Thanks for the great content as always Yannic!

@Dmitriuso 2 жыл бұрын

Great stuff as usual. The question of typical sampling on already sampled human-made training text is absolutely justified. Will be looking forward to the interview with the authors to hear their perspective of this.

@user-rr7uh4wm5y 2 жыл бұрын

Interesting approach! I think maybe we can use different language models for H(q(...)) and for log(q(...))? 1. q for the second term is our model 2. q for expected information H is the language model of our listener For example, if I want to explain something to a child, I would expect them to have a simpler language model with higher entropy H. Therefore I would need to use simpler, more generic words, with higher probability, so that my my log q will also be higher. But if I were to explain the same concept to a more knowing person, I would use words with lower log q, therefore with more information in each word. So it will take less words to explain it to person who already knows something.

@Supreme_Lobster 2 жыл бұрын

Cant wait for the author interview of this paper, seems very interesting ideas, and your questions seem very interesting too

@Tehom1 2 жыл бұрын

Interesting, but I just imagine this generator studiously avoiding "the" and "a" because they're just too common. I suspect there must be more to it than that. Perhaps it needs to match the information content of some sliding window of output instead of words or small chunks?

@alonamaloh 2 жыл бұрын

I got to the hypothesis only, but it doesn't make a whole lot of sense. What the hypothesis would mean is that, given a base language model, we can produce a better language model by boosting the probability of words whose information content is close to the conditional entropy. So they are just describing one possible bias language models may have. In particular, the hypothesis cannot be true for a really good language model.

@chrisray1567 2 жыл бұрын

I was unaware of the details of decoding transformers. It’s a more difficult process than I imagined. The current methods are clever but do seem fairly naive, and although typical sampling seems like a step in the right direction, it doesn’t have any planning in it. I don’t think sampling from a probability distribution one or a few words at a time ever will.

@ilikenicethings 2 жыл бұрын

But isn’t it possibly the case that the probability of the next word is based on higher level abstractions in other layers of the network? So in a sense the next word predictions are searching towards those abstractions? Otherwise how could the language models construct such coherent sentences at all. I’m not an ML expert so this is just a thought that occurred to me while listening to the video.

@herp_derpingson 2 жыл бұрын

5:15 "When humans generate text they just dont generate the most likely" By definition they do! All the text that we are feeding into the neural network was written by humans. We *define* most likely as in most likely to be generated by a human. If a human wants to put in some variety, we capture that too. . 23:46 This is effectively a repetition penalty? If a word exists in the generated sentence before, the information provided by next word will be less. . 28:45 In other words we can say that the information per word in an article varies very little. That is to be expected because usually articles are written by a single author and the author's stylistic choice makes him transmit a specific amount of information per word. . 31:30 Since, language models try to mimic human likelihood. So, the generated text should have similar information per word as the "average human" and not a specific author. . 43:40 One example doesnt prove much. Who knows how cherrypicked it is? . 46:00 Nice to see you came to the same conclusion as I did. . All in all, I think there is an inherent fallacy in the approach. Lets say, we train a model on only very high entropy/information content texts. The text now generated by the model is "high entropy" compared to an average human text, but it is low/average entropy compared to the training set.

@YannicKilcher 2 жыл бұрын

Yes, as you say I'm still grappling with the idea and it seems to make sense on one hand, but also seems to be a contradiction on the other hand. I'll need to do some own math on this

@laurenpinschannels 2 жыл бұрын

humans don't generate the most likely under a general language model, they generate the highest likelihood in a goal-conditional way. Each word adds a varying amount of information vs other words, but humans rarely say things in a row that are entirely implied by the stuff they had been saying earlier in the same part of the thing that they were talking about and saying the very most low information continuation of the thing that you are saying makes you use way more words like I am doing now as I add more redundant words to this sentence, because increasing likelihood tends to increase redundancy. unlikely conditioned only on the words, but I think not so much unlikely conditioned on the entire value system of the person.

@drdca8263 2 жыл бұрын

@@laurenpinschannels wow, when I was reading the deliberately redundant sentence, I just lost focus on it, in a way that reminds me of how some describe some gpt-generated text. Also, that seems like a good point, but I also feel like I would need to think about the math more in order to better evaluate whether it is. Ok, here’s a thought: If we are sampling with probabilities that are a good estimate, and sampling P(next token | the person and other stuff, and the previous tokens) then, that should be like, the correct distribution, sorta by definition... I guess? (Though we don’t really have a way to specify the person and context bit? Uh...) And, If we look at P(the person and context (aside from the previous tokens) | the previous tokens) Well, P(the next token, the person and context | the previous tokens) = P(the next token | the previous tokens, the person and context) * P(the person and context | the previous tokens) And uh, Well, sampling probabilities in theory shouldn’t depend on like, the order you do the conditioning? Sampling (X,Y) given Z, should have the same distribution if you sample Y given Z and then sample X given Y and Z, as if you sample X given Z, and then sample Y given X and Z. So if in one case, (hypothetical , not meant to be computationally feasible) you take the prompt, use it to sample the person+context , and then use those to sample the next token, and then continue using the same value of person+context to sample subsequent tokens, and only discard the info about person+context at the end, presenting only the predicted tokens, is the distribution you would get from that, any different from the distribution for if you, each time you were going to predict the next token, sampled a person+context based on the combination of the initial prompt and the tokens already produced, used it, along with the prompt and tokens-so-far to sample the next token, and discarded the samples info about the person+context ? I feel rather unsure about whether the distribution of tokens produced would differ between these two versions. Maybe they could be, like, theoretically the same if each of the conditional probability bits was perfect, but seeing as it can’t be perfect, would in practice differ substantially? (Whatever “in practice” means for an unrealistic hypothetical.)

@herp_derpingson 2 жыл бұрын

@@laurenpinschannels I didnt even realize that the sentence was deliberately made redundant. There are many people on the internet who write this way. But yes, the typicality penalty makes sense for goal conditioned texts. But right now we dont have a reliable way to do that. If we use typicality on non-goal conditioned texts, it effectively just works as a repetition penalty.

@ketilmalde3402 2 жыл бұрын

@@laurenpinschannels Hah, nicely done. Still, I think this just highlights a limitation of the language model? It fails to take into account global/long distance information. From that point of view, this looks like a hack to circumvent - or create an appearance of circumventing - that limitation.

@cPho3nix 2 жыл бұрын

"Forward Mode Method to Compute Gradients Without Backpropagation" Do you think you can review this paper?

@toposes6944 2 жыл бұрын

Regardless of linguistic or psychological justifications, the method seems an effective way to exclude not only the lowest prob samples (as in topk or nucleus) but the highest prob samples as well without introducing too many hyperparameters. I think that cutting off low prob samples could be understood as compensation of the error in estimating the true distribution (caused by the model haven't seen low-prob examples many times or the capacity's too low). Then my question is: what would be the effect of excluding high prob samples? Could it be helpful for the model to make more robust estimation of probs in subsequent tokens?

@jabowery 2 жыл бұрын

There is an interesting intersection between Shannon information and algorithmic information here. In an ideal Shannon environment, that is to say there is zero error in transmission, the optimal generator sends only the algorithmic information: the smallest program that, when run, produces the desired bit string in the listener. Of course the smallest program must be written in a language understandable by The Listener. So the speaker must have a model of the listener. This gets into computer-aided instruction theory but I won't go there right now. Where Shannon information comes in obviously is the error rate of the channel. How much error tolerance does a computer have for algorithm specification? Somewhere between these two extremes of information we have natural language.

@Peter.Wirdemo 2 жыл бұрын

Couldn’t a goal of a ”wide” probability distribution, in itself, of the next token be a good way to sample? Say a beam-search on current next tokens and pick the one that creates the widest distribution for the next subsequent token after that? I assume such an approach can act as an alternative way to avoid most top probability expected output, just by avoiding the peak situations before they arise. Basically sampling that follows the path and always trying to maximize possible future paths.

@Guytron95 2 жыл бұрын

interesting if there's a setting that would work for generating code/programs a-la alphacode/gpt-3 etc.

@Giseeese Жыл бұрын

Actually, I wonder that how to compute the information amount of a specific simple word/token in the distribution? The entropy is only a expected valua, right?

@anastasiadunbar5246 2 жыл бұрын

What if typical sampling was used to generate music (e.g. chord progressions)? I know that Markov chains produce repetitive phrases. I also think the typicality is interesting because there isn't really such a thing as true normal (apart from the most frequent).

@beans2874 2 жыл бұрын

"It can be used for big or small models. We don't discriminate here." Pun intended?

@YannicKilcher 2 жыл бұрын

always

@JuliusSmith Жыл бұрын

When I reflect on how I communicate, I _start_out_ saying typical things in order to set up a context. Then, when I feel the listener is adequately prepared, I pounce with my high-information, low-likelihood pronouncements. To model this, we need to skew the high-information words/phrases toward the end of the paragraph/sentence/whatever. "The cat is in the ... wait for it ... TREE". Autoregressive language models that predict only one token ahead completely miss dynamics such as this. An avenue for improvement is elevating the scope of a token to the phrase level and beyond.

@JuliusSmith Жыл бұрын

Maybe a token should embed up to the cliché level

@snippletrap 2 жыл бұрын

Would have liked some qualitative examples. That is the entire reason of a new decoding scheme - the text generated with existing methods does not always *feel* right. The entire thing stands or falls on this one criterion: subjective evaluation. And it’s missing from the discussion.

@soumyasarkar4100 2 жыл бұрын

Isnt this similar to adding a temperature hyperparameter and then sampling ? Having a temperature variable will also discourage picking top probailities and very low probability tokens

@ketilmalde3402 2 жыл бұрын

I don't think so, since a softmax with temperature (which I think you are talking about) will still have the highest softmax value as the most likely choice. Here, it sounds like they have a higher probability of selecting intermediate softmax values.

@Maciek17PL 2 жыл бұрын

How can I use it with huggingface??

@victoriachudinov9580 2 жыл бұрын

Hmm, the whole thing about information and likelihood of words in a sequence, that they present form informaiton theory perspective really makes me think of tf-idf, so perhaps using tf-idf in decoding may also give interesting results.

@sampruden6684 2 жыл бұрын

An infinitely large and perfectly trained language model wouldn't need sampling tricks, because all of these properties that we're trying to mimic would be captured by the model, right? So this is about noticing particular things that current generations of models aren't doing a good job of yet, and finding tricks to hack around that and get better results. I suppose this type of failure makes sense if we think of the model as focusing too much on local context and not enough on the broader picture of the whole text. It sounds like the models are learning a fairly locally greedy decoding. Perhaps the difference here is that humans have additional context - we know what we're trying to communicate. The model, on the other hand, only has access to the information in its context (and its world knowledge), so is unable to spontaneously say things which are unexpected? When models do this, they typically get criticised for imagining/inventing false information. I can see that manually controlling aspects of the generated text can be good for taking stylistic control of the output. An obvious idea here might be to take in some body of text that you want to mimic, measure statistical properties like the ones discussed in this paper, then bias the sampling to match those measured distributions. In theory, a good model can extract that information from the prompt, but real world prompts are limited in size and therefore information content. If I want a model to write like Terry Pratchett, I can't give it a whole book as a context, but I can measure 20 statistical linguistic properties of a whole book, then sample in a way that tries to be concordant with those measurements. For example, maybe "rate of rare word usage" would be another property to try to mimic. One can imagine learning fuzzy profiles of a style instead of measuring these things manually, perhaps noticing that Pratchett likes words in a particular region of the embedding space and biasing sampling towards them. Another interesting thing here is that there's a feedback loop. Our biased sampling becomes the new context, so the model tries to mimic the results of our biased sampling. Sampling schemes that avoid the highest probability words may run into weird fluctuations here, because if it sees in its context that it's generating medium probability words, then maybe it learns that medium probability is high probability. So now it's avoiding generating medium probability words... Finally, you've been asking for feedback on the review/interview format of the videos. In general I really like what you've settled on, but having the interview release a day after the video is pretty awkward. I've watched the video today and it's fresh on my mind, but the interview isn't ready yet. When the interview comes out tomorrow, I have to remember to come back and watch it, which I may not get around to. If it were available now, I would. With some of your other videos, I avoided watching them when they first released because I wanted to wait until the interview was ready, but having put them off I've never gotten around to going back to them yet. Perhaps if the interviews were published but unlisted alongside the review videos, with a link in the description, then made public a day later so you still get to have two effective releases?

@Kram1032 2 жыл бұрын

> An infinitely large and perfectly trained language model wouldn't need sampling tricks, because all of these properties that we're trying to mimic would be captured by the model, right? Honestly not convinced that's the case. Even if you have a perfect oracle language model that just simply knows *exactly* how English language is written. Even if you somehow could get over the issue that language is actually an open set, growing and shifting as time goes on. How you sample this set is still gonna matter. If you ask such a perfect language model for the most likely sentence, sure it's gonna give better results than our models today do. But it's still gonna do something relatively bland. - It's literally what regular sampling strategies tend to do! Find the most expected next word. Aka the least surprising one. Aka the blandest one. That said, giving an AI more context is definitely gonna be helpful. - Multimodal AIs are gonna help a huge deal here. Giving the AI an idea of what a red fire truck looks like, and what images of a red fire truck tend to contain alongside it, *including* things that tend to not even be mentioned in text, is surely gonna help an image-and-text AI write more compelling texts, better grounded in reality. Every modality, every bit of embodiment you can add to such an AI is presumably gonna help.

@sampruden6684 2 жыл бұрын

@@Kram1032 But if text that we consider bland is rare in the dataset, the model should learn to avoid generating bland text, shouldn't it? If blandness is unexpected, then the bland choice is unexpected and therefore not likely. The most likely word should not be a bland one. If we're talking about AI writing novels, then the creativity part might be a challenge. But if we're talking about structuring text of articles, then mimicking these properties doesn't sound too hard. That does raise the question of why it isn't already happening, which perhaps suggests that I'm wrong. You could well be right, but it's not intuitive to me! It does feel a bit dangerous judging blandness by how much a word surprises the model itself. At the extreme, a model which is very good at not being bland, may become bland with this sampling method, because the bland thing is more surprising to it than the non bland thing...

@Kram1032 2 жыл бұрын

@@sampruden6684 that's not how it works though. Being exactly according to expectation is, in a sense, the *definition* of being bland. Like, yeah, it's potentially not gonna be *as* bland. But it's still gonna be the blandest the model will ever be. In fact, this is even gonna be true for human writers! If you just write without any sort of surprise factor, you will write a text that's perfectly human-seeming (as you, presumably 😛, are human), but it's gonna be rather bland text.

@sampruden6684 2 жыл бұрын

@@Kram1032 Hmm. In theory, the model should write in the style of its prompt. By nudging it away from what it thinks is most likely, we're weakening the strength of the prompt's control over the generation - and this compounds, because our introduced deviations become part of the new prompt, because autoregression. If the aim is for the model to develop its own writing style over the duration of the output, then this bias towards unexpectedness makes a lot of sense, because it's a bias towards creativity. However, if the aim is to generate in concordance with the prompt, then I'm not sure that it should be needed. It should recognise that the prompt is not bland, and continue that not bland style. My hypothesis is that current generation models are not adopting the prompt's style strongly enough. Blandness is a result of averaging over all of the training data, but in theory could be mitigated by the model learning to be more specific to the prompt. A prompt with high surprise factor should be able to lead to generation with high surprise factor, capturable by top-p sampling. I could very easily be wrong here though! The authors of this paper know far more than I do. I expect we'll get good discussion of this in the interview video.

@Kram1032 2 жыл бұрын

@@sampruden6684 Style is really difficult to actually capture, and you might mean a great many things by that. The prompt may not include enough style information to begin with. - Like, if you give the AI the beginning of a story, it can not possibly know how you would tend to end stories. It only gets information about the beginnings. To give it a full idea of the style you're gonna have to basically have a corpus of that particular style ready as a prompt. Or you'll have to finetune it on that. Which may well be prohibitively expensive. I'm pretty sure no matter what, the sampling method is gonna matter. At best I could imagine you augmenting the prompt with like "surprise me!" or something. BTW I'm not saying this "Typical Decoding" is the be-all-end-all of sampling techniques by any means. In fact, if you look at the examples in the paper, personally I'm not actually super convinced by it. - It's an improvement, but to me it's like a different kind of flat. Instead of super expected, it feels, well basically it feels like what you'd hope it to feel given the technique's purpose? - A constant level of surprising. It gives the text a strange kind of pace. It's also not the most coherent. (Though neither are the max likelihood based methods!" I'd love to see comparisons of multiple samples across multiple models just to see how strong these actually are. I suspect the ideal sampling strategy is gonna end up being a mixture of things. Sometimes bland and expected might be exactly what you want.

@Kram1032 2 жыл бұрын

Looking at the results for the generated story is quite interesting. First of, the actual human sample they gave isn't that great to start with, as it's written by somebody on Mechanical Turk and they even ended with "haven't written in a long time :)" But what that story definitely manages that all other samples across various techniques fail at is, unsurprisingly, a relatively decent longterm coherence. (TBH, given that it's a human-made sample, it's actually surprisingly weak in this regard? Though still miles better than any of the AI techniques) Of course that's not particularly surprising as AI techniques are limited in the context they can process. But that issue aside, their sample for Typical Coding feels, taking each sentence on its own, the most story-like. It seems less concerned with actual facts about actual circles though. (The task was to write a short story about a student in math class accidentally drawing the first genuine magic circle in centuries) - half the time this technique seems to have made the circle into a character. The pacing feels a bit breathless too. Kinda one-note in a way that's difficult to describe. Not super eventful but also not super event-less if that makes sense? It feels like the method falls into the same trap one level up. The sentences are indeed more informative and so I'd argue the method is doing better than the others on this story task. However, it feels like instead of giving us the most likely sequence of words, it's giving us the most typical amount of information per word. I know that's, like, "yeah no duh, that's literally what it's supposed to do", but I'm saying that introduces yet another kind of flatness to the text. Subtler for sure, but still present. One strange thing that I noticed in most sampling techniques, and that still seems to be present here is, that it's very reluctant to pick names. - Once names are introduced, these methods *tend* to at least *sorta* assume a coherent gender and stick to those names, bringing them up in somewhat reasonable intervals and *kinda* (but not very well) keeping identities. But at random, they tend not to do this a lot. They prefer keeping things vague with "he"s and "she"s and that's it. I suspect as a tool for human collaboration it will genuinely be more useful than the most likely word version: It'll try to find "mildly surprising" and, thus, much more interesting sentences. But long form texts *solely* based on this technique will not be that great in the end. Of course a couple caveats: - They used a finetuned GPT-2 Large for their tests. A good model but nowhere near SOTA today. A stronger model is surely gonna give stronger results. Would love to see what GPT-3 can do with this. - It's only a single sample. Might well be that sampling multiple times, or asking for different stories, would have given quite a bit better results. I'm extrapolating on very limited data here. The only reason I feel quite confident in my assessment despite the sample size of literally one (1), is that the generated text for this method is fairly long. Long enough to perhaps get a bit of a feel for its idiosyncrasies.

@jabowery 2 жыл бұрын

But, but, but... what if I want to virtue signal my membership in a swarm? What about the verbal pheromones?

@anastasiadunbar5246 2 жыл бұрын

“Verbal pheromones” by constantly generating the most interesting text? It's very subjective though, recommendation algorithms for instance aren't that good because you get fed up with what's been shown over and over again and want something new, I wonder what makes something interesting, it's usually the search for something novel. But besides interesting, it could instead generate text that matches the person's mood and to make it as relatable as possible so to target them.

@jabowery 2 жыл бұрын

@@anastasiadunbar5246 One might take my comment as sarcasm were it not for at timecode 316 we hear "when _humans_ generate text they don't just produce the most likely text, they will actually trade-off likelihood with information content". Now, one might claim (with some justification) that utterances like "I stand with Ukraine" or "Diversity is our greatest strength." emit from those behaving less as humans and more as eusocial insects using words as pheromonal siignals of their membership in a swarm (whether honestly or merely to avoid being, themselves, swarmed). However, the increasing prevalence of such slogans on social media, and the real-world consequences of such swarming does deserve some attention by language modeling.

@Supreme_Lobster 2 жыл бұрын

Couldn't this be paired with some sort of GAN? Maybe very naive question but am interested lol

@Ronnypetson 2 жыл бұрын

00:00 he's speaking

@tildarusso 2 жыл бұрын

One thing always confuses me about Language model is that machine learning is mostly "statistics", meaning that models can't generate content with "information" relevant to the audience (since models only know the universe defined by training samples), this leads to the phenomenon that computer generated contents are almost always boring and gibberish. no matter how fluent and correct the language appears to be (yes T5 and GPT3), therefore eventually useless to human since the nature of communication is "information exchange".