Author Interview - Typical Decoding for Natural Language Generation

Рет қаралды 9,213

Күн бұрын

#deeplearning #nlp #sampling
This is an interview with first author Clara Meister.
Paper review video hereé • Typical Decoding for N...
Modern language models like T5 or GPT-3 achieve remarkably low perplexities on both training and validation data, yet when sampling from their output distributions, the generated text often seems dull and uninteresting. Various workarounds have been proposed, such as top-k sampling and nucleus sampling, but while these manage to somewhat improve the generated samples, they are hacky and unfounded. This paper introduces typical sampling, a new decoding method that is principled, effective, and can be implemented efficiently. Typical sampling turns away from sampling purely based on likelihood and explicitly finds a trade-off between generating high-probability samples and generating high-information samples. The paper connects typical sampling to psycholinguistic theories on human speech generation, and shows experimentally that typical sampling achieves much more diverse and interesting results than any of the current methods.
Sponsor: Introduction to Graph Neural Networks Course
www.graphneuralnets.com/p/int...
OUTLINE:
0:00 - Intro
0:35 - Sponsor: Introduction to GNNs Course (link in description)
1:30 - Why does sampling matter?
5:40 - What is a "typical" message?
8:35 - How do humans communicate?
10:25 - Why don't we just sample from the model's distribution?
15:30 - What happens if we condition on the information to transmit?
17:35 - Does typical sampling really represent human outputs?
20:55 - What do the plots mean?
31:00 - Diving into the experimental results
39:15 - Are our training objectives wrong?
41:30 - Comparing typical sampling to top-k and nucleus sampling
44:50 - Explaining arbitrary engineering choices
47:20 - How can people get started with this?
Paper: arxiv.org/abs/2202.00666
Code: github.com/cimeister/typical-...
Abstract:
Despite achieving incredibly low perplexities on myriad natural language corpora, today's language models still often underperform when used to generate text. This dichotomy has puzzled the language generation community for the last few years. In this work, we posit that the abstraction of natural language as a communication channel (à la Shannon, 1948) can provide new insights into the behaviors of probabilistic language generators, e.g., why high-probability texts can be dull or repetitive. Humans use language as a means of communicating information, and do so in a simultaneously efficient and error-minimizing manner; they choose each word in a string with this (perhaps subconscious) goal in mind. We propose that generation from probabilistic models should mimic this behavior. Rather than always choosing words from the high-probability region of the distribution--which have a low Shannon information content--we sample from the set of words with information content close to the conditional entropy of our model, i.e., close to the expected information content. This decision criterion can be realized through a simple and efficient implementation, which we call typical sampling. Automatic and human evaluations show that, in comparison to nucleus and top-k sampling, typical sampling offers competitive performance in terms of quality while consistently reducing the number of degenerate repetitions.
Authors: Clara Meister, Tiago Pimentel, Gian Wiher, Ryan Cotterell
Links:
Merch: store.ykilcher.com
TabNine Code Completion (Referral): bit.ly/tabnine-yannick
KZbin: / yannickilcher
Twitter: / ykilcher
Discord: ykilcher.com/discord
BitChute: www.bitchute.com/channel/yann...
LinkedIn: / ykilcher
BiliBili: space.bilibili.com/2017636191
If you want to support me, the best thing to do is to share out the content :)
If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: www.subscribestar.com/yannick...
Patreon: / yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

Пікірлер: 29

@YannicKilcher 2 жыл бұрын

OUTLINE: 0:00 - Intro 0:35 - Sponsor: Introduction to GNNs Course (link in description) 1:30 - Why does sampling matter? 5:40 - What is a "typical" message? 8:35 - How do humans communicate? 10:25 - Why don't we just sample from the model's distribution? 15:30 - What happens if we condition on the information to transmit? 17:35 - Does typical sampling really represent human outputs? 20:55 - What do the plots mean? 31:00 - Diving into the experimental results 39:15 - Are our training objectives wrong? 41:30 - Comparing typical sampling to top-k and nucleus sampling 44:50 - Explaining arbitrary engineering choices 47:20 - How can people get started with this? Paper: arxiv.org/abs/2202.00666 Code: github.com/cimeister/typical-sampling/blob/3e676cfd88fa2e6a24f2bdc6f9f07fddb87827c2/src/transformers/generation_logits_process.py#L242-L272

@ChaiTimeDataScience 2 жыл бұрын

I'm a huge fan of the new balance of both author interviews and detailed walkthroughs. Thanks, Yannic!

@norabelrose198 2 жыл бұрын

This is a really simple and elegant idea once you look at the formula, I’m surprised no one had thought of it before

@florianhonicke5448 2 жыл бұрын

Thanks for the video. One idea 💡: I would find it nice if the authors could quickly introduce their self before going into the topic

@YannicKilcher 2 жыл бұрын

Good idea 👍 or I should

@jeffw991 2 жыл бұрын

The dual-video analysis + interview format is great for all kinds of reasons. It's good for us to learn. It's excellent for the interviewed authors (who mostly seem to be grad students) to get some exposure and experience talking about their work. (Also, let's be honest. For a grad student, hearing, "Your academic article is interesting to me, can I talk to you about it?" can sometimes feel like winning the lottery.) And you get two solid videos per paper. :-) It helps a lot that you do a great job of bringing your questions to the discussions and challenging the authors respectfully and supportively.

@DeadtomGCthe2nd 2 жыл бұрын

So freaking interesting! All 3 of you are amazing. I'm including the paper.

@user-dg3qs3vk5p 2 жыл бұрын

Thank you so much for such a kind review of the papers that should be noted in the fast-growing industry.

@JonathanLaserson1 2 жыл бұрын

I think that Yannic's main critic from the end of the paper review video was left unanswered: Why is the distribution of the text in the training set (which the model has learned) should be any different from the distribution in which we want our model to communicate? Wasn't the need to be "efficient" in the way we communicate already included in the text the model was trained on? In other words, didn't the language model already factor-in the balance between efficiency and processing time in the probabilities it learned?

@serta5727 2 жыл бұрын

Awesome it is in Huggingface „typical-p“ parameter, I love it

@chrisray1567 2 жыл бұрын

I'd like to try this out, but I'm not familiar with Huggingface yet. What is the typical_p parameter? It sounds like it's a keyword argument for some function I should know.

@jeffw991 2 жыл бұрын

@@chrisray1567 Yes, it's a keyword argument to the generate() method of the HuggingFace model classes (e.g., AutoModelForCausalLM) in recent versions (e.g., 4.17.0). Just set typical_p=0.2 (the value from the paper) instead of setting top_p like you might do for nucleus sampling or num_beams for beam search. You probably also want top_k=0 to disable that, and be careful about setting temperature. I'm not clear on what the interactions might be there.

@chrisray1567 2 жыл бұрын

@@jeffw991 Thanks! I appreciate you taking the time to respond. I played around with the transformers library, but ultimately wasn’t able to answer my question myself, so thanks again.

@oncedidactic 2 жыл бұрын

We’ve rounded the corner on statistical methods from “make it plausible” to “keep it interesting”. Lol. Walid is cursing somewhere.

@hurktang 2 жыл бұрын

I find truly fascinating to listen to those 2. While they mumble (hmmm, like) and keep stuttering "but i i don't think i i i did uh i mean i if if i guess" 21:11 while they try to communicate the complex idea of how human beings maintain a standard fixed level of information during communication. Maybe it's not that you try to maintain the flow of information fixed more than your own mental capacity at transmitting information is limited to a maximum amount ?

@michael3698bear 2 жыл бұрын

Lol

@oncedidactic 2 жыл бұрын

Jokes aside a number of facile differences are entirely left aside by our “sota” nlp, an important one here being extemporaneous spoken word in conversation is not comparable to written text. You’d expect written text to follow much closer to a steady entropy expectation since it’s been optimized for information delivery. Speaking off the cuff brings all sorts of noise, shall we say, that won’t comport with a coarse theoretic metric. Not to mention atextual components of language irl that confounds with how they get projected in text.

@andrewjackson5798 2 жыл бұрын

@@oncedidactic they mention the difference b/w speech and text w.r.t. novelty/redundancy tradeoffs when talking about the plots.

@oncedidactic 2 жыл бұрын

@@andrewjackson5798 of beating Nick Sibicky at Go fame? Yeah, I would love to hear more on this in relation to linguistics. I have absolutely no knowledge about typicality theory. I was mostly thinking about statistical models on text corpora being so very far removed from the "ground truth generator" of people saying words out loud to each other in everyday life. But on the other hand, GPT works really well, so mise. And you could make some kinds of arguments about body language, vocal inflection, emotional regulation vs informativeness, etc. being very lossy-projected into a pure text ungrounded training set, so it all turns out a wash, ish. Still, it seems like such a "face value" metric is erroneously discarding all the degrees of freedom available in those other modes that are, if not linguistics, then tied at the hip cognitively and evolutionarily. Like, if you're going to measure entropy, you need to enumerate state without that arbitrary cutoff. (I complain this, realizing it'd be godawfully complicated to try and work them in vs a simple distribution based measure on tokens to work in a tractable subspace where the data is just sitting there in a pile already as it were.)

@andrewjackson5798 2 жыл бұрын

@@oncedidactic yes that's me. well they're definitely *different* corpora, for sure! So the paper here isn't trying to pick words conditioned on (or predicated on) an internal message we want to communicate. It is trying to mimic the information-pattern we have in our own sentences, where we connect our high-information words that make the idea with low-information words that connect them (conjunctions, transition words) provide cross-checks (conjugations, pronouns), etc. Like "the Jabberwock" -- highly surprising, high information words that don't exist in your corpus can still have their meaning inferred from the familiar, frequently found connecting words around them. We pick our words to have this relatively constant information conditioned on our message, and we're judge 'information'/novelty by assessing our listener's internal state. In speech, we track that state by all the listener cues; Yannic & Clara cue off each other to update their information level on the fly. We "write for an audience" by keeping track of what words we expect our readers to know/be familiar with, but we're also much less redundant b/c we aren't pausing to form our thought (i.e. condition on our internal message) or dynamically assessing info levels. We are already "flattening" all those "degrees of freedom" imho into approximating their internal representations. To turn that around, if our internal & social selves combine to produce these distributions of information across text, well even if our models don't have the internal & social equivalents, reproducing those distributions makes them way more interesting & delightful.

@robbiero368 2 жыл бұрын

So is this still true when you look at individuals as opposed to the whole corpus of sampled language?

@TheKirillfish Жыл бұрын

“Message we are trying to convey” sounds very close to “latent variable Z” from JEPA paper. Something that alters the distribution of the next steps of a sequence. Seems that typical sampling fits in nicely to JEPA framework, for example, could we regularize Z by expected information rather than 0 information? Or maybe it should go to action selection rather than to modeling the state of the world… I’m confused

@TheKirillfish Жыл бұрын

Why do we assume that staying close to expected information (like humans do) has something to do with the message we’re trying to convey? I didn’t get the train of thought how these two statements are related

@imrishab 2 жыл бұрын

First one here

@camillebrugel2988 2 жыл бұрын

I'm a huge fan of the channel coding theorem and it is nice to see typicality used in ML, but these decoding/sampling tricks look quite dubious to me. Doing this affects the (prefix) conditional distribution of the text produced so you will (a priori) not obtain the same distributions as the ones from dataset (as pointed out by Yannic at some point). By doing this you claim that the models are smart enougth to be used as reference to measure the information of the next word but not smart enougth to pick up on the RSA pattern hypothesized by linguists (and verified using what kind of language models ?). It might be a good correcting heuristic for current models but as models improve I think sampling from the conditional distribution is the only good way (although you will probably need to seed the model with some "intention" of the idea it wants to transmit otherwise it will be uninteresting babbling).

@liam9519 2 жыл бұрын

Completely agree with this! I'm only a little ways into the video but it doesn't make much sense to me either. We as humans don't pick words with low-but-not-too-low probability. We don't even do 'sampling' per se, we just pick the correct word to convey what we are trying to say. Maybe what we need is another network that somehow picks the correct word from the high probability set, based on some embedding of the desired meaning.

@ln2deep 2 жыл бұрын

@@liam9519 In general, I agree. Intention is very important. At the same time, I would argue that there already *is* intention. Not *your* intention, but an intention, or set of intentions. That is, the previous conditioning constrains the set of possible intentions that follow. It's just that LMs work in the text space and not the intention space. But we can view text as a function of intentions. With that in mind, it's about generating intentions according to some criteria, rather than expressing a particular intention. The question that is asked is then, is how obvious or weird should an intention be? This paper tries to give some criteria for that. You could argue that, perhaps the language model itself should be outputting 'typical' sentences with high probability. That is mostly what bothers me about having to do this kind of decoding to make sentences more 'typical'. Why should we have to impose more language like structure if the model is good? I think like most of you, I also started to look at comments without watching the whole video :P. She herself argues that the language models aren't good enough, so she tries to coax it into having more human-like language properties as a kind of post-processing.