Thank you so much for this video! I have a critique though: The case you are making here is that these models aren't understanding text because vectors that contain the same words have embeddings that are close. But one problem I have with this statement is that it is entirely feasible that text with the same words IS actually close in terms of its "meaning". BERT during training was not constrained to maintain that the language it is analyzing is grammatical. In fact there's a plethora of training examples on the internet scrape that are not grammatical. If you ascribe to Noam Chomsky's description of language, grammar obviously exists, but "proper" grammar is a construct of national language, which humans, as a whole, don't really follow. So grammer might have little effect to the "meaning" of a sentence. It might be this that is making these vectors really similar. Going deeper, in the case of the sentence vector for "a look! hey duck!" who are we to say that, if you were to see this sentence on a website, it is entirely different in meaning to "look out! duck!"? Both of those sentences follow a X! Y! pattern, both of those sentences involve somebody potentially telling another to "look", etc, etc... I understand how you obviously don't want your model to output grammatically "false" sentences- but in terms of "meaning", that is a much more loosey goosey idea. To really see if this model is "understanding" language I don't think cosine comparisons of their output vectors will provide much insight, especially if a fluctuation in one dimension of that vector corresponds to a difference in "meaning" and thereby output of the GTP model BERT was trained with. For example the "look out! duck!" sentence might have a very different value in the "meaning of noun" dimension of the embedding vector (as an illustrative hypothetical). I feel like to get a notion of whether the model is "understanding" we need to A. See how well it preforms on tests that convince everybody that they are "understanding" ala turing B. Constrict the model's output so that its embedding dimensions are explainable, and THEN see if contradictions arise C. See whether any patterns emerge in the embeddings for models that are preforming better on the (A.) tests. To the point of (C.) If we see that the cosine distances between embeddings is increasing for sentences that don't share our subjective idea of "meaning" but are bag-of-words closer to another sentence WITH meaning, then that is evidence that our subjective idea of "meaning" is actually applicable to the (A.) tests. Otherwise we are just pontificating in the dark about this black box which gets better and better despite our complaints that it isn't "understanding" anything.
@RasaHQ4 жыл бұрын
(Vincent here) Let's take these two sentences; 1. The lion is eating the man. 2. The man is eating the lion. Should these two sentences be similar? We could argue that they are, but let's be honest ... the sentences are *not* communicating the same same thing. But the representation for both of these sentences is 100% identical in the bag of words model. *That* is the issue when we talk about natural language understanding. Even a multi-headed attention layer won't be able to correct for this. In the duck example the same thing is happening; 1. Hey look! A duck! -> You need to watch somewhere. 2. Look out! Duck! -> You need to hide away from danger. A -> I fear not everybody will agree on "understanding" but a starting point for me would be to be able to detect when a sentence is "outlier"-y. B -> Another point about word embeddings; are the numbers in the embedding array interpretable? In my mental model, it's much more like clustering than interpretable axes. Feel free to check the "word analogies" video for evidence of this. C -> Being able to make accurate predictions is not the same as understanding a phenomenon. I can drive a car and predict where it will go even if I do not understand how the car works. Models can certainly still be useful, but that doesn't mean they're not wrong.
@paulcurry83834 жыл бұрын
@@RasaHQ thanks for your response Vincent! Tell me if I’m missing something here, but since the cosine distance between your first example sentences is non zero, then there is actually a different embedding for them. What this is saying to me is that the model cares much MORE about the words being chosen then their order, but order and word choice does seem to have an effect (unless there is some cosine distance margin of error thing I am missing), it is not 100% identical to a bag of words model like you claim. This could intuitively make sense since the different sequence lengths the model outputs is a much smaller domain then the words it outputs, so it makes sense that the model clusters similar words together. Going deeper, to me this is saying that humans might “understand” text in a much more bag of words way then we like to admit. Perhaps if we added a linear layer for the positional encoding the model might more easily emphasize the effect of positions in English- but I doubt it because I think the way to accomplish the task of language understanding requires a much greater understanding of words than positions.
@RasaHQ4 жыл бұрын
@@paulcurry8383 to prevent this discussion from getting esoteric: the main point I've tried to make in the video is that even BERT embeddings don't *actually* fully understand language. I'm arguing that "a look! hey duck!" shouldn't have the same linguistic interpretation as "hey look! a duck!". The order of words really does matter when you're trying to understand what is being communicated. I'm also showing that cosine distance on the embeddings doesn't seem to capture this. There's a small difference when you flip words around That's a downside. My main theory on why this is happening is because of the bag of words assumption, which is practical but it ignores a lot of lexical ambiguities. I've made a notebook public to make it easier for you to dig a bit deeper. colab.research.google.com/drive/1erkYWFLUdD1ufRpR8fbJrRrpv8npVpii#scrollTo=utxKRg-S2w18 Feel free to copy the notebook and link to a changed version. That should make a discussion much easier.
@slhcn86094 жыл бұрын
Thanks for the video, sharing the code and explanations. You are doing great job, constraints that you mentioned are obvious. But, if we want to go further, you described the problem clearly, what is your suggestion to this problem.? Thanks.
@RasaHQ4 жыл бұрын
(Vincent here) I would simply accept that natural language understanding is an unsolved problem.
@tanmaylaud4 жыл бұрын
One question: you used the mean base nli tokens. Could that be one reason for such results (the 'mean' of tokens) . Why not use some advanced sentence embedders ?
@RasaHQ4 жыл бұрын
(Vincent here) The main reason was that it was relatively simple to set up. These are BERT tokens though, so they should be a contextualized representation. Once spaCy 3.0 is out I'd love to give it a spin such that the representation of "hey look a [duck]" and "look out, [duck]!" can be compared on the tokens directly.