MIT 6.S191: Deep CPCFG for Information Extraction

Рет қаралды 20,146

Күн бұрын

MIT Introduction to Deep Learning 6.S191: Lecture 9
Deep CPCFG for Information Extraction
Lecturer: Nigel Duffy and Freddy Chua, Ernst & Young AI Labs
January 2021
For all lectures, slides, and lab materials: introtodeeplearning.com
More details on Deep Conditional Probabilistic Context Free Grammars (CPCFG): arxiv.org/abs/2103.05908
Code and datasets: github.com/deepcpcfg/datasets
Lecture Outline
0:00 - Introduction
4:18 - What is information extraction?
7:19 - Types of information (headers, line items, etc)
11:57 - Representing document schemas
12:35 - Philosophy of end-to-end deep learning
16:38 - Context free grammars (CFG)
20:55 - Parsing with deep learning
27:10 - Learning objective and training
28:21 - 2 dimensional parsing
33:20 - Handling noise in the parsing
35:23 - Experimental results
38:00 - Question and answering
Subscribe to stay up to date with new deep learning lectures at MIT, or follow us @MITDeepLearning on Twitter and Instagram to stay fully-connected!!

Пікірлер: 28

@AAmini 3 жыл бұрын

Please be sure to checkout this very exciting new paper covering the technique: arxiv.org/abs/2103.05908 As well as code and datasets: github.com/deepcpcfg/datasets

@teegnas 3 жыл бұрын

For the folks who are wondering what does CPCFG "stands for" like I did ... Conditional Probabilistic Context-Free Grammars

@jasdeepsingh6568 3 жыл бұрын

Okay but what does that mean?

@roshanshah4556 3 жыл бұрын

@@jasdeepsingh6568 +1

@teegnas 3 жыл бұрын

@@roshanshah4556 My understanding is it's a technique that is used to measure how consistent the sequence of text is with the given grammar, and this is done by parsing the text in the form of a syntax tree and assigning weights to it. Won't be able to explain it that well without taking an example, but you can refer to the original literature (homepages.inf.ed.ac.uk/csutton/publications/cscfg.pdf) as well this video once it premiere.

@koustubhavachat 3 жыл бұрын

CPCFG is interesting

@FreddyChua 3 жыл бұрын

@@koustubhavachat To answer your question about toolchain in the live chat (we can't see it anymore), it is Julialang/Flux/Tracker

@mdmonim4643 27 күн бұрын

I am a Bengali people. I am very curious about knowing AI. I want to know about algorithm, Deep learning, machine learning, supervised learning and many subjects. But I couldn’t learn because I am Bengali and I am very bad in english. Every imformative in youtube are made of english language.So please upload your videos in multilanguage including bangla.If you do that every people of the world know about this. 😢😢

@NeerajSharma-yf4ih 3 жыл бұрын

Amazing. Sir

@jma7889 2 жыл бұрын

Obviously, there is lots of context information expressed as layout and format of the documents.2D parsing targets them, but does it considerr graphic parts of the doc such as lines and frames?

@cmosguy1 3 жыл бұрын

Hey Alex, did the slides from this talk get posted anywhere?

@Shah_Khan 3 жыл бұрын

How to identify tables?

@FreddyChua 3 жыл бұрын

@Shahudullah Khan The table is specified as part of the context free grammar, then the model produces a parse tree from the document based on the grammar.

@friscogate2972 2 жыл бұрын

@@FreddyChua Thank you for your really interesting work! What if we had a table where it would be natural that column headers are explanatory to what class the corresponding cell values correspond to? I understand that the regions used for a particular parse tree may not overlap, right? So how could the model make use of the same header for multiple entries in the table? Would that only be possible if we'd incorporate contextual features into the token embeddings?

@FreddyChua 2 жыл бұрын

@@friscogate2972 Thank you for your interest. There seem to be multiple questions here, so I would answer them in the order you asked. If we had a table that had explanatory column headers, that information can be used either explicitly or implicitly. Implicitly, the underlying language model can capture these information in the form of embeddings and embeddings are utilized in the recursive neural network for the parse tree. Explicitly, the column headings can be part of the context free grammar, and that informs the parsing that, the column headings must be seen in the document in order for us to generate a valid parse tree. The regions may or may not overlap, it can be an adjustable parameter in order to obtain the parse tree. When the regions may not overlap, we derive lesser parse trees which avoids an exponential explosion in the complexity of parsing. Allowing more overlap generates much more possible parse tree which places a bigger burden on the deep neural network to disambiguate. Unforunately in real world noisy documents, we have to account for some small amount of overlap between regions because scanned documents are slightly distorted. In an academic setting, we often assume no overlap for the simplicity of explaining 2D parsing. Hopefully this answers your questions, otherwise please feel free to ask more, here is fine.

@friscogate2972 2 жыл бұрын

@@FreddyChua Wow, thanks a lot for your quick answer 😊. So, for the implicit case, we would then need a kind of encoder for the tokens that takes into account some sort of spatial neighborhood to capture headings, and this is before those embeddings are fed into the recursive neural network for the parse tree, is that correct? In the explicit case: I’m not quite sure how exactly column headings would have to be encoded in the CFG, when we had n multiple rows each representing for example one LineItem, and part of a LineItem are fields that only could be discriminated by the corresponding header (like plain numbers, for instance). Then the heading-tokens would have to be part of n different sub-trees, right? Is that something that is possible and how does such a grammar/rule look like? Sorry, I’ve learned about CFG just now, and probably I haven’t fully captured the concept, yet. I highly appreciate your explanations, though.

@FreddyChua 2 жыл бұрын

@@friscogate2972 Yes, it has been popular these days to use transformers as an encoder. We also stated that we used LayoutLM if the latest version of our paper to get such embeddings. To explicitly use column headings, one would define a grammar like this, table := headings lineitems headings := col1_name(explicitly type the name here) col2_name col3_name lineitems := lineitems lineitem | lineitem lineitem := col1 col2 col3 But changes has to be made to the CYK parser to ensure that for the column headings, we use a fuzzy string match for those colx_names while for col1, col2, col3, we use the embeddings coming from the encoder. We did this for forms that had consistent templates.

@jackyhan4622 3 жыл бұрын

On page 40 of the presentation, shouldn't the learning objective be minimising the scores' differences?

@FreddyChua 3 жыл бұрын

If the optimizer does gradient descent, then we should indeed be solving a minimization problem. Presenting it as a maximization objective is more intuitive. That objective can be converted into a minimization problem by taking negation of the terms.

@jackyhan4622 3 жыл бұрын

@@FreddyChua Thanks for the reply, Freddy. However, I don't think the choice of the optimiser can change the consistency issue here. On the same page, it is stated that a higher score is desired for each sub-tree. Suppose there is a theoretical maximum, then the learning algorithm should aim at minimising the difference between your current hypothesis and the theoretical maximum. I was confused when watching the lecture, and only got clarified through reading your paper. Sorry, not trying to be nitpicking, and I appreciate your engagement!

@jackholloway7516 3 жыл бұрын

\\1 11 Q Q T

@ryanechols2065 3 жыл бұрын

You don't want it to be trained to make the score for the right answer to look like the score for the wrong answer. You want it to be trained to give a high score for the right answer and a low score for the wrong answer (as far apart as possible). This means maximizing the difference. As Freddy said, there's the small detail that most optimizers try to minimize, so you'd simply use "-1 * diff" as the loss.