Text analysis / mining in R - how to plot word-graphs

Рет қаралды 29,090

Күн бұрын

Пікірлер: 62

@ZSY-jm4oi 2 жыл бұрын

Thank you Tom! You instruction and explanation of the codes and logic behind the functions are so clear and easy to follow. It is very helpful!

@izuchukwuokeke4256 3 жыл бұрын

Just seeing this wonderful tutorial now. I subscribed to the page, and I hope Tom is still very much available. Thanks for this and hope to see more posts.

@mkklindhardt 3 жыл бұрын

Thank you Tom, This is an excellent screencast of the incredible possibilities with tidytext!

@RT_-fj9ht Жыл бұрын

Starting to learn text mining this semester at school! Found this video really useful and interesting!

@tomhenry-datasciencewithr6047 Жыл бұрын

Glad to hear!!!!

@user-ii4uq3gi8t 3 жыл бұрын

It's a channel where you can always get useful skills. Thank you so much.

@tomhenry-datasciencewithr6047 4 жыл бұрын

*P.S.:* If you'd like more text-analysis-related content in the next few weeks, click *[like]* 👍 and *[subscribe]* 🔔! Here's the R Markdown code if you want to join in: gist.github.com/larsentom/369c4227dced0aac8c78f2d192fc68bd 📊

@andystats 2 жыл бұрын

It looks like it's been moved/deleted. Do you have an updated link?

@johnuesi 2 жыл бұрын

Excellent video. I was enthralled the entire duration. You've also given me some ideas for something I'm working on

@raphaelortiz4459 Жыл бұрын

Great video sir. Thanks for the walk through! Will be applying some of this to a project I am working on

@niceperson9223 3 жыл бұрын

Very informative, kindly upload the video for aspect based sentiment analysis in R programming

@ayasugihada 2 жыл бұрын

Thanks for the great intro Tom. Though I have to say the interpretation of the word relationships sounded a bit like good old tarot reading :). Cheers!

@robertc2121 3 жыл бұрын

Liked and subscribed. Fantastic tutorial and explanation !!!

@JD_Mortal Жыл бұрын

I didn't realize this was a thing... "R"... Though I thought the whole video was great, at the end, it seems like there could have been some better kind of formulation which would offer a better insight to the reviews. Word-pairing is great, but they were all "out of context" and due to the "chaining", it leaves one to assume that there is actually a connection of 3 or more words, when there may not be. For instance, you saw "bugs" and "fishing"... Bug = a thing you fish with, or programming issue with fishing? (I assume the prior) I see "bombing, review, click"... That could have been "review bombing", or "bombing review"... Were there bombs in the game? Was it "... bombing. Review ..." or "... review. Bombing ..." I am sure that there was no reviews that had "bombing review click" or "click review bombing". I have an issue with the "tainted results". You threw away valuable "review words". I say tainted, or "corrupted", because you removed them, which now "pairs" possibly unrelated words. Also, periods... You don't constrain "pairings" to "sentences". You are getting cross-contamination of thoughts, creating pairings that truly don't exist. Then there is the matter of "word association" and "similarity" and "depluralizing" that should be done. I saw "player" and "players", textually the same content also a pairing of "reviews negative", but no "review negative" only "review bombing" and no "reviews bombing", also island and islands, were oddly isolated. Word association... "Nintendo switch", "Nintendo game", "Nintendo switch game", "Nintendo console game" That contaminates "switch", and "game" and "console" with other relevant pairings, having nothing to do with a "game console" branded "Nintendo", specifically the "Switch" model. (Also the removal of the games title from the review, which contaminates pairs related to "animal" and "crossing" and "horizon/s".) I noticed a lot of foreign words in there too. De, en, el, es, se... Perhaps a LOT was missed since those were quite commonly found, but they surely were not reviewing in English, and pairings of foreign words, even if translated, would not always be the same. The dialects orders are often different. Thus, the "word association" needed. Which identifies the subjects and relative words you threw away half of. "Good game", with "good" being one of those common words you surely had in the list, possibly found a hundred times. Good, like/d, love/d, enjoy/ed. Missing critical triplets and notable phrases too, I assume... "well worth the money", "not worth the money", you just saw "worth money" as a pairing. "waste of my time", "no time to waste, get it now", as "waste time" and "time waste". I guess my mind just works different. I feel that you were on the right track in the isolation of good/bad, but the pairing doesn't seem to be a good metric for anything other than "game content confirmation". By the reviews, the text suggests that it is a game that involves fishing, customization/crafting of things, multiplayer, animals, it works on Nintendo switch, there are islands in it. (Compared to the game developers description, it could "confirm" game content.) Perhaps a better metric for good and bad would be the isolation of words NOT found in both. Seeing "not fun" as a pairing in a horrible review is expected. However, if you see "good value" or "worth ... money", then its not so bad.

@tomhenry-datasciencewithr6047 Жыл бұрын

You are asking exactly the questions that would go into a more detailed analysis! There is a useful function called SnowballC::wordStem() which reduces words down to a common 'stem.' For example, it produces: "then there is the matter of word associ and similar and deplur that should be done i saw player and player textual the same content also a pair of review negat but no review negat" For a more rigorous look into this subject see Julia Silge and David Robinson's excellent 'Tidy Text Mining in R' which is free online: www.tidytextmining.com Text analysis is always imperfect (and will remain so forever, I suspect), but it can yield good insights when applied to a large dataset, provided a human is in the loop!

@JD_Mortal Жыл бұрын

@@tomhenry-datasciencewithr6047 Something like this would be PERFECT for what I am doing, but it is honestly above me in complexity, at the moment. I was looking for a formulated way to form a sense of textual hierarchy. One which could be used to help new entries "find a logical category level", where it belongs with other similar associated words. In a basic sense... object => transportation => vehicle => automobile => car => gasoline_engine => hatch_back => ford => mustang => 1985 => cherry_red_paint So a "truck", which, by similar associations, would be at the level of "car". transportation => vehicle, as opposed to skates (accessory vs something you drive) or a ski-lift (not drivable) vehicle => automobile, as opposed to a bicycle or skateboard (non-motorized vehicles) etc... The hierarchy being assumed by known relations, and/or by simple volume of appearance and order. People tend to say, "my 1985 mustang", classed in reverse, "specific => generic". However, by volume, 1985 appears less than mustang and ford appears more than that. Continuing up the chain to automobiles, vehicles and transportation, which has progressively more and more "objects" that they are identified with. While knowing the relation, without needing to know the specifics of any one car... ford can be aligned with lexus, chevy, mazda, etc... Because of the similar preceding and following similar groupings. Why turn the entire language into a form of "word tree of origins"? Partly for use with AI classification of image contents. Knowing that cars have rims, wheels, paint, body styles, manufactures. Partly for extracting and isolating the correlating subject matter and "emotion/opinions" within descriptions. Partly for assisting the extraction and isolation of "finer details", such as the more rare descriptions of the types of tires, ground effects, antenna types, rim types. I could go on, but my primary goal in my knowledge-quest, was for the things just mentioned, as a whole. Extending to the final purpose of being used as a guide for helping others "classify images contents", with more valuable information that AI can use for digestion. (All in relation for text2image and image2image AI created art, which is assisted by "human textual prompts". They have a rough system in place, but it is hardly extensive or adequate enough to be used with any form of accuracy or repeatability.) P.S. Looking into this further, because of this video you posted. (Yet another language to consume my brain-cells.)

@yi-hsuanchen5458 3 жыл бұрын

Thank you so much! The code and the demo is really helpful :)

@jasoncysiu 3 жыл бұрын

Thank you for this amazing tutorial!

@tomhenry-datasciencewithr6047 3 жыл бұрын

Glad you like it!

@OpalCrossCoaching 2 жыл бұрын

This is great content on text mining in R. I also have a channel that discusses text mining in R on data from the web, PDF documents and data frames.

@teknocatt Жыл бұрын

Thank you for inspiring video!

@DataCentricInc 2 жыл бұрын

This is great content on text mining in R. I also have a channel that discusses text mining & Sentiment Analysis in R on data from the web, PDF documents and data frames.

@swifterbator8355 3 жыл бұрын

I get a vector of 3.2 GB (I have 3000 clean texts), and I cannot allocate the vector. It happens during the correlation calculation step. Any advice on memory allocation when working with heavy data?

@tomhenry-datasciencewithr6047 3 жыл бұрын

What kind of text data do you have? That sounds pretty big!

@swifterbator8355 3 жыл бұрын

@@tomhenry-datasciencewithr6047 just regular html documents that I cleaned, so it's really more like 3000 paragraphs about some company filings. I wanted to see relationships between words like covid and whatever they correlated with. I followed your code to the letter. I did however succeed in the end, but ended up with so many nodes even when filtering away all wordpairs that did not consist of covid, filtering away words not used in more than x documents and with correlations less thn 0.3. Maybe I just need some pracice. It's a really cool plot though, I subscribed for more videos

@nourbouabdallah2619 3 жыл бұрын

Thank you so much for this tutorial. when I tried to find the graph for positive correlation it says Error in FUN(X[[i]], ...) : object 'correlation' not found ." what do you think the problem is from ( nb: all the other chunks are running without errors)

@TotusCamihurs 3 жыл бұрын

Gracias por explicar cada línea.

@amrutakale2465 3 жыл бұрын

Nice tutorial! but what if i am extracting my data from pdf files . Is there any way to convert it and then perform the analysis.

@tomhenry-datasciencewithr6047 3 жыл бұрын

That's a bit more tricky! What kind of pdf files are you trying to analyze? There are ways to convert pdf files to text and then analyze the text, although you might need to use another tool to do it. For example, on Mac, there is a command called "pdftotext" which you can run in the Terminal shell. Once the pdf files are converted to text files, you could load them in and analyze that text.

@divyangirathore4156 2 жыл бұрын

i am unable to do this in ggraph. can u tell how can i plot a histogram with the word count using ggplot

@abhipsatripathy3934 3 жыл бұрын

How to apply your codes in a text file. The text file is purely a story. It's not advisable to convert it to a csv file. What to do then?

@julitopabriga9094 2 жыл бұрын

Please post the website for the dataset. Cannot be read on your presentation. Regards.

@seancherry2520 2 жыл бұрын

anyone know what this error means/how to address it? here's my input + the error that follows my input: > parsed_words % + unnest_tokens(output = word, input = text) %>% + anti_join(stop_words, by = "word") %>% + filter(str_detect(word, "[:alpha:]")) %>% + distinct() Error: Must extract column with a single valid subscript. x Subscript `var` has the wrong type `function`. ℹ It must be numeric or character.

@onsfarhat1042 3 жыл бұрын

A very intresting video! Thank you Australia all the way from France!

@jonathandevries4257 3 жыл бұрын

Hi, I am having trouble installing tinytext, it seems like this package is no longer available, and it makes many of the functions difficult to accomplish, do you know if there is a new name for this package?

@tomhenry-datasciencewithr6047 3 жыл бұрын

Hi Jonathan - have you tried install.packages("tidytext") vs install.packages("tinytext") ('D' vs 'N' in 'tidy')? I think that should work - if not let me know :)

@karthikparanthaman634 3 жыл бұрын

Hi Tom, Great video! When I tried to run the "pairwise_cor" R was returning following error : "Error in validObject(r) : invalid class “dgTMatrix” object: length(Dimnames[2]) differs from Dim[2] which is 11489" Any suggestions?

@tomhenry-datasciencewithr6047 3 жыл бұрын

I suspect this is because you might have an older version of 'widyr' installed. Perhaps try the steps at github.com/dgrtwo/widyr and install the most recent version from GitHub, and then restart R and see if it works. If not, we can figure out what is going on!

@rafaafeitos 2 жыл бұрын

Excelent video

@miguelamaral5505 3 жыл бұрын

Thank you so much for this tutorial. I tried to use a correlation >= 2 but it says "object "correlation"not found." I think it won't aloud me to plot any number higher than 1. Do you know if there's a way to fix that? My data set is kinda big and the plot gets confusing with a low correlation.

@tomhenry-datasciencewithr6047 3 жыл бұрын

Hi Miguel. Were you able to sort this out? Otherwise if you post the code chunk that is failing we can figure it out together.

@GingerFelix1000 2 жыл бұрын

Am I being slow or is it not just that a correlation coefficient cannot exceed 1. E.g. a range between -1 and 1?

@benhalsted9574 3 жыл бұрын

Great content mate! From your codes, I tried to create a positive_word_correlations dataset but I cant seem to get it. Any advice? positive_word_correlations % semi_join(users_who_mention_word, by = "word") %>% pairwise_cor(item = word, feature = user_name) %>% filter(correlation >= 0.2) %>% filter(grade >= 5)

@tomhenry-datasciencewithr6047 3 жыл бұрын

Hi Ben -- what error are you receiving?

@andrekerygma 3 жыл бұрын

Would you help to do this process with a CSV file?

@robertrotich3958 3 жыл бұрын

I would like to see this kind of approach as well

@AnahideCastro 5 ай бұрын

👏🏽👏🏽👏🏽

@divyangirathore4156 2 жыл бұрын

apparently ggraph package is not getting installed

@tomhenry-datasciencewithr6047 2 жыл бұрын

What error message do you see when you try to install ggraph? install.packages("ggraph")

@divyangirathore4156 2 жыл бұрын

@@tomhenry-datasciencewithr6047 ERROR: dependency ‘igraph’ is not available for package ‘graphlayouts’ * removing ‘/opt/homebrew/lib/R/4.1/site-library/graphlayouts’ Warning in install.packages : installation of package ‘graphlayouts’ had non-zero exit status ERROR: dependencies ‘igraph’, ‘tidygraph’, ‘graphlayouts’ are not available for package ‘ggraph’ * removing ‘/opt/homebrew/lib/R/4.1/site-library/ggraph’ Warning in install.packages : installation of package ‘ggraph’ had non-zero exit status

@tomhenry-datasciencewithr6047 2 жыл бұрын

@@divyangirathore4156 Have you tried installing those other packages first? (e.g. install.packages("igraph") and so on) or you can try install.packages('ggraph', dependencies = TRUE) one more thing - are you trying to install on a server or other shared location .... or is this just on your personal computer?

@divyangirathore4156 2 жыл бұрын

@@tomhenry-datasciencewithr6047 it is on my personal computer. the above command did not work still. is there any chance u can tell me how we can plot the same data u plotted using gplot? any references would be helpful

@CurveBlade 3 жыл бұрын

If you want your lessons to be applicable, you have to include a section that teaches us how to convert the data into whatever format you are using. I am using txt format and i am unable to replicate anything in this video.

@tomhenry-datasciencewithr6047 3 жыл бұрын

Hi Elton! Do you have an example of what your text data looks like in the txt file?

@CurveBlade 3 жыл бұрын

@@tomhenry-datasciencewithr6047 Hi Tom. It looks like this. Asin Rating Reviews B085234 4.0 out of 5 stars All in all, if you weigh quality more you should probably pay 50-100 bucks more for laptop with similar B092453 3.0 out of 5 stars It is light weight. I liked it. However, probably because of the software installed, I couldnt install the apps I was

@tomhenry-datasciencewithr6047 3 жыл бұрын

Excellent. Also, is the data stored in a tab separated format, or in a comma separated format, or is it in Excel, or is it in some other format?

@CurveBlade 3 жыл бұрын

@@tomhenry-datasciencewithr6047 I have already solved the issue. Thanks Tom! If your user reviews had a lot of non-english characters, how would you resolve that?

@JOHNSMITH-ve3rq 3 жыл бұрын

The *purpose* of generating text networks was not clear from this video. The exercise didn’t seem to generate any particular insight. Is this a problem with text networks themselves, or content selection?