Build a corpus from the web

  Рет қаралды 32,344

Sketch Engine

Sketch Engine

Күн бұрын

Пікірлер: 29
@SketchEngine
@SketchEngine 2 жыл бұрын
UPDATE: The download limit of 2,000 pages from one site has been increased to 10,000 pages.
@LegendGaming-qb7eu
@LegendGaming-qb7eu 5 жыл бұрын
Totally helpful for the researchers.
@SketchEngine
@SketchEngine 5 жыл бұрын
We're happy to hear you find it useful.
@linnmarlen
@linnmarlen 4 ай бұрын
Where can you select the time span you want to have, e.g. from 2010 until today about one magazine?
@SketchEngine
@SketchEngine Ай бұрын
This is possible to do only in the case the website contains the specific year in the URL, e.g. bbc.co.uk/2010/... for the news from 2010, bbc.co.u/2014/... for the news from 2014 etc. Then you can insert the list of such URL paths under the Input type "Website". So it mostly depends on the structure of a particular website. If you need more information, please don't hesitate to contact us at support@sketchengine.eu
@rexfarell
@rexfarell 5 жыл бұрын
Is there a way to set the text complexity of the documents retrieved, maybe by using an algorithm such as a Flesch- Kincaid or setting some kind of CEFR(A1-C2) level classifier?
@SketchEngine
@SketchEngine 5 жыл бұрын
Dear Eric, a CEFR classifier is not likely to happen because the corpora in Sketch Engine are generally made up of texts intended for native speakers and therefore all would be classified as C2. For the same reason F-K may not really work either. In additon, F-K is for English only and when we intergrate new functionality, we need to make sure it is applicable to all languages. If you are looking for language suitable for language learning, you may look a our GDEX technology www.sketchengine.eu/guide/gdex/
@Laura_380
@Laura_380 Жыл бұрын
Hello, I wanted to know if there is a way to combine the Web search, i.e. being able to set seed words, and the URL search; for example, I need newspaper articles about a specific topic (so I need seed words) from a specific newspaper (so I need the URL of their website). Additionally, can you set it to look for texts in a specific period of time in that website and with those seed words? Thanks
@Laura_380
@Laura_380 Жыл бұрын
Sorry, I meant the website search, since I don't need just one page but all relevant pages from the website
@SketchEngine
@SketchEngine Жыл бұрын
@@Laura_380 Hello, the methods (web search using the seed words, URL download or website download) cannot be combined.
@oscarmas4706
@oscarmas4706 Жыл бұрын
Let say I want to create a corpus of news broadcaster articles on a given topic, and I feed the engine with the links. Does it add the language strictly from the article ( the one I really want to be stored and analysed) or from the whole webpage ( adverts, page layout, etc.). I f so, is it possible to avoid it? Thanks.
@SketchEngine
@SketchEngine Жыл бұрын
You will find the explanation here www.sketchengine.eu/blog/build-a-corpus-from-the-web/ under the "UNWANTED CONTENT" heading
@anshulmishra6454
@anshulmishra6454 2 жыл бұрын
Hi Team, I used the sketch engine and it's a very powerful tool with simple to use UI. I have few questions about the web crawling feature. 1. If a website has a 5000 pages. Is it possible to crawl all the pages. Because it is mentioned in the video that the tool crawls only 2000 pages. What will happen with the other 3000 pages. Is there any options to do it again with the next 2000 pages and then 1000 page at 3rd round ? 2.) Also, This crawled / extracted text from webpage has many garbled character, HTML tags, blank spaces, mixed or non-language text, special characters etc. Is there any option available for corpus cleaning ? 3.) Can we translate one corpus ( Vietnamese) to another corpus ( Chinese ) ? 4.) Can I create a parallel corpus from x to y language after crawling and extracting the text from a website ? Thanks Much.
@SketchEngine
@SketchEngine 2 жыл бұрын
Hi Anshul, currently the limit has been increased to 10,000 pages from one site.
@SketchEngine
@SketchEngine 2 жыл бұрын
As for the other questions: 2) The corpus is automatically cleaned when downloaded, but as any automatic process, it is not always 100 % accurate. If needed, a manual clean up can be performed. 3) No, it is not possible to translate one corpus to another language by machine translation. 4) You can create parallel corpora if you have the same data in both languages. Automatic alignment (i.e. saying which sentence in language 1 belongs to a sentence in language 2) can be done with OneClick Terms (terms.sketchengine.eu). If you need details, please contact us at support@sketchengine.eu.
@Laura_380
@Laura_380 Жыл бұрын
@@SketchEngine Hello, I wanted to know how the manual clean up of the data can be performed
@SketchEngine
@SketchEngine Жыл бұрын
@@Laura_380 "clean up" may refer to many different things. Please email support@sketchengine.eu with details of what exactly you would like to clean up and they will be able to provide more guidance. Generally speaking, it is possible to remove documents from your corpus or to change the tags or lemmas manually.
@Mr.AIFella
@Mr.AIFella Жыл бұрын
Is arTenTen18 Corpus available to be downloaded for free? If not how much will it cost?!
@SketchEngine
@SketchEngine Жыл бұрын
We do not normally allow the download of our corpora. We may, however, consider your request. Please email inquiries@sketchengine.eu and give a bit more details about why you need to download the corpus and why it is not sufficient to search and analyse the corpus in Sketch Engine. What functions are missing from Sketch Engine.
@Mr.AIFella
@Mr.AIFella Жыл бұрын
@@SketchEngine Because I don't get a corpus after pasting a website link! When I click compile, it runs forever! So, I don't get it! I followed all the steps in the video, still, I am getting nothing. Regarding your arTenTen18 corpus, since you don't allow downloading, how will the user be able to use it?! < if you want to sell it, there is no shame in saying it costs X money!
@SketchEngine
@SketchEngine Жыл бұрын
@@Mr.AIFella Please email support@sketchengine.eu about your problem with creating a corpus from the website link. Email inquiries@sketchengine.eu if you would like to purchase a corpus from us. Thank you.
@SW-uu8nt
@SW-uu8nt 3 жыл бұрын
It's a great tool, just trying to understand the processes and the outputs. When I press the "i" on the folder for a corpus i made using this function, I can see the search returned 29 URLs. But when I go to my files, there are only 20 files. What happened to the content from the other 9 URLs? I've pulled them up individually and they're useful. Were they screened out by SketchEngine on purpose or was there an error/issue in pulling the content?
@SketchEngine
@SketchEngine 3 жыл бұрын
Some URL may not have met your additional criteria and were discarded. We need to look at your concrete corpus to help. Please select your corpus in the interface and send your question via the 'Request help or support' icon in the interface. You will find it in the top right corner of every screen.
@Mr.AIFella
@Mr.AIFella Жыл бұрын
Every time I make a comment, it gets deleted?! why?!
@SketchEngine
@SketchEngine Жыл бұрын
We do not delete any comments. Unless they are against the law, of course.
@Mr.AIFella
@Mr.AIFella Жыл бұрын
@@SketchEngine I am asking the same question above, I didn't say anything against the LAW; LOL
@SketchEngine
@SketchEngine Жыл бұрын
@@Mr.AIFella In the past few weeks we have not deleted any comments.
@sajidullah7109
@sajidullah7109 4 жыл бұрын
More than interesting!
@SketchEngine
@SketchEngine 4 жыл бұрын
Feel free to set up a free trial account and test it auth.sketchengine.eu/#register/form?form=trial
Build a corpus from your own texts/data
2:58
Sketch Engine
Рет қаралды 37 М.
Building your own corpus using Sketch Engine
8:18
Elen Le Foll
Рет қаралды 2 М.
Арыстанның айқасы, Тәуіржанның шайқасы!
25:51
QosLike / ҚосЛайк / Косылайық
Рет қаралды 700 М.
Мясо вегана? 🧐 @Whatthefshow
01:01
История одного вокалиста
Рет қаралды 7 МЛН
How to get 100% Free QuillBot Premium (New Method)
2:22
ST AirDrop Hunter
Рет қаралды 66
Concordance for advanced users
5:15
Sketch Engine
Рет қаралды 12 М.
Concordance for beginners
2:33
Sketch Engine
Рет қаралды 34 М.
Word sketch - analyse collocations in a corpus
4:01
Sketch Engine
Рет қаралды 29 М.
Counting frequency from a concordance
5:32
Sketch Engine
Рет қаралды 9 М.
Build a Chatbot with AI in 5 minutes
5:35
IBM Technology
Рет қаралды 201 М.
CQL 1: Complex corpus searches with the Corpus Query Language
6:24
Sketch Engine
Рет қаралды 30 М.
Learn English with HOME ALONE - Defending the House!
26:53
Learn English With TV Series
Рет қаралды 62 М.
Google’s Quantum Chip: Did We Just Tap Into Parallel Universes?
9:34
EMR vs EHR: What’s the Difference?
4:31
Jotform
Рет қаралды 57 М.
Арыстанның айқасы, Тәуіржанның шайқасы!
25:51
QosLike / ҚосЛайк / Косылайық
Рет қаралды 700 М.