UPDATE: The download limit of 2,000 pages from one site has been increased to 10,000 pages.
@LegendGaming-qb7eu5 жыл бұрын
Totally helpful for the researchers.
@SketchEngine5 жыл бұрын
We're happy to hear you find it useful.
@linnmarlen4 ай бұрын
Where can you select the time span you want to have, e.g. from 2010 until today about one magazine?
@SketchEngineАй бұрын
This is possible to do only in the case the website contains the specific year in the URL, e.g. bbc.co.uk/2010/... for the news from 2010, bbc.co.u/2014/... for the news from 2014 etc. Then you can insert the list of such URL paths under the Input type "Website". So it mostly depends on the structure of a particular website. If you need more information, please don't hesitate to contact us at support@sketchengine.eu
@rexfarell5 жыл бұрын
Is there a way to set the text complexity of the documents retrieved, maybe by using an algorithm such as a Flesch- Kincaid or setting some kind of CEFR(A1-C2) level classifier?
@SketchEngine5 жыл бұрын
Dear Eric, a CEFR classifier is not likely to happen because the corpora in Sketch Engine are generally made up of texts intended for native speakers and therefore all would be classified as C2. For the same reason F-K may not really work either. In additon, F-K is for English only and when we intergrate new functionality, we need to make sure it is applicable to all languages. If you are looking for language suitable for language learning, you may look a our GDEX technology www.sketchengine.eu/guide/gdex/
@Laura_380 Жыл бұрын
Hello, I wanted to know if there is a way to combine the Web search, i.e. being able to set seed words, and the URL search; for example, I need newspaper articles about a specific topic (so I need seed words) from a specific newspaper (so I need the URL of their website). Additionally, can you set it to look for texts in a specific period of time in that website and with those seed words? Thanks
@Laura_380 Жыл бұрын
Sorry, I meant the website search, since I don't need just one page but all relevant pages from the website
@SketchEngine Жыл бұрын
@@Laura_380 Hello, the methods (web search using the seed words, URL download or website download) cannot be combined.
@oscarmas4706 Жыл бұрын
Let say I want to create a corpus of news broadcaster articles on a given topic, and I feed the engine with the links. Does it add the language strictly from the article ( the one I really want to be stored and analysed) or from the whole webpage ( adverts, page layout, etc.). I f so, is it possible to avoid it? Thanks.
@SketchEngine Жыл бұрын
You will find the explanation here www.sketchengine.eu/blog/build-a-corpus-from-the-web/ under the "UNWANTED CONTENT" heading
@anshulmishra64542 жыл бұрын
Hi Team, I used the sketch engine and it's a very powerful tool with simple to use UI. I have few questions about the web crawling feature. 1. If a website has a 5000 pages. Is it possible to crawl all the pages. Because it is mentioned in the video that the tool crawls only 2000 pages. What will happen with the other 3000 pages. Is there any options to do it again with the next 2000 pages and then 1000 page at 3rd round ? 2.) Also, This crawled / extracted text from webpage has many garbled character, HTML tags, blank spaces, mixed or non-language text, special characters etc. Is there any option available for corpus cleaning ? 3.) Can we translate one corpus ( Vietnamese) to another corpus ( Chinese ) ? 4.) Can I create a parallel corpus from x to y language after crawling and extracting the text from a website ? Thanks Much.
@SketchEngine2 жыл бұрын
Hi Anshul, currently the limit has been increased to 10,000 pages from one site.
@SketchEngine2 жыл бұрын
As for the other questions: 2) The corpus is automatically cleaned when downloaded, but as any automatic process, it is not always 100 % accurate. If needed, a manual clean up can be performed. 3) No, it is not possible to translate one corpus to another language by machine translation. 4) You can create parallel corpora if you have the same data in both languages. Automatic alignment (i.e. saying which sentence in language 1 belongs to a sentence in language 2) can be done with OneClick Terms (terms.sketchengine.eu). If you need details, please contact us at support@sketchengine.eu.
@Laura_380 Жыл бұрын
@@SketchEngine Hello, I wanted to know how the manual clean up of the data can be performed
@SketchEngine Жыл бұрын
@@Laura_380 "clean up" may refer to many different things. Please email support@sketchengine.eu with details of what exactly you would like to clean up and they will be able to provide more guidance. Generally speaking, it is possible to remove documents from your corpus or to change the tags or lemmas manually.
@Mr.AIFella Жыл бұрын
Is arTenTen18 Corpus available to be downloaded for free? If not how much will it cost?!
@SketchEngine Жыл бұрын
We do not normally allow the download of our corpora. We may, however, consider your request. Please email inquiries@sketchengine.eu and give a bit more details about why you need to download the corpus and why it is not sufficient to search and analyse the corpus in Sketch Engine. What functions are missing from Sketch Engine.
@Mr.AIFella Жыл бұрын
@@SketchEngine Because I don't get a corpus after pasting a website link! When I click compile, it runs forever! So, I don't get it! I followed all the steps in the video, still, I am getting nothing. Regarding your arTenTen18 corpus, since you don't allow downloading, how will the user be able to use it?! < if you want to sell it, there is no shame in saying it costs X money!
@SketchEngine Жыл бұрын
@@Mr.AIFella Please email support@sketchengine.eu about your problem with creating a corpus from the website link. Email inquiries@sketchengine.eu if you would like to purchase a corpus from us. Thank you.
@SW-uu8nt3 жыл бұрын
It's a great tool, just trying to understand the processes and the outputs. When I press the "i" on the folder for a corpus i made using this function, I can see the search returned 29 URLs. But when I go to my files, there are only 20 files. What happened to the content from the other 9 URLs? I've pulled them up individually and they're useful. Were they screened out by SketchEngine on purpose or was there an error/issue in pulling the content?
@SketchEngine3 жыл бұрын
Some URL may not have met your additional criteria and were discarded. We need to look at your concrete corpus to help. Please select your corpus in the interface and send your question via the 'Request help or support' icon in the interface. You will find it in the top right corner of every screen.
@Mr.AIFella Жыл бұрын
Every time I make a comment, it gets deleted?! why?!
@SketchEngine Жыл бұрын
We do not delete any comments. Unless they are against the law, of course.
@Mr.AIFella Жыл бұрын
@@SketchEngine I am asking the same question above, I didn't say anything against the LAW; LOL
@SketchEngine Жыл бұрын
@@Mr.AIFella In the past few weeks we have not deleted any comments.
@sajidullah71094 жыл бұрын
More than interesting!
@SketchEngine4 жыл бұрын
Feel free to set up a free trial account and test it auth.sketchengine.eu/#register/form?form=trial