Obtaining data from the internet: Data crawling for research in business and economics

1. Topics

The increasing availability of semistructured data on the Internet is becoming an important data source for research in economics and management. However, much of the data cannot be readily downloaded but can only be accessed through websites. The manual extraction of content from websites is burdensome and becomes with increasing size of the underlying data quickly unfeasible. One solution is to systematically extract this data with automated programs written for this purpose, so- called web crawling or web scraping. This active learning workshop introduces participants to some of the key tools, frequent complications, and tips and tricks of web crawling. The goal of this course is to provide a good understanding about the possibilities of crawling, while also giving enough time for participants to work on their own project in a hands-on approach where participants actively solve exercises that are directly connected to their own research interests. In addition to teaching the basic skills needed for web crawling, the workshop will also cover more advanced topics such as accessing websites through APIs and Selenium, extracting text using HTML and CSS, scaling web-scrapers using cloud-based solutions and using Machine-Learning- based methods to filter and combine resulting large-scale data-sets.

Learning target: Understanding the possibilities of web crawling and implementing web crawling in an own project in a hands-on approach.

2. Target group

PhD students, postdocs, interested faculty
 
Previous knowledge: Participants should have a basic knowledge of a programming language (e.g. Python, R, PHP or Java). For participants without such knowledge, there are multiple resources available on the web that they can consult prior to the workshop. Examples will be given mostly in Python.

3. Application

The seminar is limited to a maximum of 20 participants (first come, first served). Please apply via e-mail to entrepreneurship@business.uzh.ch. Please indicate your willingness to participate by February 2021. It is not allowed to attend the seminar without confirmation.
Lecturers: Prof. Dr. Christian Peukert, Prof. Dr. Jörg Claussen

4. Program

5. Schedule

March, 1 - 3, 2021; 9:00 - 18:00 h (room: tba)

6. Literature

--