You have read a bunch of documents. Think a lot of documents. But after reading, you are thinking of a specific detail. A detail you have forgotten. The name of a company or product for instance. And really, you are not interested in reading all the documents again. So you start scimming the texts for the missing detail.
Good news – now you won’t have to scim it your self. Let Datashare do it for you.
ICIJ is the International Consortium of Investigative Journalists. The organization behind the Panama Papers, Lux Leaks, Off shore leaks, Implant files and many other large scale international collaborative (data) journalism projects.
We are currently attending the European Investigative Journalism Conference (EIJC19) in Mechelen, and will write about interesting software usable for research purposes.
Ever wondered how scraping works? If you are pretty much blank when it comes to programming, this guide is probably not for you. However, if you have the basic concepts in place, in a few steps the author, Mikko Helsig, shows you how to scrape a site in Python (and also how to install Python in a Windows environment).
Prerequisites are a basic understanding of programming. But then you get a concept of how Python works, how scraping works and the really cool libraries requests, requests_cache, BeautifulSoup and Gender (the latter is a library used to guessing and parsing gender of names).
Instaloader is a tool to download pictures (or videos) along with their captions and other metadata from Instagram. You can either download profiles or hashtags, and it’s possible to set up filters (for instance datefilters, see below) to narrow your search.
Install Jyputer Notebook on the environment and open a new terminal
Do a pip (not pip3, as that does not work with Anaconda) install of the instaloader and dependencies
pip install instaloader
Create a new folder in your root-environment (typically documents-folder) called for instance Instaloader
In terminal do
This is to avoid that everything is saved in your base folder 🙂
Run various command line commands in your terminal. Please do note that the interface is rudimentary but filters can be applied with the use of boolean expressions for instance:
instaloader "#HASHTAG" --post-filter="date_utc >= datetime(2017,1,1) and date_utc <= datetime(2018,1,1)" --login=USERNAME
We would love to implement this as a hosted service. However, it is not likely we will do so just now. Therefore, please experiment with it yourself. You can also ask our advice, and we will do our best to help. If you plan to use this tool on a regular basis or for larger datasets, you should probably be ready to use several user accounts and/or proxies to avoid being banned.
On 28/11 2018 we had our official launch with a seminar and get-together at our place.
We had a smashing time discussing the possibilities with you guys, and look forward to realizing ideas from the day. If you would like to relive the day (and who wouldn’t), you will find links to presentations and programme below.