content
short wrap up of digitale dramenanalyse
- todo: global linksammlung, antconc, empties
prerequisites
I will show you in a few steps an example workflow of how to prepare some base documents that enable further analysis of a dramatic text.
assumed we start with a plain text file, a lot of work had been done by others yet and we can proceed to the TEI refactoring of the text.
if we do not have a text file yet, first will be to transcribe some source of the text, usually a .pdf or collection of .jpegs.
for that purpose you can either transcribe the text manually, from picture to text, or you use e.g. transkribus, a user friendly framework for OCR (optical character recognition). with that half of the work is done by the algorithm, but you still have to check the automatic transcription for recognition failures.
next step if you have the transcript ready will be to upload the text page by page to wikisource where it can be proofread by others. if theres two correction runs ready, you can download the proper version of the text from which we proceed to the TEI.
theres multiple ways of how you can get to the TEI text. one is to wrap text elements which need to be <marked up> with oxygen, a powerful XML editor to which the FU grants a permanent license.
another way would be to use an R-script that does lot of work yet, but you'll have to very precisely define text specific parameters to be able to apply the script to your drama text. to use the script you have to be some familiar with the R language which is scheduled in class.
NT: the dracor project already provides a convenient routine to preprocess your text to the TEI format, see section 4.1 to that.
if all that is done you possess a finalized TEI text which allows further analysis of the drama again e.g. using python or R or e.g.gephi for network analysis.
dracor datenbank
collection of drama plays in many languages with multiple options to download, visualise and extract play specific data.
you can find here already some easy ways of working with the datasets and metadata (in .JSON and .CSV format).
wikisource
texteditors
for the transcription and processing of your vorlage you better work with a texteditor which extend the capacities of notepad or in general IS NOT a WYSIWYG editor to be able to edit plain text. a choice of editors:
both feature a regex implementation which you will need for further processing of the text.
transkribus OCR
you find a convenient way of performing OCR to picture sources or .pdf with transkribus. if you have not yet, you have to create a user account which will grant you 500 credits to do text recognition. each action costs you a few credits (0.2 for print models).
regular expressions overview
tools for learning and applying regex functions
- https://regexr.com
- https://regex101.com
- https://ahkde.github.io/docs/misc/RegEx-QuickRef.htm#Common
- regex compendium
oxygen
- get it here: https://www.oxygenxml.com/xml_editor.html or via the supported link in the zedat portal which points to the version licensed by the FU
R summary
- install RStudio (convenient R programming surface)
- first you have to install R (the programming language) on your system. (follow the instructions on the download page), then you can install RStudio.
- you need to install additional R-packages (libraries) to excercise the tasks in class, e.g. the package "stylo".
- if you open RStudio, you have left down you console window, where you can input commands directly. i would recommend open a new R-script datei > neu > R-script to be able to save you commands and automatise your workflow. to execute a command in the script, place the cursor at the line including your command and press CMD+return (mac) or CTRL+return (windows). to execute a command in the terminal window, just type it in there and press return.
- first command e.g.
install.packages("stylo")
- then:
library(stylo)
- mac users at this point may see a message saying you have to install XQuartz. do so, open the link provided and install XQuartz for your system. (it is a small window server which is needed to display the GUI stylo is using.
- if you in the course of installing R, RStudio or XQuartz are asked if you want to install the XCode developer tools, you can deny that since it takes a while and is about 12GB diskspace and you probably wont need this.
- to see where you're at, type
getwd()
#this will show you your current working directory.
#any saving or opening files without an absolute
#path will access this directory.
- you can change your working directory with
setwd("/path/to/your/preferred/dir")
- or by navigating to that directory in the right bottom window, clicking the zahnrad and choose: set as working directory
try the following snippet to find plays containing a certain name/character in question:
#13266.dracor simple request
library(jsonlite)
einakter <- fromJSON("https://einakter.dracor.org/data.json")
#fetch dataset from dracor server
m<-einakter$printed=="NULL"
sum(m)
einakter$printed[m]<-NA
#build dataframe of name in question
spitcast<-function(set,cast){
ndf<-data.frame()
m<-grepl(cast,set$cast)
print(sum(m))
m<-grep(cast,set$cast)
s<-data.frame(author=set$author$name[m],year=unlist(set$printed[m]),title=set$title[m])
print(s)
return(s)
}
name_to_analyse<-"Lisette"
ndf<-spitcast(einakter,name_to_analyse)
#print out first 5 elements of dataframe
head(ndf)
you can export the dataframe created above (ndf) with the following line, either to .csv or excel:
#either:
library(writexl)
write_xlsx(ndf,"dracor_names-analysed_dataframe.xlsx")
#or:
write.csv(ndf,"dracor_names-analysed_dataframe.csv")
this will save the dataframe into you working directory.
TEI
GitHub
transcription, general
- get your textvorlage: https://de.wikisource.org/wiki/Index:Kotzebue_-_Blind_geladen.pdf
- transcribe the text manually or with the help of an OCR engine
transkribus workflow
a good model for german fraktur typography is "ONB_Newseye_GT_M1+"._
you first have to import your pages. you create a collection and upload files to it.
then you open <text recognition> and choose a model:
you can either correct mistakes within the transkribus frontend itself or download the transcription first and then edit the page in the wikisource editor.
if you want to have more configurations options you can try the TRANSKRIBUS expert client. it requires java (the JDK) on your OS and runs locally on your desktop but requires an internet connection to the transkribus server.
wikisource editing
- upload (copy/paste in text) each transcribed page to wikisource for having it proofread there. it would be nice if participants proofread the texts of each other vcvs.
- observe transcription rules
- preserve the original orthography of the vorlage. don't adapt the spelling to modern orthography
- markup speakers and stage directions with '''3 apostrophes''', (bold) and ''2 apostrophes'' (italic)
- if you have proofread transcriptions, turn the status to yellow resp. green for 1x/2x proofread.
TEI
TEI preprocessing
- to enable later TEI refactoring apply a simple markup to the text explained here. that will ease the process of the complex TEI markup.
- then open the jupyter-notebook (a small python script) in the runtime environment at: https://colab.research.google.com/github/dracor-org/ezdrama/blob/main/ezdramaparser.ipynb
- there you upload your prepared textfile with the markup as explained above.
- rename it to:
sample.txt
, this will ease the process - execute the script with Laufzeit > alle ausführen
- now there should be a
sample_indented.xml
file in your files, which you can download and rename; it contains the final TEI version ready for DRACOR after a few minor adaptations.
TEI oxygen
text analysis tools
if you cant wait til you have the drama TEI in your hands you can already with the plain text version perfom some analyses e.g. using
empty