This project aims to digitalise the Malaysian Dewan Rakyat and Dewan Negara Hansards.
- Install requirements
pip install -r requirements.txt - Change directory with either
cd dewan_rakyatorcd dewan_negara. - (Optional) Run
python3 download_hansards.pyto download the source PDFs. - Bulk process with
python3 batch_run.py. If you are rerunning, free to comment out certain procedures to speed up the prcoess (e.g.parse_pdf.pytakes a long time but we have stored the output inparsed_pdf).
To be specific, these files will be run in order.
Run parse_pdf.py to get the four output files
- plaintext.txt
- bold.txt and italic.txt files (as 0, 1, or whitespaces)
- tables.json.
This will get the PDFs from src_hansard where the Hansards are named as DR-DDMMYYYYY.pdf or DN-DDMMYYYYY.pdf. The files are stored in parsed_pdf/YYYY/YYYY-MM-DD/. The parsing will only process the content from DOA onwards (ignores table of contents and MP attendance).
Run get_categories.py to analyse the table of contents (TOC) for level_1 classification later. This will output the file
- categories.json
The files are stored in parsed_pdf/YYYY/YYYY-MM-DD/ too.
Run post_parsing_edits.py to modify tables with known errors. This will modify tables.json in place.
Run pretabulation_processing.py to insert tables and to remove header rows, and other processing. The input is from parsed_pdf/ and the output is stored in pretabulation_processing/.
Run edit_hansards.py to edit the hansards to fix known any errors to ease tabulation. This will modify plaintext.txt, bold.txt, italics.txt in place.
Run tabulate_hansards.py to tabulate the hansards into a CSV file with the following fields
- level_1
- level_2
- level_3
- timestamp
- author
- speech
The input is from pretabulation_processing/ and the output is stored in tabulated_hansards/. One can get them from the links above.
parsed_pdf/contains the output ofparse_pdf.py.pretabulation_processing/contains the output ofpretabulation_processing.py.tabulated_hansards/contains the output oftabulate_hansards.py.
We do not store src_hansard/ in this repository as it is too large (~4 GB) per Dewan.
The dates or files referenced is for Dewan Rakyat unless otherwise stated. Indeed, most code are the same across Dewan Rakyat and Dewan Negara. The code was first developed for Dewan Rakyat. The only differences are perhaps in post_parsing_edits.py and edit_hansards.py where we fix known errors in specific Hansards.
- The following fonts are found in the Hansard documents
- '/Arial-BoldItalicMT',
- '/Arial-BoldMT',
- '/Arial-ItalicMT',
- '/ArialMT',
- '/TimesNewRomanPSMT'
- We cannot make any useful decisions based on font sizes.
- For each Hansard, we will create two files, each with only ones and zeroes and whitespaces, for bolds and italics respectively, called
bold.txtanditalics.txt. - Parsing 2018 gives 3 pages per second, giving around 1 minute per Hansard.
extract_text()has different layout than usingpage.chars, the latter does not retain most whitespaces, and use different text flow (see second page "Diterbitkan...",page.charswill put it at the top of the page even though it is at the bottom).- We get the formatting using
extract_words(extra_attrs=['fontname']). This will also segment words based on homogeneity of fontnames.
level_1title is first extracted from the KANDUNGAN throughexploratory_survey/get_categories.pyand stored incategories.json. We define them to be those bold and uppercased inside KANDUNGAN. Their appearance in the Hansard (after KANDUNGAN) is usually bold, uppercased, and underlined. But since all parsers cannot parse underlined (due to how underlines are implemented in PDFS), we need to depend on the list curated viaget_categories.pyinstead of extracting those underlined. Common (but not exhaustive) categories are- JAWAPAN-JAWAPAN MENTERI BAGI PERTANYAAN-PERTANYAAN
- JAWAPAN-JAWAPAN LISAN BAGI PERTANYAAN-PERTANYAAN
- RANG UNDANG-UNDANG DIBAWA KE DALAM MESYUARAT
- USUL-USUL
- RANG UNDANG-UNDANG
- KANDUNGAN will say "USUL-USUL" but in-text the title is usually "USUL"
- Sometimes, USUL will somehow go under RANG UNDANG-UNDANG, and some categories will go before others, ignoring the TOC order.
- 17072018 does not bold its categories.
One should check toc_mismatch.txt after running tabulate.py for any discrepancies between the TOC and the actual text.
- Sometimes, USUL will appear in TOC but not in-text as a
level_1. Instead USUL will be part of alevel_2under thelevel_1RANG UNDANG-UNDANG. This is expected behavior.
- When parsing JAWAPAN-JAWAPAN MENTERI BAGI PERTANYAAN-PERTANYAAN or JAWAPAN-JAWAPAN LISAN BAGI PERTANYAAN-PERTANYAAN, an MP will be numbered at the start of the string and they will speak with the keyword "minta" without ":". For example:
- Datuk Robert Lawson Chuat [Betong] minta Menteri Perdagangan Dalam...
- Speakers usually have [] to give context of who they are representing (either representing their constituency or a ministry). Sometimes the speaker will not have [] further down in the discussion if they already appeared before with []. The Tuan Yang di-Pertua doesn't have [].
- The DR.dd.mm.yyyy in the header is not consistent: the dd can be zero-padded or not.
- Most Hansards start with the page number 1 in the same page of DOA except 14.3.2018, which starts with 11.
- 29.11.2018 when parsed displays the page numbers as 1 1 instead of 11
- 12.11.2019 displays as 12.11.201
- We decide to remove them in pretabulation as it can jut between important chunks as in DR. 22.5.2023 page 108.
- The text inside
speechis formatted as markdown using*to wrap italics and*to wrap bolds.***is for both italics and bolds. - There are 7 files with natural occurences of
*. 22102019, 24112020, 30112020, 15122020, 08112021, 09122021, 13032023. We escape them with\*.
- Annotations are usually italicised and wrapped with square brackets [ ].
- Some occur in-text and some occur on a newline.
- We parse them as a new row under the "author"
ANNOTATION. - We do not style annotations
- Common annotations include
- [Tepuk]
- [Ketawa]
- [Dewan ketawa]
- [Dewan riuh]
- [Pembesar suara dimatikan]
- Start jupyter notebook and use
debug_tables.ipynb. Many real examples are there that culminates in the current way of detecting tables. - Tables will replace the plaintext in the markdown format, e.g.
| 1 | 2 | 3 |
| 4 | 5 | 6 |
| 7 | 8 | 9 |
Due to the diversity of tables we do not style the header rows. All formatting inside the table (particularly bold) will be removed.
- We operate on the assumption that there is no natural occurence of pipes | in the Hansard, which so far holds true.
There are two formats for timestamps
- Those with a bullet point, e.g.
■1350. These are quite consistent and are in 10 minute increments. - Those with words, e.g.
12.08 tgh.or7.17 mlm.
We also extract timestamps from annotations, e.g. [Mesyuarat disambung semula pada pukul 2.30 petang]
126 out of 326 Hansards have out of sync timestamps. The bullet points and word-formatted timestamps do not necessarily agree chronologically.
batch_run.py is guaranteed to run without errors by catching them and storing Hansard-specific errors at errors/. If there are no files or the files are empty then there are no errors. There are four files at most.
hansards_with_parsing_errors.txtcontains errors caught when runningparse_pdf.py.error_tables.txtcontains errors frompretabulation_processing.py. These are usually due to the parser not being able to inject the table into the text. The maintainer can usually fix this case throughpost_parsing_edits.py.pretabulation_errors.txtcontains other errors caught when runningpretabulation_processing.py.tabulation_errors.txtcontains errors caught when runningtabulate.py. The maintainer can usually fix this case throughedit_hansards.py.
Please check all files inside warnings after each run. For a successful run it is not expected for these files to be empty, and their role is to flag out suspicious cases for manual inspection and most cases are OK. Those that are not OK is on the maintainer to fix through the following
- A human mistake in transcribing that is specific to a given Hansard: edit that Hansard with
edit_hansards.pyto fix the issue and maintain reproducibility. For examples, missing ] or :, or authors that did not start on a newline. - A recurring case that is related to the parser: edit the parser in either
parse_pdf.py,pretabulation_processing.py, ortabulate_hansards.py. For example, a new salutation like "Kapten".
To minimize edits to the Hansard that is not related to formatting and punctuation, sometimes you will have to edit the parser to allow special cases. For example, the transcriber forgot to put a salutation for the author.
- Be careful when berbelah bahagi shows up. Some Hansards present it differently than others. Usually, it will have the keywords "hadir", "bersetuju", or "undi", and are usually bolded and lowercased, except for 17072019 where it is uppercased and hence parsed as a level_2.
- 26102021, 05102021, 08122020 have low table matching scores, but they are still a perfect match and you can ignore those errors.
- 30112020, 29072021, 23032022 have footnotes (not guaranteed to be exhaustive). Due to its complicated nature, you will have to manually edit this into the end product.
- 17032009 first page in the pdf is 1 but parsing shows 11. Turns out the PDF is also 11 but the second 1 is colored white. The next page is then 2.
When you hope to inspect a certain Hansard it is helpful to use python open.py DDMMYYYY to open all the files related to that Hansard.