Author: Jerry Chang
Overview:
This is a textfile parser that exposes an HTTP API to serve files of a specific format that live inside the /app/test-files directory. The parser is written in Python and is built on top of the existing BaseHTTPRequestHandler Python class, subclassed to create a custom HTTP request handler.
Running this program may be done by building with build.sh followed by running the program with the run.sh script. The curl POST request will then be accepted on a terminal to parse files and return a response, while Docker runs in the background.
Notes: In addition to handling the POST request, I have also built in a small GET request method in my handler to display a message on a barebones HTTP page. This is used simply as a sanity check that the HTTP server is up and running on 'localhost:8279/'.
As the textfiles can range from KB to GB, I have implemented the textparser to read and process data line by line to save on memory as the files are conveniently arranged as individual lines. The file entries are also conveniently assorted by time in iso8601 UTC, ascending. My implementation checks through the entries up to the end of the "to" timestamp specified in the request. The textparser then stops checking the rest of the entries in the files if the current timestamp being processed is beyond the "to" date-time. This is done to save on computation time. However, if this program is used to parse other test files that cannot be guaranteed to be sorted ascending timestamps, then the implementation should be adjusted to run through the whole file, then to sort the entries.
This implementation doesn’t check for extra input in the json body. The program design is such that the program does not know if an extra parameter is passed into the curl data body, and will not accept other keys other than "filename", "from", "to"; thus, this implementation ignores possible extranneous parameters in the curl data body. Otherwise, invalid input such as issues with case-sensitivity or spelling will return the empty json array, 200 HTTP status response code, and a content-type of application/json.
I opted to model the JSON output after the example in the instructions, with indentations and newlines for readability. In general, I don't include those because they are usually just wasted bytes to send over the connection and parse through. However, the formatting is more human-friendly for reading, so I have decided to model the formatting after the example for the purpose of this exercise. For production purposes, where the JSON response from the curl is parsed through by a caller program and not read by a human, I would opt to have the indentations and newlines taken out to avoid sending extranneous bytes.
In testing, I have tested ranges of datetimes, failure cases, and happy cases. This involved testing bit by bit by running'python3 main.py' with a barebones HTTP server first, then slowly building more functionality in handling requests, then adding the parser in to get the final output. I then adjusted the dockerfile and repeated the testing there for functionality testing within the container.
Improvements:
- More levels of abstraction for the http server can be made. As request handling is expanded to other HTTP request types and more functionality, code chunks can be reused and made smaller for better testing and easy building. This would help increase readability in a larger code base where there are multiple code contributors.
- More robust unit tests on test-files would benefit for code pushed out to production as a sanity check for further functionality built on top of this program. Current unit tests in appTest.py test requests on a mock HTTP server. The tests primarily test for POST and valid inputs. The unit tests may be run on terminal with 'python3 appTest.py' if desired.
- The current functionality is such that the program only responds with the JSON data requested from the desired textfile, as design intended. However, it is in the interest of the service/production to make sure there is tighter security from malicious data extraction should this program be made avaiable externally. For example, the curl request specifies a filename field, that could be easily replaced with a path such as "../../some_other_file_outside_test-files_dir" to extract data outside of the intended test-files directory. This data extraction would require knowledge of both the existing filesystem organization and files outside of the intended directory to be formatted the same as test-files, otherwise, a standard "[]" JSON response is emitted.