Things Learned 12/07-12/13/20

White Board Erased 12/9/2020
Finally I Get To Erase My Whiteboard

I like having a whiteboard to write random things to do and now that I’ve learned enough about Docker to be dangerous it’s time to wipe it clean and start with this week’s to-do list. It’s a large to do list, but I’ll have to get this more organized in the future. It starts from the bottom right with main goals, the bottom left is my motivation to get this done. Right above that is a random code that I’ve found useful. Above that is what really need to figure out soon. Top half is gibberish that I’ve written down to flush out my thoughts during a workday.

Index:
Problem Solved: Boto is not a known module
Git Style Guide
Airflow Import Missing Dags and Operators
Make WordPress code pretty
Testing Dag Best Practice

Problem Solved: Boto is not a known module

Even though Docker helped out with much of the problems related to dependencies, I received an error for Boto being an unknown module. Even though the dependencies contained versions of boto3 and botocore, I added the line below to the Dockerfile to ensure the environment has the correct AWS materials.

&& pip install awscli --upgrade \ && pip install boto3 --upgrade \

Git Style Guide

One of the advice I received from Udacity coach is to make sure I have consistent Git commits and use a specific style. Now that I understand commits better, not only have I started writing better commits, I am now writing them more often as I save work.

Before:

After:

git add <file_name>
git commit

# git commit gets us to editor: first line is {type}: {subject}
chore: First Commit
# space

# write body
This is the body where I write more info on changes I've made

Chris Beams Git Commit Guide

  1. Separate subject from body with a blank line
  2. Limit the subject line to 50 characters
  3. Capitalize the subject line
  4. Do not end the subject line with a period
  5. Use the imperative mood in the subject line
  6. Wrap the body at 72 characters
  7. Use the body to explain what and why vs. how

Udacity Style Guide

A thing that I’ve found that might be helpful is adding the type of commit. Udacity provides a good outline of that:

  • feat: a new feature
  • fix: a bug fix
  • docs: changes to documentation
  • style: formatting, missing semi colons, etc; no code change
  • refactor: refactoring production code
  • test: adding tests, refactoring test; no production code change
  • chore: updating build tasks, package manager configs, etc; no production code change

Airflow Import Problems

A problem I was facing is that my helper functions weren’t being imported properly. Even though they were in the right place in my local machine. The docker container could not locate the items because they did not have a path. Adding the local plugins folder to the mounted folder on the docker-compose file did the trick.

# on docker-compose.yml 
- ./mnt/airflow/plugins:/usr/local/airflow/plugins

Save variables for airflow

This is just a quick shortcut to make life easier when inserting information to Airflow. I found this on Marc Lamberti’s website. I’ve used him on Udemy to learn everything about Airflow. Really goes into detail.

# create variables
airflow variables -s s3_bucket udacity-dend
airflow variables -s s3_prefix data-pipelines

#save variables to json file

# import variables to json. Can also be done on airflow ui
airflow variables --import /home/username/example_vars.json

Correct Connection Format for Redshift

How to properly write connection settings when connecting Airflow to Redshift.

  • conn_id: redshift
  • conn_type: Postgres
  • host: copy endpoint on Redshift cluster and delete anything after ‘.com’
  • schema: dev (Database name on Redshift)
  • login: master user name on Redshift cluster
  • password: password created on Redshift cluster
  • port: 5439 (on Redshift cluster)

Make code pretty for WordPress

Install Enlighter plugin for wordpress and you’ll be able to write code that’s easy to distinguish.

print('Hello World')
SELECT *
FROM my_brain
WHERE idea NOT dumb;

Testing Dags Best Practice

Test on small DAGs first

Something I used to do as a rookie coder is writing a large chunk of code and running it only when finished. Now that I know that small errors can create huge problems, I try to work in small batches. Writing a DAG with multiple tasks means multiple ways to get an error message. A best practice I use is creating test_DAG py files and just running that task as a full DAG. That way I can find out what I wrote is actually what I’m telling the program.

Triggering Dummy Operators

Dummy operators won’t work if you just trigger dag. You have to turn the DAG on. I was used to testing dags by keeping the dag run off and just triggering it to start. I have to admit, having problems with something named DummyOperator was a low point in this project.

Have only one run of DAG at a time

This is a similar problem to what I faced above with DummyOperators. The problem with multiple tasks in one dag is that if there are previously scheduled tasks, the tasks will run concurrently. Seeing as I want to at least have one correct dag run before implementing, the best thing to do is to set max_active_runs=1 on dag. I tried setting it in default_args, but that caused a failure.

dag = DAG('test_full_dag',
          default_args=default_args,
          description='Load and transform data in Redshift with Airflow',
          max_active_runs=1
        )

Another way of ensuring just one instance is run is to set datetime to now. Working with AWS Redshift does create costs so it makes sense to test one thing at a time. changing the default_args time to datatime.now() ensures turning on the DAG will only run the instance once.