Astronomical Data Processing
This project demonstrates the implementation of a Convolutional Neural Network (CNN), inspired by the AlexNet architecture, to classify galaxy images into two distinct morphological types: 'Round' and 'Edge-on'. The goal is to build a deep learning model capable of accurately distinguishing between these two classes of galaxies based on their visual characteristics.
The project follows a standard machine learning pipeline:
-
Data Acquisition and Extraction:
- The
galaxy-zoo-the-galaxy-challengedataset was downloaded from Kaggle using the Kaggle API. - Raw image and solution files were extracted from
.ziparchives.
- The
-
Data Preparation:
- The
training_solutions_rev1.csvfile, containing galaxy classification probabilities, was loaded usingpandas. - High-confidence 'Round' galaxies (Class1.1 > 0.9) and 'Edge-on' galaxies (Class2.1 > 0.9) were identified.
- A balanced subset of 757 images for each class was randomly sampled and copied into dedicated directories (
subset_data/Round,subset_data/EdgeOn). ImageDataGeneratorfromtensorflow.keraswas used to:- Rescale pixel values to the range [0, 1].
- Split the data into 80% for training and 20% for validation.
- Resize images to 227x227 pixels, matching the AlexNet input requirement.
- Generate batches of images for training and validation.
- The
-
Model Definition (AlexNet-like CNN):
- A sequential Keras model was constructed, mirroring the AlexNet architecture.
- It comprises multiple
Conv2Dlayers with ReLU activation, interleaved withMaxPooling2Dlayers for spatial downsampling. - A
Flattenlayer connects the convolutional base to fully connected (Dense) layers. - Two
Denselayers with 4096 units each, followed byDropout(0.5) layers, provide high-level feature interpretation and mitigate overfitting. - The output layer is a
Denselayer with 2 units (for the two classes) and asoftmaxactivation function.
-
Model Training:
- The model was compiled with the
adamoptimizer,categorical_crossentropyloss function, andaccuracyas the evaluation metric. - Training was performed for 10 epochs using the
train_generatorandvalidation_generator.
- The model was compiled with the
-
Model Evaluation:
- Training and validation accuracy and loss curves were plotted over the epochs to visualize the model's learning progress and identify potential issues like overfitting.
The core of this project is a deep Convolutional Neural Network (CNN) designed with an architecture similar to AlexNet. Key components include:
- Input Layer: Accepts 227x227x3 RGB image data.
- Convolutional Blocks: A series of
Conv2Dlayers (e.g., 96 filters, 11x11 kernel, stride 4; 256 filters, 5x5 kernel, etc.) withReLUactivation for hierarchical feature extraction. - Max Pooling: Used after convolutional layers to reduce dimensionality and increase translation invariance.
- Fully Connected Layers: Two
Denselayers, each with 4096 neurons andReLUactivation, process the flattened features. - Dropout: Applied to the fully connected layers to prevent overfitting.
- Output Layer: A
Denselayer with 2 units andsoftmaxactivation for binary classification probabilities.
- Libraries:
pandasfor data handling,osandshutilfor file system operations,tensorflow.kerasfor model building and training, andmatplotlibfor plotting. - Kaggle Integration: The Kaggle API was used directly in the Colab notebook to download the dataset.
- Data Preparation Script: Custom Python code was written to filter galaxies based on confidence scores and copy images into class-specific directories, ensuring a balanced dataset for training.
ImageDataGenerator: Central to efficient data loading and augmentation, providing normalized and resized image batches for the model.- Model Definition: The AlexNet architecture was meticulously constructed layer-by-layer within a
create_alexnetfunction. - Training Loop: The model was trained using the
fitmethod with specified epochs and validation data. - Visualization:
matplotlibplots display the model's performance metrics (accuracy and loss) over the training epochs.
- Kaggle API Key Setup:
- Download your
kaggle.jsonfile from your Kaggle account (Profile -> Account -> Create New API Token). - Upload
kaggle.jsonto your Colab environment in the first code cell. - Ensure proper permissions are set (
!chmod 600 ~/.kaggle/kaggle.json).
- Download your
- Execute Cells Sequentially: Run all code cells in the notebook from top to bottom.
- The first cell will download and extract the dataset.
- Subsequent cells will handle data preparation, model definition, training, and visualization.
The training history plots (Accuracy and Loss) at the end of the notebook provide insights into the model's performance. These plots indicate how well the model learned on the training data and generalized to the unseen validation data over 10 epochs.