Hello!
I'm encountering an error when running the code, consistently across both the MNIST and CIFAR-10 datasets. Regardless of the configures I use (including the config files in train_configs directory), it reports something wrong stating "Layer xx is NaN!" for each layer. Additionally, I receive a warning that says "WARNING:tensorboardX.x2num: NaN or Inf found in input tensor."
Layer 0 is NaN!
Layer 1 is NaN!
Layer 2 is NaN!
Layer 3 is NaN!
Layer 4 is NaN!
Layer 5 is NaN!
Layer 6 is NaN!
Layer 7 is NaN!
Layer 8 is NaN!
Layer 9 is NaN!
Layer 0 is NaN!
Layer 1 is NaN!
Layer 2 is NaN!
Layer 3 is NaN!
Layer 4 is NaN!
Layer 5 is NaN!
Layer 6 is NaN!
Layer 7 is NaN!
Layer 8 is NaN!
Layer 9 is NaN!
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
round= 12 test_accuracy= 0.1046875 adv_success= 0 test_loss= nan duration= 2.9140660762786865
DEBUG:root:Memory info: 6584934400
Here is my mnist_setup.yml file for MNIST dataset:
---
client:
benign_training:
batch_size: 64
learning_rate: 0.02
num_epochs: 2
optimizer: SGD
step_decay: true
debug_client_training: false
optimized_training: true
# clip:
# type: l2
# value: 10
model_name: lenet5_mnist
# quantization:
# type: probabilistic
# bits: 8
# frac: 7
dataset:
# augment_data: false
data_distribution: IID
dataset: mnist
environment:
experiment_name: lenet5_mnist
# load_model: ../models/resnet18.h5
num_clients: 48
num_malicious_clients: 0
num_selected_clients: 6
use_config_dir: true
print_every: 1
job:
cpu_cores: 20
cpu_mem_per_core: 4096
gpu_memory_min: 10240
minutes: 10
use_gpu: 1
server:
aggregator:
name: FedAvg
global_learning_rate: 1
num_rounds: 35
num_test_batches: 20
...
And this is my mnist_setup.yml file for CIFAR-10 dataset:
---
client:
benign_training:
batch_size: 64
learning_rate: 0.02
num_epochs: 2
optimizer: SGD
step_decay: true
debug_client_training: false
optimized_training: true
# clip:
# type: l2
# value: 10
model_name: lenet5_cifar
# quantization:
# type: probabilistic
# bits: 8
# frac: 7
dataset:
# augment_data: false
data_distribution: IID
dataset: cifar10
environment:
experiment_name: lenet5_cifar
# load_model: /home/hujia/fl-analysis/models/resnet18.h5
num_clients: 48
num_malicious_clients: 0
num_selected_clients: 6
use_config_dir: true
print_every: 1
job:
cpu_cores: 20
cpu_mem_per_core: 4096
gpu_memory_min: 10240
minutes: 10
use_gpu: 1
server:
aggregator:
name: FedAvg
global_learning_rate: 1
num_rounds: 35
num_test_batches: 20
...
I suspect that the issue might stem from an incorrect version of a package in my environment configuration, but what confuses me is that the code runs correctly with the Shakespeare dataset.
Hello!
I'm encountering an error when running the code, consistently across both the MNIST and CIFAR-10 datasets. Regardless of the configures I use (including the config files in
train_configsdirectory), it reports something wrong stating "Layer xx is NaN!" for each layer. Additionally, I receive a warning that says "WARNING:tensorboardX.x2num: NaN or Inf found in input tensor."Here is my
mnist_setup.ymlfile for MNIST dataset:And this is my
mnist_setup.ymlfile for CIFAR-10 dataset:I suspect that the issue might stem from an incorrect version of a package in my environment configuration, but what confuses me is that the code runs correctly with the Shakespeare dataset.