Skip to main content

Shared storage for Atlassian Data Center tools

Objective

The following information is meant to help you start conversations about your internal storage design and needs for use with Atlassian Data Center tools. We've also provided some troubleshooting tips to assist you as needed.

Storage is not a simple subject, it's a topic that requires thought and careful implementation in order to build a successful design. Engineering teams have been working on shared, clustered, and highly available storage for decades. It is our intent to share some of our knowledge and practices for using or deploying storage services and solutions in your environment.

Please review the appropriate hardware requirements pages to determine the needs for your Atlassian product(s).

Product
Hardware requirements
Bitbucket DC
Confluence DC
Jira/JSD DC

A brief introduction to iSCSI SAN

Internet Small Computer Systems Interface (iSCSI, pronounced eye-skuh-zee) is an IP-based storage networking standard for linking data storage facilities. It provides block-level access to storage devices by carrying SCSI commands over a TCP/IP network. iSCSI is used to facilitate data transfers over intranets and to manage storage over long distances. It can be used to transmit data over local area networks (LANs), wide area networks (WANs), or the Internet and can enable location-independent data storage and retrieval.
The term SAN stands for storage area networks. The SAN protocol allows organizations to consolidate storage into storage arrays while providing clients (such as databases and web servers) with the appearance of having locally attached SCSI disks. It mainly competes with Fibre Channel, but unlike traditional Fibre Channel which usually requires dedicated cabling, iSCSI can be run over longer distances using existing network infrastructure. iSCSI was pioneered by IBM and Cisco in 1998 and submitted as a draft standard in March 2000.
An iSCSI SAN therefore is merely a SAN dedicated locally over TCP/IP. You can see an example of a typical iSCI SAN in this diagram.
In a data center, you would never deploy storage traffic on the same network, switch, or even fabric as standard Ethernet traffic is used by applications and users. A typical design will include dedicated infrastructure for the SAN. As shown in linked diagram above, standard networking operates on Fabric A, which may include clients and applications. Fabric B and C are dedicated to storage.
Due to this, we have specialized networks just for storage I/O including:
  • Dedicated PCI-E lanes for network adaptors
  • Specialized ASICs on storage type
  • Low latency
  • Packet and frame throughput
  • Jumbo frames (more I/O in a a single frame!)
  • Dedicated network adaptors for the traffic type
  • Dedicated network switches for traffic types
  • Fibre Channel
  • InfiniBand
  • iSCSI
  • Backplane switching over several ASICs
  • Fabric networks without routing (No overhead, less latency, no switch or router traversal)
  • No VLANs
This design is very common, but also used in Network File System (NFS) networking for virtualization and high-performance clustering to support thousands of virtual machines.
This is only a brief overview of the subject, but it shows that a lot of thought is put into designing and delivering high-speed storage that offers high-throughput traversal of data in low-latency networks. It requires considerable effort to deliver effective storage and access solutions that will result in optimal performance for your applications.

So why do we need to understand this better?

In the application space, you will need to provide shared network storage for Data Center products. The problems you may face are directly related to storage configuration and deployment. For every millisecond that a response takes that is less than optimal, the longer it will take the application to respond. This can have a negative effect on the overall perception of how fast the application responds for your end users.
NFS can be easily deployed by customers on Windows and Linux, or easily shared out from NAS devices, or even directly from your SAN devices. These solutions, while less costly, will tend to work in the short term, but will show signs of sluggishness as you begin to scale into true enterprise scalability. In order to help you provide better solutions for your environment, we have spent time working on solutions to assist in your configuration.

Solutions & guidance considerations

Not every solution provided herein will be possible for everyone, and we are aware of these limitations. However, working on even some of these changes could provide as much as a 15% performance improvement in storage. Considering that shared storage is used for indexing, attachments, and other aspects of Data Center products, this can significantly improve end-user experience as well.

Build or attach to dedicated storage networks

  • Avoid having application or “LAN” traffic on your storage network
  • Use storage-optimized NICs and switches in the separate SAN

Use dedicated NICs on this network with iSCSI or storage optimizations

  • Use virtual adaptors connected to a storage vSwitch
  • Perform hardware offloading to the physical NIC
  • Full duplex settings on switch ports and NICs without auto-negotiation

“Jumbo frames” on the entire network

  • Every point in a storage network must use the same or larger frame size (MTU 9000+)

If the MTU for jumbo frames on this network is not correct, you may incur data corruption and see no improvements

  • Avoid IP fragmentation
  • More I/O can be sent in a single frame
  • This includes physical NICs on the server, switch ports, virtual switches, and virtual adaptors

Eliminate VLANs and routing

  • VLANs add Ethernet overhead, it’s best to avoid them
  • As soon as you introduce routing, you incur a performance penalty for every packet
  • This means subnets/availability zones, etc.

Use multiple network paths and load balance

  • Load balance at the switch and NIC

Optimize settings

  • Use FAST handshaking
  • Use TCP recycling
  • Ensure NIC frame buffers are optimized on NICs/virtual adaptors and offloaded to hardware (TSO)
    • Not possible in cloud deployments
  • Use high-performance vNICs such as VMNETX3
  • Match the stripe/block size to match the provisioned LUN/Volume/RAID settings in rsize/wsize
    • defaults may be too big or too small, adjust these to suit the desired workload
  • Provision stripe sizes for storage
  • Enable or disable RX/TX flow control pending your device/NFS server recommendations
  • Tune the Linux kernel/TCP window scaling

Use the correct tools and devices

  • Use 10 GbE+ Networking
  • Leverage the correct storage
    • Read-optimized
    • Write-optimized
    • NVMe
    • Write-though I/O
  • Use enterprise-grade Linux with support
A simple design for this on vSphere with dedicated virtual switches, network interfaces, larger frames, and reduced latency is shown below:
Shared Storage for Atlassian Tools 2
Key benefits:
  • Dedicated networks for application traffic (users/client side)
  • Dedicated networks for NFS/storage/clustering (node to node or server to storage)
  • Reduced latency
  • Increased bandwidth
  • Better availability
  • Better scalability

Roll your own custom NFS

Linux based NFS servers are powerful, scalable, and simple to deploy. Using the above guidelines, you can optimize your deployment for different workloads and requirements. Each Atlassian application has different workload and I/O requirements, so it’s important to understand the I/O workloads to work with your infrastructure teams to get the most out of your environment.
Application
IO profile
Disk recommendation
Bitbucket
Very small (<1mb, random IO)
NVMe, SSD, hybrid
Bamboo
Very small (<1mb), random IO + sequential IO (artifacts)
NVMe, SSD, hybrid 
Jira
Small/medium (1-4mb), random IO
NVMe, SSD, hybrid
Confluence
Medium/large, random IO + sequential IO
NVMe, SSD, hybrid, RAID HDD

This is a high level example. Use cases and workloads can change by users, projects, use of applications, business model, and industry.

Linux environments

Using an enterprise and production-grade Linux is highly recommended, as this will provide better support overall. Typically, NFS servers perform network/IO passthrough to disk. This is mostly a kernel/CPU intensive operation, thus requiring minimal memory footprints.
A Linux NFS server could be made from the following specs:
Profile
CPU
Memory
Notes
Small
4vCPU
4GB+
  • Small instances
  • Single node
  • Medium 
    8vCPU
    8GB+
  • Medium-sized instances
  • 2-3 application nodes
  • Large
    16vCPU+
    32-48GB+
  • Large
  • High volume
  • Several applications/nodes
  • Linux will consume as much memory as possible to provide filesystem and object caching by default. More memory can increase these caches.

    Test different profiles and workloads to see which profile suits which application and workload for your tools within your own environment.

    A note on Git workloads

    Git can be a very resource intensive tool. It can heavily rely on CPU cores, CPU frequency, and system memory, but the greatest factor is disk I/O. Typical workloads on a NAS might be a team storing videos, photos, or large documents or backups. This is typically “sequential IO” and most hard drives can do this pretty well. Storage blocks get lined up, and written in a single “spin” of a HDD.
    For example: copying a movie to a hard drive might see 100Mbps+ transfer speeds in disk I/O.
    Git, however, behaves the opposite. Git typically handles many small files, which is harder for a traditional disk or RAID system to keep pace with. Benchmarking Git performance can become very difficult and can stress most storage systems due to the random access nature and small size of the file transfers.
    For example: copying 100,000 very small files might see 1Mbps+ transfer speeds in disk I/O.
    Therefore, it's better to work with disks that can access data with speed for I/O. Solid state drives (SSDs) typically excel over traditional hard disk drives (HDDs) for these workloads. The fastest drives, non-volatile memory express (NVMe) or Non-Volatile Memory Host Controller Interface Specification (NVMHCIS), would be the best recommendation in this area as these excel with parallelism, concurrency, and typically have much greater access and lower latency to the raw data on the drive. Most of today's computers come with NVMe drives and the access times and responsiveness are ideal for data. The downside to NVMe is that it can be much more expensive than either HDD or SSD, but the cost is worth the performance for ideal access.
    Infrastructure teams may assist in getting the right solution for your storage for those tools that will access Git.

    Troubleshooting tools and documentation

    NFS allows for extensible and flexible network-based storage. However, you may run into issues with performance that present challenges to solve. When troubleshooting NFS, you should be aware that there are several teams and components that would need to be verified in order to potentially identify and resolve issues. In order to assist you, the list below (not comprehensive) is provided to start the conversation internally to configure your response teams should these issues arise.
    Team
    Roles
    Ownership
    Tools
    Network engineering
  • Network management
  • Network security
  • Network capacity
  • Design & availability
  • Firewalls, routers
  • IPS & IDS
  • Switching
  • OSI Layers 1-4~ 
  • QoS & Bandwidth
  • Wireshark
  • TCPDump
  • Netstat
  • Switch/router OS CLI and tools
  • Infrastructure/virtualization
  • Provide virtual machines
  • Provide operating system & templates
  • Monitoring and alerting
  • Virtualization stacks
  • vSphere, Hyper-V, Xen
  • GCP, Azure, AWS
  • Clustering and HA
  • Compute capacity (CPU & Memory)
  • Sometimes manage storage too
  • Platform-specific tools 
  • VMware monitoring
  • NMON
  • Networking tools (tcpdump, wireshark, etc.)
  • Storage & backups
  • Provide storage to infra teams
  • Provide backup targets & solutions
  • Manage data integrity, availability, capacity
  • SAN 
  • PureStorage, EMC, HPE 3PAR, IBM, NetApp
  • NAS
  • CIFS/NFS Share
  • File shares for teams
  • Backups
  • Veeam, CommVault, Backup-Exec
  • nfsstat
  • nfsiostat
  • tcpdump & wireshark
  • NMON
  • SAN & storage utilities 
  • Generally, pulling these teams into a problem or incident with NFS will get the right people in the room and conversation. This is a rough example, teams and ownership may vary in your organization.

    NFS documentation

    There are many guides to assist you with configuring, implementing, and troubleshooting NFS in your environment. Below are a few linked articles that are available to assist you as needed:
    Open documents:
    If you are a RedHat client, and have RedHat Support Enablement Number (SEN), you may also review the following:

    Tweaks and tuning

    NFS daemon

    By default, the NFS service is started with 8 threads. This can be adjusted if you have a busy, highly concurrent NFS workload. Adjusting this value (doubling until diminishing returns) could provide additional throughput and better latency in Linux environments.
    1/etc/sysconfig/nfsd
    The default is commented out and can be adjusted:
    1#RPCNFSDCOUNT=82RPCNFSDCOUNT=163RPCNFSDCOUNT=324RPCNFSDCOUNT=645RPCNFSDCOUNT=128

    Tuned

    Tuned is a RHEL/CentOS service that tweaks the kernel and system parameters for different workloads. The tuned service can be installed quickly and easily:
    1# Install & Enable2yum install tuned3systemctl enable tuned4systemctl start tuned5 6# Tuned CLI7tuned-adm list profile8tuned-adm profile thoughput-latency9
    RedHat, Oracle, and IBM provide several profiles for different workloads. Using this tool can help apply some of the best practices within Linux for a target workload. More profiles are found here: tuned/profiles at master · redhat-performance/tuned
    You can even write your own profiles, building on an existing one with current tweaks which could be recommended by Premier Support or development. 

    Troubleshooting tools

    Handy tools for troubleshooting NFS performance

    More details on using these are found below in the Red Hat documentation.

    ethtool

    ethtool is a Linux tool for querying and changing NIC settings and very low levels.
    1ethtool -S ethX # Displays statistics and errors2ethtool -a ethX # Displays how flow control is performed or enabled3ethtool -K ethX # Displays how offloading is configured

    netstat

    netstat prints data about open connections and protocol stack stats, including device info, TCP socket info, and Unix sockets.
    1netstat -s # Provides error counters

    Dropwatch (Redhat)

    Dropwatch is a tool used to determine where packets are dropped by the kernel. It's useful in performance cases to narrow down the problem area.
    1dropwatch -l kas2> Start3####4Kernel monitoring activated.5Issue Ctrl-C to stop monitoring62 drops at ip_forward+1b7 (0xffffffffbd809777)74 drops at ip_forward+1b7 (0xffffffffbd809777)81 drops at sk_stream_kill_queues+48 (0xffffffffbd7a2158)91 drops at __brk_limit+25100c6 (0xffffffffc0fe70c6)101 drops at ip_forward+1b7 (0xffffffffbd809777)112 drops at ip_forward+1b7 (0xffffffffbd809777)123 drops at ip_forward+1b7 (0xffffffffbd809777)

    nfsstat

    nfsstat is a performance and stat tracker for NFS/RPC calls. It's useful on the server to diagnose naughty NFS servers.
    1nfsstat -r -l # Only RPC2nfsstat -s -l # Only Server Stats3nfsstat -c -l # Only Client Stats

    nfsiostat

    nfsiostat provides client-side mount statistics. It includes details like:
    • Operations per second
    • RPC backlog/queue
    • kB read/write per second
    • kB/op read/write (size per read/write)
    • Retransmissions
    • Average round trip time
    • Average exec time (duration in kernel)
    • Average queue (ms)
    1man nfsiostat1nfsiostat [[<interval>] [<count>]] [<options>][<mount_point>]

    tcpdump

    tcpdump is useful for capturing network data between the NFS client and NFS server. It can also validate packet sizes, frames lengths, timings, and a lot more. It can be parsed by Wireshark:
    1tcpdump -vvv -n -i ethX host localhost and nfs-server.com

    nmon

    nmon is one of the oldest and most powerful performance analysis tools for Linux. Originally developed by IBM and Oracle, this is a very powerful tool with next to zero overhead on the host machine.
    nmon has two modes, graphical/TUI shell output and background logging mode. Background logging mode allows for the collection of performance stats which can be parsed later with tools like Excel or Java output tools.
    1yum install epel-release -y && yum install nmon
    To capture data with 1 second sample intervals:
    1nmon -f -t -s 1
    This writes a .nmon file to the working directory, which can then be parsed with the following tool.
    .nmon files or reports can be shared with your support teams easily. 

    Measuring success and changes

    In order to understand the outcome of any changes made within your environment, you will need to measure data to determine what is effective. For that reason, we do recommend that all such changes are made in a lower environment prior to placing such changes into production. We recommend thorough and heavy testing to tune and configure your network, client, and server environment based on your workloads.

    Tips

    • Always perform benchmarks
    • Create tests and benchmarks based on your use case and workloads
      • Examples might include the transfer of lots of small Bitbucket repos, with many small files vs. large photos/documents in Confluence Data Center
      • Simulate concurrency and user counts
    • Measure changes between settings and configurations
    • Involve infrastructure teams early and outline the application workload and requirements
    • Be sure to document test results, as some tweaks may negatively impact performance

    Conclusion

    Every network is unique, and your workloads may dictate different solutions. It is important to understand that storage is a major component in Data Center deployments and that greater throughput with lower latency will result in a more performant application.

    Was this content helpful?

    Connect, share, or get additional help

    Atlassian Community